Popular Post elliott50 Posted January 10, 2011 at 11:16 AM Popular Post Report Posted January 10, 2011 at 11:16 AM I thought you all might be interested to know the overlap between the old HSK (advanced level) and the new HSK (level 6) in terms of the characters employed. As I live in the UK, I have also looked at the overlap with the A-Level qualification taken at the end of high-school here. My results can be seen on the venn diagram in the image below. Interpreting the diagram just for the new and old versions of the HSK exam: of the 2631 characters of the new HSK (level 6), 2559 (852+1707) (97%) are common to the old HSK (advanced) but 72 (52+20) are entirely new; while of the 2861 characters of the old HSK, 302 (248+54) (11%) are not relevant to the new HSK. With the level of character knowledge between the old and new HSK relatively close, the main difference seems to be in the number of words created by character combinations that are required. The new HSK (level 6) requires roughly 4249 character combinations to be learnt, whilst the old HSK (advanced) requires knowledge of around 6877 character combinations. So overall, compared with the old HSK (advanced), the new HSK (level 6) requires 8% fewer characters and 38% fewer character combinations to be learnt. In defence of the designers of the new HSK, I imagine they felt that the meaning of many of the combination words not included in their list can be easily guessed from the meaning of their component characters. Yes, yes, I know... I am a very sad man with too much time on my hands! :blink: 5 Quote
gato Posted January 10, 2011 at 01:51 PM Report Posted January 10, 2011 at 01:51 PM Nice, not sad at all. So what are the 119 characters on the A-level list but not on either the old or new HSK list? I am surprised that there are that many, considering that the old HSK list already has 2800 or so characters whereas the A-level list has only 1900. By the way, the old HSK vocab list had 8840 words. See here. http://en.wiktionary.org/wiki/Appendix:HSK_list_of_Mandarin_words The total number of words on the HSK committee list is 8840. Quote
jbradfor Posted January 10, 2011 at 02:19 PM Report Posted January 10, 2011 at 02:19 PM That's very interesting! It seems very good to me that the largest bucket is the 3-way intersection; it would be rather disconcerting if the three lists has significantly different characters. It also makes sense that the second biggest bucket is the 2-way intersection of the old and the new HSK lists, since those have more words than does the UK-A. It does seem a bit strange that the two 2-way intersections between the HSK lists and the UK-A list is so small, compared to number of characters that are in only one list. Could you drill down a bit deeper and see why? Are they somewhat obscure characters that are part of a phrase or something? Quote
Hugh Posted January 10, 2011 at 02:50 PM Report Posted January 10, 2011 at 02:50 PM Wow, this is impressive. How did you compile this? Please tell me it was a script and not done manually. Quote
elliott50 Posted January 10, 2011 at 03:22 PM Author Report Posted January 10, 2011 at 03:22 PM Many thanks for your kind comments. East Asia Student - I generated the vocabulary lists using Clavis Sinica with a lot of spreadsheet manipulation and then logical mathematics to get the split between the various categories. Gato & jbradfor - unfortunately the method I used did not generate lists of characters in the various categories, sorry. Quote
elliott50 Posted January 11, 2011 at 08:34 AM Author Report Posted January 11, 2011 at 08:34 AM Looking at the UK A-Level character list again (taken from the authorised course text books), it contains a large number of proper names, for example 孟子 (the philosopher Menicus), which do not appear in the old or new HSK character lists. This is probably where the bulk of the 119 characters that only appear in the UK A-level list come from. Interestingly, the character 孟 is the 1740th most common character (according to wenlin), so the compilers of the old and new HSK word lists must have consciously chosen not to include most of these common proper-name characters. Quote
elliott50 Posted January 11, 2011 at 04:55 PM Author Report Posted January 11, 2011 at 04:55 PM For completeness, the venn diagram below gives the same information for the new HSK level 5 and old HSK intermediate levels... Of the 1711 characters in the new HSK level 5, 1669 characters were in the old HSK intermediate, but 42 are entirely new. 525 characters from the old HSK intermediate level are not used in the new HSK level 5. Quote
gato Posted January 11, 2011 at 11:50 PM Report Posted January 11, 2011 at 11:50 PM Can you do the same for the word lists? :-) 1 Quote
cababunga Posted January 12, 2011 at 12:22 AM Report Posted January 12, 2011 at 12:22 AM I don't know what that UK A-level list you all are talking about nor where people get it, but for the other two word counts are like this: new - old = 522 (only in new) old - new = 4164 (only in old) new & old = 4473 (in both) I probably used somewhat different versions of the lists then elliott50 as counts for characters are different. Old list was taken from http://hskflashcards.com/ with some manual fixes and the new one from http://lingomi.com/blog/hsk-lists-2010/. Counts for characters: new - old = 74 old - new = 301 new & old = 2560 Here are the actual differences: Only in new 丐 伦 侠 剔 吝 咀 哺 唉 啬 啰 嗨 嘈 嚏 墅 婪 宠 尬 尴 岳 峭 庇 徙 恍 恕 惮 愣 抒 擎 斟 昔 暧 曝 杖 椎 橙 欧 沐 沧 沮 浏 涮 涵 澈 濒 熨 瘾 硕 磅 篷 紊 纬 纽 绎 肴 胎 腻 舔 苟 荧 虐 裔 觅 讳 诺 谍 账 赁 迄 迸 遏 锲 飙 馅 魅 Only in old: 犁 邪 撇 笆 罗 揽 伊 侍 砌 咏 芬 泣 跺 梗 倘 钮 涝 碟 贞 槽 柠 瞥 梧 瑰 箩 刨 岭 杨 芳 玲 讹 涤 瑞 喽 稼 窿 叁 脊 蹄 拇 浆 秉 寇 腊 炊 晌 盏 珑 呐 燕 乔 哗 毙 橘 蛛 缎 呜 楞 寡 屠 瓣 榆 潦 肠 抡 苯 絮 篱 鹰 茧 韵 驴 磷 酶 灸 沏 秽 驼 竿 蕾 刃 掂 茅 袄 雀 凳 捶 剃 垒 栗 玖 桅 君 驮 蒜 瘟 俄 锡 棱 垦 巫 肝 秆 檬 疮 贱 莲 窟 冶 锹 鲸 掺 尿 禾 镁 铀 乃 菊 捅 绷 匠 蝉 珊 镑 狐 柒 阀 穗 汛 筝 糠 艾 浩 函 硫 揪 潭 屯 曰 勒 卵 拴 顷 狸 闺 硅 鹿 翠 蚁 阁 砂 梅 茄 逆 颊 锌 窑 炕 囱 钙 蜘 倚 芝 垮 龟 侄 骡 粱 枣 僚 沥 唤 斧 鸦 亩 玫 爪 芭 钳 薯 氮 碱 贰 痰 撵 榷 芹 缸 蚕 羿 雁 抠 桂 啄 菇 穆 鹊 捌 柏 屎 荔 蛙 焊 瑚 铝 卜 汞 噢 菠 秧 徽 淫 汪 蜓 凯 尼 拱 兰 柳 丹 绵 靴 婶 壹 奸 藤 妖 歼 凿 虾 刁 翁 蚂 谗 骆 冈 笋 蚊 萍 蘑 徐 弓 熔 蔗 笛 镰 萝 挟 锣 亢 淇 纱 邦 粪 锯 暮 樱 舵 榴 爹 蜻 猿 肾 豁 坊 鹅 柄 蝇 仆 囊 杏 豌 剑 槐 黒 俏 蝗 棚 姜 绞 埠 绢 凤 轧 桐 桩 寨 蹭 坯 葱 凰 铲 葵 痴 荷 闸 捻 棺 籽 轿 蛾 1 Quote
renzhe Posted January 12, 2011 at 12:33 AM Report Posted January 12, 2011 at 12:33 AM Wow. Most of those are rather common. Quote
cababunga Posted January 12, 2011 at 01:05 AM Report Posted January 12, 2011 at 01:05 AM Frequency indexes for the top 10 characters from the "Only in new" list: 873 欧 1130 伦 1190 诺 1605 纽 1788 硕 1902 唉 1910 胎 1982 账 2026 岳 2035 侠 Same for the "Only in old" list: 564 罗 670 兰 841 尼 860 杨 1004 伊 1033 俄 1095 梅 1179 君 1216 徐 1275 丹 Character frequency data from http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html 1 Quote
elliott50 Posted January 12, 2011 at 07:20 AM Author Report Posted January 12, 2011 at 07:20 AM Excellent work cababunga, thanks! Many of the most common characters that are "only in the old" list (e.g. 罗, 兰, 尼 & 伊 ) are often used to transliterate foreign words, so I suspect that the new HSK has many fewer foreign proper names in it. Other changes perhaps reflect altered geo-political realities, for example 欧洲 (Europe) is only in the new HSK, while 俄语 (the russian language) is only in the old HSK. Or changing technology, 纽扣儿 (button, e.g. on a computer) has come in, but 犁 (to plough a field) has gone out. Or altered educational aspirations, 硕士 (master's degree) has come in, but 少先队 (young pioneer) has gone out. I'm sure that a full analysis of the differences would yield a fascinating picture of the changes in China's self-image between the publication of the two lists. Surely a PhD thesis in the making for someone... Quote
anonymoose Posted January 12, 2011 at 07:39 AM Report Posted January 12, 2011 at 07:39 AM I don't know about the A-level exam, but considering quite a lot of the vocabulary and characters in the HSK exams come from outside the lists, I think the list are to be taken with a pinch of salt anyway. What is and what's not included seems rather arbitrary. Quote
BertR Posted January 12, 2011 at 08:08 AM Report Posted January 12, 2011 at 08:08 AM (edited) I did something similar for words: Old HSK 1 => New HSK 1 : 132 Old HSK 2 => New HSK 1 : 8 Old HSK 3 => New HSK 1 : 4 Old HSK 4 => New HSK 1 : 0 Old HSK 1 => New HSK 2 : 123 Old HSK 2 => New HSK 2 : 20 Old HSK 3 => New HSK 2 : 9 Old HSK 4 => New HSK 2 : 2 Old HSK 1 => New HSK 3 : 204 Old HSK 2 => New HSK 3 : 75 Old HSK 3 => New HSK 3 : 17 Old HSK 4 => New HSK 3 : 6 Old HSK 1 => New HSK 4 : 185 Old HSK 2 => New HSK 4 : 318 Old HSK 3 => New HSK 4 : 49 Old HSK 4 => New HSK 4 : 20 Old HSK 1 => New HSK 5 : 96 Old HSK 2 => New HSK 5 : 684 Old HSK 3 => New HSK 5 : 323 Old HSK 4 => New HSK 5 : 122 Old HSK 1 => New HSK 6 : 13 Old HSK 2 => New HSK 6 : 177 Old HSK 3 => New HSK 6 : 694 Old HSK 4 => New HSK 6 : 1289 Words in the New HSK that weren't in the old one. ... => New HSK 1 : 11 ... => New HSK 2 : 8 ... => New HSK 3 : 14 ... => New HSK 4 : 30 ... => New HSK 5 : 111 ... => New HSK 6 : 347 Words in the Old HSK that aren't in the new one. Old HSK 1 => ... : 250 Old HSK 2 => ... : 704 Old HSK 3 => ... : 1086 Old HSK 4 => ... : 2128 Edited January 12, 2011 at 10:09 AM by BertR 1 Quote
elliott50 Posted January 13, 2011 at 08:59 AM Author Report Posted January 13, 2011 at 08:59 AM Many thanks BertR, your research certainly confirms that there is no direct mapping between the old and new HSK levels. Which means that studying the old HSK early level material in order to prepare for the new HSK lower level tests may not be as helpful as one might hope. Quote
gato Posted January 13, 2011 at 09:11 AM Report Posted January 13, 2011 at 09:11 AM Old HSK 1 => New HSK 6 : 13Old HSK 2 => New HSK 6 : 177 Old HSK 3 => New HSK 6 : 694 Old HSK 4 => New HSK 6 : 1289 Words in the Old HSK that aren't in the new one. Old HSK 4 => ... : 2128 BertR, can you explain what these numbers mean? Why is "Old HSK 4 => New HSK 6" 1289, but the number of words in Old HSK 4 but not in new HSK is 2128? Quote
BertR Posted January 13, 2011 at 01:10 PM Report Posted January 13, 2011 at 01:10 PM New HSK 6 are the words in the word list of New HSK 6 except those that are already in New HSK 5. So I count those that are extra for that level. I did the same for the Old HSK. Old HSK 1 is the Basic level (基础HSK, 1级-3级) Old HSK 2 is the Elementary level (3级-5级) Old HSK 3 is the Intermediate level (6级-8级) Old HSK 4 is the Advanced level (高等HSK, 9级-11级) Old HSK 4 => New HSK 6 : 1289 Means: 1289 words of the Old HSK 4 level (these are the new words for the advanced level, not in Old HSK 3) are also in the new HSK 6 level (but not in new HSK 5). Old HSK 4 => ... : 2128 Means: 2128 words of the Old HSK 4 level (these are the new words for the advanced level, not in Old HSK 3) are not in the new HSK. With "=>" I meant " moved to ". Mathematically writing Count(Intersection(new words for Old HSK 4, new words for New HSK 6)) = 1289 would be a correct way to write Old HSK 4 => New HSK 6. For Old HSK 4 => ... : 2128 this would become Count(new words for Old HSK 4 \ all words for New HSK) = 2128 with \ the difference operator. Is this more clear? 1 Quote
gato Posted January 13, 2011 at 01:28 PM Report Posted January 13, 2011 at 01:28 PM I see. Thanks for the clarification. I hadn't seen "=>" used as intersection before. ;) I think the latter sets of statistics comparing to the cumulative list are the most useful since there is no claim that individual levels of the old vocab list would map to any particular level of the new vocab list. On second thought, your intersection data might be helpful if you can show them in a bar diagram form like this. Maybe you can rig one up. Words in the New HSK that weren't in the old one.... => New HSK 1 : 11 ... => New HSK 2 : 8 ... => New HSK 3 : 14 ... => New HSK 4 : 30 ... => New HSK 5 : 111 ... => New HSK 6 : 347 Words in the Old HSK that aren't in the new one. Old HSK 1 => ... : 250 Old HSK 2 => ... : 704 Old HSK 3 => ... : 1086 Old HSK 4 => ... : 2128 Quote
BertR Posted January 13, 2011 at 03:25 PM Report Posted January 13, 2011 at 03:25 PM So the intended meaning for "=>" was "moved to" You want lists such as these? ... => New HSK 1 : 11 Word dwhyyjzx rank internetzh rank lcmc rank New HSK Old HSK 北京 253 189 319 1 不客气 10340 1 出租车 32982 1968 1 打电话 1091 4464 1 火车站 4333 3895 3014 1 哪儿 556 1490 2226 1 那儿 787 1420 3532 1 说话 1039 540 763 1 下雨 11414 3746 1 这儿 357 1114 1575 1 中国 40 81 82 1 That might take some time for all lists. I have all building blocks ready, but currently it takes some manual work to generate these lists... When I have more time, I'll make a website that allows to retrieve these kind of statistics (the web pages actually already exits, but it's not publicly reachable. Also it still needs some work so that others can actually use it). Quote
gato Posted January 13, 2011 at 03:37 PM Report Posted January 13, 2011 at 03:37 PM Hmm, I actually don't need any more statistics.... A super list combining the new and old HSK list might actually be useful. I'm not sure about the levels, though. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.