Popular Post sparrow Posted December 13, 2013 at 07:04 PM Popular Post Report Share Posted December 13, 2013 at 07:04 PM (edited) Edit: Please read reply #3 by alanmd on this topic. The Wikipedia frequency list is apparently not what it claims to be and repeats words inappropriately. Edit: Uploaded a corrected copy of the spreadsheet. There was a small error. Using spreadsheet formulas, I was able to pull apart the Mandarin word frequency list found on Wikipedia. Wikipedia Source PDF File Discussing Methodology (Chen, Tseng, et al.) According to the above PDF, the list comes from a 14-million-character corpus of Chinese newspapers dating 1993 or earlier. Attached is the spreadsheet. It contains Simplified, Traditional, Pinyin, and English. The comment in the top-left-most cell contains the RAND() formula, which can be used for sorting groups of characters randomly, essentially shuffling them. They can be put back in order by sorting by entry number. If people want info on how I personally use this kind of list, let me know and I'll do a write-up. Statistics Word Set Characters in Set 0001–2500 1119 0001–5000 1658 0001–7500 2048 0001–10,000 2397 Mandarin_10000_Word_Frequency_List.xls Edited December 15, 2013 at 09:25 AM by sparrow 5 Quote Link to comment Share on other sites More sharing options...
戴 睿 Posted December 13, 2013 at 08:30 PM Report Share Posted December 13, 2013 at 08:30 PM I would love to read a bit more about how you implement this sort of list into your language study. I look forward to reading a write up of that nature! Quote Link to comment Share on other sites More sharing options...
alanmd Posted December 14, 2013 at 11:14 PM Report Share Posted December 14, 2013 at 11:14 PM The Wikipedia page and the Chen et al. (1993) paper you linked to have different frequency orderings. I'm not sure that I believe that '的' is only the 28th most frequent word in modern Chienese, as the Wikipedia list and the Excel file how, or that '了' is the 25th most frequent, as the Chen et al. (1993) paper shows. Must be some quite strange source texts to give those frequency results. An interesting feature of the list on Wikipedia is that homographs are broken out into multiple entries, which implies that the meaning of each word is known when compiling the list- this would be very hard (and not impossible) for a computer to do, and the 1993 paper only talks about word segmentation not homographs so wouldn't have been able to generate the Wikipedia list. It would be interesting to know how this list was created. There seem to be some errors though, entries 758 and 759 both show 推出 with the same meaning, I did a quick count in Excel and found 1189 totally duplicate entries. I usually use the SUBTLEX-CH word frequency lists (Cai & Brysbaert, 2010), as I like their methodology of using subtitles to better measure word usage in modern speech. I don't really use these for studying, except to frequency order all of the words and characters in the Chinese language scripts I write, e.g. http://hskhsk.pythonanywhere.com/radicals?hsk=16 . Maybe someday I'll get around to doing an in-depth comparison of these frequency lists, and try to see which words differ most between them- might help to get a better idea of the advantages/disadvantages of each. For comparison the first few words of each are: Wikipedia: 一,在,有,个,我,不,这,了,。。。 Chen et al. (1993): 的,一,在,十,是,有,二,三,。。。 SUBTLEX-CH: 的,我,你,是,了,不,在,他,。。。 4 Quote Link to comment Share on other sites More sharing options...
sparrow Posted December 15, 2013 at 06:37 AM Author Report Share Posted December 15, 2013 at 06:37 AM @alanmd: Interesting—if the list did not come from Chen et al., I wonder where it came from! I noticed there are many repeat entries, and I noticed 的 is quite far down the list, which made me suspicious, so I've continued to use Routledge's frequency dictionary. To be clear, did you find 1189 duplicate entries for the same word? If so, then this list of 10,000 might be rubbish. In that case, I will make a bold note in the OP about this so that people know they're getting a low-quality list and probably should look elsewhere. Quote Link to comment Share on other sites More sharing options...
alanmd Posted December 15, 2013 at 06:49 AM Report Share Posted December 15, 2013 at 06:49 AM I created a new column which was Simplified & Trad & pinyin & definition. I then sorted on that column, and made a column that counted consecutively identical rows as a '1', otherwise '0'. The 1189 count would have counted once-duplicated entries as '1', and twice-duplicated as '2', etc. I didn't test for the number of unique duplicates, if you know what I mean! Quote Link to comment Share on other sites More sharing options...
imron Posted December 15, 2013 at 07:08 AM Report Share Posted December 15, 2013 at 07:08 AM Frequency lists depend completely on the source material that they came from. If two different data sources were used to collect frequency information, then it's understandable that different characters will appear in different positions. It's not an exact or precise figure. Quote Link to comment Share on other sites More sharing options...
alanmd Posted December 15, 2013 at 07:33 AM Report Share Posted December 15, 2013 at 07:33 AM Yes, they do depend on the source material, but they also depend on the way that words are defined, for example some frequency lists consider 我们 to be one word, and 一个 to be one word, while others consider each to be two words. Even worse they often use imperfect algorithms to determine word breaks. There can also be errors in the way they are constructed (like the duplicates I noticed above), and as an added complication the list mentioned above attempted to give different entries for homographs, so it's likely they were using some sort of AI or statistical technique to determine which 了 was being used in a give sentence, etc. When people say "the most common X words in Chinese are ..." as if it is something definitive, they are of course always failing to add the disclaimer (based on corpus Y, using word segmenting algorithm Z, counting words by their appearance in dictionary A, etc...) If a frequency list throws some seemingly obscure words as the most frequent, and buries some apparently common words as being less frequent, then it's perfectly valid to question the list- there could be problems with it using a not very representative corpus, or other issues that I mentioned above. The main reason I was comparing the lists above was to show that the Wikipedia entry and the cited paper didn't match, so they were generated in different ways or based on different corpora. Quote Link to comment Share on other sites More sharing options...
c_redman Posted December 15, 2013 at 02:04 PM Report Share Posted December 15, 2013 at 02:04 PM The wiktionary entry isn't clear at all where this data came from. In addition, the words have both simplified and traditional forms. but it doesn't say whether the corpus was from traditional or simplified and then converted to the other form. And just one more nitpick: It is called "Mandarin Frequency List" but doesn't specify whether it's words or characters, until you click on it. There is a related conversation at https://www.forumosa.com/taiwan/viewtopic.php?f=40&t=122213. It was suggested that the duplicates are due to words being counted by part of speech. Someone also suggested the list came from 中央研究院-現代漢語標記語料庫 Academia Sinica Balanced Corpus of Modern Chinese 1 Quote Link to comment Share on other sites More sharing options...
imron Posted December 15, 2013 at 08:47 PM Report Share Posted December 15, 2013 at 08:47 PM If a frequency list throws some seemingly obscure words as the most frequent, and buries some apparently common words as being less frequent, then it's perfectly valid to question the list- there could be problems with it using a not very representative corpus, or other issues that I mentioned above. Yep, I was just trying to point out to people reading the thread that while frequency lists can be useful, they are not absolute. Quote Link to comment Share on other sites More sharing options...
roddy Posted December 15, 2013 at 10:00 PM Report Share Posted December 15, 2013 at 10:00 PM Were they Taiwanese papers? There's suspiciously good coverage of Taiwan city names, and I think 网路 is a Taiwanese usage.* Personally for learners I'd just run off the HSK lists, perhaps going back to the older ones which covered I think almost 9,000 words. You don't get fine-grained frequency information, but I don't see how important that is, and it'll give you more useful vocab - you're 1500 words in before this covers 名字, and for some reason it misses both 明天 and 昨天, and even 昨日,明日. The subtitle corpus is a good idea, but you could get some oddities: 奋斗 corpus: 1 钱; 2 车; 3 房 武林外传 corpus: 1确定;2一定;3肯定 *Ah, just noticed the Academica Sinica reference, probably was. 1 Quote Link to comment Share on other sites More sharing options...
alanmd Posted December 15, 2013 at 10:27 PM Report Share Posted December 15, 2013 at 10:27 PM I agree with roddy #10, the HSK lists introduce words in a very sensible order. They won't be perfect for everyone but no list is;you need to add words relevant to you as you need or come across them (no way on earth I'm waiting til HSK 6 to learn 串 so that I can order 羊肉串!). Here are all HSK words, grouped by level and ordered within each level by subtitle corpus frequency (which is a pretty good defeault ordering to learn them in) http://hskhsk.pythonanywhere.com/hskwords . You can hover over words for more info, or click 'expand definitions' for a massive page with info on each word inline. I also have flashcard file versions of these lists on my site. The only problem I've found using the subtitle corpus is that it considers some compounds of two words to be words in their own right. This doesn't matter too much with the way I am using it however. 1 Quote Link to comment Share on other sites More sharing options...
sparrow Posted December 15, 2013 at 10:34 PM Author Report Share Posted December 15, 2013 at 10:34 PM @戴睿长老 #2: I made a new topic explaining how I use a spreadsheet to study. This is the link. @alanmd #11: I'm curious: How are you using the SUBTLEX-CH corpus? Quote Link to comment Share on other sites More sharing options...
alanmd Posted December 16, 2013 at 01:11 AM Report Share Posted December 16, 2013 at 01:11 AM I am using it to order characters and words by frequency in the scripts that I've written and linked to above. I created use and flashcard files of the HSK words ordered by frequency, so I know that within each level I am learning the highest frequency words first. I also made a big wall chart of HSK 1-6 words, using the frequency lists as weights so that the highest frequency words are nearer the middle, with all words on nodes that are coloured by HSK level (it looks very pretty!). On my other HSK charts I make more frequently occurring words at each level slightly darker, to emphasise their importance. Quote Link to comment Share on other sites More sharing options...
New Members Daedalus Posted September 23, 2017 at 06:51 AM New Members Report Share Posted September 23, 2017 at 06:51 AM Thanks Sparrow. Very useful for determining which synonyms I best memorise first. Quote Link to comment Share on other sites More sharing options...
New Members xiaobaibai Posted October 3, 2022 at 04:12 PM New Members Report Share Posted October 3, 2022 at 04:12 PM Hello bro, I'm truly thankful for this list, right now I'm building an automatic translator that will include Simplified Mandarin, Traditional Mandarin, Mandarin Pinyin, Cantonese and Cantonese Pinyin. As i need a list of reference for creating the code, this will be amazingly helpful!. Thnak you! Quote Link to comment Share on other sites More sharing options...
Pall Posted December 8, 2022 at 04:49 AM Report Share Posted December 8, 2022 at 04:49 AM On 12/13/2013 at 11:04 PM, sparrow said: Attached is the spreadsheet. Thanks a lot! This is great stuff to look through what's missing after I'm done with the HSK 5,000 list. I'll just arrange the missing words according to my character 'leading features' and other. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.