Jump to content
Chinese-Forums
  • Sign Up

Spreadsheet of 10,000 Most Frequent Chinese Words (2397 Characters)


sparrow

Recommended Posts

The Wikipedia page and the Chen et al. (1993) paper you linked to have different frequency orderings. I'm not sure that I believe that '的' is only the 28th most frequent word in modern Chienese, as the Wikipedia list and the Excel file how, or that '了' is the 25th most frequent, as the Chen et al. (1993) paper shows. Must be some quite strange source texts to give those frequency results.

 

An interesting feature of the list on Wikipedia is that homographs are broken out into multiple entries, which implies that the meaning of each word is known when compiling the list- this would be very hard (and not impossible) for a computer to do, and the 1993 paper only talks about word segmentation not homographs so wouldn't have been able to generate the Wikipedia list. It would be interesting to know how this list was created. There seem to be some errors though, entries 758 and 759 both show 推出 with the same meaning, I did a quick count in Excel and found 1189 totally duplicate entries.

 

I usually use the SUBTLEX-CH word frequency lists (Cai & Brysbaert, 2010), as I like their methodology of using subtitles to better measure word usage in modern speech. I don't really use these for studying, except to frequency order all of the words and characters in the Chinese language scripts I write, e.g. http://hskhsk.pythonanywhere.com/radicals?hsk=16 . Maybe someday I'll get around to doing an in-depth comparison of these frequency lists, and try to see which words differ most between them- might help to get a better idea of the advantages/disadvantages of each.

 

For comparison the first few words of each are:

  • Wikipedia: 一,在,有,个,我,不,这,了,。。。
  • Chen et al. (1993): 的,一,在,十,是,有,二,三,。。。
  • SUBTLEX-CH: 的,我,你,是,了,不,在,他,。。。
  • Like 4
Link to comment
Share on other sites

@alanmd:

Interesting—if the list did not come from Chen et al., I wonder where it came from!

 

I noticed there are many repeat entries, and I noticed 的 is quite far down the list, which made me suspicious, so I've continued to use Routledge's frequency dictionary.

 

To be clear, did you find 1189 duplicate entries for the same word? If so, then this list of 10,000 might be rubbish. In that case, I will make a bold note in the OP about this so that people know they're getting a low-quality list and probably should look elsewhere.

Link to comment
Share on other sites

I created a new column which was Simplified & Trad & pinyin & definition. I then sorted on that column, and made a column that counted consecutively identical rows as a '1', otherwise '0'. The 1189 count would have counted once-duplicated entries as '1', and twice-duplicated as '2', etc. I didn't test for the number of unique duplicates, if you know what I mean! :)

Link to comment
Share on other sites

Frequency lists depend completely on the source material that they came from.  If two different data sources were used to collect frequency information, then it's understandable that different characters will appear in different positions.  It's not an exact or precise figure.

Link to comment
Share on other sites

Yes, they do depend on the source material, but they also depend on the way that words are defined, for example some frequency lists consider 我们 to be one word, and 一个 to be one word, while others consider each to be two words. Even worse they often use imperfect algorithms to determine word breaks. There can also be errors in the way they are constructed (like the duplicates I noticed above), and as an added complication the list mentioned above attempted to give different entries for homographs, so it's likely they were using some sort of AI or statistical technique to determine which 了 was being used in a give sentence, etc.

 

When people say "the most common X words in Chinese are ..." as if it is something definitive, they are of course always failing to add the disclaimer (based on corpus Y, using word segmenting algorithm Z, counting words by their appearance in dictionary A, etc...)

 

If a frequency list throws some seemingly obscure words as the most frequent, and buries some apparently common words as being less frequent, then it's perfectly valid to question the list- there could be problems with it using a not very representative corpus, or other issues that I mentioned above. 
 
The main reason I was comparing the lists above was to show that the Wikipedia entry and the cited paper didn't match, so they were generated in different ways or based on different corpora.
Link to comment
Share on other sites

The wiktionary entry isn't clear at all where this data came from. In addition, the words have both simplified and traditional forms. but it doesn't say whether the corpus was from traditional or simplified and then converted to the other form. And just one more nitpick: It is called "Mandarin Frequency List" but doesn't specify whether it's words or characters, until you click on it.

There is a related conversation at https://www.forumosa.com/taiwan/viewtopic.php?f=40&t=122213. It was suggested that the duplicates are due to words being counted by part of speech. Someone also suggested the list came from 中央研究院-現代漢語標記語料庫 Academia Sinica Balanced Corpus of Modern Chinese

  • Like 1
Link to comment
Share on other sites

If a frequency list throws some seemingly obscure words as the most frequent, and buries some apparently common words as being less frequent, then it's perfectly valid to question the list- there could be problems with it using a not very representative corpus, or other issues that I mentioned above.

Yep, I was just trying to point out to people reading the thread that while frequency lists can be useful, they are not absolute.

Link to comment
Share on other sites

Were they Taiwanese papers? There's suspiciously good coverage of Taiwan city names, and I think 网路 is a Taiwanese usage.*

 

Personally for learners  I'd just run off the HSK lists, perhaps going back to the older ones which covered I think almost 9,000 words. You don't get fine-grained frequency information, but I don't see how important that is, and it'll give you more useful vocab - you're 1500 words in before this covers 名字, and for some reason it misses both 明天 and 昨天, and even 昨日,明日.

 

The subtitle corpus is a good idea, but you could get some oddities: 

 

奋斗 corpus: 1 钱; 2 车; 3 房

武林外传 corpus: 1确定;2一定;3肯定

 

*Ah, just noticed the Academica Sinica reference, probably was. 

  • Like 1
Link to comment
Share on other sites

I agree with roddy #10, the HSK lists introduce words in a very sensible order. They won't be perfect for everyone but no list is;you need to add words relevant to you as you need or come across them (no way on earth I'm waiting til HSK 6 to learn 串 so that I can order 羊肉串!).

 

Here are all HSK words, grouped by level and ordered within each level by subtitle corpus frequency (which is a pretty good defeault ordering to learn them in) http://hskhsk.pythonanywhere.com/hskwords . You can hover over words for more info, or click 'expand definitions' for a massive page with info on each word inline. I also have flashcard file versions of these lists on my site.

 

The only problem I've found using the subtitle corpus is that it considers some compounds of two words to be words in their own right. This doesn't matter too much with the way I am using it however.

  • Like 1
Link to comment
Share on other sites

I am using it to order characters and words by frequency in the scripts that I've written and linked to above. I created use and flashcard files of the HSK words ordered by frequency, so I know that within each level I am learning the highest frequency words first. I also made a big wall chart of HSK 1-6 words, using the frequency lists as weights so that the highest frequency words are nearer the middle, with all words on nodes that are coloured by HSK level (it looks very pretty!). On my other HSK charts I make more frequently occurring words at each level slightly darker, to emphasise their importance.

Link to comment
Share on other sites

  • 3 years later...
  • 5 years later...
  • New Members

Hello bro, I'm truly thankful for this list, right now I'm building an automatic translator that will include Simplified Mandarin, Traditional Mandarin, Mandarin Pinyin, Cantonese and Cantonese Pinyin. As i need a list of reference for creating the code, this will be amazingly helpful!.

 

Thnak you!

Link to comment
Share on other sites

  • 2 months later...
On 12/13/2013 at 11:04 PM, sparrow said:

Attached is the spreadsheet.

Thanks a lot! This is great stuff to look through what's missing after I'm done with the HSK 5,000 list. I'll just arrange the missing words according to my character  'leading features' and other.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...