List of word frequency

November 3, 2003 at 04:22 PM

Does anyone know of any lists of word frequency for Chinese? The kind of thing that tells you that 的 is the most common Chinese word, 今天 is the 97th, and 麟 is the 1054th?

Note, I'm not looking for character frequency, but word frequency.

I guess an HSK vocab list would be a step in the right direction - presumably the words at the lower levels are more frequent.

Roddy

November 3, 2003 at 06:38 PM

Sorry, I don't know of a word frequency list - but do you mean that you already know of a character frequency list (on the web preferably)? If so, I'd be interested to know where.

As for the HSK, I guess you are right. But they also seem to be quite selective about the vocab they use. I don't have the list to hand, but I remember that dentist wasn't on there (although doctor was), for example.

November 3, 2003 at 06:39 PM

No, I have never seen one. I think you would be lucky to find that. It is too complex to put together and would be based too much on peoples own opinions and language styles rather than researchable evidence.

As you say, there are many character frequency lists.

As long as you can master 90% of the contents of an HSK dictionary then you will be fine.

November 4, 2003 at 12:34 AM

JoH, go to Zhongwen.com and click on Character frequency under Vocabulary - it's the best I know of.

I got some clues here - but it looks like most of the stuff is for characters only and the HSK word lists are still the best bet (unless you want to do something daft like actually pay for a book).

Roddy

March 4, 2010 at 06:46 AM

Here's something:

http://sourceforge.net/projects/libtabe/

Haven't tried it myself but it says it has word frequency (not character freq.)

March 7, 2010 at 11:21 AM

wow, libtabe looks very useful!

March 7, 2010 at 03:20 PM

Here is some related stuff.

His CScanner and CWFC (Chinese Word Frequency Counter) use libtabe. For example, see this output.

March 7, 2010 at 07:46 PM

Like Roddy said http://www.zhongwen.com has a pretty good frequency list

March 7, 2010 at 09:25 PM

They're looking for word frequency.

March 8, 2010 at 12:40 AM

Wenlin has one. 的 is #1, 一起 is in the middle and 野 is at the end.

April 13, 2010 at 06:29 PM

Wenlin's word list seems to stop at 1,000 words. Not very comprehensive.

April 14, 2010 at 08:08 PM

Some corpus-derived data from the University of Leeds:

* A collection of Chinese corpora has links to lists from the Lancaster Corpus (top 5000 words) and a home-grown web corpus (top 50k words); and

* Large Corpora used in CTS links to a list from the Chinese Gigaword corpus (top 25k words)

April 15, 2010 at 12:14 AM

Wow, the corpora search engine in that first link is pretty cool. That will certainly come in handy some day if I want to search for real-life sentence examples without stuffing around on Google. Cheers!

April 15, 2010 at 02:37 PM

tooironic: Wow, the corpora search engine in that first link is pretty cool.

I got the English examples to work. I tried some pinyin which worked. Is it possible to use Chinese characters in a corpus search other than simple words or characters. If so how about an example. I always wished Google would implement something like this.

xiele,

Jim

April 15, 2010 at 11:20 PM

What do you mean? You can search by hanzi. E.g.

April 16, 2010 at 05:47 PM

Maybe the issue is that you need put word breaks in the search terms. Unfortunately, it doesn't find a match if the words are not segmented the same way as the corpus. The corpus is segmented programmatically, so it may not be 100% perfect, either.

April 16, 2010 at 08:30 PM

Two additional lists, from A Corpus Worker's Toolkit (http://www.humnet.ucla.edu/alc/chinese/ACWT/ACWT.htm). Within the software distribution, there are two files: ldc.dic has 44,000 unique words with frequencies, from a corpus of 4.9 million words; and wordlist.txt (no frequency data).

April 17, 2010 at 01:27 AM

Hmm. I guess my question would be what possible pedagogical uses do these corpora have for the average student?

April 17, 2010 at 03:47 PM

tooironic: What do you mean? You can search by hanzi. E.g.

For example I cant figure out the corpora syntax for conjunctions like 不但.*而且

xiele,

Jim

April 19, 2010 at 01:28 PM

Yes, the syntax is a little quirky; much of the example syntax doesn't work. But I managed to get something out this way:

不但 . . . . . . . . . . . . 而且

That will match with a gap of up to 12 words.

Sign In

List of word frequency

Recommended Posts

roddy

JoH

beijingbooty

roddy

baisong

piasano

querido

Cactus543

querido

tooironic

mihobu

c_redman

tooironic

buzhongren

tooironic

c_redman

c_redman

tooironic

buzhongren

c_redman

Join the conversation