roddy Posted November 3, 2003 at 04:22 PM Report Posted November 3, 2003 at 04:22 PM Does anyone know of any lists of word frequency for Chinese? The kind of thing that tells you that 的 is the most common Chinese word, 今天 is the 97th, and 麟 is the 1054th? Note, I'm not looking for character frequency, but word frequency. I guess an HSK vocab list would be a step in the right direction - presumably the words at the lower levels are more frequent. Roddy Quote
JoH Posted November 3, 2003 at 06:38 PM Report Posted November 3, 2003 at 06:38 PM Sorry, I don't know of a word frequency list - but do you mean that you already know of a character frequency list (on the web preferably)? If so, I'd be interested to know where. As for the HSK, I guess you are right. But they also seem to be quite selective about the vocab they use. I don't have the list to hand, but I remember that dentist wasn't on there (although doctor was), for example. Quote
beijingbooty Posted November 3, 2003 at 06:39 PM Report Posted November 3, 2003 at 06:39 PM No, I have never seen one. I think you would be lucky to find that. It is too complex to put together and would be based too much on peoples own opinions and language styles rather than researchable evidence. As you say, there are many character frequency lists. As long as you can master 90% of the contents of an HSK dictionary then you will be fine. Quote
roddy Posted November 4, 2003 at 12:34 AM Author Report Posted November 4, 2003 at 12:34 AM JoH, go to Zhongwen.com and click on Character frequency under Vocabulary - it's the best I know of. I got some clues here - but it looks like most of the stuff is for characters only and the HSK word lists are still the best bet (unless you want to do something daft like actually pay for a book). Roddy Quote
baisong Posted March 4, 2010 at 06:46 AM Report Posted March 4, 2010 at 06:46 AM Here's something: http://sourceforge.net/projects/libtabe/ Haven't tried it myself but it says it has word frequency (not character freq.) Quote
piasano Posted March 7, 2010 at 11:21 AM Report Posted March 7, 2010 at 11:21 AM wow, libtabe looks very useful! Quote
querido Posted March 7, 2010 at 03:20 PM Report Posted March 7, 2010 at 03:20 PM Here is some related stuff. His CScanner and CWFC (Chinese Word Frequency Counter) use libtabe. For example, see this output. Quote
Cactus543 Posted March 7, 2010 at 07:46 PM Report Posted March 7, 2010 at 07:46 PM Like Roddy said http://www.zhongwen.com has a pretty good frequency list Quote
querido Posted March 7, 2010 at 09:25 PM Report Posted March 7, 2010 at 09:25 PM They're looking for word frequency. Quote
tooironic Posted March 8, 2010 at 12:40 AM Report Posted March 8, 2010 at 12:40 AM Wenlin has one. 的 is #1, 一起 is in the middle and 野 is at the end. Quote
mihobu Posted April 13, 2010 at 06:29 PM Report Posted April 13, 2010 at 06:29 PM Wenlin's word list seems to stop at 1,000 words. Not very comprehensive. Quote
c_redman Posted April 14, 2010 at 08:08 PM Report Posted April 14, 2010 at 08:08 PM Some corpus-derived data from the University of Leeds: * A collection of Chinese corpora has links to lists from the Lancaster Corpus (top 5000 words) and a home-grown web corpus (top 50k words); and * Large Corpora used in CTS links to a list from the Chinese Gigaword corpus (top 25k words) Quote
tooironic Posted April 15, 2010 at 12:14 AM Report Posted April 15, 2010 at 12:14 AM Wow, the corpora search engine in that first link is pretty cool. That will certainly come in handy some day if I want to search for real-life sentence examples without stuffing around on Google. Cheers! Quote
buzhongren Posted April 15, 2010 at 02:37 PM Report Posted April 15, 2010 at 02:37 PM tooironic: Wow, the corpora search engine in that first link is pretty cool. I got the English examples to work. I tried some pinyin which worked. Is it possible to use Chinese characters in a corpus search other than simple words or characters. If so how about an example. I always wished Google would implement something like this. xiele, Jim Quote
tooironic Posted April 15, 2010 at 11:20 PM Report Posted April 15, 2010 at 11:20 PM What do you mean? You can search by hanzi. E.g. Quote
c_redman Posted April 16, 2010 at 05:47 PM Report Posted April 16, 2010 at 05:47 PM Maybe the issue is that you need put word breaks in the search terms. Unfortunately, it doesn't find a match if the words are not segmented the same way as the corpus. The corpus is segmented programmatically, so it may not be 100% perfect, either. Quote
c_redman Posted April 16, 2010 at 08:30 PM Report Posted April 16, 2010 at 08:30 PM Two additional lists, from A Corpus Worker's Toolkit (http://www.humnet.ucla.edu/alc/chinese/ACWT/ACWT.htm). Within the software distribution, there are two files: ldc.dic has 44,000 unique words with frequencies, from a corpus of 4.9 million words; and wordlist.txt (no frequency data). Quote
tooironic Posted April 17, 2010 at 01:27 AM Report Posted April 17, 2010 at 01:27 AM Hmm. I guess my question would be what possible pedagogical uses do these corpora have for the average student? Quote
buzhongren Posted April 17, 2010 at 03:47 PM Report Posted April 17, 2010 at 03:47 PM tooironic: What do you mean? You can search by hanzi. E.g. For example I cant figure out the corpora syntax for conjunctions like 不但.*而且 xiele, Jim Quote
c_redman Posted April 19, 2010 at 01:28 PM Report Posted April 19, 2010 at 01:28 PM Yes, the syntax is a little quirky; much of the example syntax doesn't work. But I managed to get something out this way: 不但 . . . . . . . . . . . . 而且 That will match with a gap of up to 12 words. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.