Character vs Word Frequency Comparison (Updated)

December 24, 2012 at 05:07 AM

//Updated based on non-truncated character/word list and included jpg of summary sheet

//Oops forgot to update the download link

I'm avoiding studying Chinese and decided to come up with a comparison of character and word frequencies. You always hear people saying you need to learn XX number of characters to read xx% of Chinese texts out there and then other people counter that only knowing characters is useless as they are often collocated to form words with different meanings. That is, understanding 安 (character) vs 安慰 (word). So I was curious and decided how many more words than characters you need to study to reach the same percentile recognition. The summary statistics are below:

By Character/Words Known

If you know...

100

Characters: you will recognize 42.595% of all text (text in corpus, rather)

Words: you will recognize 36.587% of all text

200

Characters: you will recognize 56.063% of all text

Words: you will recognize 43.706% of all text

500

Characters: you will recognize 75.714% of all text

Words: you will recognize 54.123% of all text

1500

Characters: you will recognize 94.520% of all text

Words: you will recognize 67.977% of all text

2000

Characters: you will recognize 97.152% of all text

Words: you will recognize 71.731% of all text

3000

Characters: you will recognize 99.242% of all text

Words: you will recognize 76.993% of all text

As you can see, and as expected, knowing more characters will make you recognize more of the text (irrespective of comprehension of meaning). Comparative word recognition generally being around 70-80% of character recognition. This comparison is a bit abstract though; much more interesting is the number of characters/words you need to know to reach a certain percentile.

By Percentile

If you want to recognize XX% of text...

50%

You will need to know 149 characters

You will need to know 353 words

75%

You will need to know 484 characters

You will need to know 2567 words

80%

You will need to know 614 characters

You will need to know 3838 words

90%

You will need to know 1062 characters

You will need to know 10078 words

95%

You will need to know 1570 characters

You will need to know 19054 words

99%

You will need to know 2783 characters

You will need to know 37676 words

As you can see, the difference here is much more stark. You generally need to know an exponentially increasing number of words vs characters to reach the same recognition level. The data is pretty grim, but the silver lining is that the text corpus includes all manners of obscure documents that we'll never come across so the number of characters for recognition of non-obscure material should be much lower. Anyway, I thought it was interesting and thought I would share. I've uploaded the excel I used if anyone wants to play with it (check my math and formulas and/or calculate more metrics): http://sharesend.com/3050i9hv

Another interesting thing that you might do is truncate the word list past 15k or 20k or 30k words. This would assume that all words after your truncated mark are esoteric words that you will either never come across or will be irrelevant to your comprehension. This would leave you with a core of useful words and would provide metrics on how many relevant, useful words you would need to know to understand text you might actually come across.

*A brief disclaimer on methodology. I used the frequency lists (around 6800 characters and 47,400 words) from the Chinese Internet Corpus found here: http://corpus.leeds....k/query-zh.html. I didn't calculate the frequencies, only summarized them, so you'll have to check their methodology. I included all the characters and words but it's still safe to assume that as you get to rarer characters/words, the frequency calculations are proportionally more inaccurate. Also, I did a quick cleanup of the list by taking out all non-Chinese entries but I may have overlooked some. All the above should be relatively minor issues but still, caveat emptor.

December 24, 2012 at 06:09 AM

The word figure for the top percentile seems pretty low - I'd guess it'd need to be triple the word figure you listed. For reference, old HSK vocab lists had about 8,000-9,000 words at the top level.

December 24, 2012 at 07:11 AM

You were right imron; I underestimated the effect of the longtail data on the frequency statistics. I've corrected by including the full lists of characters and words as opposed to truncating at 5000. The new results are updated in the original post and look much more grim...

December 24, 2012 at 09:51 AM

Haha, nice, but now I think it's probably a little too pessimistic.

December 24, 2012 at 02:36 PM

It could be argued that recognising 99% of the words in a text gives you more of an understanding of a text than recognising 99% of the characters, so they shouldn't be considered analogous. In a given sentence, you might theoretically know all the characters, but none of the words, whereas if you know all the words, you must know all the characters. I don't know what percentage of words would be equivalent to what percentage of characters.

December 24, 2012 at 05:37 PM

Knowing 100% of the words would be equivalent to knowing 100% of the characters, wouldn't it?

50%
You will need to know 149 characters

You will need to know 353 words

I was under the impression that this is not claiming that knowing 149 characters is equivalent to knowing 353 words... Because you could presumably know 353 words and know fewer than 149 characters, so long as the characters you know are extremely productive ones.

December 25, 2012 at 04:01 AM

I'm a bit skeptical of this. I know at least 200 characters and I'm nowhere near the point where I can recognize even half of the texts I'm coming across. And I'm not even talking about literature either. I'm talking about advertisements and the newspaper, I think even comic books for children require a lot more characters.

Unless of course you're talking about ignoring the most frequently used characters and focusing on ones which carry more meaning. Generally the less commonly used characters and words. The most commonly used characters are generally not the ones that carry the meaning.

December 25, 2012 at 07:07 AM

The percentage figure is not how much you understand, it's how many of the characters or words you encounter you will recognise. So imagine if someone reading something in English could recognise 50% of the words in something, how well do you think they would understand it?

December 25, 2012 at 09:48 AM

I'm a bit skeptical of this. I know at least 200 characters and I'm nowhere near the point where I can recognize even half of the texts I'm coming across

You'd be surprised. 50% is a lot lower than you might think, for example, if I remove 50% of words from the above quote, keeping only the higher frequency ones, you are left with something like:

I'm a ***** ******* of this. I ****** at ****** 200 ****** and I'm ****** ****** the ******* ****** I can ****** ****** ****** of the ***** I'm ***** ***.

As you can see, knowing even a small number of high frequency words allows you to 'understand' 50% of the words in the text, but it's still next to useless for understanding the actual meaning.

December 25, 2012 at 10:29 AM

Hmm, is it not possible to just click to reply?

Anyways, I'm not really sure there is a relationship like that. Yes, I probably shouldn't have said understand, because in any language that's not something that makes sense in this context.

But, the answer you get is very different depending upon the specifics. I've noticed that it seems to have less to do with the number of characters I know than the number of radicals I know. And moreover, it seems to have far more to do with the amount of time and exposure I have to characters in context. And the amount of time I've spent looking very closely at characters trying to figure out how to look them up in a dictionary.

I don't know, perhaps that's it, trying to learn characters in isolation is so backwards, that I can't fathom how one would be recognizing characters elsewhere.

December 25, 2012 at 03:26 PM

Anyways, I'm not really sure there is a relationship like that.

No, it is exactly like that. According to this page, you can recognize fully 33% of the words you come across in any random sample of English by knowing only the 25 most common words. But if that's all you know, you won't actually understand anything.

I've noticed that it seems to have less to do with the number of characters I know than the number of radicals I know.

What?

December 25, 2012 at 07:18 PM

A key related concept is the Lexical Comprehension Percentage, that is, the percentage of words which are confidently understood in a given text.

I find that extensive reading becomes comfortable around 98% LCP. The remaining 2% only count as in-context exposures, which will be learned after a few more exposures, perhaps within the same few lines of text.

I count a lexeme (a word) as comprehended even if I have never seen it before, so long as I know the morphemes (characters) and the meaning is clear from context.

Chinese learning is especially interesting because of the diversity of the morphemes and the richness of their the structure. I find that it is especially easy to guess word meanings in Chinese in contrast to, say, the Romance languages.

I believe the Chinese language learner is best served by a mixed approach of focused character learning and contextual word learning. Learning ~2500 individual characters is a great base that makes word acquisition 朝飯前.

December 25, 2012 at 10:07 PM

I find that extensive reading becomes comfortable around 98% LCP. The remaining 2% only count as in-context exposures, which will be learned after a few more exposures, perhaps within the same few lines of text.

This may be true for phonectically written languages, not for Chinese as with extensive reading alone you won't learn the pronunciation. (though you may get a hint ). Chinese extensive reading will learn you the character only half, the meaning, not the pronunciation.

December 25, 2012 at 10:53 PM

Good point, Silent. You don't get to learn the spoken language as a side-effect of extensive reading for Chinese.

Luckily, once you have a good base of characters, one can easily guess at the pronunciation. I ran across 鱗甲 the other day in the context of an 惡龍. The meaning was clear from context and the 甲 character. Although I am not certain that 鱗 is pronounced like 鄰居的鄰, I'm satisfied with just guessing for now. Sure I'm not certain, but it's close enough to be recognize the word later in spoken language.

December 25, 2012 at 11:55 PM

I find that extensive reading becomes comfortable around 98% LCP.

There was a discussion a while back that talked about this very point. The video linked to in that thread is very good. There was also another thread here talking about word frequency and understanding.

In my own experience, I would say that I'm inclined to agree with this figure. Even though as Silent mentioned you don't get pronunciation, if you're reading at 98% comprehension, many of the new words you come across will be made up of characters you already know, and even when they are not, you can still at least learn the meaning from context and exposure. You don't get pronunciation, but depending on what you're reading, you may or may not find that to be important.

December 25, 2012 at 11:56 PM

No, it is exactly like that. According to this page, you can recognize fully 33% of the words you come across in any random sample of English by knowing only the 25 most common words. But if that's all you know, you won't actually understand anything.

Subject to a ton of caveats and this definitely does not reflect my experience with the language. I look at newspaper articles and advertisements and I'm recognizing maybe 1/10 of the characters and words.

Radicals, you know the things that actually are used for classifying words. If you aren't paying attention to the radicals, then why on earth would you expect to remember any of those characters? Or write them for that matter, they're in the characters to give you hints about the meaning and the pronunciation. And without them you're at best rote memorizing characters. Even just memorizing them is a lot harder if you're not paying attention to them.

Trust me, if you're treating characters as whole characters you're going to take a much longer time in learning to read.

December 26, 2012 at 01:31 AM

Radicals, you know the things that actually are used for classifying words. If you aren't paying attention to the radicals, then why on earth would you expect to remember any of those characters? Or write them for that matter, they're in the characters to give you hints about the meaning and the pronunciation. And without them you're at best rote memorizing characters. Even just memorizing them is a lot harder if you're not paying attention to them.
Trust me, if you're treating characters as whole characters you're going to take a much longer time in learning to read.

I had a feeling you were using the word incorrectly. Radicals are simply the 214 (give or take, depending on which system you're talking about) components used as section headings in dictionaries. Hence the Chinese term 部首, which literally means section heading. You're talking about character components, not radicals. Components can be semantic, phonetic, pictographic, etc., and understanding how those work can indeed help with learning characters. Knowing radicals can only help you make a decent guess as to where to find the character in a dictionary.

I can also guarantee that the radicals aren't there to "give you hints about the meaning and pronunciation." It's an artificial system created as a way of indexing characters in a dictionary long after the writing system had already developed, by people who didn't fully understand how the writing system developed (for instance, the 說文解字 is wrong most of the time, I've seen claims of up to 90%), not something that is all that useful for explaining how characters actually work.

December 26, 2012 at 02:16 AM

I mentioned LCP, but I wonder if it's possible to define a semantic comprehension percentage. That is, the amount of the meaning of the text that you understand. That would be 0% in the case when your LCP is down around 30% for example.

December 26, 2012 at 03:18 AM

Totally off-topic but OneEye can I get some reference links for #17?

December 26, 2012 at 04:05 AM

Well, there's some in this Wikipedia article, but most of what I know of this I've read in books, not online, or I've gotten from lectures I've audited and conversations with PhD students. Anyway, 《文字學概要》 by 裘錫圭 is an excellent introduction, and there's an English translation of it by Jerry Norman and Gilbert Mattos called Chinese Writing. The claim of 許慎 being wrong about 90% of the characters in 說文 was made (if I remember correctly) by 杜忠誥 in his excellent 《說文篆文訛形釋例》. William Boltz's The Origin and Early Development of the Chinese Writing System is also highly recommended, but I haven't read much of it yet.

Sign In

Character vs Word Frequency Comparison (Updated)

Recommended Posts

howlingfantods

imron

howlingfantods

imron

li3wei1

陳德聰

hedwards

li3wei1

imron

hedwards

OneEye

navaburo

Silent

navaburo

imron

hedwards

OneEye

navaburo

陳德聰

OneEye

Join the conversation