Jump to content
Chinese-Forums
  • Sign Up

Character vs Word Frequency Comparison (Updated)


Recommended Posts

Posted

The word figure for the top percentile seems pretty low - I'd guess it'd need to be triple the word figure you listed. For reference, old HSK vocab lists had about 8,000-9,000 words at the top level.

Posted

You were right imron; I underestimated the effect of the longtail data on the frequency statistics. I've corrected by including the full lists of characters and words as opposed to truncating at 5000. The new results are updated in the original post and look much more grim...

Posted

It could be argued that recognising 99% of the words in a text gives you more of an understanding of a text than recognising 99% of the characters, so they shouldn't be considered analogous. In a given sentence, you might theoretically know all the characters, but none of the words, whereas if you know all the words, you must know all the characters. I don't know what percentage of words would be equivalent to what percentage of characters.

Posted

Knowing 100% of the words would be equivalent to knowing 100% of the characters, wouldn't it? :P

50%

You will need to know 149 characters

You will need to know 353 words

I was under the impression that this is not claiming that knowing 149 characters is equivalent to knowing 353 words... Because you could presumably know 353 words and know fewer than 149 characters, so long as the characters you know are extremely productive ones.

Posted

I'm a bit skeptical of this. I know at least 200 characters and I'm nowhere near the point where I can recognize even half of the texts I'm coming across. And I'm not even talking about literature either. I'm talking about advertisements and the newspaper, I think even comic books for children require a lot more characters.

Unless of course you're talking about ignoring the most frequently used characters and focusing on ones which carry more meaning. Generally the less commonly used characters and words. The most commonly used characters are generally not the ones that carry the meaning.

Posted

The percentage figure is not how much you understand, it's how many of the characters or words you encounter you will recognise. So imagine if someone reading something in English could recognise 50% of the words in something, how well do you think they would understand it?

Posted
I'm a bit skeptical of this. I know at least 200 characters and I'm nowhere near the point where I can recognize even half of the texts I'm coming across

You'd be surprised. 50% is a lot lower than you might think, for example, if I remove 50% of words from the above quote, keeping only the higher frequency ones, you are left with something like:

I'm a ***** ******* of this. I ****** at ****** 200 ****** and I'm ****** ****** the ******* ****** I can ****** ****** ****** of the ***** I'm ***** ***.

As you can see, knowing even a small number of high frequency words allows you to 'understand' 50% of the words in the text, but it's still next to useless for understanding the actual meaning.

  • Like 4
Posted

Hmm, is it not possible to just click to reply?

Anyways, I'm not really sure there is a relationship like that. Yes, I probably shouldn't have said understand, because in any language that's not something that makes sense in this context.

But, the answer you get is very different depending upon the specifics. I've noticed that it seems to have less to do with the number of characters I know than the number of radicals I know. And moreover, it seems to have far more to do with the amount of time and exposure I have to characters in context. And the amount of time I've spent looking very closely at characters trying to figure out how to look them up in a dictionary.

I don't know, perhaps that's it, trying to learn characters in isolation is so backwards, that I can't fathom how one would be recognizing characters elsewhere.

  • Like 1
Posted
Anyways, I'm not really sure there is a relationship like that.

No, it is exactly like that. According to this page, you can recognize fully 33% of the words you come across in any random sample of English by knowing only the 25 most common words. But if that's all you know, you won't actually understand anything.

I've noticed that it seems to have less to do with the number of characters I know than the number of radicals I know.

What?

  • Like 1
Posted

A key related concept is the Lexical Comprehension Percentage, that is, the percentage of words which are confidently understood in a given text.

I find that extensive reading becomes comfortable around 98% LCP. The remaining 2% only count as in-context exposures, which will be learned after a few more exposures, perhaps within the same few lines of text.

I count a lexeme (a word) as comprehended even if I have never seen it before, so long as I know the morphemes (characters) and the meaning is clear from context.

Chinese learning is especially interesting because of the diversity of the morphemes and the richness of their the structure. I find that it is especially easy to guess word meanings in Chinese in contrast to, say, the Romance languages.

I believe the Chinese language learner is best served by a mixed approach of focused character learning and contextual word learning. Learning ~2500 individual characters is a great base that makes word acquisition 朝飯前.

Posted
I find that extensive reading becomes comfortable around 98% LCP. The remaining 2% only count as in-context exposures, which will be learned after a few more exposures, perhaps within the same few lines of text.

This may be true for phonectically written languages, not for Chinese as with extensive reading alone you won't learn the pronunciation. (though you may get a hint ). Chinese extensive reading will learn you the character only half, the meaning, not the pronunciation.

Posted

Good point, Silent. You don't get to learn the spoken language as a side-effect of extensive reading for Chinese.

Luckily, once you have a good base of characters, one can easily guess at the pronunciation. I ran across 鱗甲 the other day in the context of an 惡龍. The meaning was clear from context and the 甲 character. Although I am not certain that 鱗 is pronounced like 鄰居的鄰, I'm satisfied with just guessing for now. Sure I'm not certain, but it's close enough to be recognize the word later in spoken language.

Posted
I find that extensive reading becomes comfortable around 98% LCP.

There was a discussion a while back that talked about this very point. The video linked to in that thread is very good. There was also another thread here talking about word frequency and understanding.

In my own experience, I would say that I'm inclined to agree with this figure. Even though as Silent mentioned you don't get pronunciation, if you're reading at 98% comprehension, many of the new words you come across will be made up of characters you already know, and even when they are not, you can still at least learn the meaning from context and exposure. You don't get pronunciation, but depending on what you're reading, you may or may not find that to be important.

Posted

No, it is exactly like that. According to this page, you can recognize fully 33% of the words you come across in any random sample of English by knowing only the 25 most common words. But if that's all you know, you won't actually understand anything.

Subject to a ton of caveats and this definitely does not reflect my experience with the language. I look at newspaper articles and advertisements and I'm recognizing maybe 1/10 of the characters and words.

Radicals, you know the things that actually are used for classifying words. If you aren't paying attention to the radicals, then why on earth would you expect to remember any of those characters? Or write them for that matter, they're in the characters to give you hints about the meaning and the pronunciation. And without them you're at best rote memorizing characters. Even just memorizing them is a lot harder if you're not paying attention to them.

Trust me, if you're treating characters as whole characters you're going to take a much longer time in learning to read.

Posted
Radicals, you know the things that actually are used for classifying words. If you aren't paying attention to the radicals, then why on earth would you expect to remember any of those characters? Or write them for that matter, they're in the characters to give you hints about the meaning and the pronunciation. And without them you're at best rote memorizing characters. Even just memorizing them is a lot harder if you're not paying attention to them.

Trust me, if you're treating characters as whole characters you're going to take a much longer time in learning to read.

I had a feeling you were using the word incorrectly. Radicals are simply the 214 (give or take, depending on which system you're talking about) components used as section headings in dictionaries. Hence the Chinese term 部首, which literally means section heading. You're talking about character components, not radicals. Components can be semantic, phonetic, pictographic, etc., and understanding how those work can indeed help with learning characters. Knowing radicals can only help you make a decent guess as to where to find the character in a dictionary.

I can also guarantee that the radicals aren't there to "give you hints about the meaning and pronunciation." It's an artificial system created as a way of indexing characters in a dictionary long after the writing system had already developed, by people who didn't fully understand how the writing system developed (for instance, the 說文解字 is wrong most of the time, I've seen claims of up to 90%), not something that is all that useful for explaining how characters actually work.

  • Like 1
Posted

I mentioned LCP, but I wonder if it's possible to define a semantic comprehension percentage. That is, the amount of the meaning of the text that you understand. That would be 0% in the case when your LCP is down around 30% for example.

Posted

Well, there's some in this Wikipedia article, but most of what I know of this I've read in books, not online, or I've gotten from lectures I've audited and conversations with PhD students. Anyway, 《文字學概要》 by 裘錫圭 is an excellent introduction, and there's an English translation of it by Jerry Norman and Gilbert Mattos called Chinese Writing. The claim of 許慎 being wrong about 90% of the characters in 說文 was made (if I remember correctly) by 杜忠誥 in his excellent 《說文篆文訛形釋例》. William Boltz's The Origin and Early Development of the Chinese Writing System is also highly recommended, but I haven't read much of it yet.

  • Like 2

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...