Jump to content
Chinese-Forums
  • Sign Up

How many characters per word?


Recommended Posts

Posted

I was looking for information on something else (sending SMSs in chinese) and came across a comment on the Sony-Ericsson web site.

It states that an average chinese word contained 1.6 characters.

Is that a fact quoted elsewhere or did they make it up?

Posted

From Asia's Orthographic Dilemma here

For running text, DeFrancis estimates Chinese ''as only 30 percent monosyllabic as against 50 percent for English material written in a style comparable to that of the Chinese" (1943:235). Zheng gives a higher figure of 40 percent monosyllabicity for Chinese texts (1957:50)

DeFrancis: (1 * 0.3) + (2 * 0.7) = 1.7

Zheng: (1 * 0.4) + (2 * 0.6) = 1.6

I've assumed that anything that isn't a 1syllable word is a 2 syllable word as 3+ syllables only make up 2% and I don't want the hassle :wink:

Seems about right, but this is for running text (i.e. counting the lengths of words in a normal text) not the total number of words in the dictionary.

Posted

hmm, pretty surprising for written text...

does the percentage of 2-syllable+ words go up even more in the spoken language because of homonyms?

Posted

Any page of my 汉英词典 has a majority of two-syllable "words". I'm not surprised of the 1.6-1.7 syllables - or characters?! per word.

One thing that I would love to investigate for a thesis on a university 3rd semester level is those claims of "you need to know n characters or words to understand x % of a text in Chinese. I'm curious about how the claimants measure "understanding", their sample sizes, the selection of test subjects, methods used, definitions of a "word" etc. Links/comments, please!

Posted

I have some stats I extracted from the HSK lists some time ago that might help. I wouldn't want to treat these as gospel, as there are issues. One is obviously the limited sample size. Also, some entries have punctuation in (ie, "除了。 。 。意外), and I'm not sure what effect that had. Also, I didn't bother looking for any entries above four syllables / characters. Can't remember off hand if there are any. So with that caveat . . .

The 8785 words I looked at (should be 8821, but for reasons above not all were picked up) break down like this. Note that I was thinking in terms of syllables at the time, not characters, but it's the same thing.

number of 4 syllables was: 186 which gives (744) characters

3 syllables 293 - (879)

2 syllables 6384 - (12768)

1 syllable 1922 - (1922)

This gives us a total of 16313 characters in 875 words, making about 1.85 characters per word.

Roddy

Posted

I wonder if the Sony Ericsson people were counting by frequency. A lot of less frequently used words are two, three, and even four characters. On the other hand, a lot of single character words (like 我, 你, 在) are used more often, which would bring the number down towards 1.6. I don't know.

Posted

The number is also different for different Chinese dialects.

I wouldn't be surprised if Shanghainese average 1.8-2.0, while Cantonese and Hakka are under 1.6.

And I wonder if Mandarin words like 花儿 (huar) are being counted as one or two syllables?

And is 社会主义 (socialism) one word or two? According to Xinhua dictionary, it's two words (Shehui Zhuyi); but according to Shanghainese tone sandh pattern it's one word (Zooweitsugni). To me the phonology-based Shanghainese word partition is more solid than the arbitrary Xinhua partition. Hence different definitions of word boundaries will greatly alter the above posts' calculations as well.

Posted

Lugubert

I'm only starting to learn Chinese and stumbled into talk of character frequency; how knowing 2000 characters you can recognise ~98% of written chinese, 3000 => 99.9%. Obviously recognising individual characters not the same as understanding words and meaning.

Cumulative frequencies in percentile for characters are listed here..

http://lingua.mtsu.edu/chinese-computing/statistics/index.html

HTH

Posted

I think it depends on how you calculate, for example:

(1) using statistics-based methods, then the question is: how well the corpus is?

(2) using a dictionary, then the question is: do you think this is a pure character or a word? 丁 (ding1)

Posted
I'm only starting to learn Chinese and stumbled into talk of character frequency; how knowing 2000 characters you can recognise ~98% of written chinese, 3000 => 99.9%. Obviously recognising individual characters not the same as understanding words and meaning.

Yup, it's kind of like saying how by knowing just 26 letters in the English alphabet, I can "recognize" 99.9% of all English text.

No one can guess from the characters alone that 社会 means "society"

Posted
I'm only starting to learn Chinese and stumbled into talk of character frequency; how knowing 2000 characters you can recognise ~98% of written chinese, 3000 => 99.9%. Obviously recognising individual characters not the same as understanding words and meaning.

Yup, it's kind of like saying how by knowing just 26 letters in the English alphabet, I can "recognize" 99.9% of all English text.

No one can guess from the characters alone that 社会 means "society"

That would be true for native speakers though. And, learning a radical here and there, and picking up some characters by seeing them in the streets and books and asking parents, native speakers don't really need to "learn" 3000 one by one either.

Posted

Yeah, native speakers are completely different situation, because native speakers can "pronounce" out the characters and the sound registers as a word that has meaning. Even if they don't know what 社会 is when they see it, once they pronounce out the characters 社 and 会 then they get she4 hui4, which registers as "society." But this is not possible for non-native beginners in Chinese.

Posted

assuming you know 2000 or 3000 characters, you are not a beginner at all...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...