Jump to content
Chinese-Forums
  • Sign Up

Character Frequency Analysis and Learning Site


Recommended Posts

Posted

I've put together another frequency list of Chinese characters, taking into account the usual 100,000 Chinese web pages. This one has a bit of a twist, as it translated the characters first into Pinyin, so that specific character-pinyin pairs could then be tallied.

http://readmandarin.com/research.htm

If you dig around the site, you'll see it's aimed towards teaching people to read Chinese.

Any feedback on the research, or on the site itself would be greatly appreciated.

Also, can anyone explain to me the etymology behind 的? White spoon? I've looked in a number of books and haven't gotten anywhere, and I'd like to update it for the site.

Thank you!

Daniel

Posted

a bit confusing with the line beneath each character, but i suppose i could get used to it.

what is the relation between # seen and cumulative %?

"Sentences, phrases and paragraphs of over 50 characters were included."

can i assume only paragraphs required a minimum 50-character count?

what percentage of text was discarded due to too-short paragraphs?

"Duplicate phrases were discarded."

did phrases have to be an exact match, or of a certain length to qualify?

how many times was "build a harmonious countryside" discarded?

Posted

Toffeeliz:

I wasn't quite sure what you meant by the results being off. Please explain if you have a moment.

mr.stinky:

I've removed the lines under the characters so the page is easier to read.

Cumulative percentage gives you an idea as to how much of the text you would be able to recognize if you learned only a certain number of characters. The top 10 characters make up about 12.2% of the sampled text. The top 1000 characters make up about 90.3% of the sampled text.

Yes, paragraphs had to be 50 characters in length. The idea was that the program would take in text from actual articles, reviews and blogs. I'm not sure what % of lesser length text was discarded, but I suppose I could run the analysis and find out.

It was only duplicate paragraphs that were omitted. I've changed the description to reflect that.

Thanks for the great feedback!

Daniel

Posted

If you went through the work of discarding the HTML markup and the shorter sentences, the raw datafiles could be useful to others doing text processing. I'd be curious to look at them for word-level rather than character-level frequencies myself.

Posted

trevelyan:

That's a cool idea. Would you be looking for words of 2-3 characters in length? How would you determine where one word begins and the other ends? Or would you just look for adjacent characters that come up often?

If you message me your email address, I'll send the text to you as an attachment.

Daniel

  • 2 weeks later...
Posted

Daniel:

Also, can anyone explain to me the etymology behind 的? White spoon?

"的 ...  

勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like."

source:http://www.kanjinetworks.com/

hope it inspires,

Ole

  • 1 month later...
Posted
勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like."

I've read the opposite. the spoon was the phonetic and the white was meaning. At least according to ABC dictionary.

But I'm not a character guy.

Posted

I copy/pasted the first 3000 characters into an excel spreadsheet; it weighed in at 415k, which isn't too bad. Two things that this list lacks, for my needs at least:

1. traditional characters

2. aditional readings

where applicable

Posted

勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like."

I've read the opposite. the spoon was the phonetic and the white was meaning

See here. for a more detail regarding the origin of 的。
Posted
:) it's very interesting, now i know that i know more than 1000 characters. i had 差不多 no problems in the first 1000. :( 以前我觉得用1000个汉字我可以说得很流利,不过现在我看1000太少了。
Posted

Your next shock will be when you hit 2000 and STILL can't read everything out there by a long shot. :mrgreen:

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...