sunlightatdusk Posted October 7, 2007 at 08:11 AM Report Posted October 7, 2007 at 08:11 AM I've put together another frequency list of Chinese characters, taking into account the usual 100,000 Chinese web pages. This one has a bit of a twist, as it translated the characters first into Pinyin, so that specific character-pinyin pairs could then be tallied. http://readmandarin.com/research.htm If you dig around the site, you'll see it's aimed towards teaching people to read Chinese. Any feedback on the research, or on the site itself would be greatly appreciated. Also, can anyone explain to me the etymology behind 的? White spoon? I've looked in a number of books and haven't gotten anywhere, and I'd like to update it for the site. Thank you! Daniel Quote
Toffeeliz Posted October 7, 2007 at 01:33 PM Report Posted October 7, 2007 at 01:33 PM "2 一 yī 248730 5.52 one" summat looks a bit off there mate ;) Quote
mr.stinky Posted October 7, 2007 at 04:58 PM Report Posted October 7, 2007 at 04:58 PM a bit confusing with the line beneath each character, but i suppose i could get used to it. what is the relation between # seen and cumulative %? "Sentences, phrases and paragraphs of over 50 characters were included." can i assume only paragraphs required a minimum 50-character count? what percentage of text was discarded due to too-short paragraphs? "Duplicate phrases were discarded." did phrases have to be an exact match, or of a certain length to qualify? how many times was "build a harmonious countryside" discarded? Quote
sunlightatdusk Posted October 7, 2007 at 06:50 PM Author Report Posted October 7, 2007 at 06:50 PM Toffeeliz: I wasn't quite sure what you meant by the results being off. Please explain if you have a moment. mr.stinky: I've removed the lines under the characters so the page is easier to read. Cumulative percentage gives you an idea as to how much of the text you would be able to recognize if you learned only a certain number of characters. The top 10 characters make up about 12.2% of the sampled text. The top 1000 characters make up about 90.3% of the sampled text. Yes, paragraphs had to be 50 characters in length. The idea was that the program would take in text from actual articles, reviews and blogs. I'm not sure what % of lesser length text was discarded, but I suppose I could run the analysis and find out. It was only duplicate paragraphs that were omitted. I've changed the description to reflect that. Thanks for the great feedback! Daniel Quote
trevelyan Posted October 7, 2007 at 08:24 PM Report Posted October 7, 2007 at 08:24 PM If you went through the work of discarding the HTML markup and the shorter sentences, the raw datafiles could be useful to others doing text processing. I'd be curious to look at them for word-level rather than character-level frequencies myself. Quote
sunlightatdusk Posted October 7, 2007 at 08:40 PM Author Report Posted October 7, 2007 at 08:40 PM trevelyan: That's a cool idea. Would you be looking for words of 2-3 characters in length? How would you determine where one word begins and the other ends? Or would you just look for adjacent characters that come up often? If you message me your email address, I'll send the text to you as an attachment. Daniel Quote
Ole Posted October 17, 2007 at 06:36 AM Report Posted October 17, 2007 at 06:36 AM Daniel: Also, can anyone explain to me the etymology behind 的? White spoon? "的 ... 勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like." source:http://www.kanjinetworks.com/ hope it inspires, Ole Quote
self-taught-mba Posted November 28, 2007 at 06:00 PM Report Posted November 28, 2007 at 06:00 PM 勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like." I've read the opposite. the spoon was the phonetic and the white was meaning. At least according to ABC dictionary. But I'm not a character guy. Quote
leosmith Posted December 1, 2007 at 11:15 AM Report Posted December 1, 2007 at 11:15 AM I copy/pasted the first 3000 characters into an excel spreadsheet; it weighed in at 415k, which isn't too bad. Two things that this list lacks, for my needs at least: 1. traditional characters 2. aditional readings where applicable Quote
imron Posted December 1, 2007 at 11:24 AM Report Posted December 1, 2007 at 11:24 AM 勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like." I've read the opposite. the spoon was the phonetic and the white was meaning See here. for a more detail regarding the origin of 的。 Quote
rezaf Posted December 1, 2007 at 07:27 PM Report Posted December 1, 2007 at 07:27 PM it's very interesting, now i know that i know more than 1000 characters. i had 差不多 no problems in the first 1000. 以前我觉得用1000个汉字我可以说得很流利,不过现在我看1000太少了。 Quote
renzhe Posted December 2, 2007 at 08:13 PM Report Posted December 2, 2007 at 08:13 PM Your next shock will be when you hit 2000 and STILL can't read everything out there by a long shot. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.