bozhidao Posted January 7, 2010 at 06:29 PM Report Share Posted January 7, 2010 at 06:29 PM Hi, I meant to post this a year ago when I put it online. Time flies I put a small web app for character (and 2, 3 & 4 character word) knowledge estimation up at: http://bozid.com/quiz It's hosted on Dreamhost (those who read sinosplice know how that can be), on the cheapest plan, so my apologies if it's not the snappiest. If you try it, please let me know if you think the estimates are in the right range for you. I scraped the words from baike.baidu.com about a year ago, so the frequency is a little heavy on Olympic sports terminology in the mid-range... If there's any interest I'll see about scraping again (takes forever). Thanks! Bo Quote Link to comment Share on other sites More sharing options...
renzhe Posted January 8, 2010 at 01:39 PM Report Share Posted January 8, 2010 at 01:39 PM It looks very nice. It is not too well-suited for more advanced people, though. I gave up after 40 characters and 100 bigrams, and I found it mostly too easy. It estimated around 3500 characters, and I know around 3800, so it is very accurate there. For two-character words, it estimated 9500. This might be right, it's hard to say. I don't have a list of all words I've ever learned, but I guess that my estimate would be in that ballpark. I think that having pinyin and the answer makes it too easy to guess the answer even if you don't know it. Most characters will have a radical (which eliminates most meanings) and a phonetic (which eliminates most pronunciations). But I can see how this could be a useful tool for people who don't have a meticulous record of everything they've learned. Quote Link to comment Share on other sites More sharing options...
HedgePig Posted January 8, 2010 at 03:16 PM Report Share Posted January 8, 2010 at 03:16 PM For me, it estimated just under 1100 characters which I think is a little low but certainly not way off. When running the bigrams, it simply stopped on both attempts. (On the second attempt, I'd got to around 75/83 completed I think.) It seems a little easy to guess correctly pareticularly if you know just one character and use a process of elimination. Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 03:19 PM Author Report Share Posted January 8, 2010 at 03:19 PM Thanks for trying it out, renzhe! It's good to get feedback from someone who is so much farther along in their studies than I am. I see that I should try to shorten the quiz for folks in the high range... Currently, as you answer questions correctly it just walks up through the character list by frequency in steps of 100. So if you know ~3800 characters it will ask you 38 questions before it can even start probing your knowledge at the upper boundary, which typically takes another 5-15 questions. I suppose if a user answers 3 or 4 in a row correctly it could start going up in steps of 200. That should cut around 15 questions for someone at your level. I guess I'll just have to learn another few thousand characters so I can fine-tune it at that range ;) Thanks! Bo Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 03:51 PM Author Report Share Posted January 8, 2010 at 03:51 PM Sorry for the errors on the bigram test, HedgePig. I think I might be pushing the limits on the cheap-o dreamhost setup. Congrats on 1100+ characters, though! That was my goal for last year -- I know how much work it can be. I think it should get easier (or at least more rewarding) from here on out, though, as we can read more interesting things. For example: As of my last exhaustive scrape of baike.baidu.com (about a year ago) the 90% threshold was 1133 characters, so we should know ~90% of the characters there! Bo Quote Link to comment Share on other sites More sharing options...
chinopinyin Posted January 8, 2010 at 06:29 PM Report Share Posted January 8, 2010 at 06:29 PM This is really nice tool Which frequency word list are you using to compute the estimates? Would it make sense to allow the user to choose the number of questions? Quote Link to comment Share on other sites More sharing options...
chrix Posted January 8, 2010 at 06:44 PM Report Share Posted January 8, 2010 at 06:44 PM Naturally I tried the four-character words, and I got them all right, so it said I knew 4,400 of them (?). Anyhow, I think it is too easy, for the four-character words there are way too many geographical names and person names that can be easily guessed. Maybe you should restrict yourself to chengyu only Also, the entries in the pinyin and the English columns match, which makes it much easier to guess in general. Maybe you could randomise both columns? Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 07:18 PM Author Report Share Posted January 8, 2010 at 07:18 PM chinopinyin: Thanks! I actually generated my own frequency lists by crawling through baike.baidu.com and counting the occurrences of each character/word there. For definitions I used CC-CEDICT (http://cc-cedict.org/wiki/). The algorithm I'm using stops the quiz when it thinks it has a good estimate. If it used an arbitrary number of questions it might have to stop before it reached a good estimate for some people. If you want to stop the quiz early there's a button for that. chrix: Wow, I think you're more advanced than the quiz can handle! Interesting that you say the columns match. They're not supposed to. I'll see if I can reproduce that. Thanks! Bo Quote Link to comment Share on other sites More sharing options...
c_redman Posted January 8, 2010 at 08:02 PM Report Share Posted January 8, 2010 at 08:02 PM I can't make it past the first several words. After a point I just get a blank page Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 08:22 PM Author Report Share Posted January 8, 2010 at 08:22 PM c_redman: Sorry to hear that. Maybe I should port this to the Google App Engine... Just have to learn Python first. Quote Link to comment Share on other sites More sharing options...
chrix Posted January 8, 2010 at 08:47 PM Report Share Posted January 8, 2010 at 08:47 PM bozhidao, I think I might not have phrased this right. What I meant was, that in the pinyin and English columns, there's four options, and they form four pairs. Thus, if you don't know the meaning of a word, you can exclude options by matching them with the pinyin column. If you'd randomise this completely, it would take away this possibility. Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 09:05 PM Author Report Share Posted January 8, 2010 at 09:05 PM chrix: Oh, I see what you mean. That's a good idea. In the next version I'll pull the incorrect meanings and pronunciations independently. Thanks! Quote Link to comment Share on other sites More sharing options...
chinopinyin Posted January 8, 2010 at 09:31 PM Report Share Posted January 8, 2010 at 09:31 PM @bozhidao Why don't you include on your website the list of the most common XX characters in order of frequency? I see it as very interesting in its own right. Estimates for people with limited vocabulary may be less reliable, though, since there are many high frequency words in actual speech (eye,son,today, run,car ...) that may not be that frequent in baike.baidu.com One could argue that worlists for the new HSK exam may be a good starting point The first 150 words one should know are in http://www.confuciusinstitute.qut.edu.au/docs/Chinese_Proficiency_Test_1_Vocab.pdf The first 300 words in http://www.confuciusinstitute.qut.edu.au/docs/Chinese_Proficiency_Test_2_Vocab.pdf There have to be 4 additional lists for the other levels of this exam that reach up to 5000 characters, but I have not seen them Some other lists http://lingua.mtsu.edu/chinese-computing/statistics/ http://www.zein.se/patrick/3000char.html Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 10:48 PM Author Report Share Posted January 8, 2010 at 10:48 PM chinopinyin: Yeah, the zein.se list is great, and the mtsu lists were an inspiration. I opted to make my own lists for a few reasons: 1. I wanted to have a consistent source for words of each length 2. I wanted to make sure my lists represented real, modern usage 3. It's fun If you're interested, here's a one-character list from last February. This isn't the exact list used in the quiz, but it was collected at around the same time. http://bozid.com/download/frequency.txt Columns are: 1. rank 2. character 3. total encountered 4. cumulative percentage There are some obvious anomalies there (e.g. 词), but in general I've been pretty happy with it. Bo Quote Link to comment Share on other sites More sharing options...
stelingo Posted January 8, 2010 at 11:34 PM Report Share Posted January 8, 2010 at 11:34 PM I did the quiz twice. (Is that cheating?) The first time my character count was196. The second time was 273. According to Skritter I know 290 characters, so pretty accurate. Only another 2710 to go. *sigh* Quote Link to comment Share on other sites More sharing options...
Hofmann Posted January 8, 2010 at 11:44 PM Report Share Posted January 8, 2010 at 11:44 PM It doesn't have an "I don't know" option, leaving one to guess, and a good guesser can get a score much higher than what they actually know. It's also in Simplified Chinese. Quote Link to comment Share on other sites More sharing options...
stelingo Posted January 8, 2010 at 11:46 PM Report Share Posted January 8, 2010 at 11:46 PM Just click on submit without choosing an answer. Quote Link to comment Share on other sites More sharing options...
bozhidao Posted January 8, 2010 at 11:52 PM Author Report Share Posted January 8, 2010 at 11:52 PM stelingo: That's not cheating at all! In fact, if you create an account it will keep track of your scores over time for you so you can chart your progress. If you take the test multiple times in one day it uses the average for that day's score. Thanks for trying it out! Hofmann: That's a good point. If you don't know an answer you'll get a more accurate estimate if you don't guess. Like stelingo said, you can just click submit without answering. And yes, it only has simplified characters for now. No offense to any traditional character users, it's just what I've been studying myself. Bo Quote Link to comment Share on other sites More sharing options...
Artem Posted January 9, 2010 at 12:59 AM Report Share Posted January 9, 2010 at 12:59 AM I think it's definitely too easy. I was doing the two-character test, I got bored after 125. It said I knew 11196 of those words. Dunno, a lot of the words were city names and I just matched to the English without actually knowing the city. It's hard to say how accurate it is. I don't keep count of how many characters I know, nor how many words I know. I'd say I know more than 10,000 words for sure, but I doubt they are all two character words. Definitely should limit the length of the test. I really got bored in the end. Quote Link to comment Share on other sites More sharing options...
querido Posted January 9, 2010 at 01:13 AM Report Share Posted January 9, 2010 at 01:13 AM I quit after only 25 hanzi, sorry. I "know" 1200. It estimated 1060. I'm impressed. Very good. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.