the.yangist Posted February 10, 2011 at 12:17 AM Report Posted February 10, 2011 at 12:17 AM I heard a lot of people here were looking for a pretty good word frequency list from which to start learning their vocabulary. I made one that contains all of the most common synsets from the Princeton's WordNet 3.0, translated it to make small synsets of Chinese, and put it online for everyone. Statistically, the probability of seeing one of the words on this flash card system or a closed word class word (a pro-form, preposition, subordinator, coordinator, etc.) is about 6 words for every 10 that you read. The characters that you learn by practicing the flash card set, I learned later, make everything much more readable, so many dictionary searches much quicker. I can take my data and organize it however you need it for a flash card applet, but I'll give priority to people who can help me with these goals: Cleanup: Some of the Pinyin entries are incorrect because I used an automatic Pinyin converter. Thus, 了 comes out as le, not liǎo (as with 了結), and 地, when in parentheses, reads as dì, not de. Revision: Right now, it's just me and one Taiwanese person who are working on this. I'd like to extend the number of contributors a bit, but I worry about online vandalism, so I'm only letting a few people in who would really like to take some time to help on it. I'm especially interested in hearing the opinions from HSK 5+ and native speakers. I may create an experimental, closed Wiki for just this purpose. Sentence Writing: I'm going to begin taking all of these terms and work them into increasingly complex sentences so that people can simultaneously practice their syntactic understanding while they study the vocabulary list. That should also cut the list down substantially (by about 1/2). Translators: This isn't the best place to seek it, but I'm looking for speakers of other foreign languages (mine are English, Spanish, and Mandarin). People seem to want Japanese and French lists, too, but I speak neither language, so competent speakers with similar goals should contact me. I will take a week to make a flash card set of the closed word classes if people really want it. Here's the link: http://www.flashcardexchange.com/flashcards/view/1628943 Give me comments and whatnot. Quote
creamyhorror Posted February 10, 2011 at 02:06 AM Report Posted February 10, 2011 at 02:06 AM Hmm, so these are Chinese flashcards based on English frequencies of their translations? If so, why not use a Chinese frequency list instead? Don't mean to bash your hard work - just wondering if I understood your post right. Quote
sebhk Posted February 10, 2011 at 04:40 AM Report Posted February 10, 2011 at 04:40 AM Hmm, so these are Chinese flashcards based on English frequencies of their translations? If so, why not use a Chinese frequency list instead? I am wondering the same. Is it a good idea to use English synsets as a basis for Chinese ones? When it comes to meaning a 1-to-1 mapping between English and Chinese words (and thus the number of synonyms) is not very common. Quote
the.yangist Posted February 10, 2011 at 04:58 AM Author Report Posted February 10, 2011 at 04:58 AM I only show the three most common entries for any given WordNet synset. Since WordNet employs a lot of sophisticated lexicography, ideally it should not make a difference whether the lexicon upon which the definitional distinctions are made is done in English, since the distinctions could reside in every language, even if the same tokens refer to them (you'll see a lot of overlap in the early list that I've made with be, for instance, because there are many different senses that use this token, many of which are very common, and many of which hardly ever occur). There are WordNets for other languages, including Chinese, but they are either built from the English WordNet, or they're not nearly as sophisticated as the WordNet database is. This is the bare rationale for favoring English WordNet. The following should answer your question: Mere frequency lists in any language are not a good tool for making vocabulary lists because you don't learn to distinguish the senses of words that way, and you don't get a clue whether a certain sense needs more attention than another sense of the same token without some extra computing. A good study of vocabulary will make you step back to consider the semantic domain to which many tokens of a language's lexicon apply. Frequency then is important in selecting the most common semantic domains and picking those domains' most relevant tokens. You can ignore the others at first until you grasp the fundamentals. This method is also better because it breaks beginners of a very troubling error, which is the false belief that there is a bijection for all tokens of two different languages (for instance, that '讓' only means let and nothing more). You can't beat this flash card applet unless you learn to pay attention to the definitions for which all of the cited tokens are common examples. Quote
the.yangist Posted February 14, 2011 at 09:17 PM Author Report Posted February 14, 2011 at 09:17 PM I compared the terms that were submitted for my list against this frequency corpus. There's a pretty good overlap for the top 60% of the open word classes. Some of the major differences deal with the ambiguities in usage, so a rather frequent CWC term has far less use as an OWC. I added the missing terms to their appropriate places when their synsets were available. Quote
the.yangist Posted March 8, 2011 at 05:30 PM Author Report Posted March 8, 2011 at 05:30 PM Christiane Fellbaum was nice enough to forward me a link to a Taiwanese-run Mandarin WordNet. That link is here: http://bow.sinica.edu.tw/wn/ As a result of using a more sophisticated method and a complete database, there are now 703 entries, and that puts you at about 50% proficiency (and higher with the closed word classes). Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.