mandarinboy Posted September 17, 2005 at 03:24 PM Report Share Posted September 17, 2005 at 03:24 PM If anyone is interested in it i am now parsing around 1.000.000 words a day from Chinese newspapers of different sorts to build a database with the most requently used words in modern Chinese newspapers. I will put it on my site http://www.mandarinteacher.com as soon as i am back from India. The intention of the database is to help students learn commonly used words in todays newspapers. Later on i will also calculate texts with word density to make it possible to find texts that match the level of words a student do know. What i have found is that the speach paterns varries widely betwwen for example peoples daily and 163.com. Maybe not so surprisingly but interesting. Since the data i am using are from Cedict and other sources the database will be free of cruse but licensed under the same conditions as the Cedict data are today. Quote Link to comment Share on other sites More sharing options...
in_lab Posted September 18, 2005 at 03:28 AM Report Share Posted September 18, 2005 at 03:28 AM If you are just checking for word frequency, why not use a more comprehensive word list? There are large word lists freely available, but you wouldn't have definitions for them. Quote Link to comment Share on other sites More sharing options...
mandarinboy Posted September 23, 2005 at 03:53 PM Author Report Share Posted September 23, 2005 at 03:53 PM We are in fact useing wordslists without translations when parsing. This so that we get all words. What we are using cedict and adso for is to later on create lists for stundents with pinyin and translations. After all, the most frequently used words are in those lists so it is nice combinations. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.