trevelyan Posted November 2, 2004 at 07:07 PM Report Posted November 2, 2004 at 07:07 PM geek, I've added CEDICT output to my site. Content goes in here: http://www.adsotrans.com/cedict/update.pl ... and comes out nicely formatted and with pinyin: http://www.adsotrans.com/cedict.txt Maybe this would be useful to any mainlanders having difficulty accessing your site. I should have the server for at least the next year. As long as the bandwidth requirements aren't too extravagant, you'd be welcome to mirror or redirect there it if it would help solve your access problems from the mainland. Best, --david Quote
mandarinboy Posted November 2, 2004 at 08:29 PM Report Posted November 2, 2004 at 08:29 PM The Cedict file at mandarintools.com have a new version out now and a new on-line page for new entries. Hopefuly this will help us to expand that list. I have combined the ADSO mysql database and the Cedict file in to one Microsoft sql server database and also added our own lists. This database now have more than 112.000 Chinese words and more than 150.000 english entries. The reason i mention this is that it sounds like a great idea to combine it with your list as well. Is it possible to get a copy of your files, database or what format you are using and make an realy large file/database of it? I can do the work. I have added an origin field in the database so that we can track where words are origin from and add links to proper licence information. My own entries i add to the Cedict file. I am working on tools to export this list to an cedict like text file. I have a simple interface to this database at: http://82.182.78.97/dictionary I am still working with this site ( it is my development site) so it might be slow at some times. I will within days add word classes and origin to the result page etc. I do on this page also have stroke order animations for more than 3000 Chinese characters. I have started a project to add stroke order animations to every Chinese character. To make it as useful as possible today i am linking to Ocrats stroke order animations as well as our own. Today we have as many strokeorders as Ocrat have but some of them are the same. We make both simplified and trandtional stroke order animations while Ocrat only have simplified. In a future version of our database we will include the stroke order animations. Today they are in a different character database ( compiled form the Unihan file from the Unicode organization. The result will be one very large database with information about characters, words, stroke order, radical etc. Quote
ChouDoufu Posted November 3, 2004 at 03:39 PM Report Posted November 3, 2004 at 03:39 PM has anyone made contact with the person maintaining the CEDICT project. I'd love to submit words to the dictionary, but want to know if I'll be wasting my time. If it's not updated for 8 months at a time I don't think it's worth it. Quote
mandarinboy Posted November 3, 2004 at 06:03 PM Report Posted November 3, 2004 at 06:03 PM Yes, i have made contact with mandarintools.com There is a new list up now and also an on-line page for adding new words. I just posted some to test the functionality and it is very impressive! I will start my self now to add my new words to the list. Quote
ChouDoufu Posted November 3, 2004 at 11:31 PM Report Posted November 3, 2004 at 11:31 PM I'm so happy. I have a bunch of word lists I've made that I'm going through carefully. One thing I think the dictionary is lacking though is example as well as 搭配.do you think he would be willing to add these features? is it appropriate to add usage or 搭配to definitions? for some words it would be extremely useful. Quote
mandarinboy Posted November 4, 2004 at 09:10 AM Report Posted November 4, 2004 at 09:10 AM You can use this linke http://www.mandarintools.com/submit.html to add new words or mail him at cedict AT chinesetools.com" replace AT with @). I too would like to add examples for many words. Suggest it to him. I am thinking to add it to my lists any way to my own on-line projects. I have already added compoiunds to each character and that is helpful for my self. Quote
geek_frappa Posted November 5, 2004 at 05:18 PM Author Report Posted November 5, 2004 at 05:18 PM Maybe this would be useful to any mainlanders having difficulty accessing your site. I should have the server for at least the next year. As long as the bandwidth requirements aren't too extravagant, you'd be welcome to mirror or redirect there it if it would help solve your access problems from the mainland. wow! thanks. i appreciate it. your site is very nice... i hope we can work together in the next few years. I too would like to add examples for many words. Suggest it to him. I am thinking to add it to my lists any way to my own on-line projects. I have already added compoiunds to each character and that is helpful for my self. a lot of people want examples, should ask native speakers to create sentences? i have been wondering how to do this without entering them myself... i also like the use of unihan on your site... Quote
trevelyan Posted November 6, 2004 at 04:14 PM Report Posted November 6, 2004 at 04:14 PM The result will be one very large database with information about characters, words, stroke order, radical etc. This is a fantastic idea. I like the idea of keeping things database rather than text oriented as this scales better and content can always be dumped to a textfile if someone wants. I'll switch to using this if I can make it work with what I have. Two recommendations which would make it easier: (1) A FLAG textfield to store miscellanous data about words. For example, if a word is a country name, this field will contain "COUNTRY" and perhaps "PLACE". Organizations get an "ORG" tag, measure words "MW", etc. A flexible field like this makes the database extensible without forcing people to change its structure or even run their own database. To list just one example, it would make it possible for users to mark words as belonging to special subclasses ("MEDICAL", "LEGAL", "CHENGYU", etc.). Another idea is to include a frequency tag. This is useful for parsing Chinese text, but difficult to generate. Perhaps someone could be encouraged to create one.... ;) a lot of people want examples, should ask native speakers to create sentences? i have been wondering how to do this without entering them myself... i also like the use of unihan on your site... It may be possible to automate this. I'd download several gigabytes of Chinese text to a harddrive, weed out the HTML and print the individual sentences to a textfile, and then run some brute-force pattern-matching. Anyone with access to a Unix server can do #1 with the program wget. I can modify Adso to take care of step #2 quite easily. I'm working on trying to improve the accuracy of my parser right now -- it may be possible to reliably identify POS in simple sentences and etch appropriate examples shortly. Definitely error prone, but probably better than having CSL students generating sentences and is significantly easier to organize. Letting users flag incorrect sentences for review is presumably easier than creating each definition manually. Anyone game? I can help out but don't have the computer resources sitting idle for the pattern matching. Quote
geek_frappa Posted December 5, 2004 at 02:37 AM Author Report Posted December 5, 2004 at 02:37 AM (1) A FLAG textfield to store miscellanous data about words. For example, if a word is a country name, this field will contain "COUNTRY" and perhaps "PLACE". Organizations get an "ORG" tag, measure words "MW", etc. A flexible field like this makes the database extensible without forcing people to change its structure or even run their own database. To list just one example, it would make it possible for users to mark words as belonging to special subclasses ("MEDICAL", "LEGAL", "CHENGYU", etc.). then, i might want to add as you recommended a few extra field/elements to give description of "part of speech" and maybe a keywords field to give it relevance? hmm.... Quote
trevelyan Posted December 5, 2004 at 07:42 PM Report Posted December 5, 2004 at 07:42 PM Not sure exactly how RSS works here. Probably wise to be careful of polymorphic words which can be translated as nouns, verbs or whatever depending on the context. An extensible field for non-definitional information is very useful though. To give you a practical example of the benefit I've personally seen from using it, Mark Swofford of pinyin.info fame was kind enough to pass along a list of Chinese placenames recently. These are now part of the Adso database tagged with the flag "CHINESEPLACES". As a direct result, it's now possible to select on Chinese and NonChinese cities, placenames, etc. Adding this functionality required only deciding on an appropriate way to markup the information in question. Applications using the database can decide for themselves whether this information is relevant/useful. If there are a number of projects that would find this sort of information useful, a good first step would probably to come to a standard agreement on accepted ways to markup POS content so that different projects actually generating content are doing so in a compatible manner. I'd be happy to write-up a short note on how I am handling things for anyone interested. Quote
geek_frappa Posted December 5, 2004 at 08:05 PM Author Report Posted December 5, 2004 at 08:05 PM I'd be happy to write-up a short note on how I am handling things for anyone interested. that would be a good first step. a good blueprint and technical specification will start us down the right path. in january, i'll have more time to work on this project more seriously. for now... below is a sample XML file ... http://chinese.primezero.com/soleri/add/u8.xml <?xml version="1.0" encoding="utf-8" ?> <rss version="2.0"> <channel> <title>Adding Dictionary Entries Using RSS Feeds BETA</title> <description>shi jie zi dian - bei ta</description> <entry> <chinese>倍塔</chinese> <english>Beta...</english> <pinyin>bei4 ta3</pinyin> </entry> </channel> </rss> Anyone who wants to test out this RSS-based adding system... Follow these three easy steps... 1. Download the XML. 2. Change the content to contain your name in Chinese [uTF-8], English, and pinyin. 3. Upload the XML to your webserver and provide the URL. PM Me Your URL!! Don't Post Here. Quote
geek_frappa Posted December 5, 2004 at 08:06 PM Author Report Posted December 5, 2004 at 08:06 PM <?xml version="1.0" encoding="utf-8" ?> <rss version="2.0"> <channel> <title>Adding Dictionary Entries Using RSS Feeds BETA</title> <description>shi jie zi dian - bei ta</description> <entry> <chinese>倍塔</chinese> <english>Beta...</english> <pinyin>bei4 ta3</pinyin> <field1>more info</field1> <field2>more info, too</field2> </entry> </channel> </rss> Sample XML with additional fields, addressing your recommendation... Not sure exactly how RSS works here. RSS --> Database System ... similar to Google News as far as I can gather.... Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.