trevelyan Posted September 6, 2007 at 07:50 AM Report Posted September 6, 2007 at 07:50 AM I'm looking for volunteers to help work on building rulesets and making suggestions for the next version of the Adso system. We have working code here: http://www.adsotrans.com/downloads/v5/ Right now we're in the initial stages of teaching the software to recognize simple compounds. We're teaching it to recognize times, dates, personal names, and basic compounds. This is setting up the building blocks for more complex grammar analysis. Basic processing is still being done by the backend engine. But a lot of the advanced functionality is being put into external XML files that are (optionally) read by the engine. I'll be iterating this file quite quickly as errors get fixed. If you're interested in helping out send me an email and I'll add you to the list. I'm not sure if this forum is the proper place to have ongoing and somewhat technical discussions. It all depends on what priorities people have. Notes: This is BETA software. The quality of the output is not as good as you'll find on the main site (http://www.adsotrans.com). This is because the old system has a lot of hand-coded rules that need to be re-written, re-evaluated and moved to the new system. There is no automatic recognition of place names, or verb conjugation in the new version yet, for instance. On the other hand, this new system is infinitely cooler from an architectural perspective. It is also a lot more flexible. If you wanted to print a list of all of the Proper Nouns in a document, for instance, you could take care of that with a command like: ./adso -f [input] --extra-code " AND " Quote
zozzen Posted October 20, 2007 at 06:07 PM Report Posted October 20, 2007 at 06:07 PM Wow! You're really doing a great project. Every time we click "teach me" and we have to re enter the simplified chinese characters, and there's no way to input its part of speech. That's quite inconvenient. And when a volunteer comes to make an input , it's great to ask them to input an example sentence too. That may help enrich your further dictionary. Anyway, this project is amazing, both at its idea and technology. Quote
zozzen Posted October 20, 2007 at 06:07 PM Report Posted October 20, 2007 at 06:07 PM Btw, is it possible to make a multiple entry? Entering word by word is really tired. Quote
trevelyan Posted October 28, 2007 at 01:35 PM Author Report Posted October 28, 2007 at 01:35 PM Zozzen, I edit the POS entries when reviewing contributions, so it isn't that big a deal if someone doesn't provide them. The "quick add" script attempts to guess POS based on the english definition (ie. input verbs in the infinitive) anyway. If anyone wants to make bulk contributions just send me the data somehow and I'll bulk add them. If there's a good way to automate this I'm open to any and all suggestions. We have space in the database for sample sentences, but I don't think it makes too much sense to ask people to provide them. For the ChinesePod dictionary we've just indexed all of the lesson content using Lucene and are outputting matching sentences for searches automatically. It works pretty well. We could get better results for news texts just by indexing tons of Xinhua materials, or classical terminology by indexing books like Dream of the Red Chamber. Quote
zozzen Posted November 9, 2007 at 10:28 AM Report Posted November 9, 2007 at 10:28 AM I'm downloading the beta code to see if I could contribute. (seems too technical for me) And this link is dead: http://www.adsotrans.com/downloads/v5/adso-v5.004.tar.gz It seems "teach me" function doesn't allow users to re-write the definition of a word. Let say, 字 in the dictionary is currently defined as "word". While I want to add another definition (i.e. a floor ), the system doesn't accept the edit. Quote
zozzen Posted November 9, 2007 at 10:50 AM Report Posted November 9, 2007 at 10:50 AM trevelyan , please take a look and see if it can help your project. Dr. Lu Qin (陸勤女士) from the PolyU of Hong Kong has released a chinese banktree as a public resource. Around a million words are included. http://www4.comp.polyu.edu.hk/~cclab/index.php?p=projects_treebank&lv=2&cat=1,2,1&i=6 Quote
trevelyan Posted November 10, 2007 at 09:06 AM Author Report Posted November 10, 2007 at 09:06 AM Zozzen, This is the old dictionary editing interface: http://www.adsotrans.com/adso/uniedit.pl Changes can also be made through the ChinesePod dictionary - they'll filter back to the project. One small note: adding new entries may not result in immediate recognition by the system when they are added manually though this form. The reason is that the content is added in GB2312 rather than UTF-8, and the dictionary needs to be updated before all of the data is copied into the appropriate tables. Thanks for the link to the new treebank. Hadn't heard of it and am checking it out now... Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.