trevelyan Posted February 17, 2008 at 11:21 AM Report Posted February 17, 2008 at 11:21 AM We've fixed the issues with automatic traditional character recognition that character pointed out in another thread. The updated code (and database) is available for download. Anything from version v5-022 should work: http://adsotrans.com/downloads/adso-v5.022.tar.gz Have also edited our "advanced editing page" so that traditional characters can be edited. Right now we will fail to parse traditional words if they do not exist in our database, even if the simplified counterpart does. All about maintaining the integrity of the database. Suggestions on how to improve the system for users/contributors who want to deal mostly with traditional Chinese are welcome. Do we need separate editing and annotating pages? I'm not sure but would like to make whatever changes are necessary to get the fanti crowd more involved. More details on the Adso blog. Quote
character Posted February 17, 2008 at 01:06 PM Report Posted February 17, 2008 at 01:06 PM Right now we will fail to parse traditional words if they do not exist in our database, even if the simplified counterpart does. All about maintaining the integrity of the database. Automatic conversion seems dauntingly difficult: http://www.cjk.org/cjk/c2c/c2cbasis.htm I guess the internet could be harnessed to see if traditional "matches" exist for simplified phrases. The results could be reviewed before inclusion in Adso. Quote
trevelyan Posted February 17, 2008 at 02:01 PM Author Report Posted February 17, 2008 at 02:01 PM The academic team at ChinesePod is using some Adso-related tools to help with lesson preparation, which is helping us flag some of the issues that still exist with duoyinci and pushing forward the project. Manual review is definitely critical. The best solution is really to find some people who are interested in this sort of thing and are coming at text analysis from a fanti perspective. Then religiously fixing the problems they complain about. Quote
character Posted February 17, 2008 at 03:20 PM Report Posted February 17, 2008 at 03:20 PM Then religiously fixing the problems they complain about. Going entirely to apache licensing would be favorite. --------------- ./adso -f file1.txt --code --extra-code " AND " > file2.txt This produces an empty file. Do I need to be using the non-latin database for this to work? Until this is fixed, is there any chance of an enhanced vocab mode which includes the pinyin in addition to everything else it outputs? ----------------- ./adso -f file1.txt -ie utf8 -is traditional -oe utf8 -os traditional --vocab > file2.txt 1) Wenlin says file2.txt has ~1200 UTF-8 format violations 2) Wenlin seems to be saying that the "U+3000 Ideographic space" in the input is processed into "U+FFFD Replacement character" (which displays as a control character). Quote
trevelyan Posted February 17, 2008 at 05:34 PM Author Report Posted February 17, 2008 at 05:34 PM I'm generally happy to let people use the adso materials commercially provided they attribute the materials and contribute back to the project. I don't think it's onerous to send an email asking for permission. On the traditional side, can you mail me the file you're using so that I can take a look at it myself. email address is david.lancashire at google.com. I think the command is working for me so I'd like to replicate things exactly. You are compiling from source right? Quote
trevelyan Posted March 3, 2008 at 12:50 AM Author Report Posted March 3, 2008 at 12:50 AM Thanks to pressure from Mark at toshuo.com, the annotation engine is now outputting popups in traditional characters (when input is traditional characters). Will be working on hooking up the editing functionality for the traditional stuff later this week and will post when that's done. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.