Jump to content
Chinese-Forums
  • Sign Up

Please Help -- Traditional Characters and Adso


Recommended Posts

Posted

Hi everyone,

This is a call for volunteers to help add support for complex characters to Adso, the online system for Chinese text annotation, translation and processing -- www.adsotrans.com. We've finished most of the software and database work. What we need now are people to review entries with ambiguous characters. This requires knowledge of 繁体 and a computer capable of traditional character input.

http://www.adsotrans.com/vocab/complex_check.php

If you click on the link above, you'll be taken to a page containing a table of words which start with the same character. The basic work involves looking at a simplified word, figuring out what the appropriate complex form is and telling the system. Everything else is automated.

If you have a bit of time you can spend on this and find Adso useful, please consider contributing and helping to spread the load. There are about 5000 words that need to be reviewed, but when this is done we'll be able to offer all of the services we currently do in the complex character set. We'll also be able to provide sophisticated conversion between simplified and complex texts, cross-character set annotation and more. If you find Adso useful but aren't good with traditional characters, please consider passing word to anyone you know who might be able to help. The more people get involved the faster this is likely to go.

Posted

This is interesting, I would like to try.

Edit: One question, trevelyan, why most of the terms are technical/scientific jargons?

Posted

Just to clarify, what you're requesting is just the identification of the correct traditional characters to use in place of the simplified characters - not looking for different phonetic transcriptions (since that's what most of these suspect words seem to be).

Also, what about variants where two characters are used - currently on 美台, which sometimes is written 美薹 but other times is not. What do we do with those?

Posted
Just to clarify, what you're requesting is just the identification of the correct traditional characters to use in place of the simplified characters - not looking for different phonetic transcriptions (since that's what most of these suspect words seem to be).

Yes. All entries have been flagged because they contain at least one character which maps to multiple complex characters. We want to convert them into their most common traditional form as we will be using them to parse traditional texts and handle conversion from simplified documents.

Also, what about variants where two characters are used - currently on 美台, which sometimes is written 美薹 but other times is not. What do we do with those?

Either is fine. The database can support multiple traditional characters just like it supports variants of simplified ones. The interface isn't designed to handle this kind of complexity in the editing process though, so just try to go with whichever is more common by default. I'll see that the database supports both 美台 and 美薹. If anyone notices other problematic entries please make a note for me and I'll see that they get solved.

Posted

You can semi-automate it using a best guess approach using a search engine like google or more directly a large word list.

search for the phrase/word converted to all possible variations and pick the most common as reported by google. This is the approach I am currently using....

good luck...

ps.

Is your software opensourced?

Posted

Its an interesting idea Googling for things, but I'm not sure how that method is supposed to differentiate between the different forms of the character 发. At the least one needs to know all of the ambiguous entries first and then do bulk queries on all variants. Google also indexes a lot of problematically-converted texts, so I'm not sure the statistical method would work for low-frequency characters.

We're supporting a few projects as a backend processing engine for Chinese text and have an open source license that provides for non-commercial use of the stuff we release publicly. Contributors are also free to make contributions under the CEDICT license if they wish. Email me if you're curious about the details.

Posted

yeah characters like 发 are rather problematic due to their dependancy on the context... but for a large portion of conversions the character on either side contains enough information to determine the appropriate conversion...

i was only suggesting you might use this approach to quickly fill in a large amount of the 5000 conversions you mentioned... rather than have some poor sod slave over them :wink:

have fun

Posted

I am not of a sufficiently advanced level to help with this project, but I wanted to say that I will be thrilled when it has been completed.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...