trevelyan Posted August 4, 2005 at 06:56 PM Report Share Posted August 4, 2005 at 06:56 PM Many Chinese words have multiple meanings, with the intended meaning clear only from context. 天 means both "day" and "sky", for example, while a reference to 中乌关系 probably means relations between China and Uzbeckistan, not China and a common crow. As part of its effort to offer context-sensitive annotation, Adso attempts to differentiate between many of these ambiguities. One way this is done is by associating a "weight" with each definition. Tests in the software automatically increase and decrease the weight associated with different parts of speech, and at the end of processing the part of speech with the highest weight is selected as the most likely translation. Since the general rule-set is not always suited for every word, and provides no way to differentiate between multiple definitions for a single part of speech, Adso defines a general markup language which can be used to fine grain word-choice. These rules are invoked by providing them in the CODE field of the database. In the cases listed above, Adso will probably prefer the definition "day" over "sky" if the previous word is a number. It will select "Uzbeckistan" over "crow" should it find the name of the whole country (乌兹别克斯坦) elsewhere in the source text. The specific tags which would apply these two specific rules are respectively: Number 乌兹别克斯坦 What follows is a brief explanation of currently existing markup language designed to allow users to understand how to the rules and write their own. The rules themselves fall into four categories (1) searching the fulltext of the document being translated, (2) searching the current sentence, (3) miscellaneous, (4) meta-markup. There is no limit on the number of rules which may be created for individual words, and suggestions on new rules to support are always welcome. 1. Search the fulltext: Adds score to entry for each of the Chinese words found in the fulltext of the document translated up until that sentence. 甲 乙 2. Search the current sentence: Adds score to entry for each of the Chinese words or parts-of-speech found within the sentence. Use MATCH to search for the presence of a word in the entire sentence. NEXT/LAST indicates a directional search. WORD triggers a search for Chinese characters, while LOOP may be added to continue a search until the end of the sentence. 甲 乙 Noun Verb Noun Verb Noun Verb Noun Verb 甲 乙 甲 乙 甲 乙 甲 乙 3. Miscellanous: These tags add score to entry automatically with the BONUS tag. BOOST adds score to the entry at the distance provided, while ENDING and BEGINNING are effective if the word is found at those locations in the sentence. 4. Meta: These tags contain meta-markup. They do nothing unless there is CODE within them. Should that CODE trigger any changes, the meta action is performed. Currently-supported actions are resegmenting the text after the Xth character in the current word, deleting all units but the current part of speech and deleting the current part-of-speech. .... .... ... By way of an example, the following tag resegment the current sentence by breaking up the current word after the 2nd character should the sentence already contain a verb: Verb HANDLING VERB CONJUGATION The conjugation for verbs can be set by using the TENSE commands. These either set the tense for the current verb (the first example), or set the tense for any verb located "distance" away from the current verb (the second example): The following numbers correspond to the following tenses: 0 -- infinitive 1 -- participle 2 -- gerund 3 -- indicative - present 4 -- indicative - perfect 5 -- indicative - past 6 -- indicative - plurperfect 7 -- indicative - future 8 -- indicative - future perfect 9 -- conditional - present 10 - conditional - perfect 11 -- imperative 12 -- conjunctive - present 13 -- conjunctive - perfect 14 -- conjunctive - past 15 -- conjunctive - pluperfect 16 -- past participle 17 -- future therefore sets any verb following the current verb to the gerund. GENERAL NOTES 1. Be sure to close all tags. 2. When providing a list of parts-of-speech or words, multiple options can be provided as long as they are separated by ASCII spaces. (" "). ADDITIONAL EXAMPLES 你 他 我 - adds a heavy weighting of 5 to the definition associated with this CODE field if any of the remaining words in the sentence are 你, 我 or 他. Adding 5 points to an entry is a virtual guarantee that the entry will be selected UNLESS a more likely grammatical construct is found through the grammar analysis. Pronoun - adds a light weighting of 0.5 to all NOUN entries associated with this CODE field if any of the remaining words in the sentence are tagged as a PRONOUN. Adding 0.5 points to an entry is a light nudge. Try to use this sort of disambiguation before resorting to heavier weightings if possible. City Country Place - delete all entries BUT the entry associated with this CODE field if the next word is a City, Country or Place. Quote Link to comment Share on other sites More sharing options...
Manchu Posted August 4, 2005 at 08:16 PM Report Share Posted August 4, 2005 at 08:16 PM Re: the TEXTMATCH scores. How exactly are these decided? Would you just keep adding a point until the text looks correct? What if this affects other translations, and they then have errors? I'm not sure how common this type of problem would pop up, though. Perhaps the 巴 for Pakistan/Palestine may cause problems like this. In any case, would a learning algorithm be possible? Example: for a given text, we have a wrongly-translated word, with some related words already marked. Rather than directly modify the scores of these related words, the person correcting the sentence just chooses the correct translation; the algorithm would be able to decide which scores are appropriate. Again, I'm not familiar enough with the system, so maybe this doesn't help. As is, I wouldn't personally use any of these tags. But a system like the one above would make it easier for users to correct these mistakes. Quote Link to comment Share on other sites More sharing options...
trevelyan Posted August 11, 2005 at 06:23 PM Author Report Share Posted August 11, 2005 at 06:23 PM All of the pieces of CODE markup are contextual -- changes happen only in the sentence currently under analysis, and only for that particular word. They don't affect the permanent database, or how words are translated in other contexts. Specifically, the TEXTMATCH markup is akin to saying "if I find this word in the fulltext of the piece I'm currently translating, boost the probability that its this particular word translates in this particular way by a certain amount". It works because there a fairly high degree of predictability in sorting out the ambiguities. Of course, we get into trouble if we ever need to translate an article about Palestine and Pakistan, but thats a linguistic not technical problem: even native readers would be confused if the author started using shorthand in that kind of piece. It should be possible to implement machine learning algorithms that will automatically generate CODE based on the statistical properties of known parallel-text translations. Less a technical problem than one of human effort. If anyone is interested in doing this sort of work and looking for technical support, I'd be happy to provide what support I can. Quote Link to comment Share on other sites More sharing options...
gyrm Posted August 19, 2005 at 03:40 AM Report Share Posted August 19, 2005 at 03:40 AM I think the use of CODE would be better explained through more examples. For each example code fragment, you can give a couple of sample sentences and show how the probabilities would change in a given annotation. It would be nice to have some more rigour in the description of code syntax, starting with simple stuff like "close your tags" ... also it seems that wordlists are space-separated. Again I think more examples of correct usage will make it possible to "cut and paste", in a sense. Alternatively, perhaps some kind of form / GUI might make encoding disambiguation rules less daunting? I just stumbled upon newsinchinese and adsotrans today, but I'm intensely interested. I haven't read all the forum threads yet, maybe you can point me in the right direction if this is answered elsewhere ... I'd like to ask why only one definition is given per annotated word. Is this policy meant to make improvements in the system more transparent? Or is the project moving towards unassisted translation as the final goal? All of the pieces of CODE markup are contextual -- changes happen only in the sentence currently under analysis, and only for that particular word. They don't affect the permanent database, or how words are translated in other contexts. Right ... so basically the database of CODE information is strictly separated from some other information, ostensibly the information driving normal annotation. When CODE is added, are changes reflected in the current annotation of newsinchinese, or are the changes to be rolled out at some later date? One more idea: it might be nice to provide a visual representation of how the words are currently segmented ... probably not by underlining words, maybe by incorporating alternating color spans (the difference in color should be very small so as not to be a complete visual distraction). Maybe this behavior can be toggled somehow, javascript? This would make it much easier to tell when segmentation errors have occurred, as well as when compound words are missing from the database - you see two words of two characters each when you would mentally segment it into a single four character phrase. Quote Link to comment Share on other sites More sharing options...
trevelyan Posted August 25, 2005 at 11:39 AM Author Report Share Posted August 25, 2005 at 11:39 AM Hi gyrm, Sorry for taking a while to respond to your post. Am down south on a VISA run, actually accessing the forums from a small bar in Yangshuo. It would be nice to have some more rigour in the description of code syntax, starting with simple stuff like "close your tags" ... also it seems that wordlists are space-separated. Again I think more examples of correct usage will make it possible to "cut and paste", in a sense. Alternatively, perhaps some kind of form / GUI might make encoding disambiguation rules less daunting? I'll try to edit the main post to make some changes that will address some of these concerns. The big issues here are (1) finding the time to do some comprehensive write-ups, and (2) knowing the kind of information that is most useful to provide. So it helps to know what sort of stuff isn't obvious. Thanks for the helpful suggestions. I'd like to ask why only one definition is given per annotated word. Is this policy meant to make improvements in the system more transparent? Or is the project moving towards unassisted translation as the final goal? Yes -- we're aiming for unassisted translation as the ultimate goal of the project. This is the main reason we set up a special database and have focused on doing a lot of grammatical and ontological classification for dictionary entries instead of just using the flat-file CEDICT-type dictionary format. If you're looking for a tool that provides ALL possible definitions, it would actually be fairly trivial to adjust Adso to do that. The software is basically a command-line tool, so the change to the actual source code would be minimal. The big reason we haven't set up something like this online is that it would involve a lot of work messing around with the GUI when our volunteers mostly have their hands full. If someone wants to volunteer to work on the interface, I can provide all of the backend support needed, anyway. Right ... so basically the database of CODE information is strictly separated from some other information, ostensibly the information driving normal annotation. When CODE is added, are changes reflected in the current annotation of newsinchinese, or are the changes to be rolled out at some later date? Changes are *LIVE* -- as soon as anyone makes a change to the database, the software accomodates its in all subsequent translations/annotations/etc. NewsinChinese saves a bit of processing power by serving out pre-annotated RSS feeds. So database changes don't affect what is displayed on the site UNTIL the article is reannotated by clicking on the "full" button. Since this refetches the article, reprocesses it and updates the output, any changes should be reflected in the output. One more idea: it might be nice to provide a visual representation of how the words are currently segmented ... probably not by underlining words, maybe by incorporating alternating color spans (the difference in color should be very small so as not to be a complete visual distraction). Are you using/have you used Firefox? We have a nice highlight-on-mouseover feature that relies on a combination of DHTML and CSS. Firefox does a good job of it, but IE is a bit behind the curve in its CSS support so the highlighting doesn't show up. I think it makes a big difference in the usability of the site to be able to see the words highlight on mouseover. Sometimes all that is really needed is a bit of help segmenting content. Sorry for the lag in getting back to your post, incidentally. I've been a bit on the move and didn't notice it until today. Cheers, --dave Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.