Jockster Posted January 15, 2009 at 02:40 PM Report Posted January 15, 2009 at 02:40 PM I would need to create an index in Chinese for a user manual. All the words that are supposed to be included in the index are explicitly tagged in XML, so that part's straightforward. But, how to sort them in the index? AFAIK there are at least three common ways: Stroke count Pinyin Radical + stroke ...and none of these can be said to be the default one, my Chinese colleague said. Cut a long story short, ideally I would like to handle the sorting of the index words with the help of homegrown scripts and a lookup table. I have also looked around for commercial solutions, and one vendor said that the only approach that always works is to base it on stroke count. Would you agree with that? Chinese people write less and less by hand, like the rest of us. Do they typically know how many strokes there are in a character without having to count them mentally? I realize that for characters with a fairly small number of strokes, say 5 or below, that is most likely the case, but what about more complicated characters? I have not been able to find any public domain lookup tables that list both simplified and traditional Chinese characters by stroke count or radical + stroke. There are lists like that in Chinese dictionaries, but I am looking for an electronic version without copyright. By the way, am I right in suspecting that those lists would not necessarily be 100% identical from one dictionary to the next? I recall a discussion about there being variants of some less frequently used, high stroke count characters. My immediate concern is with simplified characters, but of course I would like for it to work with traditional characters, too. Thanks if you read this far, doubly so if you take the time to reply. Quote
roddy Posted January 15, 2009 at 03:12 PM Report Posted January 15, 2009 at 03:12 PM I suspect pinyin would be usual, and a quick check of MS Word's help file and a camera manual back me up*. But if you need stroke order tables, Unihan might have what you need? Unless you're dealing with something particularly obscure you can probably find similar manuals online and see what they do. Myreadme.com has a collection of manuals for consumer electronics. *and in both cases the need to use letters means you either use pinyin, or have two indexes. Where does CPU镜头 come in a stroke order index? Quote
renzhe Posted January 15, 2009 at 03:30 PM Report Posted January 15, 2009 at 03:30 PM Pinyin is also the only thing that will work on both traditional and simplified characters (they will be located at the same place in the index). Pinyin might be less intuitive to a Hong Kong or Taiwanese user, though, but I'd imagine that most people search the index anyway. Quote
Jockster Posted January 16, 2009 at 07:53 AM Author Report Posted January 16, 2009 at 07:53 AM Thank for the replies. I like the idea of using Pinyin, but I wasn't sure whether that was due to the fact that I am not Chinese and have used Pinyin to study Chinese. I guess using Pinyin is not necessarily entirely straightforward either, though. The first step is to establish the pronunciation for each index term, that either involves what I think would be a highly reliable dictionary search or actually hard coding the Pinyin pronunciation with an attribute. Since many characters can be pronounced in more than one way, just looking at the characters individually would be too unreliable. Suppose that these are the terms that should be sorted under K. 科技 ke1 ji4 客机 ke4 ji1 科学 ke1 xue2 客户管理 ke4 hu4 guan3 li3 客户机软件 ke4 hu4 ji1 ruan3 jian4 To sort these, Pinyin goes a long way: First the terms are sorted according to the pronunciation + tone of the first word, which yields this list: 科技 ke1 ji4 科学 ke1 xue2 客户管理 ke4 hu4 guan3 li3 客户机软件 ke4 hu4 ji1 ruan3 jian4 客机 ke4 ji1 Suppose you would have two terms with the exact same Pinyin pronunciation - how would you sort those? According to the stroke count or radical + stroke count? Or would anybody actually be bothered about that? Instances like that should be fairly rare I suppose (actually tried to think of some examples for this post but came up empty-handed). So perhaps - if by chance two index terms have identical pronunciations, it does not matter if they appear in random order (the order they were processed)? The odds are it would be a very short list. Of course, if the indexing depended on a dictionary search then at some point one will encounter a term for which no match exists. It could be a Chinese name, for example. Well, one could hardcode those with an attribute and create a rule to the effect that if no attribute value exists, then a dictionary search is done. Please comment, I am sure I have forgotten some aspect. And to clarify: I don't intend to mix traditional and simplified characters in the same manual, I just want the solution to work for both. PS: I don't want to follow this example. Quote
renzhe Posted January 16, 2009 at 11:08 AM Report Posted January 16, 2009 at 11:08 AM You could use CC-CEDICT to find the pinyin for every term you are interested in. It's in utf-8 text format and very easy to parse, I've used python to look up pinyin + meaning for our First Episode Project, for example. You can do this once and create your internal pinyin table for all the terms in your index and be done with it. I also think Adso does something like this (plus text segmentation and other stuff) and IIRC, it's free. In either case, you'll need to hand-tune it once in the end (find missing information, delete the odd duplicate etc.), but most of the work should be fairly painless. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.