Links between simplified characters in database structure

April 1, 2012 at 06:48 PM

Dear all,

I'm a Belgian student, yes you can already guess it, in love with China & the language. So first of all thanks for this great forum -> I'm using it alot.

Look, next to the China-lover thing I'm also a programmer and would like to build my own learning tool.

What is the issue? I'm encountering a technical problem. I'm using the open source CC-EDICT database, but how do you figure out (automatically) the different meaning of each character for a word containing several characters? Because just matching on the individual character itself is not enough (as there are a lot of characters with multiple translations in the database)..

The question is thus, how do you link a word (=combination of several characters) to its original building blocks (=individual character).

Thanks,

Pj

April 1, 2012 at 11:01 PM

When you say "characters with multiple translations" do you mean one character having multiple entries (i.e. separate lines in the dictionary file) or multiple definitions given on the same line for a single character? I think you are talking about the former, so I will answer based on that assumption, but if it is the latter then I don't think it is possible using just the CC-CEDICT file.

Your main clue is pronunciation/tone. You want to find the individual character entry which matches in pronunciation/tone to the character in the compound word. For example, the compound 处理 (chǔlǐ) is going to match up with the third tone 处 (chǔ), not the forth tone 处 (chù). You can also probably skip "proper noun" (i.e. capitalized) single-character entries like surnames, which aren't typically going to be used in compound words. I think only recently versions of the CC-CEDICT dictionary separate surnames into different entries.

April 2, 2012 at 03:29 PM

Yes your assumption is correct. I'm having difficulties with multiple entries (=rows) for the same character.

The 'capitalized' pinyin tip is certainly a good one!

Concerning the tones, yes in a lot of cases this will be helpfull, BUT ... what about the words that got a neutral tone for example zhuō zi (where zi gets the 5th = neutral tone).

Btw, I think I have one of the lastest versions of CC-CEDICT, but there is no separation for surnames.

Finally one remark, I found out that you the order in which the words come let you find the what the first character is.

For example if you have

学 [xue2] /to learn/to study/science/-ology/

All the lines after that will be words starting with 学.

This thus helps in finding out what the first character is but not the second, (third, ...)

Any more help in figuring out the links is more than welcome .

Greetings!

April 2, 2012 at 07:38 PM

For neutral tones I would probably just "guess" the tone by using the most common reading for that character from other compounds. It won't be right in all cases, but I think it will be right enough to be useful.

I thought the surnames were showing up separately in the CC-CEDICT version I was transforming into a Mac dictionary plugin, but I might be mistaken (or it might have changed).

Taking a step back, although fooling around with CC-CEDICT is fun and I definitely encourage it, you will never produce a learning tool as generally useful as Pleco. Plus, you can download the CC-CEDICT dictionary plugin for Pleco for free. Pleco doesn't solve your particular problem (when cross-referencing a character it lets you scroll through all the different entries), but overall it already does almost everything you could want with a Chinese dictionary (and then about 1000 other things as well). So, I recommend tooling around with CC-CEDICT for fun, but as your actual study tool I would encourage you to look at Pleco.

April 3, 2012 at 02:51 AM

If you are serious about it, get a beginners book in linguistics. What we need (but will probably never happen) is a "meaning" database, ie a dictionary that breaks things up like this:

Underlying meaning
- Mandarin chinese character(s)/word
- Mandarin chinese pronunciation
- English word(s)

ie there is a many-to-many relationship between "chinese words", "english words" and "chinese pronunciation", the only way to hang them all together properly would be to link them together based on a underlying meaning relationship. If the above were done, then the database could be expanded

Underlying meaning
- Mandarin chinese character(s)/word
- Mandarin chinese pronunciation
- English word(s)
- Shanghainese word(s)
- Shanghainese pronunciation
- Cantonese pronunciation
- Spanish word
- etc...

The difficulty of doing this is probably what is preventing it ever happening. If there are enough people interested in such a project let me know. It is something I have been thinking about for a long time now, but its not something a single person could achieve on their own in a lifetime.

April 3, 2012 at 09:33 AM

You have an interesting idea, but unfortunately, the data doesn't exist to support it. And as feng mentioned, creating such data would be difficult.

Here's something you might try to create the data somewhat automatically with some programming grease:

1) link the english definitions with Wordnet and try to automatically assign each CC-CEDICT definition to a synset.

2) go through all of the multi-character words and see if each one's definition is similar to a synset of a single character.

3) if it is, then that might be the definition that the word is based on.

To be honest, I doubt this would work very well ( ~ 50% accuracy at best). In reality, to actually do this, one would need a strong knowledge of ancient Chinese in addition to strong modern Chinese skills.

In general, I think this method, while potentially useful, is also potentially dangerous. By themselves, characters, have meanings and sometimes are even words, but each characters meaning doesn't always influence the meaning of a multi-character word. It's also easy to make mistakes like saying 危机 (crisis) = danger + opportunity. That's completely wrong.

I'd love to hear more about your project idea!

April 3, 2012 at 09:57 AM

If you are trying to do etymology, you do have to refer to the classical Chinese meanings, just as the etymology section of English dictionaries refers to Greek, Latin, Old English, Middle English, Old French, Middle French, and so forth.

Sign In

Links between simplified characters in database structure

Recommended Posts

Pj Muller

muirm

Pj Muller

muirm

feng

ChouDoufu

gato

Join the conversation