How to figure out if a Character is used as a word

July 26, 2013 at 08:01 PM

Does anyone know a dictionary that will tell me if a Character can be used as a mono-syllabic word (in modern Chinese) or occurs only as part of multi-syllabic words? I find that information to be missing in any dictionary I've tried by now. I don't necessarily need a Chinese-English dictionary, a Chinese-Chinese dictionary or even a list of Characters that can be used as mono-syllabic words would be fine too.

July 27, 2013 at 10:03 AM

Although such a dictionary/list might exist, unfortunately the only real way to know this sort of thing is through practise.

July 27, 2013 at 10:23 AM

The ABC Chinese-English Dictionary -- which you can get on Pleco or via Wenlin -- does indicate when a character can only be used as part of a multi-syllabic word. See here: http://www.pleco.com....html#freebound

characters which do have meaning of their own, and often carry this meaning into many different compound words, but which do not occur independently as free words in standard modern Chinese (though they may be free words in classical Chinese or in very formal written styles of the language). Examples are nǚ 女 'female' in nǚrén 'woman', nǚháizi 'girl', nǚde 'woman, female', and fùnǚ 'woman, women';

July 27, 2013 at 11:59 AM

Thanks realmayo that's exactly what I was looking for. Though if there were a free dictionary/list with this information would be even better.

imron, while I do somewhat agree with your point, I'm actually not planning to use this as a study resource. The reason I'm asking is I'm writing a segmentation software (to split Chinese text into words), and I think this information could significantly improve my results.

July 27, 2013 at 12:26 PM

As someone with an interest in Chinese segmenting algorithms, do you mind if I ask the reasoning?

July 27, 2013 at 12:39 PM

Assuming I have the three Chinese Characters abc and both ab and bc are possible words. Without this information I have no idea if i should segment to ab c or a bc. But if I know a can occur as a monosyllabic word and c cannot I know the right segmentation is a bc. Of course this won't completely eliminate mistakes, if both a and c are possible monosyllabic words I still don't know, but I think there are quite a few cases where it would help.

July 27, 2013 at 07:55 PM

The reason I'm asking is I'm writing a segmentation software (to split Chinese text into words), and I think this information could significantly improve my results.

Doesn't that exist yet? Nciku used to do that, and quite well, Pleco does it (also well), and other comparable sites too I think.

July 28, 2013 at 12:52 PM

Yes I know segmentation tools do exist. Though the segmentations in most dictionary-like tools don't seem to be very sophisticated, actually they make a lot of mistakes (that may be hard to spot at times). For example try the simple sentence 我遇到了大麻烦, most segmenters deliver results that might be quite confusing to a beginner.

Anyway the segmenter I'm writing will only be part of a much larger software (I'll post about it when it's finished).

July 28, 2013 at 11:00 PM

我遇到了大麻烦

A good way to get around many (but not all) of these problems is to do reverse longest matching rather than forward longest matching (i.e. starting from the end of the string and working backwards, rather than the beginning of the string and working forwards). That way you'll get

我遇到了大麻烦

Rather than

我遇到了大麻烦

Doesn't that exist yet? Nciku used to do that, and quite well,

Another limitation of most segmenters is their speed. They work ok for short paragraphs, but if you want to hangle large volumes of data (i.e. a whole novel, or multiple novels) then most tools hang up or freeze. The segmenter I'm working on is being designed with performance in mind and to allow large scale segmenting and processing within reasonable times.

July 29, 2013 at 01:12 AM

That way you'll get 我遇到了大麻烦

Actually, disregard this. It turns out that 到了 (dàoliǎo) is a word so reverse longest would split it as 我遇到了大麻烦, which is also incorrect, and is where single character word differentiation would help as 遇 can't be standalone. Note that both 大 and 烦 can so it's not enough just to do forward longest matching with single character word differentiation.

July 29, 2013 at 08:09 AM

Note that both 大 and 烦 can so it's not enough just to do forward longest matching with single character word differentiation.

So, with my apologies for somewhat derailing the thread and just out of curiosity, how would you resolve this? Anyone with halfway decent Chinese can see that it's 遇到了 not 遇到了, but how do you tell a computer?

July 29, 2013 at 10:56 AM

how would you resolve this?

With great difficulty

Actually, doing the reverse longest matching combined with the standalone character would get this problem, because 遇 can't standalone and so it would take precedence over 到了.

e.g., when segmenting you would do something like:

烦 - a word

麻烦 - a word

大麻烦 - not a word, so split at the longest previous word - 麻烦

大 - a word

了大 - not a word, so split at the longest previous word - 大

了 - a word

到了 - a word

遇到了 - not a word, so split at the longest previous word - 到了

遇 - not a standalone word, so check to see if it forms a word with the next char

我遇 - no, so see if it matches with the previous char 遇到, yes, so split the previous word at 遇到 instead and continue

我 - a word.

end of input

Leaving you with 我遇到了大麻烦

However even though it works for this specific example, there is always going to be some other sentence which will then trip that up. You can do various things with heuristics and word frequences and bi-gram/tri-gram/quad-gram frequencies, and 'intelligent' name/profession guessing, but generally speaking, no matter what you do there are always going to be gaps and problems somewhere.

For reference, google splits that sentence as:

Wǒ yù dàole dà máfan

(which is wrong, and appears to be consistent with just doing some sort of reverse longest match).

Even if it got that right however, it still then borks on the example sentence in my dictionary for '到了' '这件事，到了他也没说清楚'.

Which should be: 'Zhè jiàn shì, dàoliǎo tā yě méi shuō qīngchu', but which Google renders as: 'Zhè jiàn shì, dàole tā yě méi shuō qīngchu'

July 29, 2013 at 11:56 AM

烦 - a word
麻烦 - a word

大麻烦 - not a word, so split at the longest previous word - 麻烦

Why split at the longest previous word? Why is reverse longest matching better than forward longest matching? 大麻 can be a word and 烦 can also be a separate word. Can the segmenter know this? I know that in Chinese nouns are pre-modified, but this does not exclude the possibility of the last character being a word and the first two characters being a word.

What if this was an another set of three characters where reverse longest matching would segment them incorrectly?

July 29, 2013 at 12:10 PM

Google Translate has solved it. Dunno how though.

July 29, 2013 at 12:20 PM

However even though it works for this specific example, there is always going to be some other sentence which will then trip that up. You can do various things with heuristics and word frequences and bi-gram/tri-gram/quad-gram frequencies, and 'intelligent' name/profession guessing, but generally speaking, no matter what you do there are always going to be gaps and problems somewhere.

Thanks for the explanation. Is this also one of the reasons automatic translation is still hard? Because in that case I'll have another argument why human translators such as myself won't be put out of business any time soon just yet :-p

July 29, 2013 at 12:32 PM

Especially if you work in literary translation, you will probably never lose your job. Let's put Joyce on google translate and see how the Chinese version turns out.

July 29, 2013 at 12:40 PM

Google Translate has solved it. Dunno how though.

Are you sure? It gets the translation correct, but the pinyin is segmented incorrectly - and it's the segmentation we're interested in here (they do something completely different with their translation engine I believe relying on massive input of source/target languages to do the translation rather than segmentation).

Why split at the longest previous word?

It's just one heuristic to use that doesn't require any sort of advanced analysis of the sentence, and that generally produces an ok result. You can't very well split at the shortest word (or you'd get lots of single characters), and anything else then involves more complicated analysis.

Why is reverse longest matching better than forward longest matching?

Because of the way sentences are put together Chinese, if you go forward longest matching you're more likely to run in to a situation where going forward to get the longest word actually results in splitting on the wrong words. Reverse longest matching isn't perfect either, but it generally produces noticeably better results than forward longest matching. It's not an exact science, just something that appears to work well.

July 29, 2013 at 12:41 PM

Is this also one of the reasons automatic translation is still hard?

Yes and no. I believe more advanced machine translation systems work more on sentences and groups of words, with massive input from the source and target languages.

It is however the reason that automatic pinyin conversion and simplified/traditional conversions are never going to be perfect.

Let's put Joyce on google translate and see how the Chinese version turns out.

Put Joyce in the hands of a human translator and you're probably not going to get much in the way of anything meaningful either (though I recall reading not too long ago that someone in Shanghai has been translating his work - not sure how it turned out).

July 29, 2013 at 12:48 PM

I get a thrill when Google Translate renders Korean or Vietnamese into complete nonsense-English. It's better at Chinese. I assume that's because there's more translated material out there for it to learn from?

July 29, 2013 at 01:18 PM

Put Joyce in the hands of a human translator and you're probably not going to get much in the way of anything meaningful either (though I recall reading not too long ago that someone in Shanghai has been translating his work - not sure how it turned out).

Here's an article comparing the ending of Ulysses in two different Chinese translations. They try their best and both translations are decent, but neither really works imo, mostly because Chinese lacks a word that does everything that 'yes' does in this section. In the end, even human translation falls short and all you can do is just learn the language.

Sign In

How to figure out if a Character is used as a word

Recommended Posts

陆咔思

imron

Guest realmayo

陆咔思

imron

陆咔思

Lu

陆咔思

imron

imron

Lu

imron

Angelina

roddy

Lu

Angelina

imron

imron

Guest realmayo

Lu

Join the conversation