Jump to content
Chinese-Forums
  • Sign Up

How to figure out if a Character is used as a word


Recommended Posts

Posted

Does anyone know a dictionary that will tell me if a Character can be used as a mono-syllabic word (in modern Chinese) or occurs only as part of multi-syllabic words? I find that information to be missing in any dictionary I've tried by now. I don't necessarily need a Chinese-English dictionary, a Chinese-Chinese dictionary or even a list of Characters that can be used as mono-syllabic words would be fine too.

Posted

Although such a dictionary/list might exist, unfortunately the only real way to know this sort of thing is through practise.

Posted

The ABC Chinese-English Dictionary -- which you can get on Pleco or via Wenlin -- does indicate when a character can only be used as part of a multi-syllabic word. See here: http://www.pleco.com....html#freebound

characters which do have meaning of their own, and often carry this meaning into many different compound words, but which do not occur independently as free words in standard modern Chinese (though they may be free words in classical Chinese or in very formal written styles of the language). Examples are 女 'female' in nǚrén 'woman', nǚháizi 'girl', nǚde 'woman, female', and fùnǚ 'woman, women';
Posted

Thanks realmayo that's exactly what I was looking for. Though if there were a free dictionary/list with this information would be even better.

imron, while I do somewhat agree with your point, I'm actually not planning to use this as a study resource. The reason I'm asking is I'm writing a segmentation software (to split Chinese text into words), and I think this information could significantly improve my results.

Posted

As someone with an interest in Chinese segmenting algorithms, do you mind if I ask the reasoning?

Posted

Assuming I have the three Chinese Characters abc and both ab and bc are possible words. Without this information I have no idea if i should segment to ab c or a bc. But if I know a can occur as a monosyllabic word and c cannot I know the right segmentation is a bc. Of course this won't completely eliminate mistakes, if both a and c are possible monosyllabic words I still don't know, but I think there are quite a few cases where it would help.

Posted
The reason I'm asking is I'm writing a segmentation software (to split Chinese text into words), and I think this information could significantly improve my results.
Doesn't that exist yet? Nciku used to do that, and quite well, Pleco does it (also well), and other comparable sites too I think.
Posted

Yes I know segmentation tools do exist. Though the segmentations in most dictionary-like tools don't seem to be very sophisticated, actually they make a lot of mistakes (that may be hard to spot at times). For example try the simple sentence 我遇到了大麻烦, most segmenters deliver results that might be quite confusing to a beginner.

Anyway the segmenter I'm writing will only be part of a much larger software (I'll post about it when it's finished).

Posted
我遇到了大麻烦

A good way to get around many (but not all) of these problems is to do reverse longest matching rather than forward longest matching (i.e. starting from the end of the string and working backwards, rather than the beginning of the string and working forwards). That way you'll get

我 遇到 了 大 麻烦

Rather than

我 遇到 了 大麻 烦

Doesn't that exist yet? Nciku used to do that, and quite well,

Another limitation of most segmenters is their speed. They work ok for short paragraphs, but if you want to hangle large volumes of data (i.e. a whole novel, or multiple novels) then most tools hang up or freeze. The segmenter I'm working on is being designed with performance in mind and to allow large scale segmenting and processing within reasonable times.

  • Like 1
Posted
That way you'll get 我 遇到 了 大 麻烦

Actually, disregard this. It turns out that 到了 (dàoliǎo) is a word so reverse longest would split it as 我 遇 到了 大 麻烦, which is also incorrect, and is where single character word differentiation would help as 遇 can't be standalone. Note that both 大 and 烦 can so it's not enough just to do forward longest matching with single character word differentiation.

Posted
Note that both 大 and 烦 can so it's not enough just to do forward longest matching with single character word differentiation.
So, with my apologies for somewhat derailing the thread and just out of curiosity, how would you resolve this? Anyone with halfway decent Chinese can see that it's 遇到 了 not 遇 到了, but how do you tell a computer?
Posted
how would you resolve this?

With great difficulty :mrgreen:

Actually, doing the reverse longest matching combined with the standalone character would get this problem, because 遇 can't standalone and so it would take precedence over 到了.

e.g., when segmenting you would do something like:

烦 - a word

麻烦 - a word

大麻烦 - not a word, so split at the longest previous word - 麻烦

大 - a word

了大 - not a word, so split at the longest previous word - 大

了 - a word

到了 - a word

遇到了 - not a word, so split at the longest previous word - 到了

遇 - not a standalone word, so check to see if it forms a word with the next char

我遇 - no, so see if it matches with the previous char 遇到, yes, so split the previous word at 遇到 instead and continue

我 - a word.

end of input

Leaving you with 我 遇到 了 大 麻烦

However even though it works for this specific example, there is always going to be some other sentence which will then trip that up. You can do various things with heuristics and word frequences and bi-gram/tri-gram/quad-gram frequencies, and 'intelligent' name/profession guessing, but generally speaking, no matter what you do there are always going to be gaps and problems somewhere.

For reference, google splits that sentence as:

Wǒ yù dàole dà máfan

(which is wrong, and appears to be consistent with just doing some sort of reverse longest match).

Even if it got that right however, it still then borks on the example sentence in my dictionary for '到了' '这件事,到了他也没说清楚'.

Which should be: 'Zhè jiàn shì, dàoliǎo tā yě méi shuō qīngchu', but which Google renders as: 'Zhè jiàn shì, dàole tā yě méi shuō qīngchu'

Posted
烦 - a word

麻烦 - a word

大麻烦 - not a word, so split at the longest previous word - 麻烦

Why split at the longest previous word? Why is reverse longest matching better than forward longest matching? 大麻 can be a word and 烦 can also be a separate word. Can the segmenter know this? I know that in Chinese nouns are pre-modified, but this does not exclude the possibility of the last character being a word and the first two characters being a word.

What if this was an another set of three characters where reverse longest matching would segment them incorrectly?

Posted
However even though it works for this specific example, there is always going to be some other sentence which will then trip that up. You can do various things with heuristics and word frequences and bi-gram/tri-gram/quad-gram frequencies, and 'intelligent' name/profession guessing, but generally speaking, no matter what you do there are always going to be gaps and problems somewhere.
Thanks for the explanation. Is this also one of the reasons automatic translation is still hard? Because in that case I'll have another argument why human translators such as myself won't be put out of business any time soon just yet :-p
Posted

Especially if you work in literary translation, you will probably never lose your job. Let's put Joyce on google translate and see how the Chinese version turns out.

Posted
Google Translate has solved it. Dunno how though.

Are you sure? It gets the translation correct, but the pinyin is segmented incorrectly - and it's the segmentation we're interested in here (they do something completely different with their translation engine I believe relying on massive input of source/target languages to do the translation rather than segmentation).

Why split at the longest previous word?

It's just one heuristic to use that doesn't require any sort of advanced analysis of the sentence, and that generally produces an ok result. You can't very well split at the shortest word (or you'd get lots of single characters), and anything else then involves more complicated analysis.

Why is reverse longest matching better than forward longest matching?

Because of the way sentences are put together Chinese, if you go forward longest matching you're more likely to run in to a situation where going forward to get the longest word actually results in splitting on the wrong words. Reverse longest matching isn't perfect either, but it generally produces noticeably better results than forward longest matching. It's not an exact science, just something that appears to work well.

  • Like 1
Posted
Is this also one of the reasons automatic translation is still hard?

Yes and no. I believe more advanced machine translation systems work more on sentences and groups of words, with massive input from the source and target languages.

It is however the reason that automatic pinyin conversion and simplified/traditional conversions are never going to be perfect.

Let's put Joyce on google translate and see how the Chinese version turns out.

Put Joyce in the hands of a human translator and you're probably not going to get much in the way of anything meaningful either (though I recall reading not too long ago that someone in Shanghai has been translating his work - not sure how it turned out).

Posted

I get a thrill when Google Translate renders Korean or Vietnamese into complete nonsense-English. It's better at Chinese. I assume that's because there's more translated material out there for it to learn from?

Posted
Put Joyce in the hands of a human translator and you're probably not going to get much in the way of anything meaningful either (though I recall reading not too long ago that someone in Shanghai has been translating his work - not sure how it turned out).
Here's an article comparing the ending of Ulysses in two different Chinese translations. They try their best and both translations are decent, but neither really works imo, mostly because Chinese lacks a word that does everything that 'yes' does in this section. In the end, even human translation falls short and all you can do is just learn the language.
  • Like 1

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...