Pinyin Hanzi offline database?

August 27, 2009 at 03:50 PM

I'm writing some kind of dictionary software and as part of this am looking for a good database to provide mapping from Pinyin to Hanzi. The sort of job you would expect an IME to do.

I was thinking of just parsing the CEDICT dictionary and building statistics based on that. But is there a more reliable/comprehensive way?

EDIT: Also would be looking for statistics on character frequency in the real world.

August 28, 2009 at 12:26 AM

Mapping from pinyin to Hanzi is quite a complex task. Doing it a character at a time you're likely to encounter a high rate of error. Doing it a word at a time means you'll need to have a way to segment Chinese words (not trivial with so many homonyms - even more so if it's typing pinyin without tones - and no spaces).

Probably your best bet is to parse a dictionary like CEDICT, then have a read of this page, which mentions some decent word matching algorithms.

For character frequency, Junda provides character frequency lists for a number of different language settings (classical, modern, imaginative etc).

August 28, 2009 at 12:58 AM

You can try the Adso dictionary, too, which is much more comprehensive than CEDICT. It's the dictionary used in Chinese Perakun, among other things.

http://adsotrans.com/downloads/

https://addons.mozilla.org/en-US/firefox/addon/3349

August 28, 2009 at 04:30 AM

Thanks very much guys - that's exactly the sort of info I was looking for - especially the character frequency data.

@imron - I'm not looking to automate the selection of a single hanzi from a single pinyin word. It would be more like with an IME where the user would enter pinyin and then be presented a list of possible characters. This should definitely be ordered according to frequency.

I may have it so that even partially entered pinyin will present a list of possibilities. So, for example, the user types "y" (the suggestion list starts with "一" yi1) and then adds "o" (the suggestion list now starts with "有" you3) and so on.

Are there any frequency lists for multi-character words? This is important for sorting the results of a dictionary lookup. Actually this is the thing I need most since all the stuff mentioned above may be covered by a good IME.

August 28, 2009 at 04:36 AM

The "bigram" (i.e. words composed of two characters) frequency list is for words, but probably not very complete and would not include chengyu's which are typically 4 characters long.

http://lingua.mtsu.edu/chinese-computing/statistics/index.html

Bigram frequency lists 汉字双字组频率列表

* 新闻类文本双字组频率：Bigram frequency list of the news sub-corpus

* 一般小说类文本双字组频率：Bigram frequency list of the general fiction sub-corpus

August 28, 2009 at 04:56 AM

Actually, bigrams aren't words, and can be a nonsense combination of characters (that just happen to have a high frequency of appearing next to each other).

@westmeadboy, actually I figured you might want to do some sort of sentence based prediction (like all modern IMEs), which is why I included the link above for word matching (note this is word matching, not single hanzi matching).

Libtabe (also mentioned in the resources section of the same link) has a word database that provides both character and word frequency information.

August 28, 2009 at 05:35 AM

Just stumbled across this link: http://technology.chtsai.org/wordlist

I looked at libtabe but it looks like quite a bit of work just to extract the information. I'm much more comfortable handling plain text files such as CEDICT.

The Chinese Community Information Center list looks useful though I'm not exactly sure what they mean by "phrases". Most of the entries are bigrams. Maybe they just mean "words"!

This link talks about using search engines to gauge character frequency: http://yong321.freeshell.org/misc/ChineseCharFrequency.html

...and I don't see why this couldn't be extended to words despite the problem of word boundaries making for some potential anomalies in the results.

Edited August 28, 2009 at 05:59 AM by westmeadboy

September 12, 2009 at 04:00 AM

I was wondering about an alternative to using word frequency lists...

If I search for the word "/beautiful/" in CC-CEDICT I get 38 results. Now I want to display these results in a meaningful order. I don't think its enough to use word frequency alone because what we really want is "word frequency when that word is used to mean 'beautilful'".

I suppose this is something that is very difficult to figure out.

At the other end of the spectrum, I'm looking for the easiest way to order the results reasonably effectively.

For example, what do people think about using the HSK level? I know this only covers a small proportion of the words but I suppose for most people, most of the time those are the ones they are interested in...

September 12, 2009 at 04:45 AM

Two somewhat unrelated ideas:

1. You might want to add the ability to read dictionaries in the Stardict format, as there are lots of them available: http://stardict.sourceforge.net/Dictionaries_zh_CN.php A number of E-C dictionaries are already available.

2. CEDICT lacks example sentences. Would be nice if you can link up the words to example sentences. There is a database of 20,000 example sentences here: http://www.mnemosyne-proj.org/node/115 Probably will need to pre-index the corpus.

I actually see 166 result entries for "beautiful" when I searched on mdbg.

http://www.mdbg.net/chindict/chindict.php?page=worddict&wdrst=0&wdqb=beautiful#wordadvanced

Is that more than just CEDICT?

September 12, 2009 at 04:52 AM

Thanks for the ideas.

1. Yeah, I've already (briefly) looked at stardict. I didn't find any info on the data structure but I haven't looked that hard yet.

2. Yes, I'd definitely like to bring in example sentences at some point. That will probably be an optional module.

"beautiful" - I was only talking about exact matches so not things like "beautiful woman".

September 12, 2009 at 05:04 AM

You will be able to find code for reading the Stardict dictionary format in the source code:

http://stardict.sourceforge.net/other.php

stardict-3.0.1.tar.bz2 1964K

stardict-tools-3.0.1.tar.bz2 410K

Sign In

Pinyin Hanzi offline database?

Recommended Posts

westmeadboy

imron

gato

westmeadboy

gato

imron

westmeadboy

westmeadboy

gato

westmeadboy

gato

Join the conversation