Jump to content
Chinese-Forums
  • Sign Up

Pinyin Hanzi offline database?


westmeadboy

Recommended Posts

I'm writing some kind of dictionary software and as part of this am looking for a good database to provide mapping from Pinyin to Hanzi. The sort of job you would expect an IME to do.

I was thinking of just parsing the CEDICT dictionary and building statistics based on that. But is there a more reliable/comprehensive way?

EDIT: Also would be looking for statistics on character frequency in the real world.

Link to comment
Share on other sites

Mapping from pinyin to Hanzi is quite a complex task. Doing it a character at a time you're likely to encounter a high rate of error. Doing it a word at a time means you'll need to have a way to segment Chinese words (not trivial with so many homonyms - even more so if it's typing pinyin without tones - and no spaces).

Probably your best bet is to parse a dictionary like CEDICT, then have a read of this page, which mentions some decent word matching algorithms.

For character frequency, Junda provides character frequency lists for a number of different language settings (classical, modern, imaginative etc).

Link to comment
Share on other sites

Thanks very much guys - that's exactly the sort of info I was looking for - especially the character frequency data.

@imron - I'm not looking to automate the selection of a single hanzi from a single pinyin word. It would be more like with an IME where the user would enter pinyin and then be presented a list of possible characters. This should definitely be ordered according to frequency.

I may have it so that even partially entered pinyin will present a list of possibilities. So, for example, the user types "y" (the suggestion list starts with "一" yi1) and then adds "o" (the suggestion list now starts with "有" you3) and so on.

Are there any frequency lists for multi-character words? This is important for sorting the results of a dictionary lookup. Actually this is the thing I need most since all the stuff mentioned above may be covered by a good IME.

Link to comment
Share on other sites

The "bigram" (i.e. words composed of two characters) frequency list is for words, but probably not very complete and would not include chengyu's which are typically 4 characters long.

http://lingua.mtsu.edu/chinese-computing/statistics/index.html

Bigram frequency lists 汉字双字组频率列表

* 新闻类文本双字组频率:Bigram frequency list of the news sub-corpus

* 一般小说类文本双字组频率:Bigram frequency list of the general fiction sub-corpus

Link to comment
Share on other sites

Actually, bigrams aren't words, and can be a nonsense combination of characters (that just happen to have a high frequency of appearing next to each other).

@westmeadboy, actually I figured you might want to do some sort of sentence based prediction (like all modern IMEs), which is why I included the link above for word matching (note this is word matching, not single hanzi matching).

Libtabe (also mentioned in the resources section of the same link) has a word database that provides both character and word frequency information.

Link to comment
Share on other sites

Just stumbled across this link: http://technology.chtsai.org/wordlist

I looked at libtabe but it looks like quite a bit of work just to extract the information. I'm much more comfortable handling plain text files such as CEDICT.

The Chinese Community Information Center list looks useful though I'm not exactly sure what they mean by "phrases". Most of the entries are bigrams. Maybe they just mean "words"!

This link talks about using search engines to gauge character frequency: http://yong321.freeshell.org/misc/ChineseCharFrequency.html

...and I don't see why this couldn't be extended to words despite the problem of word boundaries making for some potential anomalies in the results.

Edited by westmeadboy
Link to comment
Share on other sites

  • 2 weeks later...

I was wondering about an alternative to using word frequency lists...

If I search for the word "/beautiful/" in CC-CEDICT I get 38 results. Now I want to display these results in a meaningful order. I don't think its enough to use word frequency alone because what we really want is "word frequency when that word is used to mean 'beautilful'".

I suppose this is something that is very difficult to figure out.

At the other end of the spectrum, I'm looking for the easiest way to order the results reasonably effectively.

For example, what do people think about using the HSK level? I know this only covers a small proportion of the words but I suppose for most people, most of the time those are the ones they are interested in...

Link to comment
Share on other sites

Two somewhat unrelated ideas:

1. You might want to add the ability to read dictionaries in the Stardict format, as there are lots of them available: http://stardict.sourceforge.net/Dictionaries_zh_CN.php A number of E-C dictionaries are already available.

2. CEDICT lacks example sentences. Would be nice if you can link up the words to example sentences. There is a database of 20,000 example sentences here: http://www.mnemosyne-proj.org/node/115 Probably will need to pre-index the corpus.

I actually see 166 result entries for "beautiful" when I searched on mdbg.

http://www.mdbg.net/chindict/chindict.php?page=worddict&wdrst=0&wdqb=beautiful#wordadvanced

Is that more than just CEDICT?

Link to comment
Share on other sites

Thanks for the ideas.

1. Yeah, I've already (briefly) looked at stardict. I didn't find any info on the data structure but I haven't looked that hard yet.

2. Yes, I'd definitely like to bring in example sentences at some point. That will probably be an optional module.

"beautiful" - I was only talking about exact matches so not things like "beautiful woman".

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...