tresgog Posted April 29, 2009 at 03:56 AM Report Posted April 29, 2009 at 03:56 AM Hi, I hope this thread belongs to the right category. I would to like to find a chart/table with two entries. One would be a given pinyin, let's say "yi" (for now , without any tones specification), the other one would be the number of characters corresponding to this pinyin. I've already found many online sources in which if you enter a pinyin, they'll give you a list of characters (with sometimes even the number of characters). I tried to construct this chart myself out of the data given. But the problem is there are some many different pinyin that it takes forever to have the final chart. Does someone already know where to find that kind of chart? Here are some sources: 1)mdbg: http://us.mdbg.net/chindict/chindict.php?page=chardict&cdcanoce=0&cdqchi=yi%0D%0A&cddmtm=2&cddytm=0 for instance by typing "yi" I get 250 characters. 2) a freeware called "BabelMap_汉化版". it is a very exhanstif "map" of all chinese characters (maybe not all). In this software if I look for the pinyin "yi" I got 604 characters! But this tool given redundant answers because it includes both traditional and simplified characters. I only want a chart (or statistics) in a given set of character. Secondly, Babelmap also includes non-mandarin pinyin: if I ask for "ng" i'll give me the cantonese "唔“ (which is, I think, "wu" in mandarin pinyin) But anyway some sources are great but it takes too long to made out the chart that is why I'm looking for an already made chart. Thanks for reading Quote
Don_Horhe Posted April 29, 2009 at 06:12 AM Report Posted April 29, 2009 at 06:12 AM Wenlin can do this. Quote
roddy Posted April 29, 2009 at 06:16 AM Report Posted April 29, 2009 at 06:16 AM I can see how you could automate this, but you'd need some database skills (or perhaps just Excel even?). Get the Unihan database. Throw out everything that doesn't have a GB2312 entry. That cuts it down to 6000+ simplified characters, so we're not going to be swamped. Or there may be a GB2312 specific file somewhere. Then do some kind of search and replace to remove the numbers from YI3, WANG2, etc. Then run a query to find how many of each entry there are in the pronunciation column. Having exhausted myself with this high-level conceptual work, I shall leave the actually implementation to others. Quote
tresgog Posted April 29, 2009 at 07:39 AM Author Report Posted April 29, 2009 at 07:39 AM Thanks for the prompts answers. A few question regarding the previous posts. 1) Is Wenlin a freeware? At least get I get this functionnality in the Free version (if there is). 2)how to edit/manage/sort "throw out" this unicode file? (this questions is for the others) Quote
roddy Posted April 29, 2009 at 08:54 AM Report Posted April 29, 2009 at 08:54 AM Wenlin is not freeware. My suggestions would only be feasible if you have some experience of working with databases. Quote
tresgog Posted April 29, 2009 at 11:56 AM Author Report Posted April 29, 2009 at 11:56 AM Thank for your suggestions. Although, I would prefer a direct answer to my query. I don't mind learning those requires skills in database to answer my question (I am sure it can be useful in other fields). If someones knows about the "grailic" list/chart a link would be most welcome. In the hypothesis that such a list does not exist, it will be my honor to generate it. In such case, I need someone to show me the way. I downloaded the Unihan list, it is very exhaustif there are some characters witch are not even in standard well-know chinese directionnary (espcially vulgar terms). I am very impressed. How can I differentiate simplified from traditional? How is the GB2312 coded. If I take a given unicode characters it does not seems to have any label mentioning this GB2312. For instance “汉": U+6C49 kCCCII 224672 U+6C49 kCangjie EE U+6C49 kCantonese hon3 U+6C49 kDefinition Chinese people; Chinese language U+6C49 kEACC 274857 U+6C49 kFourCornerCode 3714 U+6C49 kFrequency 3 U+6C49 kGB0 2626 U+6C49 kHKSCS FAE4 U+6C49 kHanYu 31549.080 U+6C49 kHanyuPinlu han4(227) U+6C49 kIICore 2.1 U+6C49 kIRGHanyuDaZidian 31549.080 U+6C49 kIRGKangXi 0604.091 U+6C49 kIRG_GSource 0-3A3A U+6C49 kIRG_HSource FAE4 U+6C49 kIRG_TSource F-2166 U+6C49 kKangXi 0604.091 U+6C49 kMainlandTelegraph 3352 U+6C49 kMandarin YI4 HAN4 U+6C49 kMorohashi 99999 U+6C49 kPhonetic 546 U+6C49 kRSKangXi 85.2 U+6C49 kRSUnicode 85.2 U+6C49 kTotalStrokes 5 U+6C49 kTraditionalVariant U+6F22 U+6C49 kXHC1983 0441.040:ha虁n I don't see anything that involve "GB2312" appart form the fact that I know that is a PRC's character. Basically, how to "throw out" the non-BG2312 character? (or the GB/T 12345 because I am more interested in traditional characters for now) NB: what is the label 227 next to han4? Quote
peekay Posted April 29, 2009 at 02:27 PM Report Posted April 29, 2009 at 02:27 PM The number represents the frequency count for that character in the "Modern Standard Beijing Chinese Frequency Dictionary" (現代漢語頻率詞典, Xiandai Hanyu Pinlu Cidian). See UAX #38: Unicode Han Database for field definitions. Quote
roddy Posted April 29, 2009 at 02:40 PM Report Posted April 29, 2009 at 02:40 PM U+6C49 kGB0 2626 is the GB2312 info Quote
stoney Posted April 29, 2009 at 03:57 PM Report Posted April 29, 2009 at 03:57 PM You can download Wakan for free. In the dictionary when you enter the pinyin, it tells you the number of corresponding characters. http://wakan.manga.cz/ Quote
m_k_e Posted April 29, 2009 at 04:46 PM Report Posted April 29, 2009 at 04:46 PM Hey, You could use the CC-CEDICT list. If you're on Linux: wget http://www.mdbg.net/chindict/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz Gunzip it and using sed or VIM (or your favorite text editor), you could extract just the one-character entries. The commands I used in VIM: #delete comments :g/^#/d #delete multiple-character entries :g/^[^ ]{2}/d #delete left bracket :%s/[// #right bracket :%s/]// #delete explanation :%s//.*/ #convert everything to lowercase :%s/[A-Z]/L&/ And then, in a bash-like shell: awk '{pinyinCount[$3]++} END{for(i in pinyinCount) print i, pinyinCount}' input_file.txt |sort -r -k 2nr > pinyin_count.txt This got me something like the following: me@you:/tmp$ head pinyin_count.txt yi4 88 yu4 72 xi1 69 bi4 68 li4 65 yu2 61 ji4 57 zhi4 56 fu2 54 qi2 51 Quote
cababunga Posted April 29, 2009 at 05:56 PM Report Posted April 29, 2009 at 05:56 PM I don't see how this list can be any useful, but it was very easy to make it out of Unihan database, so here it is: http://mandarinspot.com/static/pinyin-count-no-tones-gb.txt Enjoy. Quote
tresgog Posted April 30, 2009 at 09:45 AM Author Report Posted April 30, 2009 at 09:45 AM How to generate the "pinyin vs count" chart using Wenlin? Question less related to the topic: How to update Wenlin? Because it seems that it is not absolutely complete (Even if it seems to be a very powerful software). For instance, I cannot find the character which unicode is U+216A6 (Caution: it is vulgar). some other database has definition and stuff about this character: eg. http://www.zdic.net/zd/zi3/ZdicF0ZdicA1Zdic9AZdicA6.htm (sorry I picked this one but that was the only example I found...) Quote
tresgog Posted April 30, 2009 at 10:19 AM Author Report Posted April 30, 2009 at 10:19 AM Thank you cababunga, i only noticed now that you have answered my question. Could you give the details how to re-obtain the list? So I can tune some parameters and and some filters (simplified/traditional, frequency ect...) and add the different tones I think this list is intrinsically interested because it shows for instance that the most use pinyin syllable in Chinese is "yi". Maybe because it is easier to prononce? Of course, the count of characters for this pinyin include all the characters in the database including the one that are never used. What would be more interested is to couple a given pinyin with the frequency of use of the characters. Therefore, we will able to know what is the "most-said" pinyin (I think it is still "yi") anyway thanks for the answer. Quote
cababunga Posted April 30, 2009 at 06:24 PM Report Posted April 30, 2009 at 06:24 PM The generation of the list is very simple. I think it's so obvious, I don't even know what to explain. You've seen the structure of the Unihan.txt file. You just need some programming skills to collect necessary data from it. Accounting for frequency is not as easy as it seems at first. Unihan database has frequency rating for most, but not all, characters. Here is one I just picked randomly 鼋 that doesn't have. For those that have there is no information on what these numbers really mean. Are characters with kFrequency 1 ten times more frequent then 2 or three time more frequent? Besides there are often more then one pronunciation for the same character, and, although Mandarin pronunciation variants are said to be sorted by frequency, there are no real numbers you could use for calculations. The more sensible approach would be to at least also use one of the character frequency lists, which are plenty on internet. This will at least give you real character frequencies, but you would still lack frequency of pronunciation of the particular character. What would be much better is to derive your data from real word frequencies. Most Chinese words have only one Mandarin pronunciation, with exception of some single character words. This would give something you can rely on, but unfortunately it's to much to do to just satisfy curiosity. Ok, here is something for you to play with This one is same as before, but accounting for tones and restricted to only gb2312 character set (previous was run for all GB extensions as well): http://mandarinspot.com/static/pinyin-count-with-tones-gb2312.txt This one is same, but for Big5 character set: http://mandarinspot.com/static/pinyin-count-with-tones-big5.txt Quote
YuehanHao Posted April 30, 2009 at 11:22 PM Report Posted April 30, 2009 at 11:22 PM Writing to say thanks for the lists too. I couldn't agree more that making those lists is simple and obvious, but it would nevertheless have taken me several hours, let's say, to remember or figure out the way to actually implement the solution myself (shameful as it is to admit that). As someone who is seemingly studying Chinese language as much as a general subject of curiosity as to become proficient at using it, those lists are perfect for me. I have casually wondered about distribution of characters to syllables for some time. 约翰好 Quote
tresgog Posted May 4, 2009 at 05:44 AM Author Report Posted May 4, 2009 at 05:44 AM I think this topic is nearly closed but I still would not be able to sort the Unihan txt file. the main reason is I lake some "programming skills". I used "c" before to do some easy physical computation (simulations and so one) but I wouldn't be able to sort and do statistics from a datafile. does someone would happen to know some basics tutorial for such sorting? cababunga, can you provide the code you used? thank you again Quote
Don_Horhe Posted May 4, 2009 at 04:44 PM Report Posted May 4, 2009 at 04:44 PM @ tresgog 1. Start Wenlin 2. In the menu, go to Options -> Hanzi filter and set it to ALL 60,000+ Hanzi 3. Now go to List -> Characters by pinyin and type in "yi" Quote
cababunga Posted May 4, 2009 at 05:35 PM Report Posted May 4, 2009 at 05:35 PM Ok, let me at least make it readable. Requires Python 2.5+. http://mandarinspot.com/static/pinyincount.py I'll leave it as an exercise to you to figure out how to run it in Windows, in case you use that OS. http://www.python.org/doc/faq/windows/#how-do-i-run-a-python-program-under-windows Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.