phyrex Posted November 10, 2009 at 05:39 AM Report Share Posted November 10, 2009 at 05:39 AM I just talked with my girlfriend (Chinese) about it, and she wasn't too convinced. I then did the first 100, which looked more convincing to her. There's the problem with the list though, in that it lists traditional and simplified characters separately. So, you'd either have to combine them in the program, which is way more work than I care to do, or we'd have to use a list which only lists one kind. Anyhow, i'll attach the first 100 chengyu (please note: this means the first 100 chengyu in the list that tooironic provided, sorted by google results, NOT the top 100 most-used chengyu of the list). If you think it's fine tell me, then I'll run the program over the whole list (or a better one, if anybody cares to provide one) 鼻子不是鼻子脸不是脸 4340000 包在...身上 4310000 阿狗阿猫 4130000 阿狗阿貓 4100000 表里一致 1320000 表裡一致 1310000 百无一二 1210000 百無一二 1210000 鼻子不是鼻子臉不是臉 1130000 边都沾不上 1040000 邊都沾不上 940000 爱不释手 779000 愛不釋手 771000 比手划脚 461000 比手劃腳 459000 变化莫测 417000 變化莫測 412000 八千里路云和月 395000 八千里路雲和月 389000 別具一格 387000 办事不牢 313000 辦事不牢 310000 八九不离十 299000 按劳分配 270000 按勞分配 269000 彼一时,此一时 230000 彼一時,此一時 230000 標新立異 198000 标新立异 198000 百感交集 186000 蹦蹦跳跳 166000 按兵不动 136000 按兵不動 135000 愛面子 131000 爱面子 131000 杯水车薪 129000 杯水車薪 129000 鼻子气歪了 120000 背水一战 120000 背水一戰 118000 暴跳如雷 104000 閉門造車 98700 闭门造车 98600 抱佛腳 94800 抱佛脚 94500 阿猫阿狗 93000 阿貓阿狗 90600 八九不離十 87700 鼻子氣歪了 86300 班門弄斧 81200 班门弄斧 80800 愛莫能助 77500 爱莫能助 77500 悲歡歲月 73000 悲欢岁月 72600 半斤八两 72400 半斤八兩 71700 百廢俱興 71600 安分守己 71500 百废俱兴 70300 半推半就 67900 白駒過隙 65600 白驹过隙 65500 豹頭環眼 55100 报仇雪恨 54100 報仇雪恨 54000 暗暗自责 53700 豹头环眼 52500 白刀子进,红刀子出 50300 白刀子進,紅刀子出 49900 扮猪吃老虎 48200 扮豬吃老虎 47800 八仙过海,各显神通 44400 八仙過海,各顯神通 43700 百思不解 41600 八竿子打不著 38000 抱头鼠窜 37400 抱頭鼠竄 37300 百口莫辩 33700 百口莫辯 33500 安之若素 32300 拜倒石榴裙下 31100 杯弓蛇影 29300 八竿子打不着 19000 杯酒釋兵權 17900 杯酒释兵权 17800 綁赴市曹 16300 绑赴市曹 16300 杯盤狼藉 15700 杯盘狼藉 15500 安家立業 9960 安家立业 9900 彪腹狼腰 4790 碧眼童颜 2280 碧眼童顏 2280 暗暗自責 1640 豹头猿臂 1480 豹頭猿臂 1470 报雠雪恨 799 報讎雪恨 793 Quote Link to comment Share on other sites More sharing options...
BertR Posted November 10, 2009 at 08:25 AM Report Share Posted November 10, 2009 at 08:25 AM hmkay, I got something, but before I let it run over all 1600 chengyu, i let it do the first 30. Can someone with more Chinese knowledge than me tell me if they're more or less in the right ballpark, concerning relative fequency?阿狗阿猫 4130000 Did you search for "阿狗阿猫" or for 阿狗阿猫 (with or without quotes)? When I search for "阿狗阿猫" I have only 30900 hits using google. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 08:27 AM Report Share Posted November 10, 2009 at 08:27 AM Good question, didn't even think of that. I'll add some "" and see if that changes things. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 08:32 AM Report Share Posted November 10, 2009 at 08:32 AM Very good suggestion, BertR, the numbers have changed considerably: 爱不释手 654000 百感交集 195000 标新立异 188000 蹦蹦跳跳 160000 愛不釋手 156000 按兵不动 127000 杯水车薪 123000 按劳分配 114000 按勞分配 113000 爱面子 113000 背水一战 105000 暴跳如雷 105000 闭门造车 88500 抱佛脚 81900 变化莫测 75100 變化莫測 74300 班门弄斧 73100 安分守己 71500 八九不离十 68500 八九不離十 68400 半推半就 67900 爱莫能助 66900 阿猫阿狗 66000 白驹过隙 62500 半斤八两 62400 報仇雪恨 56600 报仇雪恨 56500 八仙过海,各显神通 42200 八仙過海,各顯神通 42100 百思不解 41500 八千里路云和月 38400 八千里路雲和月 38100 扮猪吃老虎 36800 扮豬吃老虎 36600 抱头鼠窜 34500 安之若素 32400 杯弓蛇影 29300 百口莫辩 29100 阿貓阿狗 27100 愛面子 20800 杯酒釋兵權 18100 杯酒释兵权 18000 背水一戰 17200 別具一格 17200 八竿子打不着 15500 比手划脚 13400 比手劃腳 13400 抱佛腳 13300 標新立異 12600 按兵不動 11900 杯盘狼藉 11400 閉門造車 10800 愛莫能助 10500 半斤八兩 9840 杯水車薪 9470 百廢俱興 8490 百废俱兴 8320 班門弄斧 7840 表里一致 7620 安家立業 7600 表裡一致 7560 安家立业 7510 八竿子打不著 7180 悲歡歲月 6920 悲欢岁月 6880 办事不牢 6700 阿狗阿貓 6630 辦事不牢 6600 阿狗阿猫 6540 豹頭環眼 5720 豹头环眼 5650 白刀子进,红刀子出 4850 白刀子進,紅刀子出 4780 百口莫辯 4490 杯盤狼藉 4260 边都沾不上 3620 邊都沾不上 3440 白駒過隙 3210 抱頭鼠竄 3020 彼一时,此一时 2890 彼一時,此一時 2860 暗暗自责 2460 拜倒石榴裙下 2340 鼻子气歪了 1760 鼻子氣歪了 1740 鼻子不是鼻子脸不是脸 1450 鼻子不是鼻子臉不是臉 1430 包在...身上 1260 百无一二 1200 百無一二 1200 碧眼童颜 584 碧眼童顏 576 彪腹狼腰 490 報讎雪恨 348 报雠雪恨 347 豹头猿臂 295 豹頭猿臂 295 绑赴市曹 242 綁赴市曹 241 暗暗自責 89 Quote Link to comment Share on other sites More sharing options...
tooironic Posted November 10, 2009 at 08:49 AM Report Share Posted November 10, 2009 at 08:49 AM Yes, adding quotation marks would be a must! One note though: so many chengyu have developed through history that just doing the first 100 of that already small list would not be much of an indication of anything. Remember, like I said earlier, that list contains not only chengyu but any idiomatic expression in Mandarin, so you get entries like 愛面子, 包在...身上, etc which are not chengyu. Still, go right ahead and do the entire list if you can! Jiayou~ EDIT: Also I might add that I think the fact that the list has both trad and simp forms might be kind of interesting as it might highlight differences between mainland China and Taiwan, HK, etc. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 08:52 AM Report Share Posted November 10, 2009 at 08:52 AM tooironic, it was just to see if the program works before i let it run for an hour to get the result for the whole list ;) Since it seems to result sort-of-sane results now, I'll let it go over the list. I found out though, that the numbers I get from google are different from the ones you get when you go to google.com. I'll have a look at that. You guys go and find me more and better chengyu lists ;) Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 08:57 AM Author Report Share Posted November 10, 2009 at 08:57 AM phyrex, that's great! Do you think you could do it with the 40,000 from my master list? Since the list is from the Taiwanese MOE, it would be all in traditional characters though... Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 09:01 AM Report Share Posted November 10, 2009 at 09:01 AM I don't see why not. Or, if you're not scared of a few lines of code, i'll give you the program and you can do it yourself. PS: I'll need the list in plaintext format as previously described, though. Preferably only the chengyus, one per line. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 09:13 AM Report Share Posted November 10, 2009 at 09:13 AM I got around that stupid google api. The results are very different now.. I'll need somebody to have a look at both lists and tell me which one is more accurate in assessing the frequency of the chengyus. Below the first 100 chengyu of the list sorted by the estimated search results that google gives on google.com: 百無一二 5970000 百无一二 5440000 辦事不牢 4050000 办事不牢 3870000 豹頭環眼 3590000 豹头环眼 3450000 暗暗自责 3340000 百廢俱興 2940000 表裡一致 2840000 百废俱兴 2810000 爱不释手 2810000 边都沾不上 2740000 邊都沾不上 2730000 表里一致 2500000 八竿子打不著 2320000 鼻子氣歪了 1890000 鼻子气歪了 1860000 彼一時,此一時 1390000 八九不離十 1270000 彼一时,此一时 1260000 包在...身上 1160000 百感交集 915000 蹦蹦跳跳 882000 标新立异 870000 拜倒石榴裙下 866000 愛不釋手 832000 按劳分配 775000 按勞分配 774000 按兵不动 631000 暴跳如雷 623000 杯水车薪 608000 爱面子 594000 背水一战 528000 阿貓阿狗 435000 闭门造车 424000 抱佛脚 411000 安分守己 397000 变化莫测 396000 變化莫測 395000 八九不离十 371000 爱莫能助 356000 半推半就 345000 班门弄斧 344000 報仇雪恨 331000 报仇雪恨 331000 半斤八两 330000 白驹过隙 293000 碧眼童颜 290000 阿猫阿狗 290000 碧眼童顏 290000 豹頭猿臂 262000 豹头猿臂 250000 百思不解 224000 白刀子進,紅刀子出 212000 扮豬吃老虎 207000 扮猪吃老虎 207000 抱头鼠窜 205000 八仙过海,各显神通 196000 八仙過海,各顯神通 195000 八千里路云和月 189000 白刀子进,红刀子出 189000 八千里路雲和月 188000 安之若素 177000 杯弓蛇影 162000 報讎雪恨 158000 报雠雪恨 158000 百口莫辩 154000 綁赴市曹 148000 愛面子 142000 绑赴市曹 132000 抱頭鼠竄 118000 杯盤狼藉 117000 白駒過隙 116000 背水一戰 99400 別具一格 95000 杯酒釋兵權 85500 杯酒释兵权 85500 按兵不動 85200 八竿子打不着 84900 抱佛腳 75200 比手划脚 73400 比手劃腳 73300 標新立異 70600 杯盘狼藉 64300 愛莫能助 63100 杯水車薪 62900 半斤八兩 62100 鼻子不是鼻子臉不是臉 58300 閉門造車 57700 班門弄斧 38200 安家立業 36400 安家立业 36300 阿狗阿貓 31700 阿狗阿猫 31500 悲欢岁月 30200 悲歡歲月 30200 百口莫辯 29700 鼻子不是鼻子脸不是脸 24500 彪腹狼腰 16900 暗暗自責 10300 Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 09:30 AM Report Share Posted November 10, 2009 at 09:30 AM Here's the sorted list of the 1626 chengyus from tooironic. It's based on the numbers from the google webapi, not google.com. I'm waiting with that until somebody verifies that those numbers are really much more accurate, because going over google.com not only goes directly against the google terms of service, it also takes up to 10 seconds for each chengyu, so it'll take forever, and I want to know if it's worth it. chengyu_sorted.txt Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 09:40 AM Author Report Share Posted November 10, 2009 at 09:40 AM phyrex, still working on the MOE list. It would be great if you could run it for me, but I would also be interested in learning about the code. Xiexie. tooironic, it looks like a nice list. It shouldn't be too hard to cross-reference it with lists that are already widely available. 1,600 is actually an OK number, provided that the list has really been edited with an eye towards frequency, it would be a good number of core chengyu. Though I doubt that was a concern for the wiktionary editors... I got maybe 2000-3000 by culling everything from CEDICT that is marked "idiom", "proverb" "literary saying" and the like, but had to enter a lot of quite well-known chengyu later. Quote Link to comment Share on other sites More sharing options...
tooironic Posted November 10, 2009 at 09:40 AM Report Share Posted November 10, 2009 at 09:40 AM Very interesting, thanks heaps for your hard work! It makes sense that chengyu like 随时随地, 无论如何, 不可思议, etc would be so prominent as they are very common. It would be great to get some input from native speakers though... chrix: Yeah, the list from wiktionary is very random, as most people just add whatever entries at the time that interest them. But still, it's a good starting point I think. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 09:42 AM Author Report Share Posted November 10, 2009 at 09:42 AM yeah, also tooironic, the next exciting question would be, what are the best strategies translators use when faced with chengyu? A lof of chengyu just translate into simplex words in English, some need circumlocutions, and some have "English chengyu" counterparts, and in some cases you could use the "the Chinese have a saying" spiel. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 09:49 AM Report Share Posted November 10, 2009 at 09:49 AM The code is ugly but easy. I'll post it here and attach it for easier playing. Comments and criticism are welcome. #!/usr/bin/python # -*- coding: utf-8 -*- import json import urllib import os, sys import codecs from operator import itemgetter from xgoogle.search import GoogleSearch, SearchError # scrape from google.com def googleWebSearch(currentChengyu): try: chengyuCount = GoogleSearch('"%s"' % currentChengyu).num_results except SearchError, e: print "Google search failed: %s" % e sys.exit(1) return chengyuCount # use google web api def googleAPIsearch(currentChengyu): query = urllib.urlencode({'q': currentChengyu}, {'hl': 'cn'}) url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query search_response = urllib.urlopen(url) search_results = search_response.read() results = json.loads(search_results) data = results['responseData'] est = data['cursor']['estimatedResultCount'] return int(est) #START if len(sys.argv) < 3: print "USAGE: chengyu.py listWithChengYus.txt fileNameForSortedChengYus.txt" sys.exit(1) #read chengyu list chengyulist = codecs.open( sys.argv[1], "r", encoding='utf-8') #might have to adjust encoding chengyu_results = {}; i=1 # get results for chengyu in chengyulist.readlines(): chengyu = chengyu.strip().encode("utf-8") chengyu = '"'+chengyu+'"' if len(chengyu) < 1: continue chengyu_results[chengyu] = googleWebSearch(chengyu) print i,chengyu, chengyu_results[chengyu] i=i+1 # sort list by frequency chengyu_sorted = sorted(chengyu_results.items(), key=itemgetter(1), reverse=True) # write sorted list output = open(sys.argv[2], "w") for chengyu in chengyu_sorted: output.write(str(chengyu[0]) +" "+ str(chengyu[1])+ "n") EDIT: whoops, embarassing. I forgot the quotation marks again for the google websearch! I fixed it in the quote, but not in the uploaded file. Please note this when playing with the code! Quote Link to comment Share on other sites More sharing options...
tooironic Posted November 10, 2009 at 09:54 AM Report Share Posted November 10, 2009 at 09:54 AM the next exciting question would be, what are the best strategies translators use when faced with chengyu? To be honest, I think that question is out of the scope of this forum, unless they create a sub-forum dedicated to translation studies and professional translators were around to answer such questions. I've tried to request one, but it seems the demand is not there. Still, you could create a new topic and see the answers you get. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 09:57 AM Author Report Share Posted November 10, 2009 at 09:57 AM well, one question doesn't does not an entire subforum make is there some literature available on this, I'd like to learn more about it, is all.... roddy was discussing the future of the forums a day or two ago, maybe you can reiterate your request. I'd be happy to second you.... Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 10:00 AM Report Share Posted November 10, 2009 at 10:00 AM here are the 'first 100' google.com results, this time WITH quotation marks. Now it looks a bit uglier, but that could be easily fixed if needed. "鼻子不是鼻子脸不是脸" 223000000 "鼻子不是鼻子臉不是臉" 119000000 "阿狗阿猫" 76300000 "阿狗阿貓" 75900000 "百廢俱興" 12700000 "百废俱兴" 11600000 "包在...身上" 9960000 "比手划脚" 8530000 "百無一二" 7260000 "百无一二" 7250000 "八千里路雲和月" 7200000 "边都沾不上" 6710000 "愛不釋手" 6590000 "爱不释手" 6580000 "八九不离十" 5540000 "彼一時,此一時" 4250000 "邊都沾不上" 3730000 "爱莫能助" 3460000 "彪腹狼腰" 3050000 "按勞分配" 3040000 "表里一致" 2920000 "表裡一致" 2900000 "按兵不动" 2510000 "爱面子" 2430000 "鼻子气歪了" 2370000 "变化莫测" 2230000 "變化莫測" 2220000 "闭门造车" 1830000 "閉門造車" 1830000 "別具一格" 1760000 "阿猫阿狗" 1730000 "拜倒石榴裙下" 1520000 "班門弄斧" 1510000 "鼻子氣歪了" 1380000 "按劳分配" 1100000 "比手劃腳" 985000 "標新立異" 935000 "标新立异" 935000 "八千里路云和月" 917000 "百感交集" 897000 "蹦蹦跳跳" 895000 "八九不離十" 761000 "愛面子" 729000 "阿貓阿狗" 720000 "按兵不動" 710000 "办事不牢" 698000 "辦事不牢" 697000 "抱頭鼠竄" 690000 "杯水車薪" 666000 "杯水车薪" 664000 "背水一战" 624000 "暴跳如雷" 622000 "背水一戰" 619000 "彼一时,此一时" 578000 "杯弓蛇影" 506000 "抱佛腳" 486000 "抱佛脚" 485000 "扮豬吃老虎" 456000 "豹头环眼" 436000 "白刀子進,紅刀子出" 425000 "愛莫能助" 420000 "白刀子进,红刀子出" 414000 "安分守己" 397000 "半斤八两" 393000 "半斤八兩" 391000 "班门弄斧" 382000 "暗暗自責" 350000 "報仇雪恨" 348000 "半推半就" 346000 "暗暗自责" 333000 "杯酒釋兵權" 329000 "报仇雪恨" 326000 "白駒過隙" 311000 "白驹过隙" 310000 "扮猪吃老虎" 231000 "百思不解" 224000 "抱头鼠窜" 222000 "八仙过海,各显神通" 213000 "八仙過海,各顯神通" 212000 "百口莫辩" 184000 "百口莫辯" 183000 "安之若素" 177000 "悲欢岁月" 166000 "悲歡歲月" 166000 "八竿子打不著" 141000 "豹頭環眼" 126000 "綁赴市曹" 105000 "八竿子打不着" 92100 "杯酒释兵权" 85000 "杯盤狼藉" 84900 "杯盘狼藉" 84700 "碧眼童颜" 72300 "碧眼童顏" 72300 "绑赴市曹" 68500 "豹頭猿臂" 68200 "豹头猿臂" 67300 "安家立業" 51300 "報讎雪恨" 49800 "安家立业" 48400 "报雠雪恨" 46000 EDIT: turns out, those numbers are not what google actually says. To get what google actually says, one MUST NOT add the "". *sigh*. Can't anybody settle on a standard? Quote Link to comment Share on other sites More sharing options...
tooironic Posted November 10, 2009 at 10:08 AM Report Share Posted November 10, 2009 at 10:08 AM well, one question doesn't does not an entire subforum make is there some literature available on this, I'd like to learn more about it, is all.... Hehe, and neither does just a handful of interested people. Honestly, the more I think about it, the more I realise that an entire subforum wouldn't really be appropriate here. From what I can tell, there are only about half a dozen professional translators who are active posters here, and, at any rate, that does not guarantee that they would be willing to discuss anything beyond language issues, which are adequately covered in the other forums. But I'd be happy to discuss it further with you via email or MSN. I'll PM you. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 10:08 AM Author Report Share Posted November 10, 2009 at 10:08 AM So, here's the list, it's actually 48,032 entries, but that's because it includes every obscure variant that can be found in chengyu dictionaries... chengyu moe.txt Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 10:11 AM Author Report Share Posted November 10, 2009 at 10:11 AM tooironic, well I know there's the fanyi mailing-list, but if you half a dozen translators on this forum, that's already a but more than those interested in Classical Chinese. But we have been quite active and it seems that roddy is gonna give us a subforum . So it's not so much about the number of people but more about constant activity over a period of time... Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.