phyrex Posted November 10, 2009 at 10:13 AM Report Share Posted November 10, 2009 at 10:13 AM Wow, nearly one megabyte of chengyus o_O You want me to run my program over it, or will you do that? I'd only use the API variant though, for speed reasons. Do you want it formatted like the other one, or would you rather go without the google hits number? Quote Link to comment Share on other sites More sharing options...
tooironic Posted November 10, 2009 at 10:13 AM Report Share Posted November 10, 2009 at 10:13 AM tooironic, well I know there's the fanyi mailing-list, but if you half a dozen translators on this forum, that's already a but more than those interested in Classical Chinese. But we have been quite active and it seems that roddy is gonna give us a subforum Well, OK, maybe the number is more like 3 or 4, including me haha. But yeah, like I said, just because there are some professional translators here, doesn't mean they would be willing to discuss anything beyond the words themselves. Still, that's great about the Classical Chinese forum, it certainly would be useful for a lot of people. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 10:17 AM Author Report Share Posted November 10, 2009 at 10:17 AM phyrex, my programming abilities are still stuck in Basic/Pascal land, so I'd appreciate it if you could do it for me. But it's a great thing to have as reference, because python is on the list of things I want to learn at some point. I'm not sure which results, but it would be good to have the chengyu and whatever value google returns. There's some problems with the list because some two-character combinations included there as chengyu might actually be quite commonly used in other contexts (i.e. not as chengyu), but that's something I would need to take a closer look at anyways. Thanks so much Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 10:36 AM Report Share Posted November 10, 2009 at 10:36 AM chrix, no problem. It's running at a rate of about 2 chengyu per second now, so I'll get back to you in a couple hours ;) python is very very simple and very very useful. Even simpler than pascal, in my opinion, and, of course, much more useful! ;) Just look at the code, it's nearly self-explanatory Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 10:43 AM Author Report Share Posted November 10, 2009 at 10:43 AM thanks, I'll definitely check it out! Thanks! Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 10, 2009 at 12:03 PM Report Share Posted November 10, 2009 at 12:03 PM after 5500 chengyu I got an error I knew I should have checked for valid results *sigh*.. welll, you'll have to wait longer for the full list, it seems :-/ Quote Link to comment Share on other sites More sharing options...
renzhe Posted November 10, 2009 at 01:18 PM Report Share Posted November 10, 2009 at 01:18 PM Wow, great work here! As soon as you agree on a list of 1000 most common chengyu, I'll start learning them 48,000 chengyu.... Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 03:35 PM Author Report Share Posted November 10, 2009 at 03:35 PM phyrex, no worries, I'm very happy as it is that you're so kind to do that for us, so whenever you find the time renzhe, I think Beijing has a list about 1,000, and Taiwan has a core set of 1,945 or so. But the vast majority of the 48,000 monster list are variants. If you strip them all away, you'd get much less, maybe 8,000 I don't know.. Unfortunately I haven't been able to find any of the two lists, there's only that Singapore list, but it's way too less with 150 or so.... Quote Link to comment Share on other sites More sharing options...
chrix Posted November 10, 2009 at 11:25 PM Author Report Share Posted November 10, 2009 at 11:25 PM Here are the exact figures: 1.龐大的成語資料庫建構: 總資料庫頻次的建構和統計-共三十餘種書,48276筆。 2. 成組的成語選錄編輯: 同源和用法足資參考的成組選錄編輯(1594組、共5123條),並附上《附錄》及《重編國語辭典修訂本》的成語23385條,全書共計28508條。 For some reason the list I downloaded only has 48,022 entries, don't know what happened to the 254 entries difference. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 02:42 AM Report Share Posted November 11, 2009 at 02:42 AM Whew. After 14.5 hrs of running the program, and praying that my computer and program and internet connection would remain stable, I'm finally through I'm attaching the complete list, and for shits and giggles here are the top 100. As chrix expected, most of them are just two character combinations that show up pretty often just as words, but there's a four character chengyu or two in there too: 下 340000000 結果 80300000 受 65500000 目的 59300000 分析 55800000 程度 42800000 伙伴 40900000 注 39500000 英才 23100000 口碑 17700000 漂亮 17500000 便宜 14900000 薄 14200000 發 14000000 女士 13800000 索引 13400000 托 12100000 矛盾 9940000 全身 9780000 潮流 9370000 抱歉 8790000 東西 8770000 亮相 8100000 一面 7120000 真相 6850000 激烈 6330000 二天 6000000 真人 5450000 招賢納士 5180000 一系列 5060000 目光 4920000 吃喝玩樂 4620000 春秋 4510000 不遠 4460000 三千 4300000 大方 4250000 渴望 4250000 隨時隨地 4120000 取代 4070000 春意盎然 3960000 不知不覺 3830000 盲目 3760000 紅樓 3730000 與眾不同 3670000 完璧 3620000 流水 3600000 意想不到 3590000 三十六計 3490000 淑女 3440000 私房 3300000 三昧 3270000 領導幹部 3210000 神州 3170000 宜家 3160000 掌上 3030000 霧里看花 2920000 手足 2910000 人品 2860000 馬上 2820000 霧裡看花 2780000 面子 2760000 霧裏看花 2730000 佳人 2690000 白雪 2540000 元老 2530000 神鬼 2500000 泰山 2470000 石田 2440000 無論如何 2440000 在原 2380000 錦囊 2320000 立案 2300000 北斗 2260000 立地 2200000 端的 2190000 足不出戶 2160000 大亨 2150000 前所未有 2120000 澄清 2060000 千萬 2030000 提高警惕 2030000 化身 2030000 蜜月 1960000 關鍵時刻 1910000 糟糕 1900000 千金 1890000 打招呼 1880000 風情 1860000 火力 1850000 光棍 1850000 知己 1820000 觀光 1750000 疏通 1730000 調查研究 1650000 天下第一 1650000 琢磨 1640000 引人注目 1630000 莫名其妙 1620000 布袋 1610000 屠龍 1580000 chrix, after you've had a nice look at it, maybe you can share with us if you notice anything interesting. Oh, and nice blog, btw chengyumoe_sorted.txt.zip Quote Link to comment Share on other sites More sharing options...
gato Posted November 11, 2009 at 03:29 AM Report Share Posted November 11, 2009 at 03:29 AM Hmm, the Taiwan MoE list has lots of two- and three-character words that are not chengyu. Don't know why they are in the list. There are also many four-character words like 研究调查, 一天一夜 and 上上下下 that are not chengyu's in the list. I've cleaned it up a bit to obtain the top 260 or so four-character words from the list. phyrex, would you be able to do a quick clean up of the list to get rid of all non-four character words? I tried a sort in Excel, but that only sorted the words by their pinyin and not their lengths. 招賢納士 5180000 吃喝玩樂 4620000 隨時隨地 4120000 春意盎然 3960000 不知不覺 3830000 與眾不同 3670000 意想不到 3590000 三十六計 3490000 領導幹部 3210000 霧里看花 2920000 霧裡看花 2780000 霧裏看花 2730000 無論如何 2440000 足不出戶 2160000 前所未有 2120000 提高警惕 2030000 關鍵時刻 1910000 調查研究 1650000 天下第一 1650000 引人注目 1630000 莫名其妙 1620000 不知所云 1540000 不可或缺 1530000 分期付款 1500000 認真落實 1460000 一年一度 1450000 不知所雲 1370000 一如既往 1370000 游山玩水 1370000 堅定不移 1320000 全力以赴 1310000 通貨膨脹 1300000 印象深刻 1270000 疑難雜症 1260000 熱烈歡迎 1250000 小心翼翼 1220000 實實在在 1220000 各式各樣 1200000 順利進行 1200000 自由自在 1190000 實話實說 1150000 可想而知 1150000 全心全意 1140000 完成任務 1120000 各行各業 1100000 因地制宜 1060000 命中注定 1010000 文責自負 998000 成千上萬 992000 精益求精 990000 他山之石 980000 一目了然 972000 有史以來 970000 十字路口 967000 七嘴八舌 946000 大大小小 931000 衣食住行 923000 行之有效 922000 艱苦奮鬥 918000 綜合分析 916000 一應俱全 908000 截然不同 892000 迫不及待 889000 勇往直前 882000 大千世界 866000 堅持不懈 859000 不知所措 846000 男婚女嫁 842000 事半功倍 821000 或多或少 820000 赤手空拳 816000 不由自主 816000 笑破肚皮 811000 盡管如此 796000 深入人心 769000 金榜題名 769000 齊心協力 755000 白手起家 751000 自然而然 747000 從小到大 746000 淚流滿面 738000 新陳代謝 735000 風雲人物 734000 絡繹不絕 722000 氣喘吁吁 718000 久而久之 715000 不可多得 697000 應運而生 696000 不言而喻 691000 如火如荼 685000 濫用職權 677000 物美價廉 674000 情不自禁 667000 不折不扣 666000 目瞪口呆 660000 大街小巷 656000 自言自語 655000 志同道合 649000 形形色色 644000 奄奄一息 644000 自古以來 643000 知名人士 641000 物超所值 640000 此時此刻 637000 天涯海角 634000 持之以恒 633000 耳目一新 627000 出乎意料 626000 食欲不振 624000 名列前茅 622000 來之不易 622000 左鄰右舍 620000 世外桃源 620000 花花公子 612000 取而代之 611000 難以置信 602000 風土人情 597000 玩忽職守 596000 勢在必行 595000 不可思議 590000 因人而異 590000 春暖花開 589000 下定決心 588000 統籌兼顧 584000 愈演愈烈 579000 暢通無阻 577000 哭笑不得 576000 不亦樂乎 575000 與生俱來 574000 時時刻刻 574000 不可抗力 570000 出人意料 569000 見死不救 568000 可見一斑 566000 喜怒哀樂 565000 世界末日 564000 津津樂道 562000 恍然大悟 562000 嚇了一跳 560000 天上人間 558000 近在咫尺 558000 比比皆是 554000 不得而知 550000 胡說八道 549000 一舉一動 547000 栩栩如生 546000 匪夷所思 543000 學以致用 543000 塵埃落定 541000 不可收拾 523000 不顧一切 522000 一目瞭然 522000 盡收眼底 519000 乾乾淨淨 518000 微不足道 517000 參差不齊 516000 古今中外 513000 明明白白 513000 一無所知 506000 千奇百怪 506000 東西南北 502000 風風雨雨 502000 精神實質 497000 卓有成效 496000 拭目以待 494000 清清楚楚 493000 徇私舞弊 492000 記憶猶新 492000 一席之地 491000 一心一意 490000 四面八方 489000 從頭到尾 487000 舞文弄墨 483000 罪魁禍首 482000 關鍵所在 481000 酸甜苦辣 480000 青山綠水 478000 談情說愛 477000 力所能及 468000 迫在眉睫 468000 雞皮疙瘩 465000 在所難免 465000 人間仙境 462000 心中有數 462000 讚不絕口 462000 一覽無餘 459000 毋庸置疑 458000 千千萬萬 452000 原來如此 448000 刮目相看 448000 污衊誹謗 446000 黃金時代 443000 呼之欲出 443000 有所作為 442000 相得益彰 442000 翻天覆地 439000 小心謹慎 438000 博大精深 436000 據我所知 435000 生動活潑 434000 服務周到 433000 興致勃勃 432000 萬事如意 431000 古色古香 428000 量力而行 427000 紅袖添香 422000 上上下下 421000 高高在上 421000 團結一致 419000 無話可說 418000 有目共睹 417000 胡言亂語 417000 縱橫天下 417000 似水流年 415000 大名鼎鼎 415000 血本無歸 414000 半信半疑 413000 屈指可數 412000 日日夜夜 411000 急功近利 411000 竭盡全力 410000 念念不忘 408000 此起彼伏 408000 物質文明 407000 開花結果 406000 準備就緒 403000 一天到晚 402000 首屈一指 401000 生意興隆 401000 硬著頭皮 400000 不一會兒 400000 興緻勃勃 399000 雪上加霜 399000 迷迷糊糊 398000 無所不能 398000 沒完沒了 397000 無家可歸 397000 詩詞歌賦 396000 日復一日 396000 意味深長 395000 起死回生 394000 大起大落 393000 回味無窮 392000 身臨其境 390000 面目全非 390000 德才兼備 390000 措手不及 387000 贊不絕口 386000 天下無敵 385000 熱情洋溢 384000 必由之路 384000 有的放矢 383000 舍己救人 383000 不計其數 382000 萬里長城 380000 精力充沛 380000 盡如人意 378000 五湖四海 378000 舉世矚目 378000 親眼目睹 378000 蒸蒸日上 377000 自給自足 377000 盡心盡力 376000 提出異議 373000 先到先得 372000 Quote Link to comment Share on other sites More sharing options...
muyongshi Posted November 11, 2009 at 03:32 AM Report Share Posted November 11, 2009 at 03:32 AM I've been watching this thread intently. THanks for your (computers) hard work Gato beat me to actually doing a clean up of it though. And thanks Gato for the cleaned up version... Quote Link to comment Share on other sites More sharing options...
muyongshi Posted November 11, 2009 at 03:39 AM Report Share Posted November 11, 2009 at 03:39 AM It's interesting the slight skew that google brings to something like that this. The first one is not used in spoken chinese (imho) but because of job searches that comes up high on google. Also the second one is one that just seems.... not very 成语 to me. Or is that just me? Things like 领导干部,分期付款,关键时刻....etc seem to be in there a lot. I don't think I would include them myself in a "chengyu" list. Any thoughts on that? Anyway: my contribution being simplified version of what gato posted (note: I haven't checked it for inconsistencies, just did a straight conversion so there may be mistakes) 招贤纳士 5180000 吃喝玩乐 4620000 随时随地 4120000 春意盎然 3960000 不知不觉 3830000 与众不同 3670000 意想不到 3590000 三十六计 3490000 领导干部 3210000 雾里看花 2920000 雾里看花 2780000 雾里看花 2730000 无论如何 2440000 足不出户 2160000 前所未有 2120000 提高警惕 2030000 关键时刻 1910000 调查研究 1650000 天下第一 1650000 引人注目 1630000 莫名其妙 1620000 不知所云 1540000 不可或缺 1530000 分期付款 1500000 认真落实 1460000 一年一度 1450000 不知所云 1370000 一如既往 1370000 游山玩水 1370000 坚定不移 1320000 全力以赴 1310000 通货膨胀 1300000 印象深刻 1270000 疑难杂症 1260000 热烈欢迎 1250000 小心翼翼 1220000 实实在在 1220000 各式各样 1200000 顺利进行 1200000 自由自在 1190000 实话实说 1150000 可想而知 1150000 全心全意 1140000 完成任务 1120000 各行各业 1100000 因地制宜 1060000 命中注定 1010000 文责自负 998000 成千上万 992000 精益求精 990000 他山之石 980000 一目了然 972000 有史以来 970000 十字路口 967000 七嘴八舌 946000 大大小小 931000 衣食住行 923000 行之有效 922000 艰苦奋斗 918000 综合分析 916000 一应俱全 908000 截然不同 892000 迫不及待 889000 勇往直前 882000 大千世界 866000 坚持不懈 859000 不知所措 846000 男婚女嫁 842000 事半功倍 821000 或多或少 820000 赤手空拳 816000 不由自主 816000 笑破肚皮 811000 尽管如此 796000 深入人心 769000 金榜题名 769000 齐心协力 755000 白手起家 751000 自然而然 747000 从小到大 746000 泪流满面 738000 新陈代谢 735000 风云人物 734000 络绎不绝 722000 气喘吁吁 718000 久而久之 715000 不可多得 697000 应运而生 696000 不言而喻 691000 如火如荼 685000 滥用职权 677000 物美价廉 674000 情不自禁 667000 不折不扣 666000 目瞪口呆 660000 大街小巷 656000 自言自语 655000 志同道合 649000 形形色色 644000 奄奄一息 644000 自古以来 643000 知名人士 641000 物超所值 640000 此时此刻 637000 天涯海角 634000 持之以�ρ�Β� 633000 耳目一新 627000 出乎意料 626000 食欲不振 624000 名列前茅 622000 来之不易 622000 左邻右舍 620000 世外桃源 620000 花花公子 612000 取而代之 611000 难以置信 602000 风土人情 597000 玩忽职守 596000 势在必行 595000 不可思议 590000 因人而异 590000 春暖花开 589000 下定决心 588000 统筹兼顾 584000 愈演愈烈 579000 畅通无阻 577000 哭笑不得 576000 不亦乐乎 575000 与生俱来 574000 时时刻刻 574000 不可抗力 570000 出人意料 569000 见死不救 568000 可见一斑 566000 喜怒哀乐 565000 世界末日 564000 津津乐道 562000 恍然大悟 562000 吓了一跳 560000 天上人间 558000 近在咫尺 558000 比比皆是 554000 不得而知 550000 胡说八道 549000 一举一动 547000 栩栩如生 546000 匪夷所思 543000 学以致用 543000 尘埃落定 541000 不可收拾 523000 不顾一切 522000 一目了然 522000 尽收眼底 519000 乾乾净净 518000 微不足道 517000 参差不齐 516000 古今中外 513000 明明白白 513000 一无所知 506000 千奇百怪 506000 东西南北 502000 风风雨雨 502000 精神实质 497000 卓有成效 496000 拭目以待 494000 清清楚楚 493000 徇私舞弊 492000 记忆犹新 492000 一席之地 491000 一心一意 490000 四面八方 489000 从头到尾 487000 舞文弄墨 483000 罪魁祸首 482000 关键所在 481000 酸甜苦辣 480000 青山绿水 478000 谈情说爱 477000 力所能及 468000 迫在眉睫 468000 鸡皮疙瘩 465000 在所难免 465000 人间仙境 462000 心中有数 462000 赞不绝口 462000 一览无馀 459000 毋庸置疑 458000 千千万万 452000 原来如此 448000 刮目相看 448000 污�ΡΊ诽谤 446000 黄金时代 443000 呼之欲出 443000 有所作为 442000 相得益彰 442000 翻天覆地 439000 小心谨慎 438000 博大精深 436000 据我所知 435000 生动活泼 434000 服务周到 433000 兴致勃勃 432000 万事如意 431000 古色古香 428000 量力而行 427000 红袖添香 422000 上上下下 421000 高高在上 421000 团结一致 419000 无话可说 418000 有目共睹 417000 胡言乱语 417000 纵横天下 417000 似水流年 415000 大名鼎鼎 415000 血本无归 414000 半信半疑 413000 屈指可数 412000 日日夜夜 411000 急功近利 411000 竭尽全力 410000 念念不忘 408000 此起彼伏 408000 物质文明 407000 开花结果 406000 准备就绪 403000 一天到晚 402000 首屈一指 401000 生意兴隆 401000 硬着头皮 400000 不一会儿 400000 兴致勃勃 399000 雪上加霜 399000 迷迷糊糊 398000 无所不能 398000 没完没了 397000 无家可归 397000 诗词歌赋 396000 日复一日 396000 意味深长 395000 起死回生 394000 大起大落 393000 回味无穷 392000 身临其境 390000 面目全非 390000 德才兼备 390000 措手不及 387000 赞不绝口 386000 天下无敌 385000 热情洋溢 384000 必由之路 384000 有的放矢 383000 舍己救人 383000 不计其数 382000 万里长城 380000 精力充沛 380000 尽如人意 378000 五湖四海 378000 举世瞩目 378000 亲眼目睹 378000 蒸蒸日上 377000 自给自足 377000 尽心尽力 376000 提出异议 373000 先到先得 372000 1 Quote Link to comment Share on other sites More sharing options...
gato Posted November 11, 2009 at 03:49 AM Report Share Posted November 11, 2009 at 03:49 AM Chengyu is usually used to refer to written idioms derived from classic literature, so words like 天下第一 and 吃喝玩乐 are definitely not chengyu. Hehe. Why are they in the Taiwan Ministry of Education chengyu list? Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 03:51 AM Report Share Posted November 11, 2009 at 03:51 AM Here you go, only the four character results cms_onlychengyu.txt Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 03:55 AM Report Share Posted November 11, 2009 at 03:55 AM What you guys can do if you're not satisfied with the results from the 'master list', is taking a chengyu list that actually only lists chengyu, and checking that against the master list results. Writing a script to do that should be trivial, and will definitely take less time than pumping a new chengyu file through google ;) Quote Link to comment Share on other sites More sharing options...
muyongshi Posted November 11, 2009 at 04:02 AM Report Share Posted November 11, 2009 at 04:02 AM I'm (slowly) working my way through the list and feeling out the top ones. Removing some and then I'll see where that leaves me. I plan on actually using this so at some point I will actually take the ones with the same meaning and just group them not paying attention to specific order at the point. Oh and that is definitely not a complete list- it doesn't have 对牛弹琴 on it. Also this was an interesting find 吓了一跳. hmmmm... I am very satisfied with the list. It saves a TON of problems and puts us a lot closer to having an actual usage based list even if it is not "perfect". I am very grateful! Also in the simplified list the first "error" i've found is 干干净净 Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 04:08 AM Report Share Posted November 11, 2009 at 04:08 AM muyongshi, i'm glad that this can actually be of real help to somebody If you want to do manipulate the list in any way where you think some programming magic could make that easier, feel free to contact me. Quote Link to comment Share on other sites More sharing options...
muyongshi Posted November 11, 2009 at 04:13 AM Report Share Posted November 11, 2009 at 04:13 AM Hmmm... there is a lot of things I would like to do but don't know if I will ever have time. Like I might like to take the complete list converted into simplified and then cross reference it with a physical chengyu dictionary, add missing entries and then rerun the search to see with simplified what the results are. But like I said, I probably will never actually get around to doing that. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 11, 2009 at 10:22 AM Author Report Share Posted November 11, 2009 at 10:22 AM phyrex, thank you so much for your efforts, I really appreciate it And glad you like my blog too I will cross-reference this against a list of "real" chengyu. The MOE list is taken from 30 chengyu relevant sources, and they have included anything that appears in those sources. Muyongshi, look for 對牛彈琴, it's at 6390.... Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.