Wow, nearly one megabyte of chengyus o_O

You want me to run my program over it, or will you do that? I'd only use the API variant though, for speed reasons. Do you want it formatted like the other one, or would you rather go without the google hits number?

tooironic, well I know there's the fanyi mailing-list, but if you half a dozen translators on this forum, that's already a but more than those interested in Classical Chinese. But we have been quite active and it seems that roddy is gonna give us a subforum

Well, OK, maybe the number is more like 3 or 4, including me :) haha. But yeah, like I said, just because there are some professional translators here, doesn't mean they would be willing to discuss anything beyond the words themselves. Still, that's great about the Classical Chinese forum, it certainly would be useful for a lot of people.


phyrex, my programming abilities are still stuck in Basic/Pascal land, so I'd appreciate it if you could do it for me. But it's a great thing to have as reference, because python is on the list of things I want to learn at some point.

I'm not sure which results, but it would be good to have the chengyu and whatever value google returns. There's some problems with the list because some two-character combinations included there as chengyu might actually be quite commonly used in other contexts (i.e. not as chengyu), but that's something I would need to take a closer look at anyways.

Thanks so much :clap


chrix, no problem. It's running at a rate of about 2 chengyu per second now, so I'll get back to you in a couple hours ;)

python is very very simple and very very useful. Even simpler than pascal, in my opinion, and, of course, much more useful! ;) Just look at the code, it's nearly self-explanatory :)


after 5500 chengyu I got an error :( I knew I should have checked for valid results *sigh*.. welll, you'll have to wait longer for the full list, it seems :-/


Wow, great work here!

As soon as you agree on a list of 1000 most common chengyu, I'll start learning them :)

48,000 chengyu.... :shock::help


phyrex, no worries, I'm very happy as it is that you're so kind to do that for us, so whenever you find the time :clap

renzhe, I think Beijing has a list about 1,000, and Taiwan has a core set of 1,945 or so. But the vast majority of the 48,000 monster list are variants. If you strip them all away, you'd get much less, maybe 8,000 I don't know..

Unfortunately I haven't been able to find any of the two lists, there's only that Singapore list, but it's way too less with 150 or so....


Here are the exact figures:







For some reason the list I downloaded only has 48,022 entries, don't know what happened to the 254 entries difference.


Whew. After 14.5 hrs of running the program, and praying that my computer and program and internet connection would remain stable, I'm finally through :)

I'm attaching the complete list, and for shits and giggles here are the top 100. As chrix expected, most of them are just two character combinations that show up pretty often just as words, but there's a four character chengyu or two in there too:

chrix, after you've had a nice look at it, maybe you can share with us if you notice anything interesting. Oh, and nice blog, btw :)



Hmm, the Taiwan MoE list has lots of two- and three-character words that are not chengyu. Don't know why they are in the list. There are also many four-character words like 研究调查, 一天一夜 and 上上下下 that are not chengyu's in the list.

I've cleaned it up a bit to obtain the top 260 or so four-character words from the list.

phyrex, would you be able to do a quick clean up of the list to get rid of all non-four character words? I tried a sort in Excel, but that only sorted the words by their pinyin and not their lengths.

I've been watching this thread intently. THanks for your (computers) hard work :mrgreen:

Gato beat me to actually doing a clean up of it though.

And thanks Gato for the cleaned up version...


It's interesting the slight skew that google brings to something like that this. The first one is not used in spoken chinese (imho) but because of job searches that comes up high on google. Also the second one is one that just seems.... not very 成语 to me. Or is that just me? Things like 领导干部,分期付款,关键时刻....etc seem to be in there a lot. I don't think I would include them myself in a "chengyu" list. Any thoughts on that?

Anyway: my contribution being simplified version of what gato posted

(note: I haven't checked it for inconsistencies, just did a straight conversion so there may be mistakes)

Chengyu is usually used to refer to written idioms derived from classic literature, so words like 天下第一 and 吃喝玩乐 are definitely not chengyu. Hehe.

Why are they in the Taiwan Ministry of Education chengyu list?


What you guys can do if you're not satisfied with the results from the 'master list', is taking a chengyu list that actually only lists chengyu, and checking that against the master list results. Writing a script to do that should be trivial, and will definitely take less time than pumping a new chengyu file through google ;)


I'm (slowly) working my way through the list and feeling out the top ones. Removing some and then I'll see where that leaves me. I plan on actually using this so at some point I will actually take the ones with the same meaning and just group them not paying attention to specific order at the point.

Oh and that is definitely not a complete list- it doesn't have 对牛弹琴 on it.

Also this was an interesting find 吓了一跳. hmmmm...

I am very satisfied with the list. It saves a TON of problems and puts us a lot closer to having an actual usage based list even if it is not "perfect". I am very grateful!

Also in the simplified list the first "error" i've found is 干干净净


muyongshi, i'm glad that this can actually be of real help to somebody :)

If you want to do manipulate the list in any way where you think some programming magic could make that easier, feel free to contact me.


Hmmm... there is a lot of things I would like to do but don't know if I will ever have time. Like I might like to take the complete list converted into simplified and then cross reference it with a physical chengyu dictionary, add missing entries and then rerun the search to see with simplified what the results are. But like I said, I probably will never actually get around to doing that.


phyrex, thank you so much for your efforts, I really appreciate it :clap And glad you like my blog too :mrgreen: I will cross-reference this against a list of "real" chengyu. The MOE list is taken from 30 chengyu relevant sources, and they have included anything that appears in those sources.

Muyongshi, look for 對牛彈琴, it's at 6390....

