chrix Posted November 11, 2009 at 01:34 PM Author Report Share Posted November 11, 2009 at 01:34 PM I already pretty much got everything from CEDICT and HANDEDICT. That's how I started my list. CEDICT, however, lacks a surprising number of common chengyu, though they do have a lot of them too, to be fair. By Wikipedia I suppose you mean wiktionary? Tooironic provided a list of the 1,500+ entries. Do you think you could provide an Excel table reflecting all the information in wiktionary? Otherwise it might be hard to export. 漢語辭典: where can I get a machine-readable version of this? I would be most grateful to know... The best sources for chengyu are, in my opinion: dict.idioms.moe.edu.tw however, no English www.chinese-tools.com seems to have a list that is the same source as www.zdic.net/cy just with some English translations added. I suppose the person running the site got the Chinese data and English data from somewhere else, because there are glitches in some entries indicating that the information was not entered manually... Quote Link to comment Share on other sites More sharing options...
renzhe Posted November 11, 2009 at 01:44 PM Report Share Posted November 11, 2009 at 01:44 PM Re: attachment, you should try sending Roddy a note, or email him the file. I think that the file size limit is a general thing, and he often makes exceptions for important things, and this probably qualifies. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 01:51 PM Report Share Posted November 11, 2009 at 01:51 PM chrix, then I guess I've got a present for you: http://stardict.sourceforge.net/Dictionaries_zh_CN.php and http://stardict.sourceforge.net/Dictionaries_zh_TW.php I don't know which of the listed dictionaries has chengyu (but by having a quick look at the page I spotted, for example, 汉语成语词典), but you seem enthusiastic enough to go through them They all seem to be free to use (GPL, free to use, CC, ..) and are there in the stardict format. I don't know too much about it, but apparently there exist some converters, so it's probably not too hard to work with. I don't know in which of the wikimedia initiatives i'd find chengyus, but you can download all the corpuses (corpi?) [if you have enough disk space ] and then work with it, so if you think there is valuable information in there, we could find ways. Do the websites you mentioned (thanks, btw) list any sources? Quote Link to comment Share on other sites More sharing options...
chrix Posted November 11, 2009 at 02:11 PM Author Report Share Posted November 11, 2009 at 02:11 PM hey, that's great, it's got plenty of my favourite dictionaries in digitised versions http://www.zdic.net/cy/ doesn't really tell you about its sources, but it's all Creative Commons at least. So CC means extracting their content automatically is ok? For chinese-tools, I can't find anything either: http://www.chinese-tools.com/chinese/chengyu/dictionary I've got some more links on my blog: http://chinesischblog.wordpress.com/%E6%88%90%E8%AA%9E/ and http://chinesischblog.wordpress.com/%E6%88%90%E8%AA%9E/chengyu-listen/ The dictionaries listed on the right hand sidebar are useful as well, though many seem to just use the sources you just posted above Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 02:15 PM Report Share Posted November 11, 2009 at 02:15 PM I'm feeling a bit stupid, but how do I quote in this forum? Hrm, anyways: >http://www.zdic.net/cy/ doesn't really tell you about its sources, but it's all Creative Commons at least. So CC means extracting their content automatically is ok? I suppose so. But if it's CC, they also should be nice enough to give you their database, if you're asking nicely. And that would be *so much* easier than scraping it from their website (= pretending to be a human with a webbrowser) Quote Link to comment Share on other sites More sharing options...
renzhe Posted November 11, 2009 at 02:20 PM Report Share Posted November 11, 2009 at 02:20 PM So CC means extracting their content automatically is ok? Extracting contents automatically is ALWAYS ok, the only issue is whether you can redistribute the results (and the original data). This seems to be a no-derivatives license, which implies that you can't distribute the data you extract automatically. Whether anyone would actually go after you is a different question. If you only use small parts, you'd fall under "fair use" or similar, if you extract translations/explanations en masse, then it's questionable. If it's for personal use, you're fine, regardless of the license. Another issue is how you do your processing. Whether CC or not, most website admins will not appreciate it if you send 50,000 separate requests to their website within an hour. It's generally better to download the whole dictionary in an electronic form, and do the processing on your computer, as you save everyone a whole lot of bandwidth. You can do this with CC-CEDICT, I don't know if you can do it with zdict, you probably can. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 11, 2009 at 02:24 PM Author Report Share Posted November 11, 2009 at 02:24 PM renzhe, I agree. I would still think it's best to ask, just because of bandwidth reasons. Though I think we should first ask the person behind chinese-tools.com, because they have both Chinese and English. Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 02:37 PM Report Share Posted November 11, 2009 at 02:37 PM Extracting contents automatically is ALWAYS ok no it's not. For example google EXPLICITELY prohibits doing this, and I can imagine quite a few cases where other people would want to do the same Quote Link to comment Share on other sites More sharing options...
gato Posted November 11, 2009 at 02:53 PM Report Share Posted November 11, 2009 at 02:53 PM I've converted the chengyu dictionary (simplified version) for Stardict into a text file using the Stardict Editor. It has about 13,000 entries, I think. See here: http://rapidshare.com/files/305503213/HanYuChengYuCiDian-new_colors.zip or attached (remove .mp3 from file name). HanYuChengYuCiDian-new_colors.rar.mp3 Quote Link to comment Share on other sites More sharing options...
renzhe Posted November 11, 2009 at 03:06 PM Report Share Posted November 11, 2009 at 03:06 PM no it's not. For example google EXPLICITELY prohibits doing this They prohibit using their web interface for this purpose. Once you have the data (like a dictionary) on your hard disk, you can do whatever you want with it. Even if you don't own the copyright. Copyright governs copying, not use, and any such restriction is questionable legally. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 11, 2009 at 03:29 PM Author Report Share Posted November 11, 2009 at 03:29 PM gato, this looks very promising. I'll have to strip out the HTML and then write a script for importing it though, since the entries have a different number of categories. It's similar to the unihan database, for which I wrote a script once so I could import it into Excel, so it shouldn't be too complicated... Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 11, 2009 at 10:23 PM Report Share Posted November 11, 2009 at 10:23 PM They prohibit using their web interface for this purpose.Once you have the data (like a dictionary) on your hard disk, you can do whatever you want with it. Even if you don't own the copyright. Copyright governs copying, not use, and any such restriction is questionable legally. Fair enough. I didn't even consider that point, what with not being able to get to the data automatically without going through their web interface and all Quote Link to comment Share on other sites More sharing options...
muyongshi Posted November 12, 2009 at 03:00 AM Report Share Posted November 12, 2009 at 03:00 AM hey phyrex I was just wondering if you could take the list that you have of only the 四个字 and convert it to simplified and then rerun the google search? Or I can do it... but then I would have to send you the file and I don't know that that is worth it. On the other hand there will be some simplified conversion errors and in that sense I know it would probably be better if I do it since I don't want to have to make you go through and check for those. Let me know how that could work. Thanks! Quote Link to comment Share on other sites More sharing options...
gato Posted November 12, 2009 at 03:37 AM Report Share Posted November 12, 2009 at 03:37 AM if you could take the list that you have of only the 四个字 and convert it to simplified and then rerun the google search? I think google does an internal conversion to simplified or traditional when you a search in Chinese. The number of hits are very similar whether your keywords are in simplified or traditional. Quote Link to comment Share on other sites More sharing options...
muyongshi Posted November 12, 2009 at 04:18 AM Report Share Posted November 12, 2009 at 04:18 AM Ok, I had thought that in some previous searches I did that there was a large difference. But I just checked and the results were exactly the same. Nevermind Quote Link to comment Share on other sites More sharing options...
tooironic Posted November 12, 2009 at 04:22 AM Report Share Posted November 12, 2009 at 04:22 AM There would be differences though if you search google.com.cn and google.com.tw (or even google.com.hk) separately. Quote Link to comment Share on other sites More sharing options...
gato Posted November 12, 2009 at 04:24 AM Report Share Posted November 12, 2009 at 04:24 AM There would be differences though if you search google.com.cn and google.com.tw (or even google.com.hk) separately. That's probably due to censorship on google.cn. Ok, I had thought that in some previous searches I did that there was a large difference. But I just checked and the results were exactly the same. Nevermind Yeah, it's the obvious thing do if the google guys knows anything about Chinese. 不要小看谷歌哦。 呵呵 Quote Link to comment Share on other sites More sharing options...
phyrex Posted November 12, 2009 at 06:33 AM Report Share Posted November 12, 2009 at 06:33 AM There would be differences though if you search google.com.cn and google.com.tw (or even google.com.hk) separately. If you're using the web interface. I'm using the API, and there shouldn't be a difference. But again, the code is up, please go ahead and play with it Quote Link to comment Share on other sites More sharing options...
renzhe Posted November 12, 2009 at 11:14 AM Report Share Posted November 12, 2009 at 11:14 AM Fair enough. I didn't even consider that point, what with not being able to get to the data automatically without going through their web interface and all Yes. Downloading the Google database is not exactly an option On the other hand, downloading CC-CEDICT is, and this is what I use for the word lists in the First Episode Project. Quote Link to comment Share on other sites More sharing options...
chrix Posted November 13, 2009 at 04:59 AM Author Report Share Posted November 13, 2009 at 04:59 AM Here's an anki set for 1,424 frequent chengyu. I've created the list manually based on a variety of lists (it includes all HSK chengyu, the Singapore ones, all MOE chengyu with a count above 20, the 200 most common chengyu from the newspaper corpus, and also some 10 other lists). I'm not fully satisfied with it yet, but I believe it's a good point of departure for devising such a list... Tell me though, if there's some absolutely essential chengyu missing frequent chengyu.zip 2 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.