ChouDoufu Posted August 11, 2007 at 12:41 PM Report Posted August 11, 2007 at 12:41 PM I have a question about the source code status of Chinese dictionaries. Are there any open source Chinese - English dictionaries? I know CEDict is free, but it's not open source. The guy who started it Paul D. could show up one day and say: "that's my dictionary and it's not longer free." and the whole community based around CEDict would be put in turmoil. Is Adso open source? Are there any people out there who would want to help create an open-source dictionary? I definitely want to start one (or contribute to one if it already exists), but I also feel that there needs to be a lot of preparation. Decisions about things like formatting, content, etc. would all need to be discussed. Anyone interested, please send me PM. Quote
c_redman Posted August 11, 2007 at 04:53 PM Report Posted August 11, 2007 at 04:53 PM It hasn't been highly publicized, but maintenance of CEDICT seems to have shifted recently to Dennis Vierkant at MDBG (although even the Wikipedia entry for CEDICT is unclear on the actual status). Anyway, this is the header of the latest version (just from this morning!) from mdbg.net, which should reassure you: # CEDICT # Community maintained free Chinese-English dictionary. # # Published by MDBG # # License: # Creative Commons Attribution-Noncommercial-Share Alike 3.0 # http://creativecommons.org/licenses/by-nc-sa/3.0/ # # CEDICT can be downloaded from: # http://www.mdbg.net/chindict/chindict.php?page=cedict # # Additions and corrections can be sent through: # http://www.xuezhongwen.net/cedicteditor/editor.php # # New releases are announced on: # http://groups.google.com/group/chinese-dictionaries # # For more information about CEDICT see: # http://www.mdbg.net/cedictwiki/ This new header notice is quite recent; the version on 2007/5/20 did not have it, but the new versions from 2007/6/30 forward do. Quote
ChouDoufu Posted August 11, 2007 at 08:16 PM Author Report Posted August 11, 2007 at 08:16 PM Yes, I noticed this and have tried contacting Dennis, but unfortunately one cannot legally change the copyright of a work that doesn't belong to you. CEDict belongs to Paul D., and until the copyright expires (by current US laws sometime in far far future) or Paul D. himself makes a change in the copyright, it cannot simply be turned into creative commons. Imagine if I decided to add some scenes to star wars. I looked for George Lucas everywhere but couldn't find him. Thus I changed the license to a creative commons license and added my scenes to the movie and released it. If George Lucas ever decided to show up, he'd sue me for copyright infringement and would be able to stop distribution of everything I did. That's why I'd prefer to see CEDict discontinued and a new dictionary with a real creative commons license created. Using sources like out of copyright dictionaries (if it's before 1922 it's copyright has expired. after 1922 it gets more complicated) accessible in libraries and on sites like google books, the chinese learning community can create the basis of a chinese dictionary and then start adding the new words that have been added to the chinese language. Quote
gato Posted August 12, 2007 at 01:17 AM Report Posted August 12, 2007 at 01:17 AM From the CEDICT readme file, it seems that a number of people contributed to it, not just Paul D. Perhaps they all still each hold the copyright to their own contribution (if they didn't get it from a copyrighted source). http://www.mandarintools.com/download/cedict_readme.txt Although CEDICT started out as a one-person project, contributions from the Internet community have become the major source of new entries. Contributors thus far to the project are (in chronological order): Ocrat, Mike Wright, Wenke Wei, Sharlene Liu, Richard Warmington, Erik Peterson, Derek Chadwick, Dave Hiebeler, Steve Swales, Carl Hoffman Quote
ChouDoufu Posted August 12, 2007 at 02:16 AM Author Report Posted August 12, 2007 at 02:16 AM I made contributions to it too. But the problem there is it would be nearly impossible to determine who has contributed what. Furthermore, it was published by means of updates with a license that says "copyright Paul D." All these reasons illustrate why there should be an open chinese dictionary that isn't CEDict. Am I alone in feeling this way? There is a difference between free and "freely available". CEDict like Jpeg and MP3 is freely available, but it isn't free. A few years back the people who owned Jpeg were thinking about making people pay for Jpeg. Fraunhoffer has talked about selling licenses to MP3, too. It's the possibility of CEDict being taken away that makes me want to create a truly free alternative. Quote
gato Posted August 12, 2007 at 02:35 AM Report Posted August 12, 2007 at 02:35 AM Furthermore, it was published by means of updates with a license that says "copyright Paul D. Just as in your "Star Wars" example, if the copyright isn't his, adding a copyright tag doesn't make it his. But I get what you are saying. Just don't think there's a realistic chance that Paul D. will enforce this "copyright." Quote
character Posted August 12, 2007 at 02:40 AM Report Posted August 12, 2007 at 02:40 AM Are there any open source Chinese - English dictionaries?It depends on what you mean by open source.AFAIK, with my limited legal knowledge, there are no free as in freedom and as in beer C-E dictionaries. CEDict's readme's Introduction says its objective is to create a "public-domain" C-E dictionary. Unfortunately it then appends an IMO poorly-worded license at the end of the readme, so it is restricted in its use, so it's not in the public domain. I know CEDict is free, but it's not open source. The guy who started it Paul D. could show up one day and say: "that's my dictionary and it's not longer free."Again AFAIK, he couldn't change the copyright on existing versions, and the license isn't written as a lease whose terms can change at any time. and the whole community based around CEDict would be put in turmoil.Parts of the community seem to be ignoring the spirit if not the letter of the license, by:Making it a selling point of their software Selling advertising space on their implementation Is Adso open source?It's open source and free for noncommercial use, IIRC the license terms. You can get the license and the code from Adso. Are there any people out there who would want to help create an open-source dictionary? I definitely want to start one (or contribute to one if it already exists), but I also feel that there needs to be a lot of preparation. Decisions about things like formatting, content, etc. would all need to be discussed.I think it's a great idea, as long as you either put it in the public domain or make it free for commercial use. Otherwise you'd just be duplicating CEDict. I wouldn't worry about formatting initially; you can use the same format as CEDict to start with. Once you get 5K entries you can then worry about finding the best format. Getting the initial set of entries is the hard part. Quote
imron Posted August 12, 2007 at 06:02 AM Report Posted August 12, 2007 at 06:02 AM I wouldn't worry about formatting initially; you can use the same format as CEDict to start with. Once you get 5K entries you can then worry about finding the best format.Ugh, no, please don't do that. The software world is littered with bad formats that the authors would "get around to changing one day", but by that time many people/programs already relied on the existing format and so it never got changed (tabs in Makefiles anyone? )I would also avoid the CEDICT format, it's incredibly inflexible if you want to extend it, and has serious limitations e.g. different files for Traditional/Simplified that need to be manually kept in sync, no support for languages other than English, etc. re: free dictionaries, you might want to check out www.wiktionary.org, which aims to be a multi-lingual dictionary released under the GNU Free Documentation License. Quote
character Posted August 12, 2007 at 10:37 AM Report Posted August 12, 2007 at 10:37 AM Ugh, no, please don't do that. The software world is littered with bad formats [...]A great format would be a good thing, but there are so many ways to format a Chinese dictionary, so many things to put in or leave out, etc. that I fear this dictionary project would die of analysis paralysis. Better to get enough entries to interest people first, then pick a format based on how people want to use it. Reformatting 5k entries by writing a couple scripts and some manual tweaking is IMO a lot better than being months into the project and only having a half-finished design for a format to show for it.I would also avoid the CEDICT format, it's incredibly inflexible if you want to extend it,So convert the format to XML: ... 阿 阿 a1 an initial particle prefix to names of people ... Not great, but now it's obvious one could add elements or reformat using XSLT. Never mind that a program could convert the CEDICT file to this format... and has serious limitations e.g. different files for Traditional/Simplified that need to be manually kept in sync [...]My guess would be that other files are generated from the Unicode file, which contains both Traditional/Simplified.--- Anyways, I think the OP's worries that someone could come and take the CEDICT away are unfounded, so that probably killed the driving interest behind the project. Quote
ChouDoufu Posted August 12, 2007 at 01:54 PM Author Report Posted August 12, 2007 at 01:54 PM I would definitely make it open and free for commercial use. the license I would use is the creative commons attribution, share alike. 5k entries is a lot, but I think it's do-able. I agree with both posters on formatting, it's both important and non-essential. I'd solve the problems with scripting. I personally feel the formatting should be as simple as possible (maybe even using something like a comma separated database with well defined fields) and then using scripts to expand it's functionality to XML or other formats. If commercial products add useful features or fields, they can be folded in to the original. I'd want to start it soon, but I'm waiting to hear from the maintainers and contributors. I feel it would be pretty upsetting to them if some random dude (me) came along and decided to start something without them. I need them to succeed. I started working on an open en-chn dictionary. I'm calling it OSED (OpenSino-EnglishDictionary).. I don't have a webpage yet, but I have 500 entries. I'll make a webpage for it when I have a few thousand entries and release it in a CEDict format. ' I know there are lots of online english-chinese dict. but it would be nice to have an open one. Also, some of them are really strange. I went to dict.cn/en/ and just searched for one word (so my research is less than comprehensive) "right". It had a number of definitions for right(in chinese): correct, just, (a person's) right. It lacked right-left right. I did look at wiktionary. I was a little bit shocked. It says there are over 100,000 entries in it, but it is pretty hard to navigate. Quote
character Posted August 12, 2007 at 05:40 PM Report Posted August 12, 2007 at 05:40 PM 5k entries is a lot, but I think it's do-able.Oh, don't listen to me. Um...that time. It sounds like you've got a great start.I would definitely make it open and free for commercial use.the license I would use is the creative commons attribution, share alike. It's your project, but I would hesitate to use it for the following reason:"Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license." To me this seems to indicate that anything using the dictionary is subject to the license and must be available under the same/similar terms. The actual legalese is even scarier from my POV. It's for software, not documents, but the http://en.wikipedia.org/wiki/Apache_License is a commercial-friendly license. The most commercial-friendly is public domain. --- Aside: I wonder what is the standard for adding an entry to a dictionary in such a way that it's free of copyright? If you see it used/defined the same way in two different sources? Quote
novemberfog Posted August 12, 2007 at 11:09 PM Report Posted August 12, 2007 at 11:09 PM I personally feel the formatting should be as simple as possible (maybe even using something like a comma separated database with well defined fields) and then using scripts to expand it's functionality to XML or other formats. Yes please! This craze to change everything to XML is getting out of hand. I guess if everyone is writing web-apps or .NET/Java desktop software the format really doesn't matter. But as an embedded systems developer, XML files simple eat up precious storage space with all of the tags. A raw dictionary file with scripts to create an XML file from it sounds like a good idea to me. Quote
ChouDoufu Posted August 14, 2007 at 01:25 PM Author Report Posted August 14, 2007 at 01:25 PM quote: It's your project, but I would hesitate to use it for the following reason: "Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license." To me this seems to indicate that anything using the dictionary is subject to the license and must be available under the same/similar terms. I'd make it clear in the documentation that the license is for the dictionary, not software built for or around it. Thus extensions to the dictionary (adding usage examples or a word's position in cihai, for example) would be an adaptation and would need to have the same license. A program that allows people to search for a definition would not be subject to the license. Quote
roddy Posted August 14, 2007 at 01:28 PM Report Posted August 14, 2007 at 01:28 PM If you go ahead with this, some things I would like to see are 1) Keep the number of entries limited, at least at first, in favor of depth of entries. I'd rather see a relatively small number of complete entries than a huge number of very simple ones. Cedict (via xuezhongwen) says of 的: /(possessive, modifying, or descriptive particle)/of/ That doesn't even scratch the surface. Example sentences, different usages, etc. 2) As part of that, support for different usages of the same word would be good. I'd probably have CHINESEPINYINUSAGE1USAGE2, or similar. Quote
character Posted August 15, 2007 at 02:10 AM Report Posted August 15, 2007 at 02:10 AM I'd make it clear in the documentation that the license is for the dictionary, not software built for or around it.If one could go by the documentation, CEDICT is meant to be in the "public-domain". Since there are a number of commercial dictionaries able to be purchased, lawyers might steer their clients to them instead of saying "the license on this one says one thing, but the documentation says another, so you're good."Thus extensions to the dictionary (adding usage examples or a word's position in cihai, for example) would be an adaptation and would need to have the same license. A program that allows people to search for a definition would not be subject to the license.Assuming your dictionary has about the same information per entry as CEDICT, anything beyond a simple search program would require a lot of additional information. I can't imagine everything that would need to be added for a machine translation system, for example. If someone goes to all that trouble, they may not want to give it away, so they may avoid your dictionary. Quote
ChouDoufu Posted August 15, 2007 at 02:47 AM Author Report Posted August 15, 2007 at 02:47 AM I read the license a few times and I think your thoughts are just a misinterpretation. The license is for a dictionary. If someone made software that utilizes that dictionary, it is an example of them sharing the dictionary, which the license states they are free to do. "Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license."Alter, transform and build upon all refer to making changes on the dictionary. Does this clear it up? Quote
gato Posted August 15, 2007 at 03:24 AM Report Posted August 15, 2007 at 03:24 AM See this post from Mike Love of Pleco. I would agree with him. http://www.plecoforums.com/viewtopic.php?t=835&start=60 Interesting take on the online dictionary concept, some of the same community-driven ideas we've been thinking about actually. The content is kind of weak, though, and I'm not sure if they're going to get enough contributors to change that - everyone seems to think that just by hanging their shingle out there on the web they'll get people to generate all of their content for them, but with something like a Chinese dictionary that's decidedly Not Fun to work on it's hard to create a lot of interest. Adding a Wikipedia article on your favorite 18th-century French general is considerably more enjoyable than drafting a concise explanation of the proper uses of 而... The few people who are committed enough to the idea of a free online Chinese dictionary database to put in the time required are much more likely to work on an established / avowedly open-source project like Adso, where they know their work will always be free and will be widely redistributed. For our own website project I'm pretty well settled on the notion that we'll need paid contributors if we want to get a useful amount of new content - certainly a community can supply feedback / notes / corrections etc but I don't think you're going to get a database of 50,000 example sentences just by asking people to write them for you. Quote
character Posted August 15, 2007 at 11:00 AM Report Posted August 15, 2007 at 11:00 AM Does this clear it up?We can just agree to disagree. I think if a company was to take the dictionary as a starting point, and create a database which added parts of speech, tied definitions to context, etc. etc. people would expect/demand access to the contents of the database under the Share Alike terms. Quote
ChouDoufu Posted August 15, 2007 at 10:00 PM Author Report Posted August 15, 2007 at 10:00 PM how interesting that http://nciku.com/ would show up there. I believe a classmate I graduated with is affiliated with that project somehow. This is purely speculation on my part, but that site might not have a large number of native english speakers. Like I said, though, pure speculation. The comment about needing to pay people to provide a quality resource--I'm not willing to concede that. I'd also argue that with the right technology tools it is easy to create a dictionary with a lot of entries (say 50000+). Look at HanDeDict. I think it's significantly harder to create a dictionary that is more usable (ie. more examples, and usage rules). Ask me again in six months and I might concede that fact, though... Quote
gato Posted August 16, 2007 at 03:10 AM Report Posted August 16, 2007 at 03:10 AM Look at HanDeDict. My guess is that HanDeDict is more or less a German translation of an existing Chinese-English dictionary, maybe even CEDICT. It's very different to start a dictionary from scratch. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.