Jump to content
Chinese-Forums
  • Sign Up

Open Source Chinese Dictionaries


ChouDoufu

Recommended Posts

I have a question about the source code status of Chinese dictionaries.

Are there any open source Chinese - English dictionaries?

I know CEDict is free, but it's not open source. The guy who started it Paul D. could show up one day and say: "that's my dictionary and it's not longer free." and the whole community based around CEDict would be put in turmoil.

Is Adso open source?

Are there any people out there who would want to help create an open-source dictionary? I definitely want to start one (or contribute to one if it already exists), but I also feel that there needs to be a lot of preparation. Decisions about things like formatting, content, etc. would all need to be discussed. Anyone interested, please send me PM.

Link to comment
Share on other sites

It hasn't been highly publicized, but maintenance of CEDICT seems to have shifted recently to Dennis Vierkant at MDBG (although even the Wikipedia entry for CEDICT is unclear on the actual status). Anyway, this is the header of the latest version (just from this morning!) from mdbg.net, which should reassure you:

# CEDICT
# Community maintained free Chinese-English dictionary.
# 
# Published by MDBG
# 
# License:
# Creative Commons Attribution-Noncommercial-Share Alike 3.0
# http://creativecommons.org/licenses/by-nc-sa/3.0/
# 
# CEDICT can be downloaded from:
# http://www.mdbg.net/chindict/chindict.php?page=cedict
# 
# Additions and corrections can be sent through:
# http://www.xuezhongwen.net/cedicteditor/editor.php
# 
# New releases are announced on:
# http://groups.google.com/group/chinese-dictionaries
# 
# For more information about CEDICT see:
# http://www.mdbg.net/cedictwiki/

This new header notice is quite recent; the version on 2007/5/20 did not have it, but the new versions from 2007/6/30 forward do.

Link to comment
Share on other sites

Yes, I noticed this and have tried contacting Dennis, but unfortunately one cannot legally change the copyright of a work that doesn't belong to you.

CEDict belongs to Paul D., and until the copyright expires (by current US laws sometime in far far future) or Paul D. himself makes a change in the copyright, it cannot simply be turned into creative commons.

Imagine if I decided to add some scenes to star wars. I looked for George Lucas everywhere but couldn't find him. Thus I changed the license to a creative commons license and added my scenes to the movie and released it. If George Lucas ever decided to show up, he'd sue me for copyright infringement and would be able to stop distribution of everything I did.

That's why I'd prefer to see CEDict discontinued and a new dictionary with a real creative commons license created.

Using sources like out of copyright dictionaries (if it's before 1922 it's copyright has expired. after 1922 it gets more complicated) accessible in libraries and on sites like google books, the chinese learning community can create the basis of a chinese dictionary and then start adding the new words that have been added to the chinese language.

Link to comment
Share on other sites

From the CEDICT readme file, it seems that a number of people contributed to it, not just Paul D. Perhaps they all still each hold the copyright to their own contribution (if they didn't get it from a copyrighted source).

http://www.mandarintools.com/download/cedict_readme.txt

Although CEDICT started out as a one-person project, contributions from

the Internet community have become the major source of new entries.

Contributors thus far to the project are (in chronological

order):

Ocrat, Mike Wright, Wenke Wei, Sharlene Liu, Richard Warmington,

Erik Peterson, Derek Chadwick, Dave Hiebeler, Steve Swales, Carl Hoffman

Link to comment
Share on other sites

I made contributions to it too. But the problem there is it would be nearly impossible to determine who has contributed what. Furthermore, it was published by means of updates with a license that says "copyright Paul D."

All these reasons illustrate why there should be an open chinese dictionary that isn't CEDict.

Am I alone in feeling this way? There is a difference between free and "freely available". CEDict like Jpeg and MP3 is freely available, but it isn't free. A few years back the people who owned Jpeg were thinking about making people pay for Jpeg. Fraunhoffer has talked about selling licenses to MP3, too. It's the possibility of CEDict being taken away that makes me want to create a truly free alternative.

Link to comment
Share on other sites

Furthermore, it was published by means of updates with a license that says "copyright Paul D.

Just as in your "Star Wars" example, if the copyright isn't his, adding a copyright tag doesn't make it his. But I get what you are saying. Just don't think there's a realistic chance that Paul D. will enforce this "copyright."

Link to comment
Share on other sites

Are there any open source Chinese - English dictionaries?
It depends on what you mean by open source.

AFAIK, with my limited legal knowledge, there are no free as in freedom and as in beer C-E dictionaries. CEDict's readme's Introduction says its objective is to create a "public-domain" C-E dictionary. Unfortunately it then appends an IMO poorly-worded license at the end of the readme, so it is restricted in its use, so it's not in the public domain.

I know CEDict is free, but it's not open source. The guy who started it Paul D. could show up one day and say: "that's my dictionary and it's not longer free."
Again AFAIK, he couldn't change the copyright on existing versions, and the license isn't written as a lease whose terms can change at any time.
and the whole community based around CEDict would be put in turmoil.
Parts of the community seem to be ignoring the spirit if not the letter of the license, by:

Making it a selling point of their software

Selling advertising space on their implementation

Is Adso open source?
It's open source and free for noncommercial use, IIRC the license terms. You can get the license and the code from Adso.
Are there any people out there who would want to help create an open-source dictionary? I definitely want to start one (or contribute to one if it already exists), but I also feel that there needs to be a lot of preparation. Decisions about things like formatting, content, etc. would all need to be discussed.
I think it's a great idea, as long as you either put it in the public domain or make it free for commercial use. Otherwise you'd just be duplicating CEDict. I wouldn't worry about formatting initially; you can use the same format as CEDict to start with. Once you get 5K entries you can then worry about finding the best format. Getting the initial set of entries is the hard part.
Link to comment
Share on other sites

I wouldn't worry about formatting initially; you can use the same format as CEDict to start with. Once you get 5K entries you can then worry about finding the best format.
Ugh, no, please don't do that. The software world is littered with bad formats that the authors would "get around to changing one day", but by that time many people/programs already relied on the existing format and so it never got changed (tabs in Makefiles anyone? :roll:)

I would also avoid the CEDICT format, it's incredibly inflexible if you want to extend it, and has serious limitations e.g. different files for Traditional/Simplified that need to be manually kept in sync, no support for languages other than English, etc.

re: free dictionaries, you might want to check out www.wiktionary.org, which aims to be a multi-lingual dictionary released under the GNU Free Documentation License.

Link to comment
Share on other sites

Ugh, no, please don't do that. The software world is littered with bad formats [...]
A great format would be a good thing, but there are so many ways to format a Chinese dictionary, so many things to put in or leave out, etc. that I fear this dictionary project would die of analysis paralysis. Better to get enough entries to interest people first, then pick a format based on how people want to use it. Reformatting 5k entries by writing a couple scripts and some manual tweaking is IMO a lot better than being months into the project and only having a half-finished design for a format to show for it.
I would also avoid the CEDICT format, it's incredibly inflexible if you want to extend it,
So convert the format to XML:

...

阿
阿
a1

an initial particle
prefix to names of people


...

Not great, but now it's obvious one could add elements or reformat using XSLT. Never mind that a program could convert the CEDICT file to this format...:)

and has serious limitations e.g. different files for Traditional/Simplified that need to be manually kept in sync [...]
My guess would be that other files are generated from the Unicode file, which contains both Traditional/Simplified.

---

Anyways, I think the OP's worries that someone could come and take the CEDICT away are unfounded, so that probably killed the driving interest behind the project.

Link to comment
Share on other sites

I would definitely make it open and free for commercial use.

the license I would use is the creative commons attribution, share alike.

5k entries is a lot, but I think it's do-able. I agree with both posters on formatting, it's both important and non-essential. I'd solve the problems with scripting. I personally feel the formatting should be as simple as possible (maybe even using something like a comma separated database with well defined fields) and then using scripts to expand it's functionality to XML or other formats. If commercial products add useful features or fields, they can be folded in to the original.

I'd want to start it soon, but I'm waiting to hear from the maintainers and contributors. I feel it would be pretty upsetting to them if some random dude (me) came along and decided to start something without them. I need them to succeed.

I started working on an open en-chn dictionary. I'm calling it OSED (OpenSino-EnglishDictionary).. I don't have a webpage yet, but I have 500 entries. I'll make a webpage for it when I have a few thousand entries and release it in a CEDict format. ' I know there are lots of online english-chinese dict. but it would be nice to have an open one. Also, some of them are really strange. I went to dict.cn/en/ and just searched for one word (so my research is less than comprehensive) "right". It had a number of definitions for right(in chinese): correct, just, (a person's) right. It lacked right-left right.

I did look at wiktionary. I was a little bit shocked. It says there are over 100,000 entries in it, but it is pretty hard to navigate.

Link to comment
Share on other sites

5k entries is a lot, but I think it's do-able.
Oh, don't listen to me. Um...that time. :) It sounds like you've got a great start.
I would definitely make it open and free for commercial use.

the license I would use is the creative commons attribution, share alike.

It's your project, but I would hesitate to use it for the following reason:

"Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license." To me this seems to indicate that anything using the dictionary is subject to the license and must be available under the same/similar terms.

The actual legalese is even scarier from my POV. It's for software, not documents, but the http://en.wikipedia.org/wiki/Apache_License is a commercial-friendly license. The most commercial-friendly is public domain. :lol:

---

Aside: I wonder what is the standard for adding an entry to a dictionary in such a way that it's free of copyright? If you see it used/defined the same way in two different sources?

Link to comment
Share on other sites

I personally feel the formatting should be as simple as possible (maybe even using something like a comma separated database with well defined fields) and then using scripts to expand it's functionality to XML or other formats.

Yes please! This craze to change everything to XML is getting out of hand. I guess if everyone is writing web-apps or .NET/Java desktop software the format really doesn't matter. But as an embedded systems developer, XML files simple eat up precious storage space with all of the tags. A raw dictionary file with scripts to create an XML file from it sounds like a good idea to me.

Link to comment
Share on other sites

quote:

It's your project, but I would hesitate to use it for the following reason:

"Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license." To me this seems to indicate that anything using the dictionary is subject to the license and must be available under the same/similar terms.

I'd make it clear in the documentation that the license is for the dictionary, not software built for or around it. Thus extensions to the dictionary (adding usage examples or a word's position in cihai, for example) would be an adaptation and would need to have the same license. A program that allows people to search for a definition would not be subject to the license.

Link to comment
Share on other sites

If you go ahead with this, some things I would like to see are

1) Keep the number of entries limited, at least at first, in favor of depth of entries. I'd rather see a relatively small number of complete entries than a huge number of very simple ones. Cedict (via xuezhongwen) says of 的:

/(possessive, modifying, or descriptive particle)/of/

That doesn't even scratch the surface. Example sentences, different usages, etc.

2) As part of that, support for different usages of the same word would be good. I'd probably have CHINESEPINYINUSAGE1USAGE2, or similar.

Link to comment
Share on other sites

I'd make it clear in the documentation that the license is for the dictionary, not software built for or around it.
If one could go by the documentation, CEDICT is meant to be in the "public-domain". Since there are a number of commercial dictionaries able to be purchased, lawyers might steer their clients to them instead of saying "the license on this one says one thing, but the documentation says another, so you're good."
Thus extensions to the dictionary (adding usage examples or a word's position in cihai, for example) would be an adaptation and would need to have the same license. A program that allows people to search for a definition would not be subject to the license.
Assuming your dictionary has about the same information per entry as CEDICT, anything beyond a simple search program would require a lot of additional information. I can't imagine everything that would need to be added for a machine translation system, for example. If someone goes to all that trouble, they may not want to give it away, so they may avoid your dictionary.
Link to comment
Share on other sites

I read the license a few times and I think your thoughts are just a misinterpretation. The license is for a dictionary. If someone made software that utilizes that dictionary, it is an example of them sharing the dictionary, which the license states they are free to do.

"Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license."Alter, transform and build upon all refer to making changes on the dictionary.

Does this clear it up?

Link to comment
Share on other sites

See this post from Mike Love of Pleco. I would agree with him.

http://www.plecoforums.com/viewtopic.php?t=835&start=60

Interesting take on the online dictionary concept, some of the same community-driven ideas we've been thinking about actually. The content is kind of weak, though, and I'm not sure if they're going to get enough contributors to change that - everyone seems to think that just by hanging their shingle out there on the web they'll get people to generate all of their content for them, but with something like a Chinese dictionary that's decidedly Not Fun to work on it's hard to create a lot of interest. Adding a Wikipedia article on your favorite 18th-century French general is considerably more enjoyable than drafting a concise explanation of the proper uses of 而... The few people who are committed enough to the idea of a free online Chinese dictionary database to put in the time required are much more likely to work on an established / avowedly open-source project like Adso, where they know their work will always be free and will be widely redistributed.

For our own website project I'm pretty well settled on the notion that we'll need paid contributors if we want to get a useful amount of new content - certainly a community can supply feedback / notes / corrections etc but I don't think you're going to get a database of 50,000 example sentences just by asking people to write them for you.

Link to comment
Share on other sites

Does this clear it up?
We can just agree to disagree. I think if a company was to take the dictionary as a starting point, and create a database which added parts of speech, tied definitions to context, etc. etc. people would expect/demand access to the contents of the database under the Share Alike terms.
Link to comment
Share on other sites

how interesting that http://nciku.com/ would show up there. I believe a classmate I graduated with is affiliated with that project somehow. This is purely speculation on my part, but that site might not have a large number of native english speakers. Like I said, though, pure speculation.

The comment about needing to pay people to provide a quality resource--I'm not willing to concede that. I'd also argue that with the right technology tools it is easy to create a dictionary with a lot of entries (say 50000+). Look at HanDeDict. I think it's significantly harder to create a dictionary that is more usable (ie. more examples, and usage rules). Ask me again in six months and I might concede that fact, though...

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...