trevelyan Posted September 7, 2004 at 09:14 AM Report Posted September 7, 2004 at 09:14 AM I've put together a Windows interface for the Adso database which may be of use to people here. The dictionary has about 115,000 entries, many classified according to part of speech. Usage is as simple as copying text to the clipboard and then hitting the update button. Screenshots are available with the application and database here: http://socrates.berkeley.edu/~david/chinese/tools.html Quote
mandarinboy Posted September 12, 2004 at 08:53 PM Report Posted September 12, 2004 at 08:53 PM It is a very nice idea to be able to get the words within their respective word classes etc. It would be nice to also be able to do reverse lookup from pinyin or english. I love the content but i am very confused about the database design. Why did you not use one single tabe with fk/pk refereces instead of one table for each character? Maybe i miss something here but from your code i can se that you are in fact taking the character from your main table and then using the pk from there to find the correct table. With one table for all the words you could use a clustred index and gain much better performance and a database design that is much easier to maintain. I have not look in to all of your code yet but it is very nice and i love the ideas but i think that much more could be done with this list. If you want some help, let me know. I work a a database designer/IT architect so i hope that i can at least give you some ideas. My initial thought was a combination of this list and the Cedict project and maybe geek_frappas prime zero database etc. Why have so many different projects? I have just started my own on-line version of the Cedict file and Unihan file: http://82.182.78.97/dictionary/ Due to a fire a few weeks ago the work have stand still but i hope to soon be able to do everything that is on my list with this site. Since your work is open source maybe i can make a web version of it and include it on this site. Quote
trevelyan Posted September 14, 2004 at 07:29 PM Author Report Posted September 14, 2004 at 07:29 PM Hey, Answers as best I can.... the database is optimized for what I was intending to use it for. A single table works well for quick word lookups, but scales poorly and is inefficient under heavy loads -- such as those created by gist-translation systems, where the backend database is used to parse/identify as well as define words. As is, frequently used tables are loaded into memory while infrequently used ones don't clutter things up. The result should be an increase in database speed under heavy loads. It should also be more scalable overall. This is the reason I used a database rather than just pouring the data into a text-file like CEDICT. If I'm wrong, please tell me!!! License-wise, there isn't much difference between the ADSO and CEDICT projects. The intended differences are (1), you can't use the ADSO database for online commercial projects and (2) any content explicitly marked as under the ADSO license has been personally vetted by me - for what that's worth. I made the first change because I wasn't happy with the way a number of online commercial projects seemed to be ripping off CEDICT, while the second allowed me to track the entries I'd changed and slowly correct errors in the LDC dataset. If these terms are a stumbling block to wider adoption, I'd be more than happy to change them. I'd also be delighted if there was more cooperation between dictionary projects. I know that Geek Frappa and a few others have online dictionary projects, but without the ability to download or replicate the actual database content from these projects, it is hard to build cumulative projects based on these great resources. And this isn't meant as a critical jab at these efforts -- it is easy to call for more collaboration but hard for people to do the time-consuming work... just look at how slowly CEDICT has grown!!! I personally think the critical problem to find a way to ease the burden of data-entry and/or encourage data-sharing. One of the ways I've tried to make this possible is to make it technically possible under Adso to combine content from different licenses in the same database and let people distribution content only if they please. Every definition comes with an OWNER tag which can be used to restrict distribution if desired. I'd be delighted to include content from other projects under the license terms of its creators, and the current database is flexible enough to make it possible. The impetus has to be on the owners/managers of other projects to make their content reasonably available though. For all it is worth, my own opinion is that what is really needed is (1) more people willing to do data entry, (2) tools that make it possible for them to do so, and (3) software that draws people to these projects. Throwing together this Windows software was an effort to push towards #2, and maybe even #3.... All that aside, I'm not a professional computer scientist, so if you know of a better way to do things, please let me know I have perl/c++ etc. scripts that output the database to tab-denominated files, so it should be easy to switch formats, etc. if required. All someone needs to do is define a more efficient database structure and the results can probably be up-and-running in a weekend.... Send me an email. Cheers, --david Quote
mandarinboy Posted September 15, 2004 at 07:05 PM Report Posted September 15, 2004 at 07:05 PM It is the database design that i am talking about. You are using one table for every character. This meas the the database engine needs to store statistics for every table. This will realy clutter things up. If you use your character table as start table and then add just one more table with every words in you will gain speed but most important, your design will be easier to understand and expand.. By using a clustred index on that table you can easely join the characters from your main table witht the coresponding entries in the words table. The db engine makes calculations for every query you run. If it finds it worth to store stats and values in memory it will do so. If you have several tables, the overhead will be great. With one table the engine only needs to store basic information and then ad pointers to the most frequently used values. With a clustred index this is easely done since the values are already aranged. Also, if you want to add more fields to your table, you only have to do so to one table instead of thousands. This is just my thoughts. Your software works very well and I just love the fact that i can get the parts of speech! I would be happy to help you. I have some small functions to create Cedict entries and can easely create them for your db as well. I can have the list with unicode encoding or Big5 or what ever you prefer. Since I have a small database with a few thousands english search entries that are not present int he Cedict list. I am planning to add them. The result can be exported for your database as well. I have also some on-line tools to add new words directly into our database. One problem with letting anyone add new entries is that some people always will add wrong characters etc. There needs to be someone leading the project and verifying the entries. This is the hard work. Quote
mandarinboy Posted September 15, 2004 at 07:24 PM Report Posted September 15, 2004 at 07:24 PM Are you interested in combining your list with Cedict entries? It is very easy to run a check to make sure that only words not already present in your database will be added. I can do this if it ok for you. I can also contribure with some more words and some on-line tools to add more words etc. Quote
trevelyan Posted September 18, 2004 at 05:52 PM Author Report Posted September 18, 2004 at 05:52 PM Mandarinboy, Thanks for the advice -- I was worried about scalability but if clustered indexes address that, a single table would be best. I'll test the speed of the recommended set-up this weekend. Having a single table would make adding new columns/content easy.... I'm leaning towards dumping everything into CEDICT to simplify life. Let me get this database issue sorted out first though. Cheers. Quote
mandarinboy Posted November 1, 2004 at 01:25 PM Report Posted November 1, 2004 at 01:25 PM I have spend some time with the ADSO database and redesigned the database to use only thre tables instead of the thousands i the mysql version. I run this on Microsoft sql server but will also convert it to mysql when i am done. I will add some search pages on my on-line dictionary later this week to test some ideas i have for it. The reason i wanted to redesign the database is that i got better performance with only those three tables, it is easier to maintain it and I need fewer queries to the database this way. Since this database is so much larger then the Cedict file it will greatly improve theuse of free on-line ditioanries and translation tools. I will try to join ADSO and Cedict in to one file with refences to each project for licence use. I love your work! I will let you know the outcome of my test. Quote
trevelyan Posted November 9, 2004 at 06:58 PM Author Report Posted November 9, 2004 at 06:58 PM Just a note that the download site has moved: http://www.adsotrans.com/download.html The other server kept coughing up space constraints. Quote
xiaoxiajenny Posted November 10, 2004 at 03:48 PM Report Posted November 10, 2004 at 03:48 PM I think Chinese-English can be great to Chinese people, and English-Chinese would be useful for some foreigners who are interested in Chinese and try to learn it in China. When I tell my friends to use Chinese-English translation tools, they find it's difficult to use cause they know the English meaning, what they want to know is how to pronounce it in Chinese, most of them only can speak some short and easy sentences, something about greeting, eating or dating...but I know they want to learn how to say that in Chinese, even to write. Although they can just copy the Chinese word, and paste it there, it could be translated into Chinese, it's still not that useful for them. If they live in China, they try to communicate with people in Chinese, sometimes, they are proud of their ability to speak in Chinese. I've been using some online translation tools, but most of them are not correct enough, if I'd like to translate this sentence" My Chinese is very poor" into Chinese, the result would be 我的中国是非常穷. That's not right, so at last, I need to translate them by myself. It's not that hard for me to translate cause I'm Chinese, but for some foreigners, they don't know the Chinese word, it will be difficult for them to use it. I'm not quite sure whether my idea is right or not, I'm really so tired today. Who can tell me some English-Chinese translation tools? --Jenny Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.