Yet another chinese flashcard program

April 30, 2005 at 03:08 PM

I've released version 0.3 of my flashcard program called the zdt (Zhongwen development tool). It's open-source (free!), written in java and uses the Eclipse RCP framework. It currently uses the April 2005 version of the CEDICT dictionary project, although I'm looking to integrate it with Trevelyan's ADSO's database in my next release.

I wrote it specifically for myself to help me learn the vocab for each chapter in my textbook. I also wanted the annotation functionality that I first saw in Clavis Sinica so I could read chinese text, so I offer a simple pop-up over chinese text functionality. It's similar to the newsinchinese site only not as smooth or sophisticated. It's still a little rough around the edges (especially lack of documentation) but I've found it useful and I hope other people will too. If people see some potential and are willing to help out, the Eclipse RCP's plugin architecture should make it easy to add new functionality (via plugins).

I've tested it so far on Windows XP/2000. Please let me know if it works on older version of windows. There's also a linux version in the works (if anyone's interested). The zdt requires Java 5.0. If you have that version of the jdk/jre installed then you can grab the smaller zdt-0.3.0-setup.exe file (~13MB). If you don't or have no idea what Java is, then grab the larger zdt-0.3.0-setup-full.exe file (~29MB eep!).

Any feedback/comments are appreciated. Enjoy!

chris

April 30, 2005 at 07:02 PM

This is cool... thanks. I'd definitely be interested in having a Linux version, although I'm not sure it would be worth your time putting one together given the numbers involved.

April 30, 2005 at 07:57 PM

What do you mean by the numbers involved? I was thinking intially of taking the adso database as is (just simplified chars), converting it into my db format and then allow users to choose between the cedict version or the adso version.

The Linux version is almost there. There is a bug in the copy/paste functionality and I submitted a bug to the Eclipse guys which will be fixed in their next release. When that's fixed I think it will be good to go.

chris

May 4, 2005 at 02:15 PM

Bogleg - I've not had a chance to look at your software, but it might be worth you talking to Erik Petersen (sp? www.mandarintools.com) who's written a free java chinese reader called Dimsum with quite a lot of functionality.

I've thought quite a bit about what my ideal chinese reader/text editor/flashcard environment would be able to do. It's a long list, but here goes

1 a reader with mouseover definitions

2 text segmentation (to clump multicharacter words together)

3 a pinyin input system linked to the CE dictionary to help with character confusion

4 character frequency information integrated into the dictionary

5 access to a big database of chinese text, allowing testing of characters in situ (as well as in the stand alone way flashcard packages usually work)

6 right click additions of new characters to the flashcard list

7 characters you've got right n times in a row get tested less frequently as n increases (see www.supermemo.com)

8 automatic linking of individual characters to multi-character words that contain them, so that correct reading of multi-character words counts towards reducing the frequency of testing of the individual character (somewhat)

8 right click identification of characters you should have recognised but didn't (ie even though you haven't failed a formal flashcard test for a character, you can 'elect to fail' - increasing the frequency you get tested on that character - if you had to use the software to look up a character that 'should be' in your vocabulary)

9 "concordancing" ability, ie analysing the character frequency of text files (preferably based on segmented text so that it multi-character words are recognised as such), and then compare these to your existing vocabulary so that if the percentage of words you don't know is too high, you can skip that text and pick one more likely to be comprehensible

10 ability to "direct" your reading - ie if you know the 1000 most common characters, it will go onto the web and look for text rich in the next few hundred, and harvest it

Sad huh! But if you think any of that's interesting, PM me and can explain a bit more...

May 4, 2005 at 03:34 PM

bogleg,

I just meant I'm not sure if there are enough learners who use linux exclusively to justify your spending time porting the software to that platform. I'll use it though.

onebir,

We're moving in many of the directions you mention with Adsotrans. If you haven't seen the software/site yet, you should check it out. Adso is an open source context-sensitive text annotation and gist translation system. The best way to get a sense of what it does is by going to a language-learning blog we're running at http://www.newsinchinese.com. A variety of other annotation options are available from the main site itself, including pinyin-over-character, tone-over-character, etc.: http://www.adsotrans.com.

Comments six through eight are specific to flashcard programs, so I'm not going to comment too much on them, although I think that once the algorithms are taken care of and there is a flexible database system running in the background (we use SQLite and MySQL), many of the issues probably boil down to getting the interface properly tied to the backend (right click to add words, etc.). On your other points:

(3) We have a beta Chinese IME system written in Java that is ready to go. All we need is someone using a non-Chinese OS to help us put together the actual interface. When this is done, anyone interested in offering Chinese text input would be able to load the applet onto their screen, or bundle it into websites/software in other ways. This is on hold until someone steps forward with some Java experience, or who wants to roll their own using another method. We can provide comprehensive text and character lists.

(9, 10) We have the software to generate frequency information for the entire GB2312 character set as well as every one of the roughly 139,000 entries in our database. What we don't have is the time or computing power for the computational needed to produce the actual data. If you have some knowledge of linguistics, a computer you're willing to let churn through stuff for a day or two full-time, and are interesting in helping us add this kind of data to our database, we would love to get you involved.

What we need is someone who collect a reasonably representative corpus of standard mandarin (in a worst case scenario recursively wget the Xinhua site) and run the. Ideally, we'd like to have the whole process automated so that we can re-generate frequency every few months as the database expands. When the frequency information is built right into the database, it should be easy for applications to start offering frequency-based transformations and services.

We're looking at this too as having reasonably accurate statistical data about the frequency of word and character usage would be tremendously useful for us in developing algorithms to help with machine translation and text annotation, especially with the treatment of Chinese geographic and personal names. It isn't the kind of thing that can be done on a laptop though, nor can our server be devoted to it, so this is on hold until someone volunteers who is capable of basic shell scripting and has a computer they're willing to let churn away at Chinese texts for a day or two. Once its done, though, we can fairly easily incorporate 9 and 10 within Adso.... ranking articles according to their difficulty and even selecting articles which correspond with certain ontological categories or user word-lists.

Anyway, please check out the site, and drop a note if you have any time or interest in getting involved.

Best,

--david

May 4, 2005 at 10:11 PM

Hi onebir,

Yeah, I've downloaded the DimSum app and played around with it. I was actually inspired by its Annotator functionality and my version is pretty similar in function. I think that covers your points 1 & 2. However it didn't have the type of flashcard functionality that I was interested in.

Regarding point 6, I don't currently have right-click to add flashcards to a category. It should be easy to add though. However the app does support drag and drop of characters between views as well as copy/paste. That's one thing I definately was looking for.. to make it as easy as possible to add flashcards to categories.

For point 7, I do have a simplistic algorithm which takes into account the number of times the user has answered correctly in a row, total % correct, and last time tested among other things. I'd like to make the algorithm user customizable in future versions. I call it a "smart filter" but it's not super smart at the moment.

I was wondering if you could expand a little more on your points 3 and 5. Could you give me some examples? Is what trevelyn mentioned about his Chinese IME system what you're looking for?

Thanks for the comments.

chris

May 5, 2005 at 10:10 AM

http://www.newsinchinese.com.

You guys are doing some amazing (and long overdue work). Unfortunately' date=' I'm neither a programmer nor a linguist - just a frustrated learner of chinese. So I probably can't make a concrete contribution. But for what it's worth, here are some more suggestions:

(3) We have a beta Chinese IME system written in Java that is ready to go.

If that could access character lists or character frequency information, it'd be great for writing graded material. You could distribute it to chinese bloggers who're trying to learn english, and get them to write bilingual blogs (the payoff being the english-speakers help correct the english versions.) Result: fresh, bilingual material that's maybe a bit more relevant to spoken chinese.

(9' date=' 10) ... What we need is someone who collect a reasonably representative corpus of standard mandarin.

[/quote']

There are definitely big academic-derived corpora out there; given you're not-for-profit they might give them to you free if you asked.

developing algorithms to help with machine translation and text annotation' date=' especially with the treatment of Chinese geographic and personal names[/quote']

I think the Eric Petersen software has an algorithm for dealing with this that works fairly well...

May 5, 2005 at 11:11 AM

3 a pinyin input system linked to the CE dictionary to help with character confusion
5 access to a big database of chinese text' date=' allowing testing of characters in situ (as well as in the stand alone way flashcard packages usually work)

[/quote']

Bogleg, here's what I mean:

3 have an option for the english translation of the character/word to come next to it when you're inputting, so if the user's not sure which char is which, he/she gets a bit of help.

5 why should a flashcard just test the character on it's own? We want to read sentences, after all. So when a character comes up for test, why not pull up a sentence (mainly composed of words already in the users vocab - so this might take a big corpus) and show that as the test item? (highlighting the char in question).

I'll just take the opportunity to rant about character frequency and the order waiguoren should learn them in to some people who might be able to do something about it. I think existing graded reading material gets this seriously wrong - particularly for people like me who aren't interested in learning to write (missing the reinforcement that making the order less important).

The considerations I think should affect the order of learning characters are:

1) buildup of radicals/commonly occurring components not traditionally thought of as radicals (see the preface in McNaughton's "reading and writing chinese")

2) Complexity of the character - in terms of number of radicals or other components

3) Quality of mnemonic - if it's a meaning-meaning character, does the combination make sense (like person-tree for 休）? if it's sound-meaning, is that still valid?

4) does the character occur mainly in multi-char words? if so, you can often get away with a hazy recollection of it if you're pretty strong on the other chars in the word

5) does the character have a strong tendency to occur in conjunction with other chars (but in recognised words - eg measure words)? same comment as 4

I think order of learning should be based on a weighting of frequency with these factors, because sometimes they make it easier to learn to recognise several characters, with a cumulative frequency higher than the next highest frequency character.

And factors 4 & 5 can provide a further gain - in terms of improved ability to segment the sentence - that isn't captured by thinking simply about the proportion of characters in a chinese corpus a reader would be able to recognise. [This is why I stressed integration of testing of characters and words in flashcarding - I'm not sure there's enough "leveraging" of known characters in existing methods.]

Anyway - as usual, I'd be happy to elaborate ;-)

May 16, 2005 at 10:24 AM

For our own project Chinese homework trainer (unfortunately not a freeware, yet) we have just made a program to collect data from Chinese newspapers to get current statistics of the usage of modern Chinese. We will run this for a few month now and the result of this will be free to use for other, e.g. ADSO. We will add a link to the database file at our site when we have enough data. Please let us knwo if you need anything else. We are working on a datamodel where the characters will be linked to compound data etc. As for the other features. We will include them all in our next update along with several others. Since we also saves texts we can easely find texts suitable for readers regardless of knowledge of Chinese characters. We are also testing functions for linking words to sentences. This will help students when learning new words if they also get an example sentence. Since this is machine collected we do not yet know how useful it will be. It looks promissing but i guess that we need some sort of manual work in the end to make sure that it is suitable sentences we use.

June 22, 2005 at 04:45 PM

WHat a great tool this flashcard program is. Thank you so much for posting it

June 30, 2005 at 02:44 AM

I'd definetly be interested in a linux version! The only thing i use my Windows for nowadays is learning chinese...your flashcard program looks great and the Linux version will be the icing on the cake...i have the beginner's chinese book by Yong Ho and maybe as a thankyou (when i have the time) i could write the word lists for it

June 30, 2005 at 09:31 PM

Thanks for the kind words! I've been busy and havn't spent much time working on the program lately. The latest Eclipse 3.1 was recently released and I think that fixes one of the major issues I was having with the Linux version. So I'll have to try that out and see how that works.

chris

July 3, 2005 at 10:46 PM

Two Questions:

First how do you input Pinyin into the system - I don't mean characters, but for the flashcard you need to be able to input Pinyin answers and right now I do not have anyway to do that.

Second, what can I do if I want to add vocabulary to the flashcard list that does not exist in the current database?

Thanks!

July 4, 2005 at 01:27 AM

The system follows the CEDICT conventions so pinyin tones are indicated by numbers. 1=level tone, 2=rising tone, 3=mid-rising tone, 4=falling tone, 5=neutral tone. e.g. 学生 = xue2 sheng5.

If you use the "Add Character" function in the category editor and add a character that doesn't exist, then it will automatically add it to your current database.

Hope that helps!

chris

July 4, 2005 at 04:45 AM

The system follows the CEDICT conventions so pinyin tones are indicated by numbers. 1=level tone' date=' 2=rising tone, 3=mid-rising tone, 4=falling tone, 5=neutral tone. e.g. 学生 = xue2 sheng5.
If you use the "Add Character" function in the category editor and add a character that doesn't exist, then it will automatically add it to your current database.

Hope that helps!

chris[/quote']

Great!

I thought the 'add character' just added a character, not a whole word. Doh!

Thanks though, I'll certainly use it now!

Sign In

Yet another chinese flashcard program

Recommended Posts

bogleg

trevelyan

bogleg

onebir

trevelyan

bogleg

onebir

onebir

mandarinboy

waxwing

codemonkey

bogleg

kangkai

bogleg

kangkai

Join the conversation