Let's talk about new-word lists

August 14, 2010 at 08:38 PM

I have a list of known words.

I have a transcribed new lesson, let's say already parsed.

I would like a simpler and automated method/script/program for extracting the new-word list from the new lesson, just a text file with one unique new word per line. I have or can think of different ways but it always seems too complicated.

What do you use?

Thank you.

August 15, 2010 at 02:10 AM

I wrote a very simple gawk script. Works for me.

August 15, 2010 at 03:33 AM

(Edited because I misunderstood the OP's meaning)

This is what I do now:

Open the lesson transcript in Pleco Reader.
Click on the unrecognized character, and enter it into Pleco's flashcard database.
Export the flashcards to Anki.

This isn't as fast as a script, but it still saves me a lot of typing of definitions and pronunciations.

August 15, 2010 at 10:47 AM

I have written a web application that may do what you need. Along with the unique words, it outputs pinyin and definitions from CC-CEDICT, number of occurences, HSK level, and some other useful stuff.

http://zhtoolkit.com/apps/wordlist/create-list.cgi

August 15, 2010 at 12:12 PM

The website looks really interesting, but seems to have problems with somewhat larger texts.

I took the text from this page:

http://www.5156edu.com/html/7396/2.html (it's the first page of 巴金's 家）

copied it and submitted the query (without changing any other options) and get this error message:

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, webmaster@zhtoolkit.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

I just tried only the first sentence and then it works.

August 15, 2010 at 02:04 PM

I'm debugging it now. So far, it just seems to be randomly occurring, as refreshing the error page returns a valid page the second time.

August 15, 2010 at 05:17 PM

jbradfor: I was using perl or python or bash scripts that I had found and adapted, but in the adaptation (because I don't know what I'm doing) it became complicated. Also, I've just switched back to Windows from Linux and I dread having to dig back into those scripts, make them work in Windows, etc. Would you share your gawk script? I'll study it first before using it. Thank you.

feihong: Yes. I have plecodict but haven't really decided how to use it yet. That mode appeals to me as *it fit naturally into reading-time*.

c_redman: That's neat. It worked for my whole new lesson (500+ chars long). But I don't see how to input my list of known words, so it can be used incrementally. Also (less importantly), a plain text list or TSV instead of a table would be convenient.

August 15, 2010 at 07:16 PM

It might not be what you are looking for but you can do this:

1) create a New Account

2) Check both checkboxes "Vocabulary list" and "Include option to mark known words"

3) Create output

4) Check the boxes for the words you know

5) Click on Add to Known Words

There isn't yet the ability to import lists of known words, but it's on my short list of things to add.

August 15, 2010 at 09:34 PM

c_redman: "Invalid seenlist_id 'new' for user_id '143'"

August 16, 2010 at 08:14 AM

Argh, what a stupid bug! It should be fixed now.

P.S., once it's created, you'll need to use the back button to get back to the form, and then refresh to see your list. The navigation controls are still simplistic at this point.

August 16, 2010 at 01:38 PM

@querido

Would you share your gawk script? I'll study it first before using it. Thank you.

It's really not much, but here it is. The primary purpose of it is, given a list of words, look up the pinyin and definition and print it out. It also supports an optional list of words to ignore.

Usage:

gawk -v "DICT=<dictionary file name>" -v "IGNORE=<ignore list>" -f script.awk <input file>

The dictionary is required to be in CCDICT format (and, in fact, I use CCDICT); the ignore list just looks at the first two fields, so CCDICT format words, but it doesn't need the pinyin or definition, but it's OK to be there. If the word isn't in the dictionary, it will guess at the pinyin, but it's often wrong.

BTW, just in case you don't know already, "cygwin" is a decent replacement for linux when running under wintows. It's not great, but I use it all the time for running my little gawk scripts.

BEGIN {
  FS = " ";
  while ( getline < DICT ) {
     # This test is here so in case of multiple entries, we pick the first
     if ( ! ( $1 in word ) ) {
        word[$1] = $0;
     }
     if ( ! ( $2 in word ) ) {
        word[$2] = $0;
     }
     #print $0;
  }
  if ( IGNORE != "" ) {
     while ( getline < IGNORE ) {
        ignore[$1] = $0;
        ignore[$2] = $0;
        #print $0;
     }
  }
}

{
  for ( i=1 ; i<=NF ; ++i ) {
     #printf "|" $i "|";
     if ( $i ~ /^[0-9a-zA-Z().一两百千“”/+（），。—、！:：#？-]+$/ || $i == "" ) {
           # Do nothing
           #print "punc";
     } else if ( $i in word ) {
        if ( $i in seen || $i in ignore ) {
           # Do nothing
           #print "seen or ignore";
        } else {
           seen[$i] = 13;
           print word[$i];
        }
     } else {
        print "|" $i "| not found";
        trad = "";
        simp = "";
        pinyin = "";
        pinyinTxt = "";
        for ( j=1; j<=length($i) ; ++j ) {
           #print substr($i, j, 1);
           n = split(word[substr($i, j, 1)], array, " ");
           trad = trad array[1];
           simp = simp array[2];
           pinyin = pinyin pinyinTxt substr(array[3], 2, length(array[3])-2);
           pinyinTxt = " ";
        }
        if ( pinyin != "" ) {
           print trad " " simp " [" pinyin "] /NEED DEFINITION/";
        }
     }
  }
}

@c_redman

There isn't yet the ability to import lists of known words, but it's on my short list of things to add.

It would also be great if you could provide pre-made list of words to ignore, e.g. HSK lists, NPCR. And just a single click could ignore all of them.

August 18, 2010 at 02:49 PM

Thank you, you two.

I don't have time to evaluate/follow up right now.

September 1, 2010 at 10:01 PM

O.K., I've figured out how to make the tools I have do what I want.

I need that instead of scripts, etc, for the same reason I switched from Linux back to Windows- when I'm studying Chinese (or one of my other interests) I can't stand these little technical issues any more.

Solution:

Just let a flashcard program remove duplicates. In case someone needs a howto:

1. parse into words with your favorite tool

2. use find+replace-all to arrange the words one per line (it isn't necessary to remove duplicates)

3. Import into e.g. Anki which will reject the duplicates.

4. tag them, e.g., "Lesson_01". You could then export these as a list of unique words if you want to.

5. When you do the same for the next text, that rejection of duplicates will cause only the unique new words to form new cards, "Lesson_02", etc.

There only remains remembering the code to put into the find+replace box to insert a carriage return. Wenlin didn't like the codes I googled so I just cut and paste in a carriage return.

(P.S. for anyone who doesn't know this yet, Plecodict will fill out the cards automatically too, using the dictionary of your choice.)

September 7, 2010 at 07:51 PM

I need that instead of scripts, etc, for the same reason I switched from Linux back to Windows- when I'm studying Chinese (or one of my other interests) I can't stand these little technical issues any more.

And that's why I'm happy with Linux; I hate being limited by what programs can do when I can write a script and fix the problem myself in 10 minutes

September 7, 2010 at 08:32 PM

And that's why I'm happy with Linux; I hate being limited by what programs can do when I can write a script and fix the problem myself in 10 minutes

Windows is a perfectly fine scripting environment, as long as you use Python

September 16, 2010 at 04:41 AM

c redman,

Love the work you've done with zhtoolkit. Really useful for parsing and making flashcards. I use iFlash. The only problem was I sometimes just want compound characters (since I already know most of the single characters). Then I found smarthanzi.net"]smarthanzi.net which makes 2 lists, singles and compounds. But it uses numbers instead of tone marks so I have to copy and paste results back into zhtoolkit. Then copy and paste into Wenlin and save as .txt (automatically saves as unicode). Then import into iFlash (custom format: <break> <tab> <break>

Excellent!

Now if the parse on zhtoolkit would let me eliminate known characters in the generated list it would make it easy (haha) to make (even) great flashcards.

Thanks

David

October 6, 2010 at 09:26 AM

I second that.

Thanks c_redman for the great work. Looks like a fantastic tool. Keep up the good work.

-David

Sign In

Let's talk about new-word lists

Recommended Posts

querido

jbradfor

feihong

c_redman

BertR

c_redman

querido

c_redman

querido

c_redman

jbradfor

querido

querido

jbradfor

feihong

daodeyao

valikor

Join the conversation