Jump to content
Chinese-Forums
  • Sign Up

Let's talk about new-word lists


querido

Recommended Posts

I have a list of known words.

I have a transcribed new lesson, let's say already parsed.

I would like a simpler and automated method/script/program for extracting the new-word list from the new lesson, just a text file with one unique new word per line. I have or can think of different ways but it always seems too complicated.

What do you use?

Thank you.

Link to comment
Share on other sites

(Edited because I misunderstood the OP's meaning)

This is what I do now:

  1. Open the lesson transcript in Pleco Reader.
  2. Click on the unrecognized character, and enter it into Pleco's flashcard database.
  3. Export the flashcards to Anki.

This isn't as fast as a script, but it still saves me a lot of typing of definitions and pronunciations.

Link to comment
Share on other sites

The website looks really interesting, but seems to have problems with somewhat larger texts.

I took the text from this page:

http://www.5156edu.com/html/7396/2.html (it's the first page of 巴金's 家)

copied it and submitted the query (without changing any other options) and get this error message:

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, webmaster@zhtoolkit.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

I just tried only the first sentence and then it works.

Link to comment
Share on other sites

jbradfor: I was using perl or python or bash scripts that I had found and adapted, but in the adaptation (because I don't know what I'm doing) it became complicated. Also, I've just switched back to Windows from Linux and I dread having to dig back into those scripts, make them work in Windows, etc. Would you share your gawk script? I'll study it first before using it. Thank you.

feihong: Yes. I have plecodict but haven't really decided how to use it yet. That mode appeals to me as *it fit naturally into reading-time*.

c_redman: That's neat. It worked for my whole new lesson (500+ chars long). But I don't see how to input my list of known words, so it can be used incrementally. Also (less importantly), a plain text list or TSV instead of a table would be convenient.

Link to comment
Share on other sites

It might not be what you are looking for but you can do this:

1) create a New Account

2) Check both checkboxes "Vocabulary list" and "Include option to mark known words"

3) Create output

4) Check the boxes for the words you know

5) Click on Add to Known Words

There isn't yet the ability to import lists of known words, but it's on my short list of things to add.

  • Like 1
Link to comment
Share on other sites

Argh, what a stupid bug! It should be fixed now.

P.S., once it's created, you'll need to use the back button to get back to the form, and then refresh to see your list. The navigation controls are still simplistic at this point.

Link to comment
Share on other sites

@querido

Would you share your gawk script? I'll study it first before using it. Thank you.

It's really not much, but here it is. The primary purpose of it is, given a list of words, look up the pinyin and definition and print it out. It also supports an optional list of words to ignore.

Usage:

gawk -v "DICT=<dictionary file name>" -v "IGNORE=<ignore list>" -f script.awk <input file>

The dictionary is required to be in CCDICT format (and, in fact, I use CCDICT); the ignore list just looks at the first two fields, so CCDICT format words, but it doesn't need the pinyin or definition, but it's OK to be there. If the word isn't in the dictionary, it will guess at the pinyin, but it's often wrong.

BTW, just in case you don't know already, "cygwin" is a decent replacement for linux when running under wintows. It's not great, but I use it all the time for running my little gawk scripts.

BEGIN {
  FS = " ";
  while ( getline < DICT ) {
     # This test is here so in case of multiple entries, we pick the first
     if ( ! ( $1 in word ) ) {
        word[$1] = $0;
     }
     if ( ! ( $2 in word ) ) {
        word[$2] = $0;
     }
     #print $0;
  }
  if ( IGNORE != "" ) {
     while ( getline < IGNORE ) {
        ignore[$1] = $0;
        ignore[$2] = $0;
        #print $0;
     }
  }
}

{
  for ( i=1 ; i<=NF ; ++i ) {
     #printf "|" $i "|";
     if ( $i ~ /^[0-9a-zA-Z().一两百千“”/+(),。—、!::#?-]+$/ || $i == "" ) {
           # Do nothing
           #print "punc";
     } else if ( $i in word ) {
        if ( $i in seen || $i in ignore ) {
           # Do nothing
           #print "seen or ignore";
        } else {
           seen[$i] = 13;
           print word[$i];
        }
     } else {
        print "|" $i "| not found";
        trad = "";
        simp = "";
        pinyin = "";
        pinyinTxt = "";
        for ( j=1; j<=length($i) ; ++j ) {
           #print substr($i, j, 1);
           n = split(word[substr($i, j, 1)], array, " ");
           trad = trad array[1];
           simp = simp array[2];
           pinyin = pinyin pinyinTxt substr(array[3], 2, length(array[3])-2);
           pinyinTxt = " ";
        }
        if ( pinyin != "" ) {
           print trad " " simp " [" pinyin "] /NEED DEFINITION/";
        }
     }
  }
}

@c_redman

There isn't yet the ability to import lists of known words, but it's on my short list of things to add.

It would also be great if you could provide pre-made list of words to ignore, e.g. HSK lists, NPCR. And just a single click could ignore all of them.

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...

O.K., I've figured out how to make the tools I have do what I want.

I need that instead of scripts, etc, for the same reason I switched from Linux back to Windows- when I'm studying Chinese (or one of my other interests) I can't stand these little technical issues any more.

Solution:

Just let a flashcard program remove duplicates. In case someone needs a howto:

1. parse into words with your favorite tool

2. use find+replace-all to arrange the words one per line (it isn't necessary to remove duplicates)

3. Import into e.g. Anki which will reject the duplicates.

4. tag them, e.g., "Lesson_01". You could then export these as a list of unique words if you want to.

5. When you do the same for the next text, that rejection of duplicates will cause only the unique new words to form new cards, "Lesson_02", etc.

There only remains remembering the code to put into the find+replace box to insert a carriage return. Wenlin didn't like the codes I googled so I just cut and paste in a carriage return.

(P.S. for anyone who doesn't know this yet, Plecodict will fill out the cards automatically too, using the dictionary of your choice.)

Link to comment
Share on other sites

I need that instead of scripts, etc, for the same reason I switched from Linux back to Windows- when I'm studying Chinese (or one of my other interests) I can't stand these little technical issues any more.

And that's why I'm happy with Linux; I hate being limited by what programs can do when I can write a script and fix the problem myself in 10 minutes :P

Link to comment
Share on other sites

And that's why I'm happy with Linux; I hate being limited by what programs can do when I can write a script and fix the problem myself in 10 minutes

Windows is a perfectly fine scripting environment, as long as you use Python :P

Link to comment
Share on other sites

  • 2 weeks later...

c redman,

Love the work you've done with zhtoolkit. Really useful for parsing and making flashcards. I use iFlash. The only problem was I sometimes just want compound characters (since I already know most of the single characters). Then I found smarthanzi.net"]smarthanzi.net which makes 2 lists, singles and compounds. But it uses numbers instead of tone marks so I have to copy and paste results back into zhtoolkit. Then copy and paste into Wenlin and save as .txt (automatically saves as unicode). Then import into iFlash (custom format: <break> <tab> <break>

Excellent!

Now if the parse on zhtoolkit would let me eliminate known characters in the generated list it would make it easy (haha) to make (even) great flashcards.

Thanks

David

Link to comment
Share on other sites

  • 3 weeks later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...