querido Posted August 14, 2010 at 08:38 PM Report Posted August 14, 2010 at 08:38 PM I have a list of known words. I have a transcribed new lesson, let's say already parsed. I would like a simpler and automated method/script/program for extracting the new-word list from the new lesson, just a text file with one unique new word per line. I have or can think of different ways but it always seems too complicated. What do you use? Thank you. Quote
jbradfor Posted August 15, 2010 at 02:10 AM Report Posted August 15, 2010 at 02:10 AM I wrote a very simple gawk script. Works for me. Quote
feihong Posted August 15, 2010 at 03:33 AM Report Posted August 15, 2010 at 03:33 AM (Edited because I misunderstood the OP's meaning) This is what I do now: Open the lesson transcript in Pleco Reader. Click on the unrecognized character, and enter it into Pleco's flashcard database. Export the flashcards to Anki. This isn't as fast as a script, but it still saves me a lot of typing of definitions and pronunciations. Quote
c_redman Posted August 15, 2010 at 10:47 AM Report Posted August 15, 2010 at 10:47 AM I have written a web application that may do what you need. Along with the unique words, it outputs pinyin and definitions from CC-CEDICT, number of occurences, HSK level, and some other useful stuff. http://zhtoolkit.com/apps/wordlist/create-list.cgi 1 Quote
BertR Posted August 15, 2010 at 12:12 PM Report Posted August 15, 2010 at 12:12 PM The website looks really interesting, but seems to have problems with somewhat larger texts. I took the text from this page: http://www.5156edu.com/html/7396/2.html (it's the first page of 巴金's 家) copied it and submitted the query (without changing any other options) and get this error message: Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, webmaster@zhtoolkit.com and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request. I just tried only the first sentence and then it works. Quote
c_redman Posted August 15, 2010 at 02:04 PM Report Posted August 15, 2010 at 02:04 PM I'm debugging it now. So far, it just seems to be randomly occurring, as refreshing the error page returns a valid page the second time. Quote
querido Posted August 15, 2010 at 05:17 PM Author Report Posted August 15, 2010 at 05:17 PM jbradfor: I was using perl or python or bash scripts that I had found and adapted, but in the adaptation (because I don't know what I'm doing) it became complicated. Also, I've just switched back to Windows from Linux and I dread having to dig back into those scripts, make them work in Windows, etc. Would you share your gawk script? I'll study it first before using it. Thank you. feihong: Yes. I have plecodict but haven't really decided how to use it yet. That mode appeals to me as *it fit naturally into reading-time*. c_redman: That's neat. It worked for my whole new lesson (500+ chars long). But I don't see how to input my list of known words, so it can be used incrementally. Also (less importantly), a plain text list or TSV instead of a table would be convenient. Quote
c_redman Posted August 15, 2010 at 07:16 PM Report Posted August 15, 2010 at 07:16 PM It might not be what you are looking for but you can do this: 1) create a New Account 2) Check both checkboxes "Vocabulary list" and "Include option to mark known words" 3) Create output 4) Check the boxes for the words you know 5) Click on Add to Known Words There isn't yet the ability to import lists of known words, but it's on my short list of things to add. 1 Quote
querido Posted August 15, 2010 at 09:34 PM Author Report Posted August 15, 2010 at 09:34 PM c_redman: "Invalid seenlist_id 'new' for user_id '143'" Quote
c_redman Posted August 16, 2010 at 08:14 AM Report Posted August 16, 2010 at 08:14 AM Argh, what a stupid bug! It should be fixed now. P.S., once it's created, you'll need to use the back button to get back to the form, and then refresh to see your list. The navigation controls are still simplistic at this point. Quote
jbradfor Posted August 16, 2010 at 01:38 PM Report Posted August 16, 2010 at 01:38 PM @querido Would you share your gawk script? I'll study it first before using it. Thank you. It's really not much, but here it is. The primary purpose of it is, given a list of words, look up the pinyin and definition and print it out. It also supports an optional list of words to ignore. Usage: gawk -v "DICT=<dictionary file name>" -v "IGNORE=<ignore list>" -f script.awk <input file> The dictionary is required to be in CCDICT format (and, in fact, I use CCDICT); the ignore list just looks at the first two fields, so CCDICT format words, but it doesn't need the pinyin or definition, but it's OK to be there. If the word isn't in the dictionary, it will guess at the pinyin, but it's often wrong. BTW, just in case you don't know already, "cygwin" is a decent replacement for linux when running under wintows. It's not great, but I use it all the time for running my little gawk scripts. BEGIN { FS = " "; while ( getline < DICT ) { # This test is here so in case of multiple entries, we pick the first if ( ! ( $1 in word ) ) { word[$1] = $0; } if ( ! ( $2 in word ) ) { word[$2] = $0; } #print $0; } if ( IGNORE != "" ) { while ( getline < IGNORE ) { ignore[$1] = $0; ignore[$2] = $0; #print $0; } } } { for ( i=1 ; i<=NF ; ++i ) { #printf "|" $i "|"; if ( $i ~ /^[0-9a-zA-Z().一两百千“”/+(),。—、!::#?-]+$/ || $i == "" ) { # Do nothing #print "punc"; } else if ( $i in word ) { if ( $i in seen || $i in ignore ) { # Do nothing #print "seen or ignore"; } else { seen[$i] = 13; print word[$i]; } } else { print "|" $i "| not found"; trad = ""; simp = ""; pinyin = ""; pinyinTxt = ""; for ( j=1; j<=length($i) ; ++j ) { #print substr($i, j, 1); n = split(word[substr($i, j, 1)], array, " "); trad = trad array[1]; simp = simp array[2]; pinyin = pinyin pinyinTxt substr(array[3], 2, length(array[3])-2); pinyinTxt = " "; } if ( pinyin != "" ) { print trad " " simp " [" pinyin "] /NEED DEFINITION/"; } } } } @c_redman There isn't yet the ability to import lists of known words, but it's on my short list of things to add. It would also be great if you could provide pre-made list of words to ignore, e.g. HSK lists, NPCR. And just a single click could ignore all of them. 1 Quote
querido Posted August 18, 2010 at 02:49 PM Author Report Posted August 18, 2010 at 02:49 PM Thank you, you two. I don't have time to evaluate/follow up right now. Quote
querido Posted September 1, 2010 at 10:01 PM Author Report Posted September 1, 2010 at 10:01 PM O.K., I've figured out how to make the tools I have do what I want. I need that instead of scripts, etc, for the same reason I switched from Linux back to Windows- when I'm studying Chinese (or one of my other interests) I can't stand these little technical issues any more. Solution: Just let a flashcard program remove duplicates. In case someone needs a howto: 1. parse into words with your favorite tool 2. use find+replace-all to arrange the words one per line (it isn't necessary to remove duplicates) 3. Import into e.g. Anki which will reject the duplicates. 4. tag them, e.g., "Lesson_01". You could then export these as a list of unique words if you want to. 5. When you do the same for the next text, that rejection of duplicates will cause only the unique new words to form new cards, "Lesson_02", etc. There only remains remembering the code to put into the find+replace box to insert a carriage return. Wenlin didn't like the codes I googled so I just cut and paste in a carriage return. (P.S. for anyone who doesn't know this yet, Plecodict will fill out the cards automatically too, using the dictionary of your choice.) Quote
jbradfor Posted September 7, 2010 at 07:51 PM Report Posted September 7, 2010 at 07:51 PM I need that instead of scripts, etc, for the same reason I switched from Linux back to Windows- when I'm studying Chinese (or one of my other interests) I can't stand these little technical issues any more. And that's why I'm happy with Linux; I hate being limited by what programs can do when I can write a script and fix the problem myself in 10 minutes Quote
feihong Posted September 7, 2010 at 08:32 PM Report Posted September 7, 2010 at 08:32 PM And that's why I'm happy with Linux; I hate being limited by what programs can do when I can write a script and fix the problem myself in 10 minutes Windows is a perfectly fine scripting environment, as long as you use Python Quote
daodeyao Posted September 16, 2010 at 04:41 AM Report Posted September 16, 2010 at 04:41 AM c redman, Love the work you've done with zhtoolkit. Really useful for parsing and making flashcards. I use iFlash. The only problem was I sometimes just want compound characters (since I already know most of the single characters). Then I found smarthanzi.net"]smarthanzi.net which makes 2 lists, singles and compounds. But it uses numbers instead of tone marks so I have to copy and paste results back into zhtoolkit. Then copy and paste into Wenlin and save as .txt (automatically saves as unicode). Then import into iFlash (custom format: <break> <tab> <break> Excellent! Now if the parse on zhtoolkit would let me eliminate known characters in the generated list it would make it easy (haha) to make (even) great flashcards. Thanks David Quote
valikor Posted October 6, 2010 at 09:26 AM Report Posted October 6, 2010 at 09:26 AM I second that. Thanks c_redman for the great work. Looks like a fantastic tool. Keep up the good work. -David Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.