jbradfor Posted November 19, 2009 at 11:32 PM Report Posted November 19, 2009 at 11:32 PM (edited) I've been looking for a tool to generate word lists from a text. I know there are many out there, but I've never found one that meets all my criteria. Take a text file as input Create a text file as output that is a word list from which I can create a vocab list. [i've seen web-based tools that will create an output on the screen, but from there it's a major effort to feed it into my flashcard program.] Allows me to specify a list of words to exclude. So words I know won't get included So I finally got around to creating one. It's really really ugly, hacked together in a fit of pique in about 30 minutes. If people like it, I'll try to improve it. Currently, it's probably linux only, sorry windows users. [unless you install gawk and/or cygwin.] The first step is to segment the file, e.g. convert the sequence of characters into words. For this, I took Erik Peterson's tool available here: http://www.mandarintools.com/segmenter.html and modified it to print each word on a new line (rather than with spaces) and, for each word, print out a number indicating what type of object is it (e.g. Chinese word, number, whitespace, other). Thanks for Erik for doing 99% of the work! The next step is to take this file and for only the Chinese words (hence the need to identify each object), create the entry for this. To do this, I wrote a simple gawk script that reads in a dictionary, reads in an "exclude" list, and for the first time each Chinese word is seen, look up the full entry for that word in the dictionary and output it. The required files you need to provide are dict.txt -- the dictionary. I recommend downloading CEDICT ignore.txt -- a list of words to ignore (presumably because you already know them) sample.txt -- the text you want to create a word list for dict.txt and ignore.txt are required to be in CEDICT format. Assuming the above, you would run gawk -f gen_lex.awk dict.txt java segmenter.class -8 sample.txt gawk -f gen_list.awk sample.txt.seg > list.txt The first line generates bothlexu8.txt, tradlexu8.txt, and simplexu8.txt, the three input dictionary files to segmenter.class. You need to run this only once, whenever you have a new dict.txt file. Let me know what you think! EDITS: YAWLG 1.1: minor bug fix to gen_lex.awk; forgot to create bothlexu8.txt YAWLG-1.1.zip Edited November 20, 2009 at 06:14 PM by jbradfor 1 Quote
gato Posted November 20, 2009 at 03:36 AM Report Posted November 20, 2009 at 03:36 AM NJStar can create a vocab list from a text file. The trial version you can download at their website can be used indefinitely as far as I can tell. I think the registered version provides better font for printing, but if you don't care about printing, the trial version seems sufficient. Of course, you should probably pay just to be a nice citizen. http://www.njstar.com/cms/njstar-chinese-word-processor New Language Study Functions Version 5.01 has introduced a "study list" for vocabulary study. A "Word Annotation" function is also introduced in this new release. It searches NJStar dictionary and annotates Pinyin spellings and English meanings at the end of each paragraph. Quote
jbradfor Posted November 20, 2009 at 03:32 PM Author Report Posted November 20, 2009 at 03:32 PM Humm, I tried NJStart at one point, and it didn't do all that I wanted. Don't remember exactly why. Maybe they fixed it in a newer version. Anyway, here's an example of the tool in action. I took the transcript from Slow Chinese lesson 29 (http://www.slow-chinese.com/?p=751), created an exclude file consisting of the most frequent 500 characters from ZDT and a HSK A wordlist from somewhere, and created a list of words to learn. ignore.txt sample.txt list.txt Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.