Yet Another Word List Generator

November 19, 2009 at 11:32 PM

I've been looking for a tool to generate word lists from a text. I know there are many out there, but I've never found one that meets all my criteria.

Take a text file as input
Create a text file as output that is a word list from which I can create a vocab list. [i've seen web-based tools that will create an output on the screen, but from there it's a major effort to feed it into my flashcard program.]
Allows me to specify a list of words to exclude. So words I know won't get included

So I finally got around to creating one. It's really really ugly, hacked together in a fit of pique in about 30 minutes. If people like it, I'll try to improve it. Currently, it's probably linux only, sorry windows users. [unless you install gawk and/or cygwin.]

The first step is to segment the file, e.g. convert the sequence of characters into words. For this, I took Erik Peterson's tool available here: http://www.mandarintools.com/segmenter.html and modified it to print each word on a new line (rather than with spaces) and, for each word, print out a number indicating what type of object is it (e.g. Chinese word, number, whitespace, other). Thanks for Erik for doing 99% of the work!

The next step is to take this file and for only the Chinese words (hence the need to identify each object), create the entry for this. To do this, I wrote a simple gawk script that reads in a dictionary, reads in an "exclude" list, and for the first time each Chinese word is seen, look up the full entry for that word in the dictionary and output it.

The required files you need to provide are

dict.txt -- the dictionary. I recommend downloading CEDICT
ignore.txt -- a list of words to ignore (presumably because you already know them)
sample.txt -- the text you want to create a word list for

dict.txt and ignore.txt are required to be in CEDICT format.

Assuming the above, you would run

gawk -f gen_lex.awk dict.txt
java segmenter.class -8 sample.txt
gawk -f gen_list.awk sample.txt.seg > list.txt

The first line generates bothlexu8.txt, tradlexu8.txt, and simplexu8.txt, the three input dictionary files to segmenter.class. You need to run this only once, whenever you have a new dict.txt file.

Let me know what you think!

EDITS:

YAWLG 1.1: minor bug fix to gen_lex.awk; forgot to create bothlexu8.txt

YAWLG-1.1.zip

Edited November 20, 2009 at 06:14 PM by jbradfor

November 20, 2009 at 03:36 AM

NJStar can create a vocab list from a text file. The trial version you can download at their website can be used indefinitely as far as I can tell. I think the registered version provides better font for printing, but if you don't care about printing, the trial version seems sufficient. Of course, you should probably pay just to be a nice citizen.

http://www.njstar.com/cms/njstar-chinese-word-processor

New Language Study Functions

Version 5.01 has introduced a "study list" for vocabulary study. A "Word Annotation" function is also introduced in this new release. It searches NJStar dictionary and annotates Pinyin spellings and English meanings at the end of each paragraph.

November 20, 2009 at 03:32 PM

Humm, I tried NJStart at one point, and it didn't do all that I wanted. Don't remember exactly why. Maybe they fixed it in a newer version.

Anyway, here's an example of the tool in action. I took the transcript from Slow Chinese lesson 29 (http://www.slow-chinese.com/?p=751), created an exclude file consisting of the most frequent 500 characters from ZDT and a HSK A wordlist from somewhere, and created a list of words to learn.

ignore.txt

sample.txt

list.txt