Jump to content
Chinese-Forums
  • Sign Up

Yet Another Word List Generator


jbradfor

Recommended Posts

I've been looking for a tool to generate word lists from a text. I know there are many out there, but I've never found one that meets all my criteria.

  • Take a text file as input
  • Create a text file as output that is a word list from which I can create a vocab list. [i've seen web-based tools that will create an output on the screen, but from there it's a major effort to feed it into my flashcard program.]
  • Allows me to specify a list of words to exclude. So words I know won't get included

So I finally got around to creating one. It's really really ugly, hacked together in a fit of pique in about 30 minutes. If people like it, I'll try to improve it. Currently, it's probably linux only, sorry windows users. [unless you install gawk and/or cygwin.]

The first step is to segment the file, e.g. convert the sequence of characters into words. For this, I took Erik Peterson's tool available here: http://www.mandarintools.com/segmenter.html and modified it to print each word on a new line (rather than with spaces) and, for each word, print out a number indicating what type of object is it (e.g. Chinese word, number, whitespace, other). Thanks for Erik for doing 99% of the work!

The next step is to take this file and for only the Chinese words (hence the need to identify each object), create the entry for this. To do this, I wrote a simple gawk script that reads in a dictionary, reads in an "exclude" list, and for the first time each Chinese word is seen, look up the full entry for that word in the dictionary and output it.

The required files you need to provide are

  • dict.txt -- the dictionary. I recommend downloading CEDICT
  • ignore.txt -- a list of words to ignore (presumably because you already know them)
  • sample.txt -- the text you want to create a word list for

dict.txt and ignore.txt are required to be in CEDICT format.

Assuming the above, you would run

gawk -f gen_lex.awk dict.txt
java segmenter.class -8 sample.txt
gawk -f gen_list.awk sample.txt.seg > list.txt

The first line generates bothlexu8.txt, tradlexu8.txt, and simplexu8.txt, the three input dictionary files to segmenter.class. You need to run this only once, whenever you have a new dict.txt file.

Let me know what you think!

EDITS:

YAWLG 1.1: minor bug fix to gen_lex.awk; forgot to create bothlexu8.txt

YAWLG-1.1.zip

Edited by jbradfor
  • Like 1
Link to comment
Share on other sites

NJStar can create a vocab list from a text file. The trial version you can download at their website can be used indefinitely as far as I can tell. I think the registered version provides better font for printing, but if you don't care about printing, the trial version seems sufficient. Of course, you should probably pay just to be a nice citizen.

http://www.njstar.com/cms/njstar-chinese-word-processor

New Language Study Functions

Version 5.01 has introduced a "study list" for vocabulary study. A "Word Annotation" function is also introduced in this new release. It searches NJStar dictionary and annotates Pinyin spellings and English meanings at the end of each paragraph.

Link to comment
Share on other sites

Humm, I tried NJStart at one point, and it didn't do all that I wanted. Don't remember exactly why. Maybe they fixed it in a newer version.

Anyway, here's an example of the tool in action. I took the transcript from Slow Chinese lesson 29 (http://www.slow-chinese.com/?p=751), created an exclude file consisting of the most frequent 500 characters from ZDT and a HSK A wordlist from somewhere, and created a list of words to learn.

ignore.txt

sample.txt

list.txt

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...