New tool for vocabulary extraction

September 24, 2011 at 03:24 AM

Hi everyone. I just released a new Windows program to take Chinese text of any size, split it into words, and create a summary of the vocabulary. Some of you may be familiar with a similar online web page I made to do the same task; however, it was straining the shared hosting provider to the point of crashing the page. This program works offline, is much more reliable, and now can handle both simplified and traditional text. It also makes it easy to update the dictionary to the most recent CC-CEDICT, or replace or complement it with another dictionary.

This is the results page, after the text is entered or loaded and analyzed. It looks quite ugly in this format, but it is really intended to be copied and pasted into Excel or another spreadsheet program for further work.

This is the preferences page, which sums up what the program is capable of.

September 24, 2011 at 03:45 AM

Thank you very much.

September 24, 2011 at 04:54 AM

Pimp! This worked well as a quick sanity check for use along with my Putonghua Thread here...much easier format to work with compared to what I was doing before. Thanks!

September 24, 2011 at 05:59 AM

I like to use your online tool. Thank you.

September 24, 2011 at 07:15 AM

Thank you sir. Your online tool is awesome - I especially like the log-frequency statistic it provides.

We've had a few tools somewhat like this before, and one useful feature is the ability to exclude/ignore known words or characters. It looks like your tool has this as well, and thus it almost supersedes the existing tools. Very impressive.

September 24, 2011 at 09:23 AM

Oh goodness gracious thank you!

September 24, 2011 at 02:03 PM

Thank you sir. Your online tool is awesome - I especially like the log-frequency statistic it provides

Unfortunately, I wasn't able to include the word frequency statistics from the Lancaster Corpus in this program. I use it personally, but it wasn't clear from their distribution license that I could include it, so I erred on the side of caution. If there is an obviously free word frequency list, I can include it. Meanwhile, anyone can easily add in any statistics they have access to.

Character frequency lists, however, are plentiful. I just wanted to get this basic program out first, and add that data the first chance I get.

September 25, 2011 at 08:50 AM

i want it, BUT IT WON'T OPEN THE WEBSITE FOR ME. $& YOU FIREWALL.

September 25, 2011 at 04:08 PM

Unfortunately, I wasn't able to include the word frequency statistics from the Lancaster Corpus in this program. I use it personally, but it wasn't clear from their distribution license that I could include it, so I erred on the side of caution.

Would it be possible to upload the column of words extracted from the analysis using the off-line program and have your site generate the word frequency statistics (in the same order so that it could be added back as a new column)

September 25, 2011 at 04:16 PM

I have looked up the Lancaster Corpus online and read the current license page. (http://www.ota.ox.ac.uk/scripts/download.php?otaid=2474)

I believe that if you add a provision for people to get their own copy (by submitting an email address and agreeing to their terms for private use), that you could incorporate the ability to use it, if present.

This model is used by many other software programs.

September 28, 2011 at 05:54 PM

I've used your online tool quite a few times in the past. Thanks for the continued work on these tools.

October 3, 2011 at 02:13 PM

I got this working on Ubuntu Linux, although it's not a tidy package at this point. Here are the commands I used:

lsb_release -a
# No LSB modules are available.
# Distributor ID: Ubuntu
# Description: Ubuntu 11.04
# Release: 11.04
# Codename: natty
wget http://apt.wxwidgets.org/key.asc -O - | sudo apt-key add -
# if it waits without going back to the prompt, it's waiting for the sudo password
sudo nano /etc/apt/sources.list
# replace "natty" with appropriate codename if different Ubuntu version
# # wxWidgets/wxPython repository at apt.wxwidgets.org
# deb http://apt.wxwidgets.org/ natty-wx main
# deb-src http://apt.wxwidgets.org/ natty-wx main
sudo apt-get update
sudo apt-get install python-wxgtk2.8 python-wxtools wx2.8-i18n
sudo apt-get install subversion
sudo apt-get install python-chardet
mkdir ~/CWE
cd ~/CWE
svn export http://svn.zhtoolkit.com/ChineseWordExtractor/trunk/
cd trunk
python main.py &

October 15, 2011 at 07:32 AM

I also have it working nicely in Linux Mint. Already produced a vocab list for the Analects! Thanks very much

October 20, 2011 at 12:58 PM

Hello c_redman,

First, thanks a lot for your very useful tool. I've downloaded and tried the Windows version, which works flawlessly.

I would like to use it in my never ending quest for reading material that is slightly above my current level.

For that I tried to use the new HSK4 list as a filter, in the hope that I could find texts that contain relatively few words not included in that list. However, it turns out to be harder than expected, as there are quite a lot of very simple words that are not included in the HSK4 list, either because they are too simple by themselves (天, 手…) or because they are transparent expressions or compounds (第一次, 跑出去…). So I'm trying to compile a list of such words but it might take a long time... Any idea to simplify this? Maybe a function which would allow excluding all words whose frequency per million is above x?

Also, there is a feature of the online tool which I don't find in the local version: the meanU(log10). Yes, it's only a very rough estimate, but I have tested it with a range of simple to intermediate and difficult texts, and it gives a useful indication of the degree of difficulty. Any plan to incorporate it the harddrive version?

On the other hand, I could actually *read* the text and see for myself whether it's difficult or not :-). Anyway, just a few questions... Thanks again,

Laurenth

Sign In

New tool for vocabulary extraction

Recommended Posts

c_redman

stoney

heifeng

Ole

creamyhorror

querido

c_redman

bunny87

stephanhodges

stephanhodges

Gleaves

c_redman

Hugh

laurenth

Join the conversation