Popular Post c_redman Posted September 24, 2011 at 03:24 AM Popular Post Report Posted September 24, 2011 at 03:24 AM Hi everyone. I just released a new Windows program to take Chinese text of any size, split it into words, and create a summary of the vocabulary. Some of you may be familiar with a similar online web page I made to do the same task; however, it was straining the shared hosting provider to the point of crashing the page. This program works offline, is much more reliable, and now can handle both simplified and traditional text. It also makes it easy to update the dictionary to the most recent CC-CEDICT, or replace or complement it with another dictionary. This is the results page, after the text is entered or loaded and analyzed. It looks quite ugly in this format, but it is really intended to be copied and pasted into Excel or another spreadsheet program for further work. This is the preferences page, which sums up what the program is capable of. 14 Quote
stoney Posted September 24, 2011 at 03:45 AM Report Posted September 24, 2011 at 03:45 AM Thank you very much. Quote
heifeng Posted September 24, 2011 at 04:54 AM Report Posted September 24, 2011 at 04:54 AM Pimp! This worked well as a quick sanity check for use along with my Putonghua Thread here...much easier format to work with compared to what I was doing before. Thanks! Quote
Ole Posted September 24, 2011 at 05:59 AM Report Posted September 24, 2011 at 05:59 AM I like to use your online tool. Thank you. Quote
creamyhorror Posted September 24, 2011 at 07:15 AM Report Posted September 24, 2011 at 07:15 AM Thank you sir. Your online tool is awesome - I especially like the log-frequency statistic it provides. We've had a few tools somewhat like this before, and one useful feature is the ability to exclude/ignore known words or characters. It looks like your tool has this as well, and thus it almost supersedes the existing tools. Very impressive. Quote
querido Posted September 24, 2011 at 09:23 AM Report Posted September 24, 2011 at 09:23 AM Oh goodness gracious thank you! Quote
c_redman Posted September 24, 2011 at 02:03 PM Author Report Posted September 24, 2011 at 02:03 PM Thank you sir. Your online tool is awesome - I especially like the log-frequency statistic it provides Unfortunately, I wasn't able to include the word frequency statistics from the Lancaster Corpus in this program. I use it personally, but it wasn't clear from their distribution license that I could include it, so I erred on the side of caution. If there is an obviously free word frequency list, I can include it. Meanwhile, anyone can easily add in any statistics they have access to. Character frequency lists, however, are plentiful. I just wanted to get this basic program out first, and add that data the first chance I get. Quote
bunny87 Posted September 25, 2011 at 08:50 AM Report Posted September 25, 2011 at 08:50 AM i want it, BUT IT WON'T OPEN THE WEBSITE FOR ME. $& YOU FIREWALL. Quote
stephanhodges Posted September 25, 2011 at 04:08 PM Report Posted September 25, 2011 at 04:08 PM Unfortunately, I wasn't able to include the word frequency statistics from the Lancaster Corpus in this program. I use it personally, but it wasn't clear from their distribution license that I could include it, so I erred on the side of caution. Would it be possible to upload the column of words extracted from the analysis using the off-line program and have your site generate the word frequency statistics (in the same order so that it could be added back as a new column) Quote
stephanhodges Posted September 25, 2011 at 04:16 PM Report Posted September 25, 2011 at 04:16 PM I have looked up the Lancaster Corpus online and read the current license page. (http://www.ota.ox.ac.uk/scripts/download.php?otaid=2474) I believe that if you add a provision for people to get their own copy (by submitting an email address and agreeing to their terms for private use), that you could incorporate the ability to use it, if present. This model is used by many other software programs. Quote
Gleaves Posted September 28, 2011 at 05:54 PM Report Posted September 28, 2011 at 05:54 PM I've used your online tool quite a few times in the past. Thanks for the continued work on these tools. Quote
c_redman Posted October 3, 2011 at 02:13 PM Author Report Posted October 3, 2011 at 02:13 PM I got this working on Ubuntu Linux, although it's not a tidy package at this point. Here are the commands I used: lsb_release -a # No LSB modules are available. # Distributor ID: Ubuntu # Description: Ubuntu 11.04 # Release: 11.04 # Codename: natty wget http://apt.wxwidgets.org/key.asc -O - | sudo apt-key add - # if it waits without going back to the prompt, it's waiting for the sudo password sudo nano /etc/apt/sources.list # replace "natty" with appropriate codename if different Ubuntu version # # wxWidgets/wxPython repository at apt.wxwidgets.org # deb http://apt.wxwidgets.org/ natty-wx main # deb-src http://apt.wxwidgets.org/ natty-wx main sudo apt-get update sudo apt-get install python-wxgtk2.8 python-wxtools wx2.8-i18n sudo apt-get install subversion sudo apt-get install python-chardet mkdir ~/CWE cd ~/CWE svn export http://svn.zhtoolkit.com/ChineseWordExtractor/trunk/ cd trunk python main.py & Quote
Hugh Posted October 15, 2011 at 07:32 AM Report Posted October 15, 2011 at 07:32 AM I also have it working nicely in Linux Mint. Already produced a vocab list for the Analects! Thanks very much Quote
laurenth Posted October 20, 2011 at 12:58 PM Report Posted October 20, 2011 at 12:58 PM Hello c_redman, First, thanks a lot for your very useful tool. I've downloaded and tried the Windows version, which works flawlessly. I would like to use it in my never ending quest for reading material that is slightly above my current level. For that I tried to use the new HSK4 list as a filter, in the hope that I could find texts that contain relatively few words not included in that list. However, it turns out to be harder than expected, as there are quite a lot of very simple words that are not included in the HSK4 list, either because they are too simple by themselves (天, 手…) or because they are transparent expressions or compounds (第一次, 跑出去…). So I'm trying to compile a list of such words but it might take a long time... Any idea to simplify this? Maybe a function which would allow excluding all words whose frequency per million is above x? Also, there is a feature of the online tool which I don't find in the local version: the meanU(log10). Yes, it's only a very rough estimate, but I have tested it with a range of simple to intermediate and difficult texts, and it gives a useful indication of the degree of difficulty. Any plan to incorporate it the harddrive version? On the other hand, I could actually *read* the text and see for myself whether it's difficult or not :-). Anyway, just a few questions... Thanks again, Laurenth Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.