Chinese text analysis for full words

November 25, 2008 at 04:03 AM

Does anybody knows a link for Chinese text analysis which can recognize words, rather then just single characters. Goulnik's page is great, it gives me count and frequency of each individual character, but does not really recognize words.

http://goulnik.com/chinese/gb/

Didn't get it? what I mean is, 你好 will be listed as 2 characters, rather then 1 word. Same for 现在 etc etc.

Is there such a software, or online tool? Any workaround (other then manual counting) will be appreciated too.

November 25, 2008 at 06:31 AM

I would also be interested if such a system exists.

November 25, 2008 at 07:12 AM

I think adsotrans used to generate word lists, although not with frequency information. That function doesn't seem to be available now, but you could drop them a line and see.

November 25, 2008 at 01:17 PM

http://lingua.mtsu.edu/chinese-computing/vp/

Is really useful for this purpose.

Jun Dai is Assoc. Prof. of Linguistics at Middle Tennessee State University. As I understand it, he makes use of cross-entropy distributions to identify what are functionally multi-character chinese words, as well as making use of standard wordlists such as those for HSK.

He has a paper on this available online at

http://lingua.mtsu.edu/academic/tclt4/JunDa-TCLT4Paper.pdf

Edited November 25, 2008 at 02:31 PM by student

November 25, 2008 at 02:42 PM

Student's link does do some impressive analysis. It's not foolproof, but should be OK for my purpose.

I don't look for simple single character analysis, this can goulnik's side do already perfectly.

The final aim is actually to find the more-then-1-character words to be put into Pleco later.

November 26, 2008 at 03:57 PM

http://www.zhtoolkit.com/apps/wordlist/create-list.cgi

I wrote this for myself as a tool to make vocabulary lists, and am gradually adding more support so others can make use of it. The simplest way to use it is to paste in some Chinese text, submit, and copy the table of words into a spreadsheet for further work. If you create an account, you can check off words you already know, which will filter them out of future vocabulary lists.

It's terribly slow, because it loads CC-CEDICT and any other dictionary you choose, every time it generates a list. However, once the dictionaries are loaded, it's not too bad, even if I paste an entire book chapter.

Another vocabulary creator I know of is DimSum, using the Append Definitions function. However, the CEDICT database embedded in the distribution is rather old, and it just gives the words, not the counts or other data.

November 27, 2008 at 07:25 AM

http://goulnik.com/chinese/gb/

I found some errors on Goulniks page. I typed in the first passage of "Chinese Breeze" 我一定要找到她...

When I copy and paste it into Goulniks page I get 1091 character, 315 unique. When I use http://lingua.mtsu.edu/chinese-computing/vp/ then I get 1064 / 212. Quite a difference. mtsu will break the text into single characters with a space in between - when copy and paste that spaced text back to Goulniks page I get 1063/211

Having a closer look at the 315 unique characters I saw a 16 squares or empty places, and a few characters that did not appear in the text. He seems to have a problem with parsing the text.

mtsu sites does that better, I guess coz of the extra space, but also failed in the 2 character word parsing. Some were found, some wrong ones too, but some where missing up to an extend where I would say it's not useful to use.

Goulniks page does the job a little better, but also comes up with illogical combination's, such as 片今, whereas 旅行 was not found.

My conclusion is that wordlist generation, or even parsing, does not work, yet. Wordlist generation may also have a problem that you will get a HUGE number of words if you enter, lets say, 1000 characters. That would create how many words? a few 1000 I guess.

November 27, 2008 at 07:57 AM

Yes, Adso (http://adsotrans.com/downloads) can help automate these sorts of tasks. There are two approaches we take. The first is using the engine to print out the segmented text in some form that allows external software to easily count the output. For one word per line this command does the trick nicely:

adso -f [infile] --extra-code " AND "

The nice thing about this approach is that you can selectively process content on fairly arbitrary criteria (part of speech, length, semantics) and/or pretty much any other criteria you might like. Personally use this approach heavily for text summarization, content extraction or to prepare text for machine indexing for things like search.

adso -f [infile] --extra-code " AND AND AND ...."

The grammar language is not really that tricky to learn, since there are samples in the grammar directory. Once you have a file with all of the words organized line by line you can use standard unix utilities like "sort" and "uniq" to process and count the resulting output.

The second approach is to install the MySQL version and get the software to take care of everything automatically. This is faster and easier, but the frequency data gets stored in the main database. If you don't mind this you can start by resetting all of the frequency information in the database to zero:

adso --frequency-reset

.... and then feed in the text you want to count, instructing the software to update the database with frequency information.

adso -f [infile] --frequency

The downside to this is that you'll need to know some sort of scripting language to get the frequency data out of the database once it is in there. Which isn't really a lot of work, but it requires more than copy-and-paste.

If you do take a look at Adso, I'd recommend you run the software with the --no-phrases flag to avoid counting phrases if that would be a problem. --skip-stage-advanced will also stop the engine from combining certain word combinations into phrases.

December 1, 2008 at 03:27 PM

Note that Goulnik's tool specifies "gb2312-encoded".

Sign In

Chinese text analysis for full words

Recommended Posts

flameproof

anon6969

roddy

student

flameproof

c_redman

flameproof

trevelyan

querido

Join the conversation