Jump to content
Chinese-Forums
  • Sign Up

Chinese text analysis for full words


flameproof

Recommended Posts

Does anybody knows a link for Chinese text analysis which can recognize words, rather then just single characters. Goulnik's page is great, it gives me count and frequency of each individual character, but does not really recognize words.

http://goulnik.com/chinese/gb/

Didn't get it? what I mean is, 你好 will be listed as 2 characters, rather then 1 word. Same for 现在 etc etc.

Is there such a software, or online tool? Any workaround (other then manual counting) will be appreciated too.

Link to comment
Share on other sites

http://lingua.mtsu.edu/chinese-computing/vp/

Is really useful for this purpose.

Jun Dai is Assoc. Prof. of Linguistics at Middle Tennessee State University. As I understand it, he makes use of cross-entropy distributions to identify what are functionally multi-character chinese words, as well as making use of standard wordlists such as those for HSK.

He has a paper on this available online at

http://lingua.mtsu.edu/academic/tclt4/JunDa-TCLT4Paper.pdf

Edited by student
Link to comment
Share on other sites

Student's link does do some impressive analysis. It's not foolproof, but should be OK for my purpose.

I don't look for simple single character analysis, this can goulnik's side do already perfectly.

The final aim is actually to find the more-then-1-character words to be put into Pleco later.

Link to comment
Share on other sites

http://www.zhtoolkit.com/apps/wordlist/create-list.cgi

I wrote this for myself as a tool to make vocabulary lists, and am gradually adding more support so others can make use of it. The simplest way to use it is to paste in some Chinese text, submit, and copy the table of words into a spreadsheet for further work. If you create an account, you can check off words you already know, which will filter them out of future vocabulary lists.

It's terribly slow, because it loads CC-CEDICT and any other dictionary you choose, every time it generates a list. However, once the dictionaries are loaded, it's not too bad, even if I paste an entire book chapter.

Another vocabulary creator I know of is DimSum, using the Append Definitions function. However, the CEDICT database embedded in the distribution is rather old, and it just gives the words, not the counts or other data.

Link to comment
Share on other sites

http://goulnik.com/chinese/gb/

I found some errors on Goulniks page. I typed in the first passage of "Chinese Breeze" 我一定要找到她...

When I copy and paste it into Goulniks page I get 1091 character, 315 unique. When I use http://lingua.mtsu.edu/chinese-computing/vp/ then I get 1064 / 212. Quite a difference. mtsu will break the text into single characters with a space in between - when copy and paste that spaced text back to Goulniks page I get 1063/211

Having a closer look at the 315 unique characters I saw a 16 squares or empty places, and a few characters that did not appear in the text. He seems to have a problem with parsing the text.

mtsu sites does that better, I guess coz of the extra space, but also failed in the 2 character word parsing. Some were found, some wrong ones too, but some where missing up to an extend where I would say it's not useful to use.

Goulniks page does the job a little better, but also comes up with illogical combination's, such as 片今, whereas 旅行 was not found.

My conclusion is that wordlist generation, or even parsing, does not work, yet. Wordlist generation may also have a problem that you will get a HUGE number of words if you enter, lets say, 1000 characters. That would create how many words? a few 1000 I guess.

Link to comment
Share on other sites

Yes, Adso (http://adsotrans.com/downloads) can help automate these sorts of tasks. There are two approaches we take. The first is using the engine to print out the segmented text in some form that allows external software to easily count the output. For one word per line this command does the trick nicely:

adso -f [infile] --extra-code " AND "

The nice thing about this approach is that you can selectively process content on fairly arbitrary criteria (part of speech, length, semantics) and/or pretty much any other criteria you might like. Personally use this approach heavily for text summarization, content extraction or to prepare text for machine indexing for things like search.

adso -f [infile] --extra-code " AND AND AND ...."

The grammar language is not really that tricky to learn, since there are samples in the grammar directory. Once you have a file with all of the words organized line by line you can use standard unix utilities like "sort" and "uniq" to process and count the resulting output.

The second approach is to install the MySQL version and get the software to take care of everything automatically. This is faster and easier, but the frequency data gets stored in the main database. If you don't mind this you can start by resetting all of the frequency information in the database to zero:

adso --frequency-reset

.... and then feed in the text you want to count, instructing the software to update the database with frequency information.

adso -f [infile] --frequency

The downside to this is that you'll need to know some sort of scripting language to get the frequency data out of the database once it is in there. Which isn't really a lot of work, but it requires more than copy-and-paste.

If you do take a look at Adso, I'd recommend you run the software with the --no-phrases flag to avoid counting phrases if that would be a problem. --skip-stage-advanced will also stop the engine from combining certain word combinations into phrases.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...