Introducing Chinese Text Analyser

April 9, 2014 at 07:19 AM

Chinese Text Analyser is a new tool I’ve been working on to help with reading and vocabulary acquisition.

You use it by opening a particular file containing Chinese text (or by pasting text from the clipboard), and it will highlight any words that you don’t know. You can then mark words as either known or unknown, and Chinese Text Analyser will then remember them, and over time build up an accurate list of your known vocabulary (you can also import lists of known words from other sources).

You can then use it to:

Easily see which words you do and do not know in any piece of Chinese text
Sort words in a document by frequency to help prioritise which new words to learn
Export wordlists (with optional sentence and cloze deletion extraction) for use with flashcard programs
Create segmented text files - with or without markup
Calculate and export frequency statistics
Search for grammar patterns e.g. 因为…所以

and more

The current version - 0.99.0 - should be considered a pre-release version, and there are still a number of major features I’m working on in the lead up to version 1.0.0, plus there are still various bits of polish I need to add for existing features too. However the program is now stable and currently at a point where it has a lot of useful functionality. If you’d like to try it out, you can download it from:

http://www.chinesetextanalyser.com/download

Currently it’s Windows only, (though it will run on Linux under Wine), but I’m planning native OS-X and Linux versions at some point in the future.

Unlike a number of other products with similar features, Chinese Text Analyser is fast. It will happily load and segment a full Chinese novel in under a second, and scrolling anywhere in the document is instant and flicker free.

If you need to process large amounts of data, it also handles multi megabyte or gigabyte files with ease and without locking up your computer.

If you're interested, check it out and be sure to let me know any feedback you might have.

April 9, 2014 at 07:34 AM

It looks interesting Imron. What about putting some screenshots on your site? Many people (me) like to see how a piece of software looks like before they purchase it or download the trial version, in order to have a very rough idea of the functionalities and the look and feel.

April 9, 2014 at 07:47 AM

Yeah, that's on my list of things to do

In the meantime, attached is a screen shot of the program with the script for 武林外传 loaded against HSK-6 word list.

And also the dialog for exporting vocabulary from a given document

April 9, 2014 at 08:14 AM

I won't have time to try it before next week, but it sounds just like what I've been waiting for!

April 9, 2014 at 08:36 AM

Since I read pretty much everything on paper, this would be the kind of tool I'd never use, but it sounds so extremely useful that in this case I might actually use it.

April 9, 2014 at 09:23 AM

I've been using this since the weekend and although I don't want to 拍 any 马derator's 屁, I think it's fantastic (I have no idea how it compares to other similar programmes out there). Each time I get to a new chapter in the books I'm reading I've started pasting that text into the programme -- before reading -- and then export all the unknown words: I skim through them and try to 'learn' a certain number of those I think most worthwhile. I particularly like how the programme will also export the actual sentence that each unknown word occured in. I like to think that when I then start reading the chapter, and come across those words, they are learned better than if it was the other way around (i.e. first read, then take out unknown words, then learn them).

I've also started doing the same with news: collect text from several articles on the same topic, paste it all into the programme, and find the most frequent unknown words across those texts, and skim or learn them before starting reading.

I'm on a reading & vocab binge at the moment so it's been perfect timing.

Lu, there's probably a decent chance what you're reading is online too which would let you copy/paste before returning to the paper book.

April 9, 2014 at 09:40 AM

You should show the PayPal logo earlier, on the homepage and/or the Buy Now page!

April 9, 2014 at 09:50 AM

That's probably a good idea. Have added it to my todo list.

April 9, 2014 at 10:13 AM

I think your desktop shortcut icon looks a little squished. :-)

April 9, 2014 at 11:16 AM

I think your desktop shortcut icon looks a little squished. :-)

It's a stylised 中 so the rectangle can't be too square.

April 9, 2014 at 11:54 AM

Oh? I see a 文. (Windows 8.1, 64 bit)

April 9, 2014 at 11:58 AM

nice

April 9, 2014 at 12:52 PM

Oh? I see a 文. (Windows 8.1, 64 bit)

If you had a lighter background, you'd see the 文 is inside a 中

April 9, 2014 at 02:30 PM

Maybe I'm the only one with a dark background, but I swear I've turned it around.

April 9, 2014 at 06:40 PM

I have a dark background too

Imron, I think the 中 might be a bit easier to recognise if it was in a brush like style like the 文. Before you explained it, I didn't get the pun at all.

April 9, 2014 at 07:08 PM

This looks promising, I may well buy a copy. I'm wondering though how similar is this to lwt.sourceforge.net and http://www.mandarintools.com/dimsum.html? And does it offer a way of automatically looking up the definitions and romanization for the words?

You might want to provide a list of formats as I could see this being a tool to work along with LWT and Anki. If the segmenting is any good, this would make my life a lot easier.

Also, I really appreciate that the pricing page warns about the possibility of a foreign transaction fee, it's nice to have a warning about that for those of us that have a choice of CC that includes one that doesn't charge the fee.

April 9, 2014 at 11:50 PM

I used it a bit today, and over all it looks quite nice.

But, a couple of things would greatly improve things. I appreciate the options for the word, but it would be nice to have one which is the pinyin and the character together. When I do my first run through on vocab, I like to do it like that. Then on later runs I'll match the pinyin to the character.

A preview of what the export is going to look like would also be very helpful. IMHO, it doesn't matter whether it's a generic example, or one that's taken from the actual output, but it would make it easier when exporting.

For whatever reason there's a bunch of boxes in the output. They don't seem to be causing any trouble, so I'm not really worried about it, but it could be confusing for people using the program.

Over all though, it looks very nice and I'm likely to buy a copy as it appears that it will save me a huge amount of time.

April 10, 2014 at 12:27 AM

I'm wondering though how similar is this to lwt.sourceforge.net and http://www.mandarint...om/dimsum.html

They are similar in quite a few ways in that they are all designed to segment text and help you learn vocab, but different in others. Probably one of the main differentiators is performance. Chinese Text Analyser was written from the ground up to handle large amounts of text in a short amount of time. Neither of the above tools work well when dealing with anything other than short articles, and struggle at novel length texts (though DimSum is significantly better than LWT in this regard). Chinese Text Analyser handles megabytes of text with ease, and gigabytes of text without raising a sweat. I would encourage you to do a side-by-side comparison of segmentation and statistics gathering time for each of the programs with something like full script for 武林外传, or the full text of a novel of your choice, in order to get a full appreciation of this.

Next is ease of use. LWT is a pain to set up and install on your local machine (some might even say a nightmare), and if you don't do it on your local machine you'll have resource constraints on whatever server you are running it on making it even more impractical to use for long texts. DimSum is much better than this but still has dependencies on Java runtimes. Case in point, I just downloaded the latest release but after downloading need to update and download Java runtimes before it works - ugh. By comparison, Chinese Text Analyser has no external dependencies and takes two clicks to install once downloaded.

In terms of total features, LWT and DimSum are still ahead of Chinese Text Analyser, but what Chinese Text Analyser does, it does well, and I'll be continuing to add features as time goes by.

And does it offer a way of automatically looking up the definitions and romanization for the words?

It does not have an automatic way of looking up definitions and romanization of words, but this is intentional because I believe this encourages bad habits and is detrimental to long-term learning objectives and lets you fool yourself into thinking you understand things when you do not. As such, Chinese Text Analyser will happily export pinyin and English definitions for words in a document, but it doesn't provide automatic lookup because I want to discourage the user from continually looking up words while reading. Instead they should be actively marking words as unknown and then exporting lists of those words for use in other programs such as an SRS tool. I do plan to add a feature in the future that allows manual lookups (e.g. by right clicking on the word or similar) but then automatically marks those words as unknown for a fixed period of time.

You might want to provide a list of formats

At the moment it reads UTF-8, UTF-16 or GB*. Adding support for other encodings is trivial if there is a demand. Output is always UTF-8 text.

If the segmenting is any good, this would make my life a lot easier.

The segmenting at the moment is passable, but not as good as I want it. I have a number of ideas for improvements and have designed the program in such a way that I can easily add or replace new segmenters (including for other languages), but I wanted to get the product out first before returning my focus to improving the segmentation.

I appreciate the options for the word, but it would be nice to have one which is the pinyin and the character together. When I do my first run through on vocab, I like to do it like that.

Can you explain this in a bit more detail - when you say 'options for the word' what do you mean, you should be able to export pinyin alongside the characters if needed.

A preview of what the export is going to look like would also be very helpful.

Already on my list of things to do

For whatever reason there's a bunch of boxes in the output.

Can you send me a screen shot, and also a sample of the text that has the problem so I can figure out what is causing it?

April 10, 2014 at 12:38 AM

This reminds me of what LingQ does but a million times better. I think I am in love.

April 10, 2014 at 12:48 AM

Thanks for the response. I totally understand the lack of automation on parts of it, I've personally got mixed feelings about it, but I definitely see merit to that.

I've attached a screen shot there. I should probably mention that I'm using Crossover to run the application, so that might be the source of the boxes. As far as I can tell, it's how the tabs are being reflected, but I suppose it could be something else. But, since it doesn't show up in the exports, I wouldn't consider it a high priority.

I've found flash cards for Chinese to be a bit of a pain because there are different words that correspond to the same Pinyin. There's 3 different cards to do what in most languages would be just one. So, if I'm learning 这个， I'd generally have a card that has 这个 - zhè ge on the front and the definition on the back. Then I'll generally create another card that's just the 这个 on the front and the definition on the back.

If that's a dumb way of doing it, feel free to say so and not implement it, but because of the pinyin not being unique to a given word, any cards that involve pinyin are going to need something to make them unique in a way that's useful to the learner.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

laurenth

imron

Ruben von Zwack

Lu

Guest realmayo

querido

imron

querido

imron

querido

ouyangjun

imron

querido

Ruben von Zwack

hedwards

hedwards

imron

陳德聰

hedwards

Join the conversation