Introducing Chinese Text Analyser

April 10, 2014 at 12:53 AM

"I have a number of ideas for improvements ... (including for other languages)"

I hope you will kindly keep Cantonese in mind. <3

April 10, 2014 at 01:02 AM

I should probably mention that I'm using Crossover to run the application, so that might be the source of the boxes.

Yes, that's a known issue running under any wine variant - tabs and newlines get printed as boxes. It's already on my list of things to fix. It's only a display issue and doesn't affect segmentation or exports. Note also how in your screenshot, the program toolbar is missing several icons on the buttons, and how the startup screen is totally transparent - also both Wine issues, though I probably won't fix those as I'm eventually planning native versions that will solve that problem entirely.

but because of the pinyin not being unique to a given word, any cards that involve pinyin are going to need something to make them unique in a way that's useful to the learner.

Ah ok, now I see what you mean. Unfortunately there's not an easy way to solve this without writing a program that attempts not just to segment, but also understand Chinese - which is a significantly more difficult problem and worth a fortune to whoever solves it. The current (still pre-release) version just takes the easy way out an exports the first dictionary entry for the word. Future versions will output all dictionary entries, or have the option to allow the user to select which variation they wish to choose (with recommendations based on relative frequencies).

April 10, 2014 at 01:03 AM

This reminds me of what LingQ does but a million times better. I think I am in love.

Thanks. And it's only going to get better

April 10, 2014 at 01:06 AM

I hope you will kindly keep Cantonese in mind. <3

Do you have any online sources for Cantonese dictionaries with pronunciation?

April 10, 2014 at 01:12 AM

What would help you?

I won't have time tonight, but I would definitely try to help!

"Online sources"? I have big dictionary files I've downloaded, somewhere.

April 10, 2014 at 01:21 AM

All the segmenter needs is a list of words - one per line, it already has this for Chinese, but I imagine there are many Cantonese specific words not currently in this list.

To show pronunciation it just needs a list of words and their associated pronunciation (tab separated format is fine) - preferably at the word level rather than the character level in order to have better accuracy.

Meanings are a bit trickier as there are potential copyright issues depending on the source material and the licence under which is was published, so public and open sources are better.

April 10, 2014 at 01:55 AM

Ah ok, now I see what you mean. Unfortunately there's not an easy way to solve this without writing a program that attempts not just to segment, but also understand Chinese - which is a significantly more difficult problem and worth a fortune to whoever solves it. The current (still pre-release) version just takes the easy way out an exports the first dictionary entry for the word. Future versions will output all dictionary entries, or have the option to allow the user to select which variation they wish to choose (with recommendations based on relative frequencies).

It's not that big of a deal, I usually just load things up into a spreadsheet and then concatenate the cells I want to compare.

Usually adding something like: B1 & " " & C1 solves the problem, it's just a minor annoyance and not really that hard to work around.

I hope that when you allow choosing different words that this will allow me to tell the program when one character should be treated as two, I think that's the only thing I'm noticing that's in real need of fixing.

But thanks, from what I'm seeing in Crossover, it seems to work mostly flawlessly, except for those couple minor annoyances.

April 10, 2014 at 02:06 AM

Forgive me my brain is getting worse and worse. (And I have not even tried to use the program yet. I did make note of its awesome speed though.)

Can I import a Cantonese dictionary (of my own for my own use) to replace whatever is built-in now? Would that work right, now? Is that what you just told me in post #26, above?

Thanks!

"Meanings are a bit trickier as there are potential copyright issues depending on the source material and the licence under which is was published, so public and open sources are better."

I understand. A terrible restriction you face.

April 10, 2014 at 02:43 AM

Would that work right, now?

Probably, with the caveat that it is unsupported and not guaranteed to work well. With that said, assuming you installed Chinese Text Analyser to C:\Program Files\ChineseTextAnalyser, then in the directory C:\Program Files\ChineseTextAnalyser\data there are two files of interest: words.u8, and cedict_ts.u8.

words.u8 is a list of utf8 encoded words used by the segmenter, one unique word per line.

cedict_ts.u8 is where the program pulls pronunciation and definitions from, in standard cedict format.

You could theoretically replace these with another dictionary also in cedict format but with different pronunciations and it would probably work.

I make no guarantees, but if you stuff things up, you can just reinstall the whole program to get it back from scratch.

I usually just load things up into a spreadsheet and then concatenate the cells I want to compare.

Ah, ok, you want concatenated fields in the exporter? I'll have a think to see if there's a good way to do this. It might get a bit unmanageable - depending on all the different possible combinations.

this will allow me to tell the program when one character should be treated as two

Do you mean in the main text area when the program has segmented a word incorrectly?

April 10, 2014 at 02:57 AM

200-300k characters in under a second? That is really, really fast!

Are you segmenting on unicode and parsing longer words from user-generated wordlists or actually doing dictionary lookups? I'm curious because my segmentation speeds for Adso are much, much, much slower and constantly hitting an SQL engine for possible word-matches is currently really time-intensive for me.

April 10, 2014 at 03:01 AM

I usually just load things up into a spreadsheet and then concatenate the cells I want to compare.
Ah, ok, you want concatenated fields in the exporter? I'll have a think to see if there's a good way to do this. It might get a bit unmanageable - depending on all the different possible combinations.

I don't think that every combination is necessary, if it's pain just offering one would probably suffice. Offering an option for the simplified/traditional and pinyin would probably be sufficient. Anybody who only cares about the traditional or simplified character can just ignore the other one, and I'm not sure if tone markers or numbers is better. But, you get the idea, that should be more than enough to deal with that typical flashcard problem.

this will allow me to tell the program when one character should be treated as two
Do you mean in the main text area when the program has segmented a word incorrectly?

Right, basically allow the reader to tell the program when it's not right. I'm not sure what the best way of handling that would be. I do realize that this can get rather complicated.

April 10, 2014 at 07:23 AM

200-300k characters in under a second? That is really, really fast!

Thanks! I put a lot of work in to it to get it that fast. Actually on my computer (Win7 running under a virtual machine on a MacBook Air) it can process about 2.5 million Chinese characters a second (if you include punctuation and whitespace it's about 3.5 million) It's a little hard for people with no background in Chinese text processing to relate that figure to something more concrete, so saying a novel in under a second is just a conservative way let people know what to expect, while also giving plenty of leeway that it will still be true even on older machines, and still be faster than what many other programs can do.

In terms of how I do the processing, I have a word list that I load in to a custom data structure optimised for looking up matches or partial matches of Chinese words.

Then I process the document looking up to see if there are any matches or partial matches against the wordlist, and segmenting it in to words to get total statistics.

Rendering of the page is then handled separately from this, with segmentation happening in realtime for all visible text, thereby allow the user to scroll anywhere and still see segmented and highlighted words, even if total document processing hasn't finished yet.

All parsing is done in utf8 - (other formats like utf16 or GBK are first to converted to utf8 before processsing).

There's a bunch of other stuff going on too, like efficient allocation of memory, and using memory mapped files so only small portions of the entire file are ever loaded at once and so on.

Dictionary lookup is then handled separately again. On program startup I parse cedict and build an index to each entry, and store the index using the same data structure mentioned above. When I need the meaning or pronunciation, I simply look up the index, then map in the portion of the cedict file required and extract the relevant information.

I don't think that every combination is necessary,

I'm sure you don't, it's then the next user who asks, how come you have this combination but not that combination, and then the next one who asks how come you have those two combinations and not the one I want, so it's a matter of trying to design something that is flexible enough to meet everyone's needs without being too confusing to use.

Simplified, Traditional, Pinyin (tones), Pinyin (numbers) the combinations start to get daunting quickly.

Can you explain more a little bit about your workflow, maybe there's another way to achieve the same result. Perhaps allowing you to specify the separator character, then you could export Word + Pinyin with space as a separator rather than tabs?

April 10, 2014 at 08:48 AM

I do plan to add a feature in the future that allows manual lookups (e.g. by right clicking on the word or similar) but then automatically marks those words as unknown for a fixed period of time.

An excellent educational compromise, IMO. When I first tried the program, I desperately hit all available buttons, menus and icons to find some kind of popup dictionary. Then I started to suspect that the lack of such a tool may be intentional...

I've toyed a bit with the program and what it does is really well done. From a personal point of view, I'll probably purchase a license even if the time I spent studying Chinese in front of a computer is close to nil: I use my phone for lookups and reviews, and I consciously try to use more paper and less ebooks for reading, precisely for the reason Imron mentions, i.e. I tend to rely too heavily on easy dictionary lookups, rather than on my brains and on vocab learning.

April 10, 2014 at 09:14 AM

I don't actually learn in front of the computer, too, because there are too many ways to cheat myself. I read online news articles (with the intention of learning) by printing them out on paper, taking notes and manually looking up the unknown vocabulary in my phone or when that's unavailable even in a paper dictionary.

But to be honest, it is painfully slow and tiring, simply because the process is a bit complicated. I believe my study workflow might be way way better with this tool. I can't wait to try it out.

April 10, 2014 at 10:27 AM

Then I started to suspect that the lack of such a tool may be intentional...

Yes. The program prioritises long-term learning objectives over short-term understanding and is designed to force you to acknowledge words that you don't know, rather than let you pretend you know them (but 'just wanted to check'), because accepting that you don't know certain words well enough yet is the way to make real progress.

It then provides tools to let you quickly extract those words (and optionally sentences containing them) for study in other programs dedicated to the task.

The program is designed around two major goals 1) helping you decide if a given piece of text is suitable for your current level, and 2) helping you prioritise which vocabulary you need to learn.

April 10, 2014 at 06:34 PM

Imron, would you consider it useful to have the ability to merge all open windows into one single window? There have been a few times I needed to do this. It's no great effort pasting both texts into notepad and then paste it all back in to CTA -- just a suggestion, if more people would find it useful too.

April 10, 2014 at 08:50 PM

How about the ability to append to the clipboard, so you could copy everything in window A to the clipboard, append copy everything in window B to the clipboard and then paste in a new window?

That would probably achieve the same result, but be much easier to implement and use than merging windows.

April 11, 2014 at 12:56 AM

I'm sure you don't, it's then the next user who asks, how come you have this combination but not that combination, and then the next one who asks how come you have those two combinations and not the one I want, so it's a matter of trying to design something that is flexible enough to meet everyone's needs without being too confusing to use.

Simplified, Traditional, Pinyin (tones), Pinyin (numbers) the combinations start to get daunting quickly.

Can you explain more a little bit about your workflow, maybe there's another way to achieve the same result. Perhaps allowing you to specify the separator character, then you could export Word + Pinyin with space as a separator rather than tabs?

A separator character might be a good resolution to this. I'm not sure which one to use as it needs to be not allowed in any of the text fields and recognized by the programs accepting the files for import. The user would need to manually add the break character between the fields. Which doesn't seem unreasonable, I'm sure it's probably easier than trying to figure out how to mark off certain fields to be joined and others to be kept separate.

I'm not sure where the miscommunication is happening. But, generally I'll enter all my data into a spreadsheet. In the spreadsheet, the first 2 columns are the character with romanization and the definition. This I then import into Anki usually using tabs as separators.

Perhaps this is a case where it's best just to let the user deal with it. Now that I think of it a judicious use of sed might just solve this problem much more efficiently than anything that's going to be reasonable in the program.

I'm being moderately dense here.

sed 's/\t/ - /' test.txt

Solves the problem as long as it's the first and second field that I want to combine and the file that I'm reading is named test.txt. In retrospect, I probably should have just done this as it was like 5 minutes work.

April 11, 2014 at 01:36 AM

Ok, I though Anki allowed you to combine fields when importing, but I checked now and I guess it doesn't. I'll have a think and see if I can come up with a nice way to merge columns, but it probably won't be implemented for a while.

April 11, 2014 at 02:28 AM

Ok, I though Anki allowed you to combine fields when importing, but I checked now and I guess it doesn't. I'll have a think and see if I can come up with a nice way to merge columns, but it probably won't be implemented for a while.

That was probably a feature from the 1.x and earlier versions. Unfortunately when they upgraded to the 2.x versions a lot of the Chinese related functions were removed. Anyways, now that I've realized that I can just script the solution to this problem trivially, I think that putting it off for the future is probably the right decision. It's more of a Linux centric solution, but people that really need it can probably find Windows versions of sed to perform this on, or they can just use the recipe I referred to earlier in a spreadsheet.

I think the software is going to meet my needs even as is, so I'll be buying a copy when I get my next pay check.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

querido

imron

imron

imron

querido

imron

hedwards

querido

imron

trevelyan

hedwards

imron

laurenth

Ruben von Zwack

imron

Guest realmayo

imron

hedwards

imron

hedwards

Join the conversation