Introducing Chinese Text Analyser

December 3, 2014 at 06:23 AM

Imron,

Version 0.99.9.

With a txt file, whether opening manually or through "Open recent," it always open to the top of the document.

Also -- when looking up a word on the far right hand side of the screen, the dictionary popup is obscured by the screen margin. (see picture attached)

December 3, 2014 at 07:24 AM

Can you send me a copy of the file:

C:\users\<username>\appdata\local\chinesetextanalyser\data\recent

Also from within the application:

Close all tabs.

Open a large file

Scroll all the way to the end

Close that file

Reopen that file.

If it hasn't opened at the end, then:

Help->send feedback

Check 'attach log file'

Hit send.

December 3, 2014 at 07:50 AM

Hi Imron,

I just emailed the file "recent" to you.

The txt file still opens to the beginning (does not remember my place). When I tried to send feedback, the program crashed (see attached picture). I tried sending feedback several times. It seems to crash every time I attach the log. When sending without the log (only with the message "test feedback") and it went through ok. But when clicking to attach the log it crashed every time.

I'm using Windows 8 64. Sending from within China.

December 3, 2014 at 08:04 AM

That's not good.

Does it actually crash out to the desktop, or does it just freeze?

Instead of sending the log file from within the program, can you please send it manually.

It is located in C:\users\<username>\appdata\local\chinesetextanalyser\logs

There should only be one file in there NNNN.log where NNNN is a number. If there is more than one, sort the directory by last modified, and send the most recent one.

December 3, 2014 at 08:27 AM

hi Imron,

Yes, it is crashing to desktop. I tried just now and it is still crashing.

I located the logs folder. The only file in the folder was last updated September 20th. I have used the program many times since that date. (BTW I just checked my email: this date is the same day I won the license giveaway--the day you emailed me a license and I updated to the paid version.)

I'll email you the log now.

December 3, 2014 at 08:41 AM

Thanks. As you mentioned, the log file there appears to be from an earlier date (Sep 20). Were you running CTA when you went to that folder? The log file should only be there if the program is running. If there is no other log file there when the CTA is running, then there is a problem going on (and probably all connected to the issue you're having).

So this is the new set of steps

Close CTA.

Delete everything in the 'log' directory.

Restart CTA

Open a large file
Scroll all the way to the end
Close that file
Reopen that file.
If it hasn't opened at the end, then before closing the application:

Go to the log directory

If there is a log file there send it to me manually.

If there isn't a log file there, let me know.

Thanks.

December 3, 2014 at 09:00 AM

Imron,

To confirm, CTA was closed earlier when I went to (and emailed you) the log from Sept 20.

Here's the behavior I'm seeing. When I open CTA, a new file appears in the local/ChineseTextAnalyser/logs folder. But when I close CTA, it disappears from the folder.

I followed your steps and manually emailed you the log. My computer will not allow me to upload the log file while CTA is open (probably because it is in use). So I saved a copy of the log to my desktop and emailed you that instead. It should be the same file -- it has the same name, date, and filesize.

December 3, 2014 at 09:12 AM

Yep, CTA should clean up its log file if it exits correctly. I should have mentioned the copy thing.

Anyway, I've found the problem. It seems the file position thing isn't working for GB files and the file you opened in GB. Probably this is being caused because internally CTA only deals with UTF8, and so for GB files it converts it to UTF8 and stores it as a temporary file and then works with that temporary file and something is getting messed up then when saving the offset.

I'll fix that in the next version, but as a quick temporary solution:

Open the file

Ctrl-A to select the entire file

Ctrl-C to copy the entire file

Ctrl-V to paste the entire file to a new tab.

Close the tab and save the file. This will save a utf8 version.

Then, in the future use the utf8 version rather than the GB version.

December 3, 2014 at 09:25 AM

Imron,

Wow, that was fast! The files are now opening from where I closed them. Thanks :-)

A reminder that sending logs from the "send feedback" option is still causing the program to crash to desktop. And as mentioned above, when looking up words close to the right margin, the dictionary popup appears partly off screen.

December 3, 2014 at 09:45 AM

Yep, am investigating the crash, and the dictionary popup off-screen is something I've added to my todo list.

December 3, 2014 at 09:49 AM

Regarding: Word segmentation - it's interesting, the more I think about it, the more I wonder if it's possible to do this well without actually having a model that
"understands" (very loose usage of this word!) the grammar of the sentence. The correct segmentation of ABCD could be AB CD or A BC D depending on the grammatic functions. I remember the paper on SUBLEX-CH mentioned using a neural net that was trained on hand-segmented texts - so the neural net developed an internal model of the segmentation that was apparently quite accurate. Their data actually gives part-of-speech frequencies (e.g. how often is X used as a noun, a verb). So interested to know what your plan is. I also wonder if it will be a big speed hit.

Regarding: Character frequencies. I actually can think of a few uses. If I know a word I have at least some knowledge of the characters that make it up (from my personal point of view I almost always can write it, but I know some people most recognize and then you might be relying on recognition of a pair - anyway it's more than 0% knowledge). And if I know the characters that make up a word, I frequently have a very good chance of guessing the meaning of that word, and it is much lower cost to learn than other words. If I know 1/2 characters in a word it's still easier than a word that I only know 1 character (N+1 principle). Certainly a text made up of 100% characters I know is much easier than one that is 80%. It would be great to be able to see "low hanging fruit" like words in the text that are made of characters you already know, or already have seen in other words you know.

But I think the tricky part is the uses for characters that you don't yet know. Last week I discovered a new grammatical use of 好. So was I lying when I marked it as known?? :-)

December 3, 2014 at 11:08 AM

I remember while my wife was pregnant. My daughter's ultrasound had an entry for 羊水. Sheep water, huh. Turns out 羊水 means "amniotic fluid," with 羊 substituting for 阳 of 阴阳 fame. I learned 羊 and 水 in my first year of Chinese study. And here I was, 6 years later, combining high frequency characters to learn words I almost never use even in English.

Some, though not all of the word segmentation problems could be improved by having a larger dictionary corpus. CTA's dictionary is pretty good (it has 羊水 for example), but I find myself adding custom words quite often. I think Imron wrote that licensing larger dictionaries was an option for the future.

As to tysond's last point, goodness, words like 好 and 了 etc. are nebulous to learn :-) These problems are present in any language with lots of homonyms. For instance, English learners might know what...

"Fred betrayed Susan."

... means, but not...

"The sweat on Fred's brow betrayed his feelings about her."

Or what...

"John was full of confidence."

...means, but not...

"John was a confidence man."

Readers like Learning with Texts (LWT) and LingQ deal with similar issues in most languages. And in terms of Mandarin, Chinese Text Analyser is ahead by a long shot :-)

December 3, 2014 at 11:38 AM

I also wonder if it will be a big speed hit.

CTA is fast enough that I'm prepared to take a speed hit in the default segmenter of up to about 10x in order to get better segmentation (for a common novel this will still result in about 1-2 seconds for processing, which is within a reasonable limit I think) and will allow the user the option of choosing between segmenters so they can make the tradeoff themselves if they are prepared to sacrifice accuracy for speed.

Regarding segmentation, there are a bunch of things I'd like to explore. Currently the segmenter just does forward longest matching, that is it moves forwards through a sentence finding the longest matching word in its wordlist and segments at that point, then finds the next longest word and so on. Probably the first thing I'll try is a reverse longest matching (which goes backwards through a sentence rather than forwards) which typically has better results than forward matching. I'd also like to do intelligent name matching by building a list of professions, titles and common characters used in names and matching against that words in the document. Another thing is doing bi-gram, tri-gram and maybe even quad-gram matching - both from n-grams in the article itself and also from n-grams generated from a corpus and using frequency information from that to help split sentences - e.g. maybe doing both a forward longest match and reverse longest match and then if there are differences using the n-gram frequencies to help decide on the optimal segmentation strategy. It'd also be nice to incorporate POS information also and stuff like hidden markov models. There are also interesting tradeoffs to be made because the more information and data sets you provide, the larger the application becomes and I don't particularly want a 100MB+ download for the installer :-) Then there are things like intelligent parsing of certain common grammatical structures e.g duplication of words ABB, ABAB, AABB, numbers and so on. Then there's all sorts of playing around trying to mix all of those things together, plus having systems way to test, verify and compare the accuracy of the different segmentation algorithms.

On top of all that, there are also other optimisations I can make for multi-core processors that will bring the segmenting time down again, gaining back some of the speed lost from doing extra work.

Exploring each one of those things will likely have a measurable improvement on segmentation quality, but each one is also a time sink :-)

Regarding: Character frequencies, at some point I'm going to get around to having ways to get low-hanging fruit from an article (or several articles) and so ideas about what constitute low-hanging fruit are always welcome.

December 3, 2014 at 11:40 AM

CTA's dictionary is pretty good

It's just CC-CEDICT, I'd really like to incorporate better dictionaries in the future.

December 3, 2014 at 12:50 PM

word segmentation is low on priority compared to other features I want to get in before finally releasing a 1.0.0 version.

And after explaining all of those different ways you could improve it, I can see why... But what about enabling custom word segmentation for the 1.0.0 version, so that I at least don't have to mark, say, two individual characters "A" and "B" as known, just because it's not segmenting word "AB" properly.

Regarding: Character frequencies. I actually can think of a few uses.

All of the stuff tysond said was what I was pretty much thinking as well. I can definitely see how it could help - but I don't see that it could hurt, could it? Or at least, I feel that the potential that it has in helping the student out weighs the potential of hurting the student... Or is there something I'm overlooking?

December 4, 2014 at 12:46 AM

But what about enabling custom word segmentation

1.0.0 (and possibly before) will have custom word segmentation where you can manually say how a piece of text can be segmented.

but I don't see that it could hurt, could it?

It doesn't hurt so much as distract. It's focusing on minutae and details that don't really matter and that add noise in the form of more statistics and numbers that are ultimately useless yet still quite distracting.

Providing low hanging fruit is worthwhile but I will probably look for a way for CTA to present this information to the user in a way that still focuses on words. e.g. use character analysis under the hood to provide a list of 'words you might/almost know' or something like that.

December 24, 2014 at 02:55 AM

1.0.0 (and possibly before) will have custom word segmentation where you can manually say how a piece of text can be segmented.

Excellent! I'm looking forward to it!

Another question:

When I export sentences with unknown words, it seems like it only exports the first occurrence of the new word. Is it possible to export all occurances with each new word (even the ones that repeat) so I get a few sentences with each new word (assuming it comes up in the text more than once) instead of just one?

December 24, 2014 at 03:39 AM

At the moment it's not possible, however I'm working on a sentence mining feature that will allow just that.

January 30, 2015 at 10:15 AM

I've read a few posts mentioning making a "Persistent user corpus"... Not entirely sure what this is, but from what I can tell, this would allow the user to not only see frequency of a word by measuring it against a massive corpus that used hundreds of thousands of newspapers or whatever to get data, but also allow the user to measure word frequency across the texts that they've read, or in the text that they are reading, to help decide if the word is worth learning (or how well to learn it)... Is that about right?

Another question... You said (in the post above this) that you're working on a feature that will allow the user to export all occurrences of an unknown word, not just the first occurrence. I'm assuming that you mean all occurrences in the same file. But what if you could export all occurrences of the words in any of the pieces of text that you read in the past? I'm not sure how'd you want to do this - but for example, what if you made a folder that I could save all of my texts in (saving each as a txt document). Then, when reading my book and I don't know the word 牛奶, I could select an option for CTA to export all of the sentences that contained this word, both in the current document and in any document I had read in the past (or any document that I stored in the special folder).

It would also be awesome if search functions could be extended to those documents too. For example, sure maybe I've studied the word 又，but I want to get used to the pattern 又...又... I've read it before, but my teacher just formally taught us in class, and I want to do a search of all of the texts I've used CTA with before - all of which I've already read, so I know the context for all of them, which will help me further understand the usage of 又...又...

And a problem:

For some words that are variants of other words, for example the word 動盪 the dictionary just says: Variant of 動蕩, but doesn't actually tell you what the word means.

January 30, 2015 at 10:22 AM

Just happened to notice murrayjames comment on 羊水 above - it actually came up not that long ago, might be interesting.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

murrayjames

imron

murrayjames

imron

murrayjames

imron

murrayjames

imron

murrayjames

imron

tysond

murrayjames

imron

imron

Yadang

imron

Yadang

imron

Yadang

roddy

Join the conversation