Introducing Chinese Text Analyser

July 24, 2017 at 09:15 AM

It should work in the wordlist view with multiple words. At least it used it. If it doesn't then a bug has crept in. I'll investigate.

July 25, 2017 at 04:50 AM

I just checked it again on Linux with a document of 5 unknown words to make sure I wasn't just getting lost in the sea of unknown words, and it didn't work. I'm pretty sure it doesn't on Windows either.

August 23, 2017 at 05:47 PM

Three feature requests for the next version:

1. I'd like the ability to drop a .txt file onto the dock icon/window and have it open and I'd like for it to tell the OS that it can read .txt files so I can right click->open with->CTA. Right now, the File->Open menu is the only way to load a text file and it's clunky compared to drag and drop.

2. After adding a custom word and then clicking "Show Definition", the popup just says "no definition". Instead, can it show the pinyin of the word? I add names as custom words and then I forget how to pronounce them. Before making it a custom word, I could click on each letter and sound it out but once it's turned into a custom word, I need to copy and paste it into Google Translate just to get the Pinyin.

3. When using flux, the colors are really hard to tell apart. In the light scheme, the color for looked up words looks almost identical to known words and in dark mode, the looked up words are almost invisible.

Looking forward to the next version!

August 24, 2017 at 02:41 AM

1 is easy and can be in the next version

2 I'll add to my todo list

3 You can already change, but you have to edit a text file:

windows: /Users/<username>/AppData/Local/ChineseTextAnalyser/colour-schemes/default.colours

macos: /Users/<username>/Library/Application Support/ChineseTextAnalyser/colour-schemes/default.colours

linux: /home/<username>/.local/share/ChineseTextAnalyser/colour-schemes/default.colours

The file is just a list of key=value pairs. The keys should be fairly self-explanatory and the values are hex rgb values but without the # sign at the front.

August 24, 2017 at 02:37 PM

On 7/25/2017 at 0:50 PM, Yadang said:

I just checked it again on Linux with a document of 5 unknown words to make sure I wasn't just getting lost in the sea of unknown words, and it didn't work. I'm pretty sure it doesn't on Windows either.

@Yadang I just checked this, and it works correctly on windows, linux and macos.

What version of Linux are you running?

August 26, 2017 at 05:39 AM

Moved question about scripting here.

August 29, 2017 at 05:04 AM

On 8/24/2017 at 8:37 AM, imron said:

What version of Linux are you running?

lsb_release -a gives me:

Distributor ID:   Ubuntu
Description:   Ubuntu 14.04.5 LTS
Release:   14.04
Codename:   trusty

Is that what you're looking for? Note that I'm running it on a chromebook with crouton. A lot of things don't work the way they're supposed to. As for windows, I was using windows xp when I encountered the problem.

September 6, 2017 at 08:28 PM

This silly question is probably answered somewhere in the depth of this thread, but...

I'm changing computer. How do I transfer my current "know word" list, and other settings, to the new machine?

Thanks.

September 7, 2017 at 03:22 AM

The ChineseTextAnalyser data directory can be found here:

Windows: c:\users\<username>\AppData\Local\ChineseTextAnalyser\

macOS: /Users/<username>/Library/Application Support/ChineseTextAnalyser/

Linux: /home/<username>/.local/share/ChineseTextAnalyser/

And you can just copy the whole thing to the same location on the new computer. If you are changing operating systems, some config options such as remembering size and positions of windows will not be preserved.

You can also just copy specific sub-directories within that directory e.g. wordlists or colour-schemes to get just those things. Custom words you have specified can be found in data/words.u8

December 15, 2017 at 11:24 PM

Might have asked this before, but can't find it... Is there a way to import words and only make them be added as custom words, but not marked as known? Or even - if a list of words is added with the import feature, are they added as custom words if there's no dictionary entry that matches? If so, could I just import them by list, then copy and paste the document and remark them all as unknown?

December 16, 2017 at 12:48 AM

There's no way to do this from the user interface, but you can do it by manually editing files.

1. Close CTA if it is already open

2. Go to the CTA data directory (macOS: ~/Library/Application Support/ChineseTextAnalyser/data/, windows: C:\Users\<username>\AppData\Local\ChineseTextAnalyser\data\, linux: ~/.local/share/ChineseTextAnalyser/data/)

3. Open the file called words.u8 (or create it if it doesn't exist. This should be a plain text file in utf-8 format)

4. Paste custom words to the end of the file - one word per line

5. Save the file and close

6. Re-open CTA and enjoy all your custom words not yet marked as known.

January 4, 2018 at 02:03 AM

Hi. I read somewhere that a new version was going to have a more accurate segmenter. Is this the case now?

Also, what if some words I already know (and which I import as a wordlist) are not in the CEDICT dictionary? Will CTA fail to recognize them since they're not in the dictionary?

Thanks!

January 4, 2018 at 07:06 AM

You read it right in this thread from a previous post of mine. I was actively working on it and the results weren't as good as expected because the statistical information it relied on would overmatch words, and a large amount of those overmatched words didn't exist in the dictionary so doing a dictionary lookup on many words would just result in a 'no definition' definition.

There are a number of ways to solve that problem but, I got caught up with a bunch of other work and haven't gotten around to doing that yet.

Words that you have added as custom words will still be matched by the newer segmenter, CTA will look at the words you've added and give them a statistical bias.

January 8, 2018 at 09:22 PM

I've just started with CTA - looks great so far. Due to my dodgy colour vision, the colours for known and unknown words in the text view are almost indistinguishable to me. In your post of 24 Aug 2017 you said the colours could be changed by editing this file:

windows: /Users/<username>/AppData/Local/ChineseTextAnalyser/colour-schemes/default.colours

I can't find that file. The only folders in AppData/Local/ChineseTextAnalyser are clipboard, data, logs and wordlists. Is it still possible to change the colours in the light scheme? Some colours are much easier for me to distinguish than others.

Thanks

January 8, 2018 at 10:13 PM

What version of CTA are you using?

That file should still be there, but it might not be created on disk until you run CTA for the first time and then quit the program.

When you come up with a suitable set of colours can you let me know and I'll include them in the main program for other people who face similar issues.

January 9, 2018 at 05:31 PM

Thanks. As you said, the file appeared after I closed and reopened CTA.

After some experiments I found that #0072BC (RGB 0, 114, 188) worked well for unknown.foreground.

For me it is easily distinguishable from the colours for known words and hover/looked up words, but still dark enough to read easily.

So far the other colours seem fine. If I have any more issues I will let you know and suggest alternatives. BTW my kind of red-green colour blindness is one of the most common types, so if you are able to cover that in the main program as you said, I'm sure that would be very helpful for others.

I'm using version 0.99.16 - 64 bit, which I recently downloaded.

This program is definitely going to be very helpful. Thanks again.

March 7, 2018 at 08:43 AM

Forgive me if this has been asked before, but is there an English language equivalent of CTA?

March 7, 2018 at 09:48 AM

It has been asked before, and I haven't made an English language version (yet), and I don't know if there is anything equivalent.

September 11, 2018 at 03:29 AM

@imron, is there any update on improving word segmentation in CTA? I recall you said something a couple of years ago about planning to improve it from current somewhat "hit and miss" state. I try using it about once a month and put it back with a sigh, after seeing "unknown words" it produces. An example from today:

没有最大限度地利用已有的资源。
is segmented as 没有/最/大限/度/地利/用/已/有的/资源。

Asking as a paying customer who resorted to running a Windows VM in order to use a better-working (and free btw) software. There are free(!) public(!) segmenters on Github IIRC ready for copy pasting or at least re-implementing in the language of your choice.

September 11, 2018 at 05:31 AM

1 hour ago, uvwxyz said:

There are free(!) public(!) segmenters on Github IIRC ready for copy pasting

That are not as fast.

1 hour ago, uvwxyz said:

or at least re-implementing in the language of your choice.

Which I did, and got it to a working state with acceptable performance, but found that the results were also just as much hit and miss because the segmenter was overbroad - meaning it would rate things as words that were really phrases, and that then had an impact on looking things up in the dictionary because many of the hits were on non-dictionary words that returned no results.

There are ways to fix this, and I have investigated some of them, and then life and work got in the way and I haven't had time to get back in to things.

2 hours ago, uvwxyz said:

Asking as a paying customer who resorted to running a Windows VM in order to use a better-working (and free btw) software

If you don't mind me asking, which software?

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

Yadang

Shalmanese

imron

imron

imron

Yadang

laurenth

imron

Yadang

imron

mlescano

imron

Arlo_

imron

Arlo_

somethingfunny

imron

uvwxyz

imron

Join the conversation