Introducing Chinese Text Analyser

September 11, 2018 at 08:34 AM

2 hours ago, imron said:

That are not as fast.

I'd say it's better to have perfect segmentation at 5k words per second (Stanford segmenter, GPL) than blazing fast segmenter that makes single digit % errors. Why are you so much into extremely fast processing? I chucked the whole 中华上下午天年 into Stanford segmenter and it took just over a minute. I suppose if someone processes a 100 full-length books a day, this might be slightly annoying, but approximately 100% of your users would be happy to wait a few seconds at startup in exchange for a perfect result.

Quote

C:\Users\xxx\Desktop\stanford-segmenter-2018-02-27>segment.bat ctb ..\shangxia.txt UTF-8 0 >words.txt
CTB: Chinese Treebank segmentation
File: "..\shangxia.txt"
Encoding: "UTF-8"
kBest: "0"
-------------------------------
Invoked on Tue Sep 11 17:48:05 AEST 2018 with arguments: ...

...

CRFClassifier tagged 543985 words in 6101 documents at 4387.83 words per second.

And it did segment my example sentence perfectly (spaces):

没有最大限度地利用已有的资源

Another cut and paste example, some HSK5 test IIRC. If you want to compare this result to CTA, remove segmenting spaces before pasting it.

Quote

把电脑回收站清空后，文件是不是就被彻底删除了 ? 答案是否定的。其实那些被你删除的文件还好好地放在原来的位置，一步都没挪动。这也是为什么一些文件被彻底删除了，却还能被数据恢复软件找回来的原因。
假如你对某个硬盘的全部文件都执行了删除命令，那么这些文件立刻就都消失了。但事实上电脑并没有删除它们，而是做了以下的事情 : 第一，将这个盘的文件设为不显示 ; 第二，给这个盘做一个特殊的标记 —— 这个盘里的文件全都没用了，如果要储存新的文件可以存放到这个盘里来。
如果新的数据存放进去后，完全占满了这个盘，那你以前的文件就真的彻底没了。但如果你删除文件后，一直没有新文件存入，那么，这些被删除的文件就会永远留在原处，只不过不显示而已。数据恢复软件的作用就是让它们重新显示出来。电脑之所以这么做，是为了提高工作效率，因为让电脑真正抹杀一份数据所消耗的时间是很长的，如果电脑真的如此处理删除命令，估计每个用电脑的人都会疯掉的。

2 hours ago, imron said:

If you don't mind me asking, which software?

TBH (see above) when I need segmentation (=rarely), I use Stanford java tool. Once I have a file with spaced "words", I can do whatever I want, with a bit of scripting. I usually just test % of words I don't know in the book. And I must say CTA without babysitting it by adding fake words can be 5%+ off the real value, which is kind of important - 80+% understanding vs 90% is a big deal. (This babysitting is another pain in the *** - your policeman-like approach to looking up words doesn't make it easier to correct the tool's mistakes, but I digress.)

I checked my Windows VMs - there's Chinese Word Extractor and Chinese Toolbox, yet they are long dead and make exactly the same mistakes on my example sentence, so they are not a good example. I did play with them for a while last year tho, they must have been better at least on some texts. So, there is no Windows-specific segmenter I can recommend, I take that back. Java runs more or less everywhere.

September 11, 2018 at 12:27 PM

3 hours ago, uvwxyz said:

Why are you so much into extremely fast processing?

It makes a big difference for responsiveness, especially with the real time highlighting of known/unknown words, especially if you have a large screen resolution and a full screen window full of text. Assuming a refresh rate of 10 frames per second for a minimum level of responsiveness, 5,000 words/sec gets you 500 words per frame, or about double the amount of text you posted above. That's far less than a screen of text, and that means you get lag in highlighting any time you scroll the screen or mark words as known/unknown. That's something of a pet peeve of mine. I hate it when my code editor flashes as it updates the syntax highlight of a much smaller file (happens all the time with Xcode) and there's not much excuse for it. CTA will open any file instantly and will also update highlighting more or less instantly regardless of size of file or screen size, and that's something I care about.

3 hours ago, uvwxyz said:

I must say CTA without babysitting it by adding fake words can be 5%+ off the real value, which is kind of important

I use to think this way too, but part of the reason the segmenter hasn't been getting much love is because I found out that for many use cases it's not actually that important. It stills gives accurate enough ballpark figures if you're trying to estimate the relative difficulty between two texts or if you're trying to find the words with the highest frequency.

That being said, I'd love to spend more time on the segmenter, and get it to the same level of correctness as the stanford one without sacrificing speed, but unfortunately I'm unlikely to have the time to do that anytime within the next few months.

3 hours ago, uvwxyz said:

Java runs more or less everywhere.

Less more often than more on Macs.

September 12, 2018 at 01:26 AM

12 hours ago, imron said:

That's something of a pet peeve of mine. I hate it when my code editor flashes as it updates the syntax highlight of a much smaller file (happens all the time with Xcode) and there's not much excuse for it

Ok, this explains.

September 17, 2018 at 04:19 AM

In the meantime, couldn't you just paste the pre-segmented text into CTA, then use CTA for all the statistics, exporting etc, like normal?

(See this post. Only you'd have the Stanford segmenter be doing most of this for you)

September 23, 2018 at 12:19 PM

Hi Imron, where are the known words saved? Can you make it so that it is a user-defined location? I want to save it in the cloud so that I can sync across devices. I started to use the app more and more and I would like to use on more than 1 device.

September 23, 2018 at 01:02 PM

Windows: c:\users\<username>\AppData\Local\ChineseTextAnalyser\wordlists

macOS: ~/Library/Application Support/ChineseTextAnalyser/wordlists/

Linux: ~/.local/share/ChineseTextAnalyser/wordlists

You should be careful to only open it on one machine at a time, otherwise you might lose known words. This is because it loads the wordlist at the beginning of a session and saves it out to disk when the application exits.

So if you open it on computer 1, and then make some changes, and then open it up on computer 2 and make some other changes and then close the app on computer 2 and then close the app on computer 1, the version left on disk at the end will be the version on computer 1.

September 24, 2018 at 07:40 AM

Hi Imron, thanks for the reply. I found the files and copied them over to the other PC. I will make sure not to have the files open on both machines at the same time.

It would be great though if you could make a future version that is portable (either all files, binary files + wordlists etc. all in the same folder OR the option to save the wordlists at a user defined location).

September 24, 2018 at 09:50 AM

A future version will have automatic sync (online) and also manual sync (export a bunch of files on one computer and import those files on another computer).

October 1, 2018 at 08:32 AM

Hi Imron, not sure whether you know this bug: if you increase the file size (I like really large characters), it does not activate the scrollbar, so you will end up with text that flows below the visible area of the text window and you have no way to scroll down, neither using mouse wheel nor the keyboard. Enlarge beyond a certain character size and you can simply not scroll to the end of the text, no matter how long.

Also: sometimes the automatic segmentation is wrong, e.g. for "不了解" it segments as "不了+解". Could you add a function to manually override this?

February 3, 2019 at 04:27 AM

Hi Imron, I am currently using CTA on a MacBook air. Next week I am switching to PC and want to keep using CTA. What is the process for copying my user files over to the new computer?

February 4, 2019 at 06:33 AM

In Finder press Command-Shift-G and in the text box that appears type in:

~/Library/Application Support/ChineseTextAnalyser/

Copy (or zip up) this entire folder and copy it somewhere on your new machine - easiest is probably to a folder your desktop called ChineseTextAnalyser.

Then on your windows machine, install Chinese text analyser. When it asks for your licence, there will be a copy in <Desktop>\ChineseTextAnalyser\cta.licence (assuming you copied the files to <Desktop>\ChineseTextAnalyser).

Now quit CTA. It's important to run it once, as it will configure all the user directories for you.

Make sure to really quit, because CTA saves all it's config files upon exit, so if you overwrite them when you are running the program, then they'll be changed back to the old values when CTA quits.

Now open up C:\Users\<username>\AppData\Local\ChineseTextAnalyser

Delete the 'wordlists' folder, and then copy the entire 'wordlists' folder from your copy on the Desktop (<Desktop>\ChineseTextAnalyser\wordlists)

If you've modified the 'colour-schemes' at all, then do the same thing with the 'colour-schemes' folder.

Finally, to get any custom words and dictionary definitions, go in to <Desktop>\ChineseTextAnalyser\data and copy words.u8 and cedict_ts.u8 to 'C:\Users\<username>\AppData\Local\ChineseTextAnalyser\data' which won't have those files in it.

Then restart CTA and you should be good to go.

February 4, 2019 at 08:46 AM

Thank you Imron! One more question--

Quote

When it asks for your licence, there will be a copy in \ChineseTextAnalyser\cta.licence (assuming you copied the files to \ChineseTextAnalyser).

I'm not sure I understand this sentence. Are you saying that when CTA asks for my license, I select cta.licence from the \ChineseTextAnalyser folder on my desktop?

February 4, 2019 at 11:22 PM

Yes, you can do that, or you can use the licence key originally sent to you via email - both will work because when you register CTA, it keeps an internal copy of your licence and that is where it puts it (so it will be an exact copy of the licence file you originally used to register). Doing it this way just saves you hunting through your email archives for the licence key.

Technically speaking, you could just copy 'cta.licence' from the \ChineseTextAnalyser folder on your desktop to c:\users\<username>\AppData\Local\ChineseTextAnalyser and skip the entire registration process. But going through the registration dialog lessens the likelihood of human error - if CTA can't find the internal copy of the licence in the right place, then it will fall back in to 'free trial' mode.

February 5, 2019 at 11:44 PM

Worked perfectly. Thanks @imron

February 5, 2019 at 11:54 PM

Glad to hear it.

February 6, 2019 at 08:30 PM

TLDR: can you make it possible to select multiple word lists at once?

A while back, I mentioned that it would be nice to somehow keep track of how many times I've seen a word in different texts, because I rarely really know the word after learning it for just one text. I think at the time you said you'd try to figure out some way to do this. But I've since been worried that any approach used might be inefficient for some words and insufficient for others. Some words I really do know after just seeing them once (at least I know them well enough to read them - maybe I wouldn't be comfortable using them but that will come with time and more reading, not more srs reps), where as others I really do need to see them several times in several books to know them. So I think any approach that, for example, marks any exported word that has been exported 3 times as known, would be inefficient for words in the first case and insufficient for words in the second case.

Recently, though, I've thought I should have a "rotating" known words list. If I learn the words for book x, I add them to my "Known" list to make a "Known + x" list. Then I use that list to see what words I need to study for book y. After having studied book y, I have a "known + x + y" list. Then I use that word list to study book z, etc, etc. I think what I should do (although I haven't actually tried this yet) is to periodically remove word lists. So for example, let's say I have the word list "known + x + y + z + a + b + c" for six books I've read. For my 7th book, I'd like to remove the words that I studied under book x, so I only have "known + y + z + a + b + c". Any words I come across that I feel I really know, I can then add to my "known" word list, which I never rotate out.

This can be done in the current version, but it would be made a whole lot easier if multiple word lists could be selected at once, so I didn't have to concatenate all my wordlists, then periodically un-concatenate the older ones, etc.

February 7, 2019 at 12:34 AM

I don't know if selecting multiple lists is the answer (which list do words get added to if you mark them as known?) but maybe allowing for easy manipulation of word lists would suffice, for example, allowing you to create wordlists by specifying an expression such as "known + a + b - z".

February 7, 2019 at 01:11 AM

34 minutes ago, imron said:

which list do words get added to if you mark them as known?

Could you select multiple word lists but privilege one over the rest, and it's that one that known words are added to (either upon export, or clicking in the document, etc.)?

36 minutes ago, imron said:

maybe allowing for easy manipulation of word lists would suffice, for example, allowing you to create wordlists by specifying an expression such as "known + a + b - z"

Yes, I think that would work!

February 7, 2019 at 04:52 AM

3 hours ago, Yadang said:

Could you.....

Everything is possible to do, but it's also a matter of trying not to make things too complicated and to me seleting multiple lists and allowing one to be privileged is started to feel a little complicated - mabye not to you and me because we're discussing it here and know what we are talking about, but to every other user who comes across it in the UI and doesn't have the benefit of this context.

3 hours ago, Yadang said:

Yes, I think that would work!

I think this is the much better thing to do - just make it easier to combine/subtract lists together in various ways so you don't have to do it manually.

February 7, 2019 at 06:17 AM

imron, when do you plan to develop English Text Analyser and make millions selling it in the lucrative ESL market?

Sign In

Introducing Chinese Text Analyser

Recommended Posts

uvwxyz

imron

uvwxyz

Yadang

yaokong

imron

yaokong

imron

yaokong

murrayjames

imron

murrayjames

imron

murrayjames

imron

Yadang

imron

Yadang

imron

murrayjames

Join the conversation