Introducing Chinese Text Analyser

February 19, 2016 at 12:11 PM

How do you measure x% understanding? Just the percentage of those words of the whole text? I think alternatively it would be useful to track understanding on a sentence level, e.g. this set of characters/words is a superset of all characters/words used by x% of all sentences in this text. Or you first determine a normal frequency list and then match each character with the percentage of sentences which you can fully understand (on a character level) using only characters up to the current position on the frequency list.

February 19, 2016 at 03:01 PM

At the moment it's very simplistic, just a percentage of unique words known and also a percentage of total words known.

I do plan to add more advanced metrics at some point.

March 6, 2016 at 11:24 AM

So I wrote a script and it seems you need to know 55%-65% of all characters used in a text for knowing all characters of 90% of the sentences. 90% of the text uses only 15%-25% of all characters, though. Those are just some rough numbers, I only used the glossika sentences and two old texts from project Gutenberg. Other metrics for complexity/difficulty I came up with are: number of characters used (obviously), median length of sentences and number of bigrams divided by number of characters. The last one could be interesting but I haven't done any tests yet, python might be a little too slow for looping through all the characters one by one.

March 20, 2016 at 05:52 PM

Windows 10

CTA 0.99.11

If CTA had been maximized the last time it was used then on next startup of CTA I get an icon on the toolbar for as long as your startup splash screen is open but then the icon disappears.

If I then minimize CTA the icon comes back and stays back as I maximize it again and use it.

March 21, 2016 at 02:09 AM

Thanks. I'll look in to it.

March 21, 2016 at 02:55 AM

Ok, this will be fixed in the next version.

Incidentally, if you are using a maximised window, you might also be interested in using full-screen mode (toggle by pressing F11, or select from the menu: Window->Full Screen).

March 28, 2016 at 02:41 AM

A new version is now available.

Custom dictionaries are now supported as described in this post. If you have been following that advice you should be able to upgrade and everything should work. If not, please make sure to backup your programfiles\chinesetextanalyser\data directory beforehand.

If you only want to use your custom definitions and not any definitions from the main dictionary, then there is a flag in the config file (c:\users\<username>\AppData\Local\ChineseTextAnalyser\data\config) under the '[definition]' section called 'customDefinitionsOnly'. By default it will be set to 'false', set it to 'true' if you only want the custom dictionary entries available. Note: if you have just upgraded, this setting might not be there until you restart CTA and then exit the program.

When doing document exports you can now also use \n \r and \t to add newlines, carriage returns and tabs in the various pre/post fields

Finally, the bug mentioned above by querido is also fixed.

March 28, 2016 at 12:11 PM

Setting customDefinitionsOnly to true gets me the Unexpected Error popup. Setting it back to false fixes it.

I put my Cantonese dictionary in AppData/Local/Chinese Text Analyser/data/My cedict_ts.u8.

That same dictionary, but named cedict_ts.u8, is in Program Files/Chinese Text Analyser/data.

March 28, 2016 at 01:52 PM

Hah, I actually put that option in specifically for you because I figured you'd want the Cantonese only definitions

The dictionary you put in AppData/Local/ChineseTextAnalyser/data should also be called cedict_ts.u8 (no 'My') at the beginning. If that doesn't fix the problem, can you please send me a copy of the file so I can investigate.

March 28, 2016 at 02:28 PM

Yes, that fixed it.

[For me? Makes me think I could request - or commission you for - other things I sometimes think of. Nothing at this time. E.g., back when I thought I had to change Mnemosyne/Anki I learned enough Python to do that, etc., but I can't spare the brain cells any more.]

March 28, 2016 at 02:56 PM

Glad to hear that fixed it.

For me?

I remember you mentioned replacing the dictionary entirely for use with Cantonese and figured you might not want to have mixed Mandarin and Cantonese, and probably didn't want to have to keep backing up and replacing the dictionary file every time you upgraded. Limiting CTA to only use the custom dictionary seemed a good way to get around that, and from previous posts I know you don't mind mucking about in the config file. Once the main custom dictionary code was implemented, it was only a couple of extra lines to add the option to only use the custom dictionary, and so the option was born. I'm sure other people will find it useful too.

Regarding features, requests are always welcome, so feel free to let me know of anything you think could be improved but I make no promises (as @realmayo will attest - though I will get multiple wordlists done eventually which will hopefully ease some of his pain).

For commissions I'm always open to discussion, but I'm a little apprehensive about them. I suspect my standard consulting rate is probably going to put even smallish features out of budget, and anything less than my standard consulting rate is going to make it low on my priority list and I might not get around to it for weeks/months because I need to do other work to pay the bills.

March 29, 2016 at 08:09 PM

Imron!

Thanks for the update. To confirm: When I add entries to AppData/Local/ChineseTextAnalyser/data/cedict_ts.u8, I no longer need to paste them into the Program Files/ChineseTextAnalyser/data dictionary file?

==

Changing gears for a second..... in CTA, sometimes small amounts of the text are obscured from view. I've noticed this with:

(1) the last line of a text obscured by the screen frame

(2) the rightmost characters of a line obscured by the scrollbar/word statistics pane

Resizing the window, switching into Full Screen mode, changing the font size, or hiding the word statistics pane usually solves these problems.

I've attached screenshots of both so you can see what I mean. In screenshot #1, the last line of text is hidden from view. (I'm hovering over one of the words, but it's too low to read.) In screeenshot #2, several characters on the right margin are hidden from view. The line break appears to occur at a point to the right of the scrollbar.

Problem #2 usually happens when pasting text from the internet into CTA, especially text with tabs or tables. Here is the text I used, if you want to replicate:

http://xybk.fuyin.tv/Bible/NCV/gb/gen/1.htm

This problem isn't new to this version. I noticed it before. And it is easy to solve, usually by maximing the CTA window or otherwise playing with the window dimensions. It's probably unwanted behavior, though, so I'm mentioning it here.

March 29, 2016 at 11:38 PM

To confirm: When I add entries to AppData/Local/ChineseTextAnalyser/data/cedict_ts.u8, I no longer need to paste them into the Program Files/ChineseTextAnalyser/data dictionary file?

That is correct. You now only need to add them to AppData/Local/ChineseTextAnalyser/data/cedict_ts.u8, and this file will be unaffected by upgrades.

Regarding the obscured text, it's a known problem that I haven't got around to fixing yet.

The problem with the right margin is caused by 'Tab' characters earlier in the line, which throw out the line width calculations.

The problem with the bottom margin is caused by always rendering one extra line beyond the edge of the screen to help with smoother scrolling and to handle partial lines. In some situations this throws off the scrolling.

Neither are difficult to fix, they just haven't been a high priority.

April 1, 2016 at 10:09 PM

Nice update - glad I can use it without administrative permissions - thanks!

So, I'm running it off of a USB stick on multiple computers. The README in c:\users\<username>\AppData\Local\ChineseTextAnalyser/wordlists says that if I want to back up my word lists and such, I should just copy the whole "wordlists" folder. So, say I run it on a different computer, do I basically just open up CTA, upon which this new "wordlists" folder will be created, close CTA, and then just replace the one "wordlists" folder with the one that I had copied previously from a different computer? Then when I re-open CTA, it will have imported all of my known words?

What if, whenever I decide to use CTA on a different computer, I just go file>export>to file, and then under the "Word List" tab, select

Word List: [Known]

Filter: Include words on list

and then for my selected fields just have "Word" selected (seeing that I'm using it with traditional Chinese, is there any difference between selecting "Word" and selecting "Traditional")?

Then I could create a document, call it "known words" or something, and when I use CTA on a different computer, just import that list and mark all as known.

Would that process produce results any different from copying over the whole "wordlists" folder?

April 1, 2016 at 11:26 PM

If you are running it off a USB key, create a new shortcut file on the usb key to cta.exe, and then right click the shortcut and select properties and edit the 'Target' to include --portable after the name of the executable e.g. cta.exe --portable

Then when the program runs, it will store of the wordlists and config on the USB key in a sub-folder of the CTA directory called 'Portable'. Note for any other users, this only works for the licensed version, and not the Free Trial.

If you wish to copy your existing vocab, take it from c:\users\<username>\AppData\Local\ChineseTextAnalyser\wordlists and put it in <usb-install-folder>\ChineseTextAnalyser\Portable\UserData\ChineseTextAnalyser\wordlists You might also want to copy over your config file (data/config) and your license file (cta.licence) to the relevant locations on Portable\UserData\ChineseTextAnalyser.

and then just replace the one "wordlists" folder with the one that I had copied previously from a different computer? Then when I re-open CTA, it will have imported all of my known words?

Yes. This is correct. However if you use the --portable flag, then you won't ever need to worry about this as it will be running entirely off the USB.

What if, whenever I decide to use CTA on a different computer, I just go file>export>to file, and then under the "Word List" tab, select

This won't work, because it will only export words from the current document that are Known, not your entire list of known words. Even if you open up your existing list of known words, due to parsing/segmentation differences it is possible that there will be some words dropped/missing when you export.

Currently the only guaranteed way to get all your vocab is to copy the wordlists folder. Eventually there will be a vocab syncing feature that will allow syncing of vocab across multiple computers.

is there any difference between selecting "Word" and selecting "Traditional"

Yes. Word is the word as it appears in the document, and Traditional always the Traditional version of the word, even if the document is in simplified. Traditional will be the same as Word if the original document was also Traditional.

Likewise for Word and Simplified - Simplified will always be the simplified version of the word, even if the document is in traditional, and it will be the same as Word if the document was originally in Simplified.

e.g. if the original document contained the word 汉语 then when exporting, Word will be 汉语, Simplified will be 汉语 and Traditional will be 漢語

If the original document contained the word 漢語 then when exporting, Word will be 漢語, Simplified will be 汉语 and Traditional will be 漢語.

April 2, 2016 at 12:28 AM

Excellent - I followed your instructions and deleted the local files on the computer, and it still works! Thanks!

Ah, I see see the difference now between Word/Traditional/Simplified. Thanks! That makes me wonder... I think you mentioned (or someone did, in a post a while back, who knows where) that you consider learning simplified (if you've been using traditional), or traditional (if you've been learning simplified), to be the same difficulty, because in the end, you are just memorizing new characters. While I can see why that is helpful in determining if going Trad>Simp or Simp>Trad is probably about the same in difficulty, in terms of the words that I actually know (not those that I'm able to read, but actually know), it doesn't make a difference. Even though when I look at CTA's known words, I am of course looking at the words that I know how to read, it is also a nice way to get a gist of how many words I know (even if the ones I know will be higher than the ones I can read). The problem is, if I am counting both "汉语" and "漢語" as a word, it will look like I know a lot more words then I do, when really I am just trying to get reading practice in both traditional and simplified. So, for a more accurate account of words that I know, should I just export Traditional into a text file and count them with excel? Would that be the most accurate way to filter out words that I can read in both traditional and simplified?

April 2, 2016 at 01:00 AM

You're probably not going to like my answer, but personally I think the best thing to do is not place much importance on that figure.

In fact that was a core goal of CTA - to stop people thinking about raw vocabulary numbers, and instead get them to think about how well they could understand a given piece of text.

Especially as your vocabulary increases, what you can read will not be determined by your total vocabulary size but rather how much overlap there is between your vocabulary and what you are reading.

That being said, what you suggest doing would also not be that accurate, because whenever you 'export' it is only exporting words from the current document.

At some point I hope to provide more information and statistics to users about their vocabulary (graphs over time, etc) and I'll consider adding breakdown of simp vs trad.

April 2, 2016 at 08:07 AM

Ah I see - well, I'll await the official stats then

I'm confused about how to export cloze cards and then import into Anki. I thought I had to import notes into anki, not cards. But it seems as if CTA only exports cards.

For example, let's say I have the following sentence:

要是我認真念書起來，實力強到，連我自己都會害怕。

And the words that are marked unknown are:

認真

實力

害怕

Let's say I have CTA export cloze sentence, word, pinyin and English definition. It looks like CTA will export three sentences (three cards):

要是我[...]念書起來，實力強到，連我自己都會害怕。　　　認真　　　rènzhēn　　serious

要是我認真念書起來，[...]強到，連我自己都會害怕。　　　實力　　　shílì　　　　strength

要是我認真念書起來，實力強到，連我自己都會[...]。　　　害怕　　　hàipà　　　afraid

(I realize the definitions are terrible - they were taken from google translate)

But if I want to import into Anki, I think I'd want to have something like this (one note, with multiple cards built in):

要是我{{c1:認真}}念書起來，{{c2:實力}}強到，連我自己都會{{c3:害怕}}。　　　認真-rènzhēn-serious, 實力-shílì-strength, 害怕-hàipà-afraid

Then of course, after I imported it to anki, I would change field #2 to lay out the new words nicely and such.

I thought about doing a substitute formula in excel - to go down the list of unknown words and substitute each one in each sentence for {{c1:<unknown word>}}, but the problem with that is, I'd end up with something like this, with only one cloze deletion for all words sharing a sentence:

要是我{{c1:認真}}念書起來，{{c1:實力}}強到，連我自己都會{{c1:害怕}}。

Also, it would be really hard to try to incorporate all of the other information (pinyin, English definition, and so on).

Is there an easier way to do all of this through CTA?

April 2, 2016 at 09:26 AM

I haven't looked at Anki or Anki notes for a while. I'll have a play around with things and see if there's a good way to do it.

April 2, 2016 at 11:48 AM

To Yadang: I'm working straight through a graded reader now. I avoid the extra involvement you describe in #356 and #358 (though I understand it, as I used to do that). With regard to #358, I would just take those three cards generated by CTA, not using anki cloze notes, and press on.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

wibr

imron

wibr

querido

imron

imron

imron

querido

imron

querido

imron

murrayjames

imron

Yadang

imron

Yadang

imron

Yadang

imron

querido

Join the conversation