Introducing Chinese Text Analyser

April 1, 2020 at 08:23 AM

18 hours ago, imron said:

On 2/25/2020 at 6:36 AM, Pall said:

It would be fine to add also a function that can mark characters from HSK1, 2, 3, 4, 5, 6

I get that people are interested in things like this, but CTA aims to subtly push people away from thinking in terms of HSK. In fact I only provide the HSK statistics to drive home the point that for most native content, the vocabulary for the HSK doesn't give you very much at all, and you're better off using frequently occurring words in what you are reading.

On 3/25/2020 at 10:24 AM, Pall said:

It would be especially great if CTA could also mark and counter 'head' characters even though they're of conditional nature.

I had a look through your link, but I'm still not entirely sure what you mean by head characters.

I understand your point. It's true, HSK5 is not enough even for reading newspapers. But the idea is to learn some basis exlusively well, to be able to feel and observe it in one's mind, and that'll make things much easier when encountering a new character, since it can be fit in the firm HSK5 framework. As to me, basing on three first poems, for B-P-M-F, D-T-N-L and Z-C-S , I've learnt all characters from HSK5 very confidently. However, let's asume I may doubt sometimes if I know a new character, which pinyin I looked up, and it happened to be one of the learnt syllables. I check it in the Table and... (1) see it's there. I'm sure now it's very unlikly that next time I'll hesitate to recognize it. (2) It's not there. OK, I just add it to a certain card and cell in the Table marking it in green. In this case it's also very likely that I'll memorize it much quicker compared to the situation when there is no firm basis in the form of the Table and the cards (two types, 'intermediate poem presentation cards' and 'head character cards').

Head characters are just some characters selected to represent an entire syllable with respect to tone. For instance, in the HSK5 there are three characters sounding fáng：房，防，妨。 We take one of them as a representative. I selected 房 as such. For 'fang' in the other tones head characters are also selected. For 1st tone it's 方 , for 3rd 访, and for 4th 放 . They're all 'head' characters. We select also one of their meanings to use in formulas (it concernes all characters): the 'corner 'for 方, 'building' for 房 (though 'flat' might be more often meaning), 'explore' for 访 and 'advertise' for 放。 Head characters are used in 'head character cards', on one side of which there is the head character, and other characters of the same pinyin are on the reverse, the latter being arranged in the special order for better memorization, see picture (at the bottom of it). In the 'head character cards' other characters are linked to 'head' ones by a 'horizontal' formula, a phrase connecting their meanings one after the other (in the same word order), with the use of other necessary words, of course. The meanings of the characters are 'target' words.

The possible number of the head characters is about 1200-1300 for the whole language, and within HSK5 it's 880 (just to begin with ). But the number of head character cards required may be much less, for hundreds of syllables are represented by a single character while others by a number of characters of the same pinyin. For example, in the HSK5 we need only about 350 head character cards.

Then one of the 'head' characters is selected as the 'key' character to represent one of the 400 syllables without considering for tones. I chose 方。 The 'key' characters, their meaning (one of) is used in 'poems' composed according to 声母 vs 韵母 correspondence. All head characters, including key characters, are presented in 'intermediate poem presentation' cards, see pic (at the top of it). Head characters of the same syllable (of different tones) are linked to one another by 'vertical' formulas. I managed to compose these formulas in English.

Thus, one has to learn only 120 'intermediate poem presentation cards' and some hundreds of 'head character cards' (for HSK5 only 350) to know all characters according to his level. In the Dictionary of Contemporary Mandarin which I have 20,000 words are based on only 4,500 characters. So, starting from the HSK5 basis one can move to the objective of 4,500 by adding new characters marked in green.

April 1, 2020 at 08:26 AM

16 hours ago, roddy said:

Did you ever know that you're my hero
And everything I would like to be?
I can fly higher than an eagle
For you are Imron beneath my wings

Good poetry! I'm sure you could manage to compose in English even long 'horizontal' formulas linking a number of characters in a given order.

April 4, 2020 at 04:01 PM

@Imron: is there a way to export a [word list] to txt or xls?

When I go to [menu: word lists] and [manage...] it lets me delete words etc, but I cannot select all words and copy/paste them elsewhere...

April 5, 2020 at 04:52 AM

There isn't a way to export them, but there is a way to access them.

Assuming you are on Windows you can open file explorer and go to:

C:\Users\USERNAME\AppData\Local\ChineseTextAnalyser\wordlists\cache

Where USERNAME should be replaced with your computer username.

Each file in that directory corresponds to each wordlist, and will be a .txt file with one word per line.

NOTE: You should copy/open these files after closing CTA as recent changes might not have been written out to disk. Also note, these are not the actual saved wordlists themselves, just a cached copy that CTA uses so it doesn't have to rebuild the full list from the actual saved format. If you edit these files, those changes will be overwritten when CTA detects changes have been made and recreates the cached copy.

April 18, 2020 at 02:25 AM

@imron

@roddy

Hi,
please, also see how using head characters helps to learn words, the last answer in the

https://www.chinese-forums.com/forums/topic/59743-meaningphonetics-based-system-to-learn-characters-adopted-for-english-speakers/

May 21, 2020 at 05:28 AM

Imron, question for you. Today I updated CTA to the latest version (0.99.18). After updating, I noticed that the known word % of texts I was reading had fallen 0.50–1.00%. Any idea why this happened?

May 21, 2020 at 10:20 AM

The latest version included an update to use the latest version of cedict, which means there’s a bunch of extra words, which means some things will be segmented differently.

May 22, 2020 at 04:34 AM

Ah, that would do it! I was slightly sad when those percentages went down.

May 22, 2020 at 05:52 AM

Keep learning and they'll go back up!

June 1, 2020 at 09:26 AM

On 11/2/2015 at 9:16 AM, Geiko said:

@Imron: according to CTA, my current known vocabulary is at 11912 words, but I've been analysing both simplified and traditional texts, and if I'm not wrong traditional characters and their simplified counterparts are counted as different words, so the real figure must be lower.

On 11/2/2015 at 9:39 AM, imron said:

That's correct, because not all of them will be easily guessable if you know the other.

From a few years back...

I'm wondering what the best way to handle this is, on the assumption that you know both character sets well and are happy to treat them as interchangeable - ie, if I mark 个 as known, I don't want 個 turning up in an unknown list, and vice versa. Off the top of my head...

1) Decide on one character set to use CTA with and convert to that before feeding any text in. FWIW, ~~I find MS Word's Trad>Simp conversion very reliable.~~ (actually, looking at it more carefully now, I'm changing my mind on this. Plenty of mistakes you can easily skim over as they're similar enough, but not as good as I thought). You could then back-convert exported unknown word lists if you wanted (although I'm less sure on MS Word's Simp>Trad conversion, and at that point it's working with a list of words and won't have so much context to go on. Not sure what difference that makes. Conversion issues aside, this seems most elegant as you don't have 'duplicate' entries.

2) Every time you switch character sets, take your known word list, convert it, paste in, mark all as known. Again, possible conversion issues and I'm not keen on what it does to vocabulary size.

3) Manually add as you go along. This seems least efficient.

Would appreciate any 前车之鉴。

I really enjoy using CTA. Not sure how much demand there'd be, but if you ever thought of a Pro / Advanced version with some extra features, I'd be on board.

June 1, 2020 at 10:10 AM

Advanced or pro version should include counting characters, too. Words are specific. If one's learnt all words in a book, in amount 5,000, for example, it doesn't mean that another book will be based on the same words, many new appear while some disappear. But if one'slearnt 5,000 characters, he can expect that he'll come across another characters very rearly.

June 1, 2020 at 12:32 PM

3 hours ago, roddy said:

Not sure how much demand there'd be, but if you ever thought of a Pro / Advanced version with some extra features, I'd be on board.

Plenty of features I'd love to add, but not enough time to add them at the expense of paying work.

3 hours ago, roddy said:

ie, if I mark 个 as known, I don't want 個 turning up in an unknown list, and vice versa.

These are different enough that you might not know them. It's all very well and good to know both sets "well enough" but it's in the margins where this will make the difference. Perhaps a feature that lets you mark simplified/traditional variants as known, but one that needs to be run manually rather than automatically.

3 hours ago, roddy said:

Manually add as you go along. This seems least efficient.

It also involves the most amount of work and therefore the most learning.

June 6, 2020 at 03:49 PM

@imron, I'm a bit embarrassed to admit that since I first downloaded CTA I never updated it, my version is 0.99.9 and now I wanted to update it to the latest version but I can't find how on the website. How should I do it?

June 6, 2020 at 11:06 PM

https://www.chinesetextanalyser.com/download

Then click on the link for your operating system.

June 17, 2020 at 04:22 AM

Hello Imron, on MacOS, CTA is not working correctly when I mark a line-breaking word as correct.

An example: The word 盛行 spans two lines (i.e., 盛 is the last character of one line; 行 is the first character of the next line). If I place my cursor on 盛 and mark the word 盛行 as known, CTA works as expected. If I place my cursor on 行 and the mark the word 盛行 as known, 盛行 remains unknown, and the character 行 becomes unknown.

June 17, 2020 at 05:01 AM

That's a bug!

June 17, 2020 at 09:03 AM

Ah yes. I caught it flying around my laptop just now.

July 29, 2020 at 11:45 PM

Two suggestions:

Keep a running corpus of all recent articles. By that, I mean save and cumulative sum the word lists from previously opened documents, which then can be used as an alternative sorting for vocabulary in the current document. I do this a cumbersome way now (save all articles read each week to Evernote, export to a single file, then do a single CTA scour for most frequently hit unknown words in that group), but building this into CTA would be great. Make it easy to reset this local corpus frequency count, so I can reset weekly, monthly, etc as I deem fit.
Timed movement of highlight along characters. My current habit is to load a document, mark it entirely known, click the right arrow quickly (or, hold it down) as I read through and article, and then hit 'd' to pop up a definition as necessary. I'll probably get some form of repetitive stress syndrome from all that clicking (once per word!). I'd prefer if I could set a timer for 0.x seconds per character, with CTA then moving at that set pace (and maybe make left/right arrow increase or decrease the rate). Hitting 'd' once pauses for a definition, hitting again moves on.

July 30, 2020 at 06:40 AM

Adding a corpus feature is a long-standing issue and probably the next major feature I'll tackle.

Adding timed movement is a much easier feature and something I could probably add more quickly - though no promises yet. Interesting piece of trivia: "Chinese Text Analyser" started life as "Chinese Speed Reader" and it got everything implemented except the speed reading part, at which point I decided to just release the analyser part first and then come back and do the speed reader later, but have never had the time to do so.

July 30, 2020 at 11:13 AM

One more request:

Full screen mode is more like "focus" mode. Right now full screen mode essentially maximizes the window, but doesn't change how it looks - menus remain, etc. Instead, it should hide all menus (top, right bar), and status bar on the bottom, with just the body text visible and centered - similar to how an ebook looks, or reader mode in Firefox, when full screen.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

Pall

Pall

Jan Finster

imron

Pall

murrayjames

imron

murrayjames

imron

roddy

Pall

imron

Geiko

imron

murrayjames

imron

murrayjames

icebear

imron

icebear

Join the conversation