Introducing Chinese Text Analyser

April 2, 2016 at 08:37 PM

At some point I hope to provide more information and statistics to users about their vocabulary (graphs over time, etc) and I'll consider adding breakdown of simp vs trad.

Excellent - I think that would be a cool breakdown to see.

I would just take those three cards generated by CTA, not using anki cloze notes, and press on.

Oh I see - do you mean I could just import it as a "regular card" with multiple fields, like this?

Front: 要是我[...]念書起來，實力強到，連我自己都會害怕。

Back: 認真

Pinyin: rènzhēn

English: serious

And even though it's not technically a cloze deletion card, it would serve as one? I think that would work - good idea!

Honestly, I might just use pleco, have the word field be the clozed out word, and the definition field be the sentence, like this:

Word: 　認真

Definition: 　要是我認真念書起來，實力強到，連我自己都會害怕。

Then I could test Definition>Word, and pleco would automatically blank out 認真 in the sentence, when I was seeing it. Then, if I wanted to remind myself, outside of context, what that word meant, I'd just use the popup window during the flashcard test. This would also allow me to have multiple dictionaries at my disposal, instead of importing into anki, where I'd just be using the definitions given by CTA. Now that I think about it, this seems like a better idea than using Anki anyways.

April 4, 2016 at 08:17 AM

I think querido's advice to just use normal cards rather than notes for cloze cards is probably the best option at this point in time if you are using Anki. In the future I hope to support more advanced export formats, but until then, tab separated files will have to do.

In other news, there's a new version out, which fixes a crash bug in OSX, another bug when searching over very long lines, and it also adds HSK information, with individual words getting their HSK level specified, as well as HSK statistics being calculated for the whole document (how many total/unique words there are per HSK level).

May 25, 2016 at 11:20 AM

I just downloaded this and I'm trying to get a feel for it. So, Frequency is the number of times the word occurs in a given text; %Frequency is the percentage of the text as a whole that the word makes up; but what is Cumulative % Frequency? Unless I'm missing something, this tool is only useful for analyzing large bodies of text, where words occur enough times to give one a sense of a word's overall frequency (thereby allowing a user to determine if it is worth remembering)? If I were to upload a single newspaper article, it's not going to give me any meaningful sense of how frequent a word is, since most advanced words will only occur once in a given article. Is that correct?

An FAQ on the tool's website with answers to basic questions like these would be very useful, I think.

May 25, 2016 at 01:12 PM

@unfadeable: You've got the general gist of it. Cumulative % Frequency is how frequently that words shows up and all of the words more frequent that in combined. So if 你 has a 5% frequency, 我 has a 3% frequency, 我 will have a Cumulative % Frequency of 8% since it and the all the words more frequent than it add up to 8%. This will make the most sense if you use the all tab and look at how the Cumulative % Frequency increases alongside the individual % frequencies.

As far as determining what words to learn, that's only one side of the tool. Since it can also mark what words you know, the more you use it the better an idea it will give you of how difficult a text will be for you, whether it is a newspaper article or a full length novel. To me, this is it's greatest value. Being able to through in a longer article from, say, initium.com, and seeing how much of the text is unknown, gives me a good idea of how I want to read the text. Is it a quick extensive read, or does it merit more of an intensive approach. Extensive being just reading for meaning and quick looking up the few unknown words, while intensive reading being I will learn the words and read the text several times. Extensive reading is best done when you know about 97% of the words. Using CTA I aim for around > or = 93%, though, since there are inevitably some words that I know and haven't marked and others that were just parsed slightly differently. For example: knowing 星期一 will not also mark 星期二 as known.

You are right that it won't tell you if you should learn a word that only appears once in a short newspaper article. However, based on my experience, there tends to be a few more difficult words that do get repeated several times. I can then take those words and learn them.

I don't see CTA as a do all for you, but it arms me with the statistics and information to make more informed decisions in what I decide to read and what vocabulary I take the time to learn.

On a final note, the ability to export the unknown words from CTA and import them into Anki simplifies what used to be a long process for me.

May 25, 2016 at 11:58 PM

@艾墨本: thanks for the thoughtful reply--that makes more sense now, and the use cases you described are definitely of value. It just seems like a no-brainer for a tool like this to have an "overall frequency" column that would be based on whatever corpus' frequency rankings. Still, it definitely seems worth buying based on the functions you described. Cheers!

May 26, 2016 at 01:42 AM

Yep, it's just like 艾墨本 described. The Cumulative %Freq is useful when you want to see which words you need to learn to reach a certain % understanding of the text (within various margins of error).

I'm actually in the process of writing full documentation for the application at the moment, which will cover all of these things.

It just seems like a no-brainer for a tool like this to have an "overall frequency" column that would be based on whatever corpus' frequency rankings

It *seems* like a no-brainer, but there's actually a very good reason not to have this, especially at the advanced level, and that's because the frequency rankings from a general corpus will be very different from the frequency rankings from what you are reading - often by quite a large margin, and so CTA is designed to help you focus on what you are reading now, rather than let you worry about some arbitrary frequency ranking which might have little to no relevance to what you are reading.

If I were to upload a single newspaper article, it's not going to give me any meaningful sense of how frequent a word is, since most advanced words will only occur once in a given article. Is that correct?

It depends on your definition of advanced. But even at say HSK 6 level, you'll find that in any native newspaper article, there will still be plenty of unknown words with multiple frequency counts, and that's really all you need - it's a meaningful sense of how frequent the word is in what you are reading now. Most people will generally read on a small range of topics that interest them, from a small range of sources that provide interesting material and so words that repeat multiple times in a smaller article are likely to be found in similar content also.

The smaller the article is, the less this holds true, but one thing you can do to mitigate this is create a text file, and each day just paste each new article to the top of the file and then open that in CTA. Over time that file will grow to contain many articles - essentially your own mini corpus of content you are interested in, and then you can use CTA to get frequency statistics from that - giving you frequency statistics far more meaningful than a general corpus, because it's targeted precisely to content you are likely to read.

A future version of CTA will have this corpus functionality built in (it's a highly requested feature) so it will keep track of all articles you have read and allow you to get frequency statistics and mine sentences/vocabulary from the entire corpus.

it definitely seems worth buying

It is! Though I'm obviously biased in that regard ;-)

May 29, 2016 at 07:31 AM

A future version of CTA will have this corpus functionality built in

Will the corpus be designed as one large (and growing) block of text? I was thinking it might be nice for CTA to remember discretely what text was inputted at what time with the option to label it, tag it, and so on, meaning if I've got a new word I want to learn I can quickly say: show me all the sentences (clozed or otherwise) where this this word occurred, in all articles I've tagged as 'news' or in the last novel I read.

May 29, 2016 at 08:07 AM

On the file system level, it will be saved as one big growing text file (it's more efficient than having lots of smaller files) but for the user it will appear as multiple separate documents that can be named, dated, labeled and tagged. This information will then be saved separately along with the location of the document in the bigger file.

May 29, 2016 at 08:11 AM

Very cool.

June 27, 2016 at 06:54 AM

Very nice!

August 16, 2016 at 01:10 AM

I know that this has been brought up before (by myself and maybe others), but after using CTA more, I really think that there is a huge area which is lacking, which for me really diminishes the usefulness of CTA.

You've said in the past that you'd like to improve the segmentation, but that the problem is a huge amount of time would be necessary and would result in only a little bit of improvement.

While improvement in the segmentation would be nice, I don't think it's really that necessary. After going though a few chapters of a book, CTA has learned most of the words I know (which are relevant to the book at least) and so the time needed to add more words to my list of known words, or add custom known words goes down dramatically. Because of the coloration, I can just find each word CTA says I don't know, see if in the context of a sentence I do know it, and mark it accordingly. This takes very little time.

The problem is this. CTA still doesn't let me mark a word as a custom word if it is already has a dictionary entry. It will just tell me that the word already exists. Although I say "mark as a custom word", what I'm really trying to do is custom segmentation. If I could custom segment a whole chapter at a time (which really wouldn't take too much time), than when I import the unknown words into pleco, I would have very little work to do and could start studying immediately (with the exception of words with different pronunciations and deciding which one is actually being used, etc.).

As it is, there are a few words that are segmented incorrectly so that when I import into pleco, either I have all these cards for words that aren't actually useful to me (because they've just been segmented incorrectly), or I have all of these cards consisting of a single character each, and sometimes (because of how pleco maps headwords to dictionary entries), I have four cards all for the same character, but different pronunciations/meanings. So then I have to go through all of the cards and figure out which characters should be spliced together to make a word. Also, at this point, all of this is being done on my phone, so it's pretty slow going.

But if I could just go through a file, make sure all the words were segmented correctly, and custom segment the words that weren't, and export to pleco, it would save a huge amount of time. I think this feature would be much much more useful than improving the segmentation algorithm, because it's already good enough as it is, IF the user could be able to override certain segmentations.

I also feel like (and maybe I'm totally wrong here) it would take much less time to add this feature than to improve the segmentation algorithm (which could be worked on afterwards knowing that at least where the segmentation is lacking, the user can correct for mistakes).

August 16, 2016 at 05:46 AM

Yes you are right that this sort of manual segmentation is much easier to do, and it is also already on my todo list because even an improved segmenter will still make some mistakes.

In the meantime, one possible workaround for your problem is to use CTA to export the entire document and add a space after each word (Export->Document and then set 'Word - Post' to a single space). Then you can manually edit that, moving the spaces around as necessary to correct any segmentation errors. Then you can open (or paste) the space-separated document back in to CTA and you should have a much more accurately segmented document.

It's a bit of a hassle, and a future version will help basically automate much of that, but you might find it works as a temporary workaround.

The biggest problem for me a the moment, is really a lack of time :-/

August 23, 2016 at 06:11 AM

Hi Imron, glad it's on your todo list!

Thank you for the workaround - I'll try it!

August 23, 2016 at 08:37 AM

Unfortunately though, my todo list is long and my time is limited, so no firm timeline for when it will be available

September 18, 2016 at 08:13 PM

Another thing to add to your long todo list

It would be really cool if CTA could count Verb-Objects as words even if separated:

ex:

I know 吃飯

It would be great if CTA could recognize that I also know:

吃完飯

吃過飯

etc... It would be cool if it still marked 過, 完, etc. as unknown (until I mark as known, of course).

Also, is there any way to copy text from CTA to Microsoft Word while still keeping the unknown words highlighted red?

September 19, 2016 at 02:13 AM

Thanks for the feedback.

1) would be pretty tricky so guarantee that it'll happen anytime soon

2) can't be done at the moment but probably wouldn't be too hard to implement

September 26, 2016 at 11:24 PM

When I export all unknown words as sentences, how does CTA divide sentences up? Whenever it sees a period? I have lines of subtitles that I'm pasting into CTA, and then exporting the lines with unknown words in them. Unexpectedly, so far it seems like it's been treating each line as a sentence (which is what I want). I even put a period in the middle of a line, but it still exported the whole line as a single sentence. It might have been an English period instead of a Chinese one... Would this have made a difference?

For this project, I actually want CTA to treat each line as a new sentence, regardless of any punctuation. But I'm just curious why the period I put in there didn't break up the line into two sentences, as I had expected - and I'm curious how to keep it this way.

September 27, 2016 at 05:07 AM

Splitting on lines is by design. It should definitely split on Chinese full stops so try that.

I'm pretty sure I also made it split on English full stops, but I guess not. I'm on mobile at the moment so can't check but will add English full stops to the list of breaks if not there already.

September 27, 2016 at 09:35 PM

Ah. The Chinese full stop does split. The English full stop and ellipsis don't. Would it be possible to get the list of breaks?

September 28, 2016 at 01:59 AM

Yep just checked the code and it's currently Chinese fullstops only. Will add both the English one and the ellipsis to the sentence breaks.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

Yadang

imron

Unfadeable

艾墨本

Unfadeable

imron

Guest realmayo

imron

Guest realmayo

dungcaxinh

Yadang

imron

Yadang

imron

Yadang

imron

Yadang

imron

Yadang

imron

Join the conversation