Introducing Chinese Text Analyser

February 2, 2015 at 03:27 AM

With the corpus feature I'm working on, what will happen is that you will be able to create a new 'corpus', and then every file you open will be added to the corpus automatically. You won't need to save things to a separate folder as CTA will handle this for you.

You will then be able to get statistics and sentence/word mine/search through all of those documents (including pattern matching) and either view sentences or export them to a file.

For some words that are variants of other words, for example the word 動盪 the dictionary just says: Variant of 動蕩, but doesn't actually tell you what the word means.

I'll see if there's a good way around this problem.

February 2, 2015 at 03:13 PM

For some words that are variants of other words, for example the word 動盪 the dictionary just says: Variant of 動蕩, but doesn't actually tell you what the word means.

Is this CC-CEDICT? I have a script that takes the base dictionary and modifies all those "reference-only" definitions to include the definitions they point to. merge-cedict-defs.pl

February 3, 2015 at 01:15 AM

Yep, cc-cedict. I'd be looking to do a similar thing, except would do it in c++ dynamically at runtime for any entry that is displayed/used.

That script will save me however from hunting through cc-cedict to find all variations of 'variant of..'

Thanks :-)

March 2, 2015 at 12:52 PM

This looks like an already interesting and useful tool with even more interesting plans for the corpus feature in the future; any news on when the OS X version is to be expected?

March 5, 2015 at 06:12 AM

For anyone else interested in the OS X version, it seems that it is still roughly a couple of months away, although there is not a hard time frame. I'm going to try wine for now through playonmac.

March 5, 2015 at 12:15 PM

It'd be great if you could post a follow up about how it works or doesn't with playonmac.

March 5, 2015 at 01:59 PM

and then every file you open will be added to the corpus automatically

Would it be too much to ask to have this not automatic, so you can choose what goes into your corpus? And presumably you have a way of ensuring that things don't get put in multiple times.

March 5, 2015 at 02:31 PM

There will be a separate 'corpus' file type (actually a directory) which you can open from within CTA. When a corpus is open, then any documents you open are added to it automatically.

If you don't want files added to the corpus you can just close the corpus first, and then they won't get put in any corpus.

Manually adding/removing documents from the corpus will also be easy (documents will be shown in a tree view on the side), though I suppose I could also add an option not to automatically add documents to an open corpus if the above is not enough.

Regarding multiple copies, a unique identifier will be created for each document based on its contents, and documents that have identical contents will have the same identifier.

Documents will only be added to the corpus if no other documents have a matching identifier.

March 13, 2015 at 08:53 PM

Any ETA on the next version? It seems like it has been on 99.9 since I first read about it, which seems like quite a few months.

I'd also like be able use it as a reader, and automatically have pinyin annotations over words I don't know. There just isn't software that can do this very well yet, and it seems like it would be fairly easy to build onto what you already have.

During the trial I found that it was mostly helpful for making sure flashcard lists didn't have duplicate or known words in them. However, I haven't found any good software that acts as a casual reader for passive learning, which seems to work best for me.

I found the look-up function (and thus learning new vocabulary with the program) too much trouble to use.. One major problem is that it's annoying to restart the program, I sometimes accidentally mark characters or words as known, only later to find out I was wrong (or maybe I don't find out) because there is no way to quickly check the actual definition without blacklisting the word for the session. So if you think you know it, the program strongly dis-incentivizes you to actually check the meanings of words.

There is also a problem when the text is segmented wrong, I'm sometimes torn between marking things as known for the segmented meaning, as opposed to the meaning in context. Or even if the word is segmented right, the word has multiple meanings and white-listing the word will white-list everything. So even though I can read most things, I still will have a bit of red here and there and I don't want to look them up either, because I'm on the fence about white-listing them..So basically its a dead-end as far as studying vocabulary in context. The best I can do is export the word list and study it somewhere else which seems like a real wasted opportunity.

This software is so close to being great. I just wish it didn't try to babysit so much by forcing you to restart the program and making left and right clicking through a context menu to check a definition or reading. Not having a reading mode is also very disappointing, given the speedy platform and being able to identify more or less which words you already know it seems almost criminal the option isn't available. The result is I just ended up not using the software except for quickly creating lists. Even marking off words I knew to add to the list became tedious because I was frustrated by words I was on the fence about. Especially with words I know some of the definitions to, but not confident about all. etc.

The final judgement on the version (99.9) I tried though is that although I was disappointed in the aforementioned deficits, it does perform a fairly essential function quite well. I just wish it was more of a study tool or reading environment than just a list making tool, because it is so close to delivering something I think a lot of people have waited a long time for.

Also, I wish it was open source and donation-ware but I can understand the commercial license too.

March 14, 2015 at 03:35 AM

because I was frustrated by words I was on the fence about. Especially with words I know some of the definitions to, but not confident about all. etc.

For reading, confidence is just as important as every other aspect of the word. If you are not confident about the word then you don't know it well enough yet and would benefit from spending time studying it further.

One of the main goals of CTA is to make you recognise that fact. Many other learning tools gloss over this and make you think you know a word when actually you don't know it well enough to use/read it in context at a speed conducive to reading. CTA on the other hand has a much higher standard of what it means to 'know' a word, one that says you know a word when you can read it comfortably and confidently without resorting to a dictionary (this is not an unreasonable definition).

To do that you basically need instant recognition and recall on the word. If you're hesitant about a word you should mark it as unknown, because you need to study it a bit to make sure that next time you are not hesitant about it.

there is no way to quickly check the actual definition without blacklisting the word for the session

And that is intentional. If you are looking up a word just 'to quickly check', then great, that's a perfect example of your actions showing you don't know that word well enough yet, despite how much you might protest that you really do know it. That's a word that you need to spend more time studying so that next time you don't need to quickly check because you know it with confidence.

The thing is, if you knew the word you wouldn't need to look it up. How often do you look up words in your native language? Probably rare to never. It might seem frustrating that words you 'know' are marked as unknown but CTA is just reflecting the truth about your current level and the material you're trying to read and it's using an objective measure rather than letting you believe something that is not the case.

CTA doesn't disincentivise looking up words, it disincentivises letting you believe you know a word just because you looked it up. It wants you give you feedback to say 'hey, you looked up this word and even though it's in your short term memory, you're probably going to forget it again real soon if you don't spend more time studying it'. It then gives you the tools to get those words into other study tools such as flashcard and SRS programs.

Actually, you don't need to restart CTA to mark words back to known, just export unknown words (Export->Wordlists, then choose Wordlist [Known] and Filter 'Exclude words on list') and check the 'mark exported words as known' checkbox. This way you have a list of all the words from a particular article that you need to spend more time studying in order to read them fluently.

So basically its a dead-end as far as studying vocabulary in context. The best I can do is export the word list and study it somewhere else which seems like a real wasted opportunity.

But this is exactly what you should be doing. Reading native material, encountering a word in context that you don't know, and then marking it down and setting it aside for later study to make sure you'll know it the next time you come across it. That is how you will make real improvements to your Chinese.

I'd also like be able use it as a reader, and automatically have pinyin annotations over words I don't know.

This is not a feature that I plan to add. The whole design philosophy around CTA is to encourage you to develop the skills necessary to read without aid. The best way to do that is to read without aid, and then study words you have trouble with. Yes it can be frustrating not to get instant gratification, but it's better for your long term learning.

There is also a problem when the text is segmented wrong

Text segmentation is a problem, and the current algorithm for doing that is too simplistic. I have plans to improve this, however implementing those ideas will take a whole lot of time for relatively small percentage gains in correctness and so it's lower priority than other features (OS X and Linux versions, corpus feature, graphs and more).

Also, I wish it was open source and donation-ware but I can understand the commercial license too.

I've tried that before with other tools, and the reality is, most people don't donate. The commercial licence of CTA is both permissive and relatively inexpensive.

Although CTA is a labour of love, without any sort of income it is difficult to justify spending time on it at the expense of paid work (I do freelance software development for a living) and in fact this is part of the reason the next version is taking so much time (paid work takes priority because sales of CTA are still far from covering living expenses and I like to eat and have a roof over my head). ETA on the next version is still a month or two away and the main feature of the next release will be the OS X version.

March 14, 2015 at 09:54 AM

Don't get me wrong, I understand the philosophy behind why you made those decisions. I just wish the software didn't babysit so much or had options to toggle between modes. It's a good tool for finding words you don't completely understand, I just wish there was more opportunities to learn those words inside the software as well.

I learned 98% of my Chinese reading ability by simply reading and looking up words as I went along in a dictionary. I only started studying Chinese after I could already read most non-technical writing fairly fluently. So I disagree with the premise that it's not possible to learn by taking the easy way out, and simply looking up the word when it's unfamiliar, repeatedly if necessary. The first actual Chinese class I took was advanced Chinese with 3rd year Chinese majors, and I noticed even though I never actively studied Chinese, I could already read faster and more naturally than most other students, because I never had to try memorizing or drilling characters.

Outside of the classroom environment, I've only spent a few hours on flashcards, never completed an Anki Deck, and never opened a textbook on learning Chinese. I consider myself to have a good memory for certain types of information, but for me it's difficult to commit all possible definitions and usages to memory at the same time when studying flashcards. Let me put it this way, repeating a flashcard entry 5-10 times in order to commit to memory 5 possible definitions with different POS, and then recalling each one while reading, may be less efficient for some than looking the word up 5-10 times when encountering the word in 5 different contexts when actually reading. So, looking up the word when it's encountered in a different context allows for more natural learning (in my own personal experience). You don't have to memorize the definitions, you just have to remember the situation you learned it in. Then there is the matter that most people, including myself, prefer to spend more time consuming, not studying. And that's not necessarily a bad thing. Flashcards are useful as a primers, but tools that help make the most of my consumption, also help me learn more naturally.

I didn't have to /try/ memorizing anything, just like I never had to try memorize words in my native language (for the most part). I just had to answer the questions I had while consuming, usually with a dictionary, like I did while reading books as a child. My argument is that everyone learns differently, and most people who have already achieved an upper-intermediate or advanced level already understand how they learn best. There are just some features that would work better for /me/. I do think that what the software does do, it does it very well.

March 14, 2015 at 01:06 PM

I just wish the software didn't babysit so much or had options to toggle

To somewhat mitigate this issue the next version will colour looked up words differently than unknown words and although you still won't be able to mark them as known, you will be able to set that colour to the same colour as known words effectively hiding them for the session.

So I disagree with the premise that it's not possible to learn by taking the easy way out, and simply looking up the word when it's unfamiliar, repeatedly if necessary

I think this is more feasible at lower levels when there are plenty of high frequency words that you still need to learn. As your level increases, you'll encounter unfamiliar words at a much lower frequency and this is where something like CTA steps up because it reminds you 'hey, here are some words you almost know, make sure you spend some time on them so you don't need to look them up next time'.

and most people who have already achieved an upper-intermediate or advanced level already understand how they learn best.

Possibly but not necessarily. It's also at this point that crutches that have been in useful in getting to that level start to make it difficult to make further advances. CTA purposely makes you try to do without such things and pushes you towards unaided reading.

You don't have to memorize the definitions, you just have to remember the situation you learned it in.

CTA is great for this. Don't export definitions, instead, export Word, Pinyin and Cloze Sentences and drill yourself on those. So for example you can easily make flashcards like this:

F: 你早知道，是爹爹将我许配给他，[...]是我自己作的主么？*

B: 难道 nándào

*from 雪山飞虎 for those interested

March 14, 2015 at 04:37 PM

I like the idea of cloze sentences, but there are a couple usage situations where it could use some improvement. mostly revolving around text which includes multiple instances of a word of character, the cloze sentence will export multiple copies of the first sentence instead of finding the next instance. More of an issue with very large documents and when learning words with multiple uses. Also, sometimes more obscure words that aren't correctly identified may cause the first instance to be irrelevant to the word usage you want. For example, if your text contains 我会因此造成负面的羯磨吗？ and 将这宝贝似的绵羯，宰杀做成全羊，放进有福的锅里，烧火将它煮上。 exporting 羯 will give you two instances of the first sentence '会因此造成负面的羯磨吗？' And no instance of future sentences,.because '羯磨' isn't in ccdict. There isn't much point exporting duplicate copies of the same sentence/word anyway.

To get around this I put together a script to segment a text into sentences, add them to a list and then generate all matching sentences using a vocab list (like exported from CTA), or from a database of example sentences I put together. I like the idea of the cloze sentences, but it would be nice if there were more options for export, especially since your software is a lot faster and more efficient than what I was able to do. Obscure words will always be a problem, but Ideally it should use something like a POS analysis and take that into consideration, or just export the sentence each instance was found, instead of multiples of the first one. .

Edit: I see you already addressed the issue of sentence mining on the previous page. Looking forward to that.

Anyway, I just want to repeat that for what it does, it's a very well made and FAST piece of software, it does what it does better than anything else. It's just the usage case is a little too narrow in the current form to justify a purchase (for me). With the addition of an annotation function and different dictionaries/word segmentation options, I would definitely shell out 10-15 dollars for a license. Even better if it was possible to edit text inside the window (unless it would cause too much of a speed hit).

March 16, 2015 at 03:25 PM

Even better if it was possible to edit text inside the window (unless it would cause too much of a speed hit).

I did plan on writing an editor at some point using the same underlying tech as CTA that allowed you to write content against specific wordlists. It will probably be a while before I get around to doing this, and it would be a separate product from CTA, although if it works well and is fast enough I would take the editor technology and ad it to CTA also.

At the moment there is a speed benefit CTA gets from being read-only.

March 17, 2015 at 04:15 PM

Imron, is CTA portable, I mean could I install it on a USB key and run it from there (on any random Windows machine)?

March 18, 2015 at 02:29 AM

It's almost portable.

It doesn't write any information to the windows registry, except for the uninstall information required to make it appear in Add/Remove programs.

You can therefore copy the entire contents of the install directory (by default: c:\program files\chinesetextanalyser) onto a USB key and the application will run fine.

However, all the user data is stored in the users local application data folder (by default c:\users\<username>\AppData\Local\ChineseTextAnalyser), and this is specific to each machine so you'd need to make a backup copy of that and copy it to the new machine if you wanted all your data to come across with it.

I do have it on my todo list to make it support standalone installations (including data) running from a USB, and although it wouldn't be difficult to do it's not currently a high priority.

Feel free to try and convince me of reasons why it should be a higher priority if you like

March 18, 2015 at 09:01 AM

Thanks Imron, it works. Now I no longer have any valid reason to abstain from purchasing a license... I'm starting to play with your programme and my first impression is: What a humbling experience! But that's exactly how you want the user to feel like, don't you?

March 18, 2015 at 09:20 AM

I want users to realise if they rely on a dictionary too much and get them to do something about it that's the case.

It might be humbling initially, but it's certainly not the aim of the program to make you feel that way - hopefully it will give you a sense of accomplishment when you see progress being made.

Speaking of which, even though I don't have fancy progress graphs yet, the program is generating that data so once graphs are added, it will be able to show how you've been learning and improving over time.

March 18, 2015 at 10:30 AM

I've just noticed that my question about a portable version of CTA was already discussed in more details a year ago. Sorry.

March 18, 2015 at 10:37 AM

No problem. When the next version is eventually released I'll make it support portable data also (though with the same caveats mentioned in the previous post).

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

c_redman

imron

巴特B

巴特B

imron

li3wei1

imron

Junso1

imron

Junso1

imron

Junso1

imron

laurenth

imron

laurenth

imron

laurenth

imron

Join the conversation