Introducing Chinese Text Analyser

September 28, 2016 at 02:26 AM

For good measure, I've also thrown in English and Chinese question and exclamation marks.

September 28, 2016 at 06:49 PM

Noticed a small bug/typo:

Win7 - Increase font size works when you press Ctrl+= rather than the Ctrl++ it says in the toolbar.

I just bought a license of CTA and I really like it. I haven't figured out exactly how I'm going to use it yet, but I appreciate the great work. vi keyboard shortcuts are a cool feature

also first post after lurking for years...

September 29, 2016 at 03:36 PM

vi keyboard shortcuts are a cool feature

Yup. I love my vi shortcuts. I've got a whole bunch more I want to add too, top of the list being / for searches.

Good spot with the Ctrl++. I'll update it with the next release which also address the line splitting.

In a series of follow up emails with Yadang, I think I'm also going to add Lua scripting support to CTA so people will be able to write simple scripts to get and export various things from a document (documents, paragraphs, sentences, words and associated statistics, dictionary entries and lists of known/unknown words will all be exposed to the Lua scripts).

I know there there are a bunch of other features people are waiting on, but I've decided to move this to the top of my todo list because it should be relatively simple to add and then people will be able to write all sorts of sentence-mining and data extraction scripts without needing to wait for me to get around to it.

September 29, 2016 at 05:27 PM

Did you hear about the Google Neural Machine Translation?

September 29, 2016 at 05:42 PM

Yes I did, but 1) CTA is not a translation tool so it doesn't immediately apply 2) CTA is designed to work offline rather than be dependent on an online API that 3) costs money to use if you embed it in an application and also requires you to include Google branding in that application.

So it's interesting, but not directly relevant to CTA at this point in time, however some of the research papers they've written might come in useful at some point for improving CTA's segmenter, but even then, for reasons listed above it'd still need to be my own implementation of that rather than relying on any premade Google tool.

September 29, 2016 at 05:45 PM

Yes, this is a translation tool, not what CTA is doing, but you can certainly learn from their approach.

September 29, 2016 at 06:08 PM

but you can certainly learn from their approach.

To some extent yes, but their approach relies on training the neural network with very, very large sets of data, using many machines to do so. There are likely some interesting and applicable things in the research papers they've written, but it's unlikely to be directly related to CTAs use case.

September 30, 2016 at 01:31 AM

Minimising Google is a good idea. I get the feeling is trying to help us too much. It will soon be trying to predict when we need to go to the toilet if it hasn't done so already.

September 30, 2016 at 06:54 AM

here is a thread on it, no need to spam CTA

October 1, 2016 at 12:26 AM

Imron,

Is there a full list of keyboard shortcuts?

Also, just now received a security warning in Chrome while trying to visit your website https://www.chinesetextanalyser.com/. I had click two additional links ("If you understand the risks to your security, you may visit this site....") to navigate to it.

October 1, 2016 at 01:28 AM

There is no list documented anywhere except the source code. The vi shortcuts referenced here are j to scroll down a line, k to scroll up a line, ctrl-d to scroll down a page and ctrl-u to scroll up a page.

Thanks for the info about the certificate. Looks like my SSL certificate didn't automatically renew, and stranger still when I try to manually renew it says it doesn't need renewal -- hmm will look in to it a bit later.

October 1, 2016 at 01:39 AM

SSL issue is now fixed.

October 2, 2016 at 07:12 AM

New version is now up that allows you to export Lines and Cloze Lines. It also breaks on English punctuation for sentences.

October 2, 2016 at 10:22 PM

Could I ask, what are Lines and Cloze Lines and what is their use in the app?

October 3, 2016 at 01:55 AM

It's used when you want to export an entire line that contains the word, rather than just a sentence.

For example, say you are learning the word 麻烦, and you have a transcript or subtitles document that contains the line

掌柜：你也知道是麻烦啊。在你心目中大侠应该是什么样子呢？

If you export the 'Sentence' containing the word, you will get 你也知道是麻烦啊 (and Sentence Cloze will be 你也知道是[...]啊).

If however you export the ’Line', then you'll get the entire line 掌柜：你也知道是麻烦啊。在你心目中大侠应该是什么样子呢？ (and Cloze Line will get you 掌柜：你也知道是[...]啊。在你心目中大侠应该是什么样子呢？)

It's useful mostly for things like subtitles where each line of the document is already going to be a complete thing that someone says. However that 'complete thing' might contain punctuation such as full-stops and question marks, and therefore exporting the Sentence only will only get you part of that, which might not be ideal.

October 17, 2016 at 06:37 AM

For people following along, Lua scripting is progressing very well and on my development version of CTA I can now run Lua scripts that have access to CTA internals.

All that's remaining is to write the GUI for running the scripts and to write documentation for the functionality that CTA exposes to the scripts. Hopefully it will be ready some time around the end of this month or the beginning of next.

It's going to make some very interesting things possible, for example, a script to search all open document and spit out all 'mostly known' sentences (e.g. sentences where you know more than a certain percentage of the words) is about 30 lines of code. Here's an example:

local function sentenceMostlyKnown( sentence, known, threshold )
    if threshold == nil then
        threshold = 0.97
    end

    local total = 0
    local totalKnown = 0
    for word in sentence:words() do
        if known:contains( word ) then
            totalKnown = totalKnown + 1
        end
        total = total + 1
    end

    local ratio = totalKnown / total

    return ratio >= threshold and ratio < 1
end

local cta = require 'cta'
local known = cta.knownWords()
for _, document in ipairs( cta.openDocuments() ) do
    for line in document:lines() do
        for sentence in line:sentences() do
            if sentenceMostlyKnown( sentence, known ) then
                print( sentence )
            end
        end
    end
end

October 17, 2016 at 08:09 AM

That looks awesome!

Another suggestion I've been thinking of as I've been using CTA more...

I find that drilling say, a word in the context of one sentence 20 times does much less than drilling the same word in the context of 2 sentences 10 times or 3 sentences 6 times. I find that I don't really know a word until I've seen in in multiple contexts (I know this is nothing new...).

To this end, it would be cool if there was a functionality in CTA that allowed words to be considered as something other than just "unknown" or "known". If there were three or four categories as opposed to two, that would be helpful.

I can see you saying, well, if the word isn't known well enough to be marked as "known" and to be put into one of the other categories instead, then the word really just isn't known. I agree with this, but I don't think it takes away from the functionality of having more than just the two categories. For example, as it is, when I export words, I don't tell CTA to mark them as known, because I know that just this one exposure won't be enough for me to feel really comfortable with them. What this means is, over time, I will keep exporting words and not marking them as known, and so I'll get a build-up of words that slowly turn from unknown to known that won't be marked.

Of course, I can go through everything and re-mark the words I've come to known though multiple exposures, but it would be nice if CTA kept track of this so I didn't have to, and I could tell CTA to only mark the words known that I've promoted to these other categories within "unknown".

Anyways, definitely not a necessity, and I'd rather see many other features than this one, but I thought I'd pitch it anyway.

October 17, 2016 at 08:47 AM

it would be cool if there was a functionality in CTA that allowed words to be considered as something other than just "unknown" or "known". If there were three or four categories as opposed to two, that would be helpful.

I actually thought about this quite a bit when making CTA and ultimately decided that when it comes down to it, any word you don't know with full confidence, is basically unknown in the context of reading because it will cause an interruption to the reading process.

Having multiple levels of 'known' encourages a situation where people can pretend they know something when really, for the purposes of using the language, they don't (this happens a lot with SRS).

One way to mitigate the problem you describe is to keep exporting words without marking them as known, and then at regularly intervals export words you count as known from Anki and bulk import them to CTA (this is still a multi-step process and still a bit of a hassle).

In the meantime I'll think about keeping track of recently exported words (along with an exported count), which you can then browse from within CTA and mark as known at your pleasure if needed.

I find that drilling say, a word in the context of one sentence 20 times does much less than drilling the same word in the context of 2 sentences 10 times or 3 sentences 6 times

The great thing about the Lua scripting feature is that it will be relatively simple to find and export all sentences containing unknown words, e.g. the following script:

local cta = require 'cta'
local sentences = {}

for _, document in ipairs( cta.openDocuments() ) do
    local unknown = document:unknownWords()
    for word, sentence in document:findSentencesContaining( unknown ) do
        if sentences[word] == nil then
            sentences[word] = { sentence }
        else
            table.insert( sentences[word], sentence )
        end
    end
end

for word, sentenceGroup in pairs( sentences ) do
    for _, sentence in ipairs( sentenceGroup ) do
        print( word, sentence:clozeText( word ) )
    end
end

Will give you output like:

快捷	旁人见他身法[...]，出手狠辣，都不禁为这僮儿担心，却见剑光闪动，左僮的剑尖指到了阮士中后心。
快捷	这一招[...]异常，右僮手中长剑正与刘元鹤铁拐相交，忽见剑到，急忙矮身相避，只听刷的一响，小辫上的一颗明珠已被利剑削为两半，跌在地下。
苦功	他在这双肉掌上下了数十年[...]，施展开来果然不同寻常。
苦功	“这单刀功夫，我也曾跟师父下过七八年[...]，知道单刀分『天地君亲师』五位：刀背为天，刀口为地，柄中为君，护手为亲，柄后为师。
若非	瞧她神气，[...]侯门巨室的小姐，就是世代书香人家的闺女，哪里像是江湖大侠之女。
若非	[...]汉奸吴三桂卖国，引清兵入关，这天下就是姓李的了。
若非	[...]胡一刀代我求情，我这条小命是早已不在了。
若非	只是他时常埋怨自己，说道[...]他对我妈不够温存体贴，我妈也不致受了旁人之骗。
慢吞吞	她[...]的取钥匙，开箱子，又跟韩婶子商量该垫银狐的还是水貂的。
慢吞吞	爹[...]地吃起了饭，才吃了几口就将筷子往桌上一放，把碗一推，他不吃了。
慢吞吞	老全眼睛[...]地眨了一下，像是看了一会我们，随后嘴巴动了动，声音沙沙地问我们：“这是什么地方？”
慢吞吞	在牛的脊背上刷动，一些树叶[...]的掉落下去。

And so on for each unknown word, so you'll be able to get lots of example sentences.

It will be trivial to extend that script so that instead of working on the open documents, it processes all files in a specific directory (or directories).

October 18, 2016 at 09:18 PM

One way to mitigate the problem you describe is to keep exporting words without marking them as known, and then at regularly intervals export words you count as known from Anki and bulk import them to CTA (this is still a multi-step process and still a bit of a hassle).

That's a good idea. In fact, I bet it wouldn't be very difficult to export all of my cards into excel, extract just the clozed out word(s) from each sentence, and then do a count of how many sentences I have containing each word. I could then decide that if I have a word in 3 or more sentences (or whatever), to import only those words into CTA, as there's a good chance that I'll know those words already.

The only problem is some of the words that are clozed out aren't recognized as words by CTA. When importing words to CTA, will it add as a custom word any word that doesn't have a match?

In the meantime I'll think about keeping track of recently exported words (along with an exported count), which you can then browse from within CTA and mark as known at your pleasure if needed.

That would be a nice feature.

It will be trivial to extend that script so that instead of working on the open documents, it processes all files in a specific directory (or directories).

That would be really great.

October 19, 2016 at 12:41 AM

When importing words to CTA, will it add as a custom word any word that doesn't have a match?

Not currently, but it probably should. I've added it to my todo list.

That would be really great.

As I know you've been researching and learning about Lua, CTA will come with the Lua File System module installed if you want to get familiar with how to iterate over files and directories in Lua.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

sketchc89

imron

Angelina

imron

Angelina

imron

Flickserve

Angelina

murrayjames

imron

imron

imron

LinZhenPu

imron

imron

Yadang

imron

Yadang

imron

Join the conversation