Lua Script and Chinese Text Analyser

January 5, 2017 at 04:57 PM

Chinese Text Analyser now has support for the Lua programming language.

You can use it to write simple (or complicated) scripts to manipulate text and statistics of files containing Chinese text.

There's relatively complete documentation for how this works, see Running Scripts, the sample example, and the full CTA Lua API

The latest version of CTA also has a bunch of sample scripts for playing around with to give you an idea of what sorts of things are possible.

I'm creating this thread for people who would like to ask for help writing scripts - either with scripts you are trying to write yourself but are having problems with, or for those without the time or the interest in programming, asking for help for someone to write a script to perform some task.

I figured it's better to split this off from the main CTA topic which I hope to keep related to release information and feature requests, rather than programming help with Lua.

Anyway, this is probably only going to be a feature used by a smallish group of people, but it allows for some really cool stuff for manipulating Chinese text.

January 6, 2017 at 02:58 PM

Also, just to get things started, here is a description of each of the examples scripts that currently ship with CTA to give you an idea of some of the things that are possible:

* anki-cloze.lua - scans a directory (and all sub-directories) for any .txt files, and for each file, find all mostly known sentences (sentences that contain at least one unknown word, and > 97% known words), and then output an Anki compatible cloze deleted sentence. e.g. something like 到了那种时候，谁也不怕死人骨头了，夜里就是挨在一起睡觉也不会做{{c1::噩梦}}。

* highlight-unknown.lua - exports a document, surrounding each unknown word with '_' markers.

* more-than-10-less-than-20.lua - finds all unknown words in a document that appear at least 10 times and no more than 20 times.

* sentences.lua - finds all sentences in a file that contain at least one unknown word and greater than 97% of all words in the sentence are known. Outputs all the unknown words in these sentences sorted by frequency, and all the sentences sorted by percentage of known words.

* subs2anki.lua - does everything in this spreadsheet without the need for all the copying and pasting between the different worksheets and CTA.

If you want to see what is involved in writing these scripts, you can find them in the scripts directory of your install location. This will typically be:

Windows: c:\program files\chinesetextanalyser\scripts

macOS: /Applications/Chinese Text Analyser.app/Contents/Resources/scripts/

Linux: /opt/ChineseTextAnalyser/scripts

The files can be opened in any text editor, though you'll probably find it easier to use a programming friendly editor such as Notepad++ or Sublime Text. If you're going to make changes, it's also preferable to copy these scripts elsewhere and make changes to the copy because everything in this directory will be overwritten each time you upgrade CTA versions.

January 6, 2017 at 04:13 PM

Imron, this all looks terrific, and thanks for the underscore script (and also the pastable statistics)! I've been trying to work out how to modify that script but I can't get my head around how to do it. Is there an easy way to make a version which instead of scanning a text and underscoring all words that are unknown, instead scans a text and underscores all words which exist in another given list? The idea is that having used CTA to identify words worth studying ahead of reading a text or a chapter of a novel, and looked up & learned those words, that those words are somehow highlighted when I subsequently do the actual reading.

January 6, 2017 at 04:44 PM

Is there an easy way to make a version which instead of scanning a text and underscoring all words that are unknown, instead scans a text and underscores all words which exist in another given list?

Yes there is!

You can load custom wordlists from a file. The file should contain one word per line. Here's a modified version of that script that loads the wordlist from a file and then only adds the markers if the word exists in the wordlist. Save it to a file (e.g. highlight-words.lua) and you should be good to go.

--[[
@cta Exports a document, surrounding each word from a specified wordlist with '_' markers.
--]]

local cta = require 'cta'

local file = cta.askUserForFileToOpen( { title = "Please choose a file with Chinese text" } )

if file == nil then
    error "No file selected"
end

local document = cta.Document( file )

--local wordListFile = "wordlist.txt"
local wordListFile = cta.askUserForFileToOpen( { title = "Please choose a word list file" } )
local wordList = cta.WordList( wordListFile )

for line in document:lines( true ) do
    for word, _, wordType in line:words( true ) do
        if wordList:contains( word ) then
            cta.write( ' _' .. word .. '_ ' )
        else
            cta.write( word )
        end
    end
end

If you're going to be using the same wordlist file each time, then delete line 16 (which prompts you for the file) and then remove the -- from the start of line 15 and replace "wordlist.txt" with the name and path of your file (-- signifies the start of a comment, which is why you need to remove the -- otherwise it treats the line as a comment as does nothing). This will save you from having to choose the file each time.

Other points of difference/interest:

On line 17, instead of calling cta.knownWords() which returns CTA's list of known words, I'm calling cta.WordList( wordListFile ), which loads a word list from the file.

On line 21, we have a much simpler test here, because we are only marking words that are on the list.

If you want different markers other than ' _', then on line 22 you can replace the contents of the ' _' and '_ ' with whatever before and after markers you want.

January 6, 2017 at 04:48 PM

that those words are somehow highlighted when I subsequently do the actual reading.

Out of curiosity, what are you using to read the text with?

If it's something that supports html, or some other kind of colouring markup, it would be quite simple to provide coloured output instead of _ _ markers.

January 6, 2017 at 08:59 PM

Very cool and thanks for providing a rewritten script straight away ... I did try but failed to figure it out myself -- apart from some occasional excel monstrosities my programming career ended about 25 years ago when I couldn't make the step up from basic to visual basic....

I convert Chinese kindle novels into text format, run them through CTA and read them on a tablet running Pleco reader. (I don't think I could give it colours.) CTA combined with a spreadsheet and corpus frequency stats/HSK levels helps me choose which upcoming unknown words I should learn. And I think now, the underscores will remind me the first few times that these recently-learned words come up in the text which can only be helpful, albeit something of a crutch.

But also, I'd like a quick way of scanning e.g. a 锵锵三人行 transcript to see if it includes any words that I've already learned over the last few weeks: knowing that a recently-learned word is coming up, seeing the sentence in which it will come up, ought to make sure that I really notice it when it comes up for real, and so reinforce it. I think that can only be helpful. But it wouldn't be worthwhile if it meant spending ages hacking around in spreadsheets each time, sapping my willpower before I even get started on the studying. However thus far everything about CTA has been super slick and easy, so I think this approach will be super quick and painless.

One issue I've noticed previously is that some words I've learned aren't in the Cedict dictionary. So if I'd recently learned 百年老店 then I think CTA would assume that "百年","老" and "店" were all recently-learned words and it should help me by highlighting all instances of those three words in a text. I presume the solution is to periodically paste vocabulary which isn't in Cedict into the cedict_ts file?

January 7, 2017 at 01:33 AM

The wordlist the segmenter uses is actually separate from cedict (though based on it). With your example, you should highlight 百年老店 and then right-click and choose "Add custom word". This will cause the segmenter to likely choose those characters as a single word (depending on surrounding context and whether the characters before it form another word).

You won't be able to get definitions for this word unless you add a custom dictionary entry. You shouldn't add it to the main dictionary file, as this gets overwritten each upgrade. You can however set up a user dictionary. See this post for details.

January 8, 2017 at 03:11 PM

CTA combined with a spreadsheet and corpus frequency stats/HSK levels helps me choose which upcoming unknown words I should learn

Is choosing the words a manual process, or does the spreadsheet just spit out the words based on various formulas you've set up.

If the latter, it's highly likely that a Lua script could be written to do the whole thing, making the process even simpler. Even if it's the former, if you have general heuristics that could be applied, it's likely that a Lua script could be written to do it too.

January 12, 2017 at 07:06 AM

Documentation for Lua scripting is now online. See the first post for updated links.

January 18, 2017 at 10:54 AM

On 08/01/2017 at 3:11 PM, imron said:

Is choosing the words a manual process, or does the spreadsheet just spit out the words based on various formulas you've set up.

It's manual: numerical values relating to frequency in a few corpuses are returned in one big table and I just arrange in descending order a few times, picking out the best candidates. Only takes a couple of minutes to select 20+ words.

January 18, 2017 at 11:00 AM

The Lua guide has a bit about:

Instead of HSK levels you can also load custom word lists from files using the cta.WordList() function as follows:

local wordList = cta.WordList( 'wordlist.txt' )

Which will load the words from a file called ‘wordlist.txt’.

If I understand this right, I need to create a file called wordlist.txt (in the right format etc). Where do I save this file so it can be found and used by CTA? Currently I'm getting an error loading WordList from files "wordlist.txt".

January 18, 2017 at 12:26 PM

It's doesn't really matter where, but you probably need to specify the full path to the file, so if the file is saved to c:\users\<your username>\Documents\wordlists.txt then in the Lua script you would write "C:/users/<your username>/Documents/wordlists.txt" (note the forward slashes / rather than backslashes).

January 18, 2017 at 01:32 PM

Thanks

June 27, 2017 at 03:45 AM

Uploading a script that will check the coverage of characters in a document that exist in the HSK6 wordlists (see here for context).

char-coverage.lua

June 27, 2017 at 11:46 AM

What do you do once you download the file?

June 27, 2017 at 11:50 AM

Run it from the Tools->Run Script dialog from within Chinese Text Analyser.

See here for documentation.

June 27, 2017 at 12:25 PM

Thank you. Just asking, is there anyway to import word lists into Chinese Text Analyser.

June 27, 2017 at 12:46 PM

Yes. Make a file containing one word per line and then choose File -> Import from the menu.

See here for more details.

August 26, 2017 at 02:55 AM

It's possible to create a scrip to delete lines with all known words and just remais lines with unknown?

and it's possible to ordinate the lines by numbers of unknown words?

1 unknown

2 unknown

3 unknown

etc...

August 26, 2017 at 05:40 AM

Yes it's possible to do both of these things.

Would you want a single script that does both?

Sign In

Lua Script and Chinese Text Analyser

Recommended Posts

imron

imron

Guest realmayo

imron

imron

Guest realmayo

imron

imron

imron

Guest realmayo

Guest realmayo

imron

Guest realmayo

imron

RogerGe

imron

RogerGe

imron

dougwar

imron

Join the conversation