Jump to content
Chinese-Forums
  • Sign Up

Lua Script and Chinese Text Analyser


Recommended Posts

Posted

Also, just to get things started, here is a description of each of the examples scripts that currently ship with CTA to give you an idea of some of the things that are possible:

 

* anki-cloze.lua - scans a directory (and all sub-directories) for any .txt files, and for each file, find all mostly known sentences (sentences that contain at least one unknown word, and > 97% known words), and then output an Anki compatible cloze deleted sentence. e.g. something like 到了那种时候,谁也不怕死人骨头了,夜里就是挨在一起睡觉也不会做{{c1::噩梦}}。

 

* highlight-unknown.lua - exports a document, surrounding each unknown word with '_' markers.

 

* more-than-10-less-than-20.lua - finds all unknown words in a document that appear at least 10 times and no more than 20 times.

 

* sentences.lua - finds all sentences in a file that contain at least one unknown word and greater than 97% of all words in the sentence are known.  Outputs all the unknown words in these sentences sorted by frequency, and all the sentences sorted by percentage of known words.

 

* subs2anki.lua - does everything in this spreadsheet without the need for all the copying and pasting between the different worksheets and CTA.

 

If you want to see what is involved in writing these scripts, you can find them in the scripts directory of your install location.  This will typically be:

 

Windows: c:\program files\chinesetextanalyser\scripts

macOS: /Applications/Chinese Text Analyser.app/Contents/Resources/scripts/

Linux: /opt/ChineseTextAnalyser/scripts

 

The files can be opened in any text editor, though you'll probably find it easier to use a programming friendly editor such as Notepad++ or Sublime Text.  If you're going to make changes, it's also preferable to copy these scripts elsewhere and make changes to the copy because everything in this directory will be overwritten each time you upgrade CTA versions.

  • Like 2
Posted

Imron, this all looks terrific, and thanks for the underscore script (and also the pastable statistics)! I've been trying to work out how to modify that script but I can't get my head around how to do it. Is there an easy way to make a version which instead of scanning a text and underscoring all words that are unknown, instead scans a text and underscores all words which exist in another given list? The idea is that having used CTA to identify words worth studying ahead of reading a text or a chapter of a novel, and looked up & learned those words, that those words are somehow highlighted when I subsequently do the actual reading.

Posted

Is there an easy way to make a version which instead of scanning a text and underscoring all words that are unknown, instead scans a text and underscores all words which exist in another given list?

 

Yes there is!

 

You can load custom wordlists from a file.  The file should contain one word per line.  Here's a modified version of that script that loads the wordlist from a file and then only adds the markers if the word exists in the wordlist.  Save it to a file (e.g. highlight-words.lua) and you should be good to go.

--[[
@cta Exports a document, surrounding each word from a specified wordlist with '_' markers.
--]]

local cta = require 'cta'

local file = cta.askUserForFileToOpen( { title = "Please choose a file with Chinese text" } )

if file == nil then
    error "No file selected"
end

local document = cta.Document( file )

--local wordListFile = "wordlist.txt"
local wordListFile = cta.askUserForFileToOpen( { title = "Please choose a word list file" } )
local wordList = cta.WordList( wordListFile )

for line in document:lines( true ) do
    for word, _, wordType in line:words( true ) do
        if wordList:contains( word ) then
            cta.write( ' _' .. word .. '_ ' )
        else
            cta.write( word )
        end
    end
end


If you're going to be using the same wordlist file each time, then delete line 16 (which prompts you for the file) and then remove the -- from the start of line 15 and replace "wordlist.txt" with the name and path of your file (-- signifies the start of a comment, which is why you need to remove the -- otherwise it treats the line as a comment as does nothing).  This will save you from having to choose the file each time.

 

Other points of difference/interest:

 

On line 17, instead of calling cta.knownWords() which returns CTA's list of known words, I'm calling cta.WordList( wordListFile ), which loads a word list from the file.

 

On line 21, we have a much simpler test here, because we are only marking words that are on the list.

 

If you want different markers other than ' _', then on line 22 you can replace the contents of the ' _' and '_ ' with whatever before and after markers you want.

Posted
that those words are somehow highlighted when I subsequently do the actual reading.

Out of curiosity, what are you using to read the text with?

 

If it's something that supports html, or some other kind of colouring markup, it would be quite simple to provide coloured output instead of _ _ markers.

Posted

Very cool and thanks for providing a rewritten script straight away :D   ... I did try but failed to figure it out myself -- apart from some occasional excel monstrosities my programming career ended about 25 years ago when I couldn't make the step up from basic to visual basic....

 

I convert Chinese kindle novels into text format, run them through CTA and read them on a tablet running Pleco reader. (I don't think I could give it colours.) CTA combined with a spreadsheet and corpus frequency stats/HSK levels helps me choose which upcoming unknown words I should learn. And I think now, the underscores will remind me the first few times that these recently-learned words come up in the text which can only be helpful, albeit something of a crutch.

 

But also, I'd like a quick way of scanning e.g. a 锵锵三人行 transcript to see if it includes any words that I've already learned over the last few weeks: knowing that a recently-learned word is coming up, seeing the sentence in which it will come up, ought to make sure that I really notice it when it comes up for real, and so reinforce it. I think that can only be helpful. But it wouldn't be worthwhile if it meant spending ages hacking around in spreadsheets each time, sapping my willpower before I even get started on the studying. However thus far everything about CTA has been super slick and easy, so I think this approach will be super quick and painless.

 

One issue I've noticed previously is that some words I've learned aren't in the Cedict dictionary. So if I'd recently learned  百年老店  then I think CTA would assume that "百年","老" and "店" were all recently-learned words and it should help me by highlighting all instances of those three words in a text. I presume the solution is to periodically paste vocabulary which isn't in Cedict into the cedict_ts file?

Posted

The wordlist the segmenter uses is actually separate from cedict (though based on it).  With your example, you should highlight 百年老店  and then right-click and choose "Add custom word".  This will cause the segmenter to likely choose those characters as a single word (depending on surrounding context and whether the characters before it form another word).

 

You won't be able to get definitions for this word unless you add a custom dictionary entry.  You shouldn't add it to the main dictionary file, as this gets overwritten each upgrade.  You can however set up a user dictionary.  See this post for details.

Posted
CTA combined with a spreadsheet and corpus frequency stats/HSK levels helps me choose which upcoming unknown words I should learn

Is choosing the words a manual process, or does the spreadsheet just spit out the words based on various formulas you've set up.

 

If the latter, it's highly likely that a Lua script could be written to do the whole thing, making the process even simpler.  Even if it's the former, if you have general heuristics that could be applied, it's likely that a Lua script could be written to do it too.

Posted

Documentation for Lua scripting is now online.  See the first post for updated links.

Posted
On 08/01/2017 at 3:11 PM, imron said:

Is choosing the words a manual process, or does the spreadsheet just spit out the words based on various formulas you've set up.

 

It's manual: numerical values relating to frequency in a few corpuses are returned in one big table and I just arrange in descending order a few times, picking out the best candidates. Only takes a couple of minutes to select 20+ words.

Posted

The Lua guide has a bit about:

 

Instead of HSK levels you can also load custom word lists from files using the cta.WordList() function as follows:

local wordList = cta.WordList( 'wordlist.txt' )

Which will load the words from a file called ‘wordlist.txt’.

 

If I understand this right, I need to create a file called wordlist.txt (in the right format etc). Where do I save this file so it can be found and used by CTA? Currently I'm getting an error loading WordList from files "wordlist.txt".

Posted

It's doesn't really matter where, but you probably need to specify the full path to the file, so if the file is saved to c:\users\<your username>\Documents\wordlists.txt then in the Lua script you would write "C:/users/<your username>/Documents/wordlists.txt" (note the forward slashes / rather than backslashes).

  • 5 months later...
Posted

Yes. Make a file containing one word per line and then choose File -> Import from the menu.

 

See here for more details. 

 

  • 1 month later...
Posted

It's possible to create a scrip to delete lines with all known words and just remais lines with unknown? 

 

and it's possible to ordinate the lines by numbers of unknown words?

1 unknown 

2 unknown

3 unknown

etc...

Posted

Yes it's possible to do both of these things.

 

Would you want a single script that does both?

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...