Lua Script and Chinese Text Analyser

August 26, 2017 at 02:58 PM

Yes, would be great.

August 26, 2017 at 03:10 PM

if I'm not asking for too much, this is my workflow maybe you know a way to automated it;

1. First a get a text file, I make one sentence per line

2. I run highlight-unknown.lua script

3. In excel I ordinate a row by numbers of known words.

4.I make some formulas to have this rows, than I export it to Anki;

ex.

Sentence with blank Unknown word Complete sentence.

那个学生昨天 __ 了书丢那个学生昨天 _丢_ 了书。

。

August 28, 2017 at 09:04 AM

Yep, that can be completely automated.

The script could also be made to export sentences in Anki's 'cloze' format if you'd like.

September 1, 2017 at 12:43 PM

Imron, is it possible to change the script from checking the percentage of new characters based on HSK6 to it being based on your current word list? Thanks

September 4, 2017 at 03:01 AM

Yes, it's very easy to do. On line 11 of the file, change

for word in cta.hskLevel( lower, upper ):words() do

to

for word in cta.knownWords():words() do

This will then build the list of characters from your known vocabulary rather than the HSK 1-6 vocabulary.

Attached is a copy of the script with that modification (plus a few cosmetic changes to remove reference to HSK from variable names and output).

char-coverage-known.lua

September 4, 2017 at 06:17 AM

@dougwar please find attached a script that should do what you want.

It finds all the sentences in a given document that contain unknown words.

Then it sorts those sentences by the number of unknown words, with sentences containing the least amount of unknown words appearing first

Then for each unknown word in each sentence it prints

The total number of unknown words in the sentence

The sentence with the current unknown word replaced with __

The unknown word

The sentence with the word surrounded by __ e.g. _生词_

This means that each unknown word in the sentence will have its own line in the output, so if the sentence has 5 unknown words, that sentence will appear 5 times in the output with a different word replaced each time.

You should then be able to save this file and import it directly in to Anki.

Let me know if this does what you want, or if you need any adjustments.

unknown-sentences.lua

May 2, 2018 at 08:08 AM

Uploading a script that extracts a marked word from the first field of a tab separated file (e.g. from cards exported by anki)

extract-marked-words.lua

See here for context.

See here for instructions on how to run the script.

March 29, 2020 at 05:10 AM

Uploading a script that finds all unknown characters in a document, and prints them out in order of frequency (highest to lowest).

An 'unknown' character is defined as a character that does not exist in any of your known words.

unknown-chars.lua

March 29, 2020 at 08:11 AM

Thanks Imron :))

It worked once. The second time I tried, I got this:

March 29, 2020 at 08:43 AM

You're running a different script "subs2anki.lua" and it is likely expecting a file in a different input format.

March 29, 2020 at 08:53 AM

9 minutes ago, imron said:

You're running a different script "subs2anki.lua" and it is likely expecting a file in a different input format.

?

Thanks.

It works perfectly ?

August 18, 2020 at 09:34 PM

Hi

is it possible to make a script to read a directory with several files and output the % of know words in each file?

August 19, 2020 at 03:51 AM

Yes, it’s possible. What sort of format would you expect the output to take?

August 19, 2020 at 12:40 PM

8 hours ago, imron said:

Yes, it’s possible. What sort of format would you expect the output to take?

I'm thinking something like this:

File name - Total Words - % Know Total - Total Unique - % Know Unique - Character Unique

I have a data big data base of books, I want to generate a list of readability like you made in your website to guide me to choose which book to read next.

Thanks in advance

August 25, 2020 at 06:25 PM

I did a script that read all files from a directory and show the % os know words in each files

example;

File Name   ;   Total Words   ;   Know Words   ;   % Know   %
1984.txt   ;   96408   ;   65274   ;   67   %
1Q84.txt   ;   98991   ;   64990   ;   65   %
1Q84BOOK2.txt   ;   85398   ;   55843   ;   65   %
1Q84BOOK3.txt   ;   113353   ;   74226   ;   65   %

the script is in the attachment.

If you have thousands of books its a good tool to search more easily what is in your level to read.

Percent Know Words Directory.lua

February 20, 2022 at 09:16 AM

Thanks, dougwar, for that script, I ran it through my library and found several potential books to read. I had no idea they were at my level.

Hi @imron, would it be possible to extend dougwar's script with the number of unique words and known unique words? Could you potentially give me a few pointers at what functions to look at? Not that I know programming, but I am willing to go to great length to find JUST the right book(s) to read... ?

The reason I am trying to do this is that I found that "Total Percent Known" by itself is not really enough to judge if I can read a book without frequently needing a dictionary. I can memorize some checked words if they appear a couple times, but obviously not if the vast majority of the unique words are unknown to me.

To illustrate with an example of two extremes:

While "Total Percent Known" might be at say 96% (my level of《大智度論》now), the remaining 4% might still contain an immense number of unknown words (almost 60% or unique words are unknown to me in the same book).
On the other hand simpler texts at the same 96% total percent known level might only have a few hundred unknown unique words (such as 论确实性 On Certainty by Wittgenstein, where only around 20% of unique words are unknown to me).

I am aware that I cannot read 《大智度論》 without a dictionary (and a good teacher ?), so I would like to find easy reads for my spare time (aka not Wittgenstein ?). I so enjoy not checking the dictionary at all or just a few times, it makes me feel far more immersed in the book.

February 20, 2022 at 11:14 AM

Can you remind me in a week if I haven’t gotten back to you by then? Thanks.

March 29, 2022 at 01:34 PM

On 2/20/2022 at 7:14 PM, imron said:

remind me in a week

Apologies for the late reply, I tried to use a workaround, had an almost working AutoHotKey script, then Windows died irreparably, so here I am (this time on Linux) without a good solution for this. Could you please help me get started?

March 30, 2022 at 06:59 AM

Here you go.

As a bonus, it should also run considerably faster than the previous script because I'm calling out to get the stats from the CTA engine, rather than totaling manually inside the lua script.

known-words.lua

March 31, 2022 at 03:32 PM

Amazing stuff, thank you so much! I recommend you include it with the program, this is immensely useful.

Previously I had an AHK script that went through my Calibre book catalog, opened the books one by one in CTA, waited for the doc to load by checking the status bar, then clicked on the stats one by one, copied and saved them, went back to calibre, opened the custom metadata view, and pasted the stats one by one ? -- not very fast as you can imagine. After a lot of refinement I could get my script to run at around 10 books per minute (depending on file size).

With your LUA script I just processed 161 books in 11 seconds (almost 100 times faster than my method). Now I just have to find a way to incorporate the info back into Calibre.

Sign In

Lua Script and Chinese Text Analyser

Recommended Posts

dougwar

dougwar

imron

RogerGe

imron

imron

imron

imron

Jan Finster

imron

Jan Finster

dougwar

imron

dougwar

dougwar

yaokong

imron

yaokong

imron

yaokong

Join the conversation