Jump to content
Chinese-Forums
  • Sign Up

Lua Script and Chinese Text Analyser


imron

Recommended Posts

if I'm not asking for too much, this is my workflow maybe you know a way to automated it;

1. First a get a text file, I make one sentence per line

2. I run highlight-unknown.lua script

3. In excel I ordinate a row by numbers of known words.

4.I make some formulas to have this rows, than I export it to Anki;

ex. 

Sentence with blank          Unknown word                 Complete sentence.

那个学生昨天 __ 了书                 丢                      那个学生昨天 _丢_ 了书。
   

              

 

   
Link to comment
Share on other sites

Yes, it's very easy to do.  On line 11 of the file, change

 

for word in cta.hskLevel( lower, upper ):words() do

to

for word in cta.knownWords():words() do

 

This will then build the list of characters from your known vocabulary rather than the HSK 1-6 vocabulary.

 

Attached is a copy of the script with that modification (plus a few cosmetic changes to remove reference to HSK from variable names and output).

 

char-coverage-known.lua

 

  • Like 1
Link to comment
Share on other sites

@dougwar please find attached a script that should do what you want.

 

It finds all the sentences in a given document that contain unknown words.

Then it sorts those sentences by the number of unknown words, with sentences containing the least amount of unknown words appearing first

Then for each unknown word in each sentence it prints

     The total number of unknown words in the sentence

     The sentence with the current unknown word replaced with __

     The unknown word

     The sentence with the word surrounded by __ e.g. _生词_

 

This means that each unknown word in the sentence will have its own line in the output, so if the sentence has 5 unknown words, that sentence will appear 5 times in the output with a different word replaced each time.

 

You should then be able to save this file and import it directly in to Anki.

 

Let me know if this does what you want, or if you need any adjustments.

 

 

unknown-sentences.lua

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

  • 7 months later...
  • 1 year later...
  • 4 months later...
8 hours ago, imron said:

Yes, it’s possible. What sort of format would you expect the output to take?

I'm thinking something like this:

 

File name  - Total Words  - % Know Total - Total Unique  - % Know Unique  -  Character Unique

 

I have a data big data base of books, I want to generate a list of readability like you made in your website to guide me to choose which book to read next.

 

Thanks in advance  

Link to comment
Share on other sites

I did a script that read all files from a directory and show the % os know words in each files 

example;

File Name    ;    Total Words    ;    Know Words    ;    % Know    %
1984.txt    ;    96408    ;    65274    ;    67    %
1Q84.txt    ;    98991    ;    64990    ;    65    %
1Q84BOOK2.txt    ;    85398    ;    55843    ;    65    %
1Q84BOOK3.txt    ;    113353    ;    74226    ;    65    %

 

the script is in the attachment.

If you have thousands of books its a good tool to search more easily what is in your level to read.

Percent Know Words Directory.lua

  • Like 2
Link to comment
Share on other sites

  • 1 year later...

Thanks, dougwar, for that script, I ran it through my library and found several potential books to read. I had no idea they were at my level.

 

Hi @imron, would it be possible to extend dougwar's script with the number of unique words and known unique words? Could you potentially give me a few pointers at what functions to look at? Not that I know programming, but I am willing to go to great length to find JUST the right book(s) to read... ?

 

The reason I am trying to do this is that I found that "Total Percent Known" by itself is not really enough to judge if I can read a book without frequently needing a dictionary. I can memorize some checked words if they appear a couple times, but obviously not if the vast majority of the unique words are unknown to me.

 

To illustrate with an example of two extremes:

  • While "Total Percent Known" might be at say 96% (my level of《大智度論》now), the remaining 4% might still contain an immense number of unknown words (almost 60% or unique words are unknown to me in the same book).
  • On the other hand simpler texts at the same 96% total percent known level might only have a few hundred unknown unique words (such as 论确实性 On Certainty by Wittgenstein, where only around 20% of unique words are unknown to me). 

I am aware that I cannot read 《大智度論》 without a dictionary (and a good teacher ?), so I would like to find easy reads for my spare time   (aka not Wittgenstein ?). I so enjoy not checking the dictionary at all or just a few times, it makes me feel far more immersed in the book. 

  • Good question! 1
Link to comment
Share on other sites

  • 1 month later...
On 2/20/2022 at 7:14 PM, imron said:

remind me in a week

Apologies for the late reply, I tried to use a workaround, had an almost working AutoHotKey script, then Windows died irreparably, so here I am (this time on Linux) without a good solution for this. Could you please help me get started?

Link to comment
Share on other sites

Amazing stuff, thank you so much! I recommend you include it with the program, this is immensely useful.

 

Previously I had an AHK script that went through my Calibre book catalog, opened the books one by one in CTA, waited for the doc to load by checking the status bar, then clicked on the stats one by one, copied and saved them, went back to calibre, opened the custom metadata view, and pasted the stats one by one ? -- not very fast as you can imagine. After a lot of refinement I could get my script to run at around 10 books per minute (depending on file size).

 

With your LUA script I just processed 161 books in 11 seconds (almost 100 times faster than my method). Now I just have to find a way to incorporate the info back into Calibre.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...