Lua Script and Chinese Text Analyser

March 31, 2022 at 06:48 PM

On 3/31/2022 at 3:32 PM, yaokong said:

not very fast as you can imagine

Yes, that would be a very slow way of doing things!

March 31, 2022 at 06:50 PM

On 3/31/2022 at 3:32 PM, yaokong said:

With your LUA script I just processed 161 books in 11 seconds

CTA was built with this sort of use case in mind - not the Lua scripting per se, more the analyzing a bunch of books and figuring out which one is going to be the most suitable to read.

It's the sort of stuff I'd like to incorporate in to the program itself, but I added Lua scripting as a stop gap because I don't have as much time as I'd like to work on CTA, and the scripting allows for a relatively easy way to extend the program's functionality without needing to wait for a new release.

April 9, 2022 at 12:22 PM

@imron, I found a bug in this script, it seems to skip filenames and folder names with Chinese characters, is that expected? I assume not, since we are talking about scripting CTA, literally built for processing Chinese texts.

The folder I am testing on is "_YiXi - TED Talks of China" from Chinese Transcripts. The script processes 6 files out of 647 (results attached), and none of those 6 have Chinese characters in the filename. The ones with Chinese characters are skipped entirely.

I just tried again and failed even with some texts with English filenames, this time it was the folder "华灯初上\Season 1\Plain Text - ZH Simplified", also from Chinese Transcripts. The result is empty, it only contains the header line. If I move those files to another folder that has no parent folder with Chinese characters, then the files are processed just fine.

If it is too much work to fix, don't worry about it, I will find some temporary workaround, like renaming all the files in bulk.

_YiXi - TED Talks of China_knownWordsLUA.txt 华灯初上--Season 1--Plain Text - ZH Simplified_knownWordsLUA.txt

April 9, 2022 at 02:01 PM

These days, I am generating transcripts for the videos on this YouTube channel.

https://www.youtube.com/c/Lindsay说

With the transcripts, I try to identify and make Anki cards for idiomatic expressions such as chengyus, using the lua script attached below.

At the moment, when I run the Lua script to generate Anki cards, only the Anki cards that have entries "not existing" in my Anki cards get imported into Anki. That is to say, if there already exists an Anki card for 一知半解 and the "newly" generated Anki cards also contains a card for 一知半解, that new card doesn't get imported. As a result, at the moment, I am restricted to having only one example sentence for one entry. Not bad for now, but I think having multiple example sentences could be useful and I noticed that some chengyus are used in multiple videos. After I am done generating transcripts for almost all the videos on that channel, I am thinking about compiling texts and extracting example sentences across the texts.

Is it possible to write a Lua scripts that checks all the texts and extract sentences that uses a certain expression?

Attached is the lua script that I currently use. It generates only one example sentence per entry(the word that I highlighted unknown).

anki-export.lua

April 10, 2022 at 04:18 AM

On 4/9/2022 at 12:22 PM, yaokong said:

is that expected?

It's unexpected. The script is just getting all files in all directories. Not sure why files and directories with Chinese characters are being excluded. I don't have time to look in to it at the moment, so maybe go the workaround route for now.

Edit: using filenames with chinese characters works for me on macos.

April 10, 2022 at 08:06 AM

On 4/10/2022 at 12:18 PM, imron said:

using filenames with chinese characters works for me on macos.

you gave me an idea: just tested on my Linux machine and here it works just fine. I was using my wife's Windows laptop yesterday, the bug only occurred there. Maybe I need to change the <forgot (the exact term)> non-unicode Windows region codepage </forgot> to Chinese.

Sign In

Lua Script and Chinese Text Analyser

Recommended Posts

imron

imron

yaokong

pon00050

imron

yaokong

Join the conversation