Introducing Chinese Text Analyser

August 27, 2020 at 01:08 AM

you mean, i+1 sentences?

August 27, 2020 at 03:29 AM

6 hours ago, mungouk said:

Generally, it would also be very useful to know if there's a "safe" way of converting/importing a PDF, epub etc to get it into CTA

The safest way is to open it in another application, then copy to the clipboard, and then paste it in to CTA.

The one word per line importing is for importing wordlists.

August 27, 2020 at 07:03 PM

Bug: if a multi character word breaks across two lines, a mouse highlight on the fragment in the second line plus hotkey (space to mark known/unknown) does nothing. Mouse over of the fragment on the first line + hotkey works as expected.

Additionally: with a 4 character word (错综复杂) using the 'd' hotkey while mouse over on the second line fragment reports no definition available. Mouse over the first line fragment + hotkey shows the correct definition.

August 28, 2020 at 03:06 AM

This is known bug on my list of things todo.

August 28, 2020 at 04:36 PM

Quick question possibly already asked - if simplified is fed in, and traditional exported - I'm assuming that's info from CEDICT and as such has been looked at by a human, rather than automated?

August 28, 2020 at 07:06 PM

It's info from CEDICT, and maps to the CEDICT entry. You might want to do a test exporting a single character like 发 which has multiple traditional characters to see what it exports when there are multiple entries. From memory you should get one exported entry for each different character.

September 6, 2020 at 09:00 PM

On 8/13/2020 at 2:39 PM, mungouk said:

I think what I'm missing is some good descriptions of use-cases and tutorials to show what it's capable of, and how I could be using it.

Are there any examples out there already on, say, youtube?

If not, do any of you power-users feel like explaining how you use it to do things you couldn't do with other tools?

My goal is to understand what words I'm running into frequently across many texts, but maybe only appear a few times in a single text. I primarily read short to medium length Chinese articles (500-5,000 characters) on geopolitics and economics, both for personal interest and work, so a personal corpus is very useful; if I was mostly reading long books it may not be as important a step.

Reading:

Find one article of interest
Copy text into CTA to read, and into a personal corpus*
CTA: mark entire document known, then read through and use mouseover + hotkey "d" to show the definition of any word I'm not certain of. I've requested an automatic "read along" feature that will reduce the need for a mouse during this part, and try to train faster reading (following along), but even as is this is pretty good.
Find a new article of interest

Once weekly:

Load entire corpus of Chinese articles into CTA
Review top ~x (usually 30-50) words by frequency within corpus, mark as known any that I have high confidence around
Copy remaining most frequent unknown words into Pleco flashcards

The corpus: for a long time I used Evernote Web Clipper to save articles to a "Chinese" folder, then used Evernote desktop once weekly to export that folder as a single text file. More recently I'm using a Firefox add-on called "MarkDownload" that saves the body of a web page as a nearly simple text .md file, which I can then merge later into a single file. Whatever the method, you need a single file with all read articles in one place to easily copy into CTA.

You can reset this corpus occasionally, although I'm of a few minds on whether that's useful. On the one hand, I don't want to be studying words from very old-to-me texts that suddenly emerged as top frequency; on the other, my list of known and unknown words should be as current as the most recent article I read (since I mark all words known, and then show definition/unknown those I have trouble with going through). In principal this means in my weekly review I should only see words I recently had trouble reading. But, either point of view can be argued, and I do vacillate between them.

I've also requested a CTA feature that would mimic this corpus process (keeping a local word list with count of occurrences across all loaded documents in the past), but again this is a pretty easy workflow to maintain.

September 22, 2020 at 08:50 AM

I wonder if overall frequency info would be useful - not within the text in question, but in whatever corpora you take the information from. That'd help judgement calls on what's worth paying attention to - "Ok, it only appears twice in this text, but it looks to be relatively common in the language, so...." or "Comes up a lot here, but looks pretty obscure in general"

November 6, 2020 at 08:06 PM

On 5/25/2014 at 2:14 PM, fabiothebest said:

read that the program allows to "Export word lists of known or unknown words for use in SRS or other programs". I'd be interested in creating worlists for Pleco. Is it easy to do?

@Imron: I cannot figure out how to export my list of known words (under menu "word lists") at all. ?When I open menu--> export--> all 3 options are faded (unclickable). I am talking about my "reference list of known words" (under menu "word lists"), not a word list generated after CTA analyses a random text. I would like to review my list of known words to see HSK levels, character numbers etc.

When I go to menu "word lists" --> my word list --> edit, I can see all words in the small window and select, but not copy them (to paste them elsewhere)... (?)

November 7, 2020 at 11:21 PM

There is not currently an easy way from within the app itself to export all your known words (the export function works on the current document).

However there is an easy way to get at the list of known words, because they are all cached in files in the following directories:

Windows:

C:\Users\[username]\AppData\Local\ChineseTextAnalyser\wordlists\cache

macOS:

~/Library/Application Support/ChineseTextAnalyser/wordlists/cache/

Linux:

~/.local/share/ChineseTextAnalyser/wordlists/cache

That directory will contain one file per wordlist, with each file containing one word per line.

Note: Don't make changes to this file as this is not how CTA stores the wordlists internally. If CTA detects the cached files have been changed it will just overwrite them based on its internal copy of the wordlist.

November 27, 2020 at 09:09 AM

On 9/7/2020 at 7:00 AM, icebear said:

I primarily read short to medium length Chinese articles (500-5,000 characters) on geopolitics and economics, both for personal interest and work, so a personal corpus is very useful

Would you be willing to share this corpus? I'm studying international relations and it'd be great to run it through CTA and find some geopolitics related vocab to learn.

December 23, 2020 at 02:06 PM

Hello, is there a way I can force CTA to work on a character-by-character basis (no parsing) for classical Chinese?

[Edit] Oops, my question has an answer, sort of, on p. 32. Sorry.

Edited December 23, 2020 at 02:14 PM by laurenth

December 26, 2020 at 01:14 AM

There’s not really a way to currently do this.

January 21, 2021 at 09:51 AM

Hi imron,

is there a way to open a pre-segmented txt file (i.e. a book neatly segmented with stanford nlpcore) and textanalyser uses that segmentation (skips own segmentation pass)?

January 21, 2021 at 08:48 PM

Not currently unfortunately.

January 22, 2021 at 10:26 PM

On 1/21/2021 at 10:51 AM, tiantian said:

is there a way to open a pre-segmented txt file (i.e. a book neatly segmented with stanford nlpcore) and textanalyser uses that segmentation (skips own segmentation pass)?

If you managed to use that library to segment the text yourself, I bet you can also write a script that does the analysis of the segmented text:)

I guess this is even possible with Excel, but don't have any clue about that. In case you can't figure it out, send me a message, I could write something basic for you.

January 27, 2021 at 07:42 AM

On 1/21/2021 at 9:51 AM, tiantian said:

is there a way to open a pre-segmented txt file (i.e. a book neatly segmented with stanford nlpcore) and textanalyser uses that segmentation (skips own segmentation pass)?

Surely if text has already been segmented - with spaces between words - then CTA will respect those words as words, except where it doesn't have the words in its database (and therefore breaks them down into individual characters)?

January 27, 2021 at 09:25 AM

Quote

Surely if text has already been segmented - with spaces between words - then CTA will respect those words as words,

Oh have have to try this then!

January 27, 2021 at 10:14 AM

Something different:

Based on @imron's provided lua scripts, I wrote a script "example sentence extractor" that does the following:

It asks for a corpus txt file of Chinese text.

It asks for a wordlist txt file of your unknown words (one line per word).

It outputs a list of example sentences for each unknown word, the unknown word is marked in brackets. Only sentences are selected where you know 80% of all words (the number is adjustable in the script). This keeps the example sentences easy enough to concentrate on the unknown word.

I am advocate of learning new words with example sentences and this is just an easy way to create some for the words you want to learn that consist mostly of your known vocabulary.

I use a corpus of around 1000 books.

I am no coder, so the question I have is: Is there a way to make this script faster? Obviously, the bigger a corpus you use, the more sentences you get. But it will take hours to run the script. Too many loops I guess.

Currently, it loops through the list of unknown words then through all the sentences that CTA found, then through all the words of the sentence.

I share the script here because I find it useful and perhaps somebody might have an idea to make it faster (I kind of just pasted stuff together from imron's scripts so most credits go to him).

P.S.: @imron Do you still plan to integrate a corpus feature of this sort natively?

examplesentences.lua

January 27, 2021 at 07:03 PM

8 hours ago, tiantian said:

Do you still plan to integrate a corpus feature of this sort natively?

Yes, but I haven’t had the time to work on it.

I’ll take a look at your script a little later.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

dougwar

imron

icebear

imron

roddy

imron

icebear

roddy

Jan Finster

imron

Borkie

laurenth

imron

tiantian

imron

jannesan

Guest realmayo

tiantian

tiantian

imron

Join the conversation