Introducing Chinese Text Analyser

July 30, 2020 at 12:25 PM

I'll add "Reader Mode" to my todo list.

July 30, 2020 at 04:38 PM

Look forward to all of these when available. Great app, highly recommended to others!

August 8, 2020 at 01:43 AM

Request: allow me to configure highlight colors, especially in dark mode. Blue text on a black background is hard on the eyes!

August 8, 2020 at 01:56 AM

You can do this, but it involves manually editing a config file. If you're on windows, this will be located in:

c:\users\<username>\AppLocal\Data\ChineseTextAnalyser\colour-schemes\default.colours

(there is a similar file on other OSes so let me know if you're on a different OS).

You can edit that file with any text editor, and the values are hex-colours with the # removed.

August 9, 2020 at 12:11 AM

I'm trying to do an analysis of the 普通话水平测试 to determine how to go about learning all the 字. I want to see if I use my time learning all the 字 in the 60篇文章 will I then know most of the 字 that appear on the test. However, CTA uses 词 on the side on only says how many unique characters there are without allowing me to do analysis based on the characters. Is there any way to work this out?

《普通话水平测试用普通话词语表》.doc 普通话水平测试文章60篇.doc

August 9, 2020 at 01:20 AM

Is CTA meant to analyze things on the character level like that?

In any case, counting the characters in those essays only took a few lines of Python, if that's helpful to you. See the attached csv file.

char_count.csv

August 9, 2020 at 01:53 AM

30 minutes ago, 大块头 said:

Is CTA meant to analyze things on the character level like that?

In any case, counting the characters in those essays only took a few lines of Python, if that's helpful to you. See the attached csv file.

Thank you. Coding is definitely a language I wish I was more interested in. So useful.

But Yes, that's kind of what I'm looking for but not just the raw frequency of the characters. I'm looking to determine what % of characters would be covered if I learned all of the characters that show up in the 60 essays and vice versa (what percentage of the characters in the essays would be covered if I learned the list of words).

Would it be doable in python to check this?

August 9, 2020 at 02:24 AM

The essays contain 2293 unique characters. The word list contains 1668 unique characters. The intersection of these two sets contains 1307 characters.

I won't share my code just in case there is some way to make CTA do this. My intention isn't to cobble together some 山寨 version of one of its functions...

August 9, 2020 at 02:33 AM

That's great info. Then I'm going to work on learning the 60 essays since that is more fun than the list and then learn the remaining 300+ characters after that. Might take a couple years, though.

Add this to my list of function requests for CTA @imron

August 9, 2020 at 02:51 AM

17 minutes ago, 艾墨本 said:

I'm going to work on learning the 60 essays

Sounds like a great use case for CTA!

August 9, 2020 at 03:33 AM

@艾墨本

Are you going to one day take the Putonghua test that mainland Chinese people take? ?

August 9, 2020 at 06:37 AM

3 hours ago, LinZhenPu said:

Are you going to one day take the Putonghua test that mainland Chinese people take? ?

That's my goal. I started working through it last year and got side tracked with COVID. Four of the essays down, 56 to go. But I'm also focusing on quality over quantity (though quantity will be needed eventually) making sure I can properly recite each line in a "story telling" fashion. Even after learning just four of them with my tutor (Shout out to @GoEastMandarin) I saw an enormous amount of growth.

CTA helps me determine which words to focus on.

August 10, 2020 at 12:46 AM

22 hours ago, 大块头 said:

I won't share my code just in case there is some way to make CTA do this. My intention isn't to cobble together some 山寨 version of one of its functions...

Thanks for the consideration, but I generally follow a philosophy that more is better than less, so regardless of whether or not CTA can do this, please feel free to share source code or tools that other people might find useful (but maybe start a new thread, to keep this one just about CTA).

That being said, CTA intentionally focuses on 词 rather than 字 and doesn't have this feature built in. I've considered adding it, but am still in two minds about it.

However, what CTA does have is Lua scripting support, and in that sense you can make CTA do whatever you want. For example, here is a script that counts the number of unknown characters in a document. It would be trivial to modify that script to count all characters, just change line 47 from this

        if charType == "Chinese" and knownChars[char] == nil then

to this

        if charType == "Chinese" then

And with a bit of effort, it could also be made to calculate the % coverage of a document with a given word list - in fact there is already a script that ships with CTA (char-coverage.lua) that does this for HSK6 coverage of a given document.

@大块头, if you don't want to tread on CTA's toes, feel free to make Lua script versions of any scripts and post them in that other thread :mrgreen:

August 10, 2020 at 01:57 AM

1 hour ago, imron said:

@大块头, if you don't want to tread on CTA's toes, feel free to make Lua script versions of any scripts and post them in that other thread

生活苦短，我用Python。 :wink:

August 10, 2020 at 01:50 PM

On 8/10/2020 at 2:57 AM, 大块头 said:

生活苦短，我用Python。

生活苦短，我用bash,

echo 'Unknown char count:'

comm -13 sortedknowncharlist <(cat file | sed 's/\(.\)/\1\n/g' | sort | uniq) | wc -l

August 10, 2020 at 04:12 PM

For every 100 of us snot-nosed brats scribbling on whiteboards and typing at our fancy-schmancy IDEs, there is some UNIX wizard who is sipping coffee and browsing Usenet because they've already solved the problem with a bash one-liner. :mrgreen:

August 12, 2020 at 04:04 PM

On 8/7/2020 at 5:56 PM, imron said:

You can do this, but it involves manually editing a config file.

Thanks, worked like a charm!

August 13, 2020 at 07:04 AM

On 8/10/2020 at 2:50 PM, philwhite said:

comm -13 sortedknowncharlist <(cat file | sed 's/\(.\)/\1\n/g' | sort | uniq) | wc -l

TMTOWTDI:

comm -13 sortedknowncharlist <(sed 's/./&\n/g' file | sort -u) | wc -l

August 13, 2020 at 10:27 PM

On 8/12/2020 at 4:04 PM, icebear said:

Thanks, worked like a charm!

What colours did you end up using?

15 hours ago, philwhite said:

comm -13 sortedknowncharlist <(sed 's/./&\n/g' file | sort -u) | wc -l

Stray cats are a continual problem with unix one-liners ?

August 13, 2020 at 10:39 PM

I bought CTA a while back, and to be honest have only used it once or twice, to analyse HSK levels of ebooks or similar.

I think what I'm missing is some good descriptions of use-cases and tutorials to show what it's capable of, and how I could be using it.

Are there any examples out there already on, say, youtube?

If not, do any of you power-users feel like explaining how you use it to do things you couldn't do with other tools?

I guess I'm not the only one who could benefit from your collective wisdom.

Cheers!

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

icebear

icebear

imron

艾墨本

大块头

艾墨本

大块头

艾墨本

大块头

LinZhenPu

艾墨本

imron

大块头

philwhite

大块头

icebear

philwhite

imron

mungouk

Join the conversation