Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


Recommended Posts

  • 2 weeks later...
Posted

Request: allow me to configure highlight colors, especially in dark mode. Blue text on a black background is hard on the eyes!

Posted

You can do this, but it involves manually editing a config file.  If you're on windows, this will be located in:

 

c:\users\<username>\AppLocal\Data\ChineseTextAnalyser\colour-schemes\default.colours

 

(there is a similar file on other OSes so let me know if you're on a different OS).

 

You can edit that file with any text editor, and the values are hex-colours with the # removed.

  • Like 1
Posted

I'm trying to do an analysis of the 普通话水平测试 to determine how to go about learning all the 字. I want to see if I use my time learning all the 字 in the 60篇文章 will I then know most of the 字 that appear on the test. However, CTA uses 词 on the side on only says how many unique characters there are without allowing me to do analysis based on the characters. Is there any way to work this out?

《普通话水平测试用普通话词语表》.doc 普通话水平测试文章60篇.doc

  • Like 1
  • Good question! 1
Posted

Is CTA meant to analyze things on the character level like that?

 

In any case, counting the characters in those essays only took a few lines of Python, if that's helpful to you. See the attached csv file.

char_count.csv

  • Like 1
  • Helpful 1
Posted
30 minutes ago, 大块头 said:

Is CTA meant to analyze things on the character level like that?

 

In any case, counting the characters in those essays only took a few lines of Python, if that's helpful to you. See the attached csv file.

Thank you. Coding is definitely a language I wish I was more interested in. So useful.

 

But Yes, that's kind of what I'm looking for but not just the raw frequency of the characters. I'm looking to determine what % of characters would be covered if I learned all of the characters that show up in the 60 essays and vice versa (what percentage of the characters in the essays would be covered if I learned the list of words).

 

Would it be doable in python to check this?

Posted

The essays contain 2293 unique characters. The word list contains 1668 unique characters. The intersection of these two sets contains 1307 characters.

 

I won't share my code just in case there is some way to make CTA do this. My intention isn't to cobble together some 山寨 version of one of its functions...

  • Helpful 2
Posted

That's great info. Then I'm going to work on learning the 60 essays since that is more fun than the list and then learn the remaining 300+ characters after that. Might take a couple years, though.

 

Add this to my list of function requests for CTA @imron

Posted
17 minutes ago, 艾墨本 said:

I'm going to work on learning the 60 essays

 

Sounds like a great use case for CTA!

  • Like 1
Posted
3 hours ago, LinZhenPu said:

Are you going to one day take the Putonghua test that mainland Chinese people take? ?

That's my goal. I started working through it last year and got side tracked with COVID. Four of the essays down, 56 to go. But I'm also focusing on quality over quantity (though quantity will be needed eventually) making sure I can properly recite each line in a "story telling" fashion. Even after learning just four of them with my tutor (Shout out to @GoEastMandarin) I saw an enormous amount of growth.

 

CTA helps me determine which words to focus on.

 

Posted
22 hours ago, 大块头 said:

I won't share my code just in case there is some way to make CTA do this. My intention isn't to cobble together some 山寨 version of one of its functions...

Thanks for the consideration, but I generally follow a philosophy that more is better than less, so regardless of whether or not CTA can do this, please feel free to share source code or tools that other people might find useful (but maybe start a new thread, to keep this one just about CTA).

 

That being said, CTA intentionally focuses on 词 rather than 字 and doesn't have this feature built in.  I've considered adding it, but am still in two minds about it.

 

However, what CTA does have is Lua scripting support, and in that sense you can make CTA do whatever you want.  For example, here is a script that counts the number of unknown characters in a document.  It would be trivial to modify that script to count all characters, just change line 47 from this

 

        if charType == "Chinese" and knownChars[char] == nil then

 

to this

 

        if charType == "Chinese" then

 

And with a bit of effort, it could also be made to calculate the % coverage of a document with a given word list - in fact there is already a script that ships with CTA (char-coverage.lua) that does this for HSK6 coverage of a given document.

 

@大块头, if you don't want to tread on CTA's toes, feel free to make Lua script versions of any scripts and post them in that other thread :mrgreen:

Posted
1 hour ago, imron said:

@大块头, if you don't want to tread on CTA's toes, feel free to make Lua script versions of any scripts and post them in that other thread :mrgreen:

 

生活苦短,我用Python。:wink:

Posted
On 8/10/2020 at 2:57 AM, 大块头 said:

生活苦短,我用Python。

 

生活苦短,我用bash,

echo 'Unknown char count:'

comm -13 sortedknowncharlist <(cat file | sed 's/\(.\)/\1\n/g' | sort | uniq) | wc -l

  • Like 2
Posted

For every 100 of us snot-nosed brats scribbling on whiteboards and typing at our fancy-schmancy IDEs, there is some UNIX wizard who is sipping coffee and browsing Usenet because they've already solved the problem with a bash one-liner. :mrgreen:

  • Like 2
Posted
On 8/7/2020 at 5:56 PM, imron said:

You can do this, but it involves manually editing a config file. 

Thanks, worked like a charm!

  • Like 1
Posted
On 8/10/2020 at 2:50 PM, philwhite said:

comm -13 sortedknowncharlist <(cat file | sed 's/\(.\)/\1\n/g' | sort | uniq) | wc -l

TMTOWTDI:

comm -13 sortedknowncharlist <(sed 's/./&\n/g' file | sort -u) | wc -l

 

  • Like 2
Posted
On 8/12/2020 at 4:04 PM, icebear said:

Thanks, worked like a charm!

What colours did you end up using?

 

15 hours ago, philwhite said:

comm -13 sortedknowncharlist <(sed 's/./&\n/g' file | sort -u) | wc -l

Stray cats are a continual problem with unix one-liners ?

  • Like 1
Posted

I bought CTA a while back, and to be honest have only used it once or twice, to analyse HSK levels of ebooks or similar.

 

I think what I'm missing is some good descriptions of use-cases and tutorials to show what it's capable of, and how I could be using it.

 

Are there any examples out there already on, say, youtube? 

If not, do any of you power-users feel like explaining how you use it to do things you couldn't do with other tools?

 

I guess I'm not the only one who could benefit from your collective wisdom.

 

Cheers!

 

 

  • Like 1
  • Good question! 2

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...