Introducing Chinese Text Analyser

February 7, 2019 at 06:37 AM

I can barely manage to develop CTA. Not sure where I can find the time at the moment to make ETA.

February 7, 2019 at 06:39 AM

OK sure. But $$$$

February 7, 2019 at 10:11 AM

Potential $$$$ gained versus actual $$$$ lost in opportunity costs, and while I'd like to get around to doing something like that, ETA is probably about idea number 10 on my Potential $$$$ list

April 20, 2019 at 01:55 PM

Hey. I am having an issue with scrolling down in the clipboard on the mac version. I can scroll down in the panels on the left but I can't in the clipboard panel. Thanks in advance.

April 21, 2019 at 10:56 AM

Are you able to post a screenshot of what you mean?

Also, which version of macOS are you running?

April 22, 2019 at 01:12 PM

I am using high sierra (10.13.6). The yellow arrows indicate parts where I am able to scroll, red arrow indicates part where I am unable to scroll.

Thank you.

April 22, 2019 at 02:40 PM

Thanks, I'll look in to it.

April 23, 2019 at 11:48 AM

Hey！After installing the new security update, its working again. Thanks for your prompt reply and sorry for any inconvenience!

April 24, 2019 at 12:14 AM

No worries. Thanks for taking the time to raise the issue.

June 26, 2019 at 07:17 AM

imron, what is the best indicator of the difficulty of a text in CTA, if you've never uploaded a list of your Known Words?

Is it the number of unique words/unique characters in the text? The HSK percentages?

June 26, 2019 at 09:35 AM

Not the HSK percentages. They are there mostly to show how the HSK is not that useful :mrgreen:

The number of unique words is one potential indicator of difficulty, but I'd also look at the number of words it takes to get to 98% comprehension of the text and see how big a proportion of total words that is, and I'd also look at what percentage comprehension you get if you learnt every word that appeared more than once.

That gives you an idea of how many words you'd need to know/learn in order to read the book comfortably.

June 26, 2019 at 04:08 PM

A thought. Dividing the number of unique words by the total number of words gets you the percentage of unique words in a text. Dividing the other way tells you, for example, that 1 in 5 words is unique. Not sure how closely the density of unique words correlates with difficulty though.

June 26, 2019 at 05:07 PM

So when does this integrate with one of those text-to-speech APIs? I want word lists for TV shows.

June 27, 2019 at 12:41 AM

7 hours ago, roddy said:

text-to-speech APIs? I want word lists for TV shows.

Speech to text you mean?

I've actually been toying with writing an application that does this, but for any language, not just Chinese (and by toying I mean I've already written a bunch of code and done test calls with the APIs and got reasonable results back).

Still not sure if I have the time to make it though and if there is any demand for this kind of thing, especially as it would need to be a paid service (because Google/Microsoft charge for each API call).

June 27, 2019 at 12:57 AM

8 hours ago, murrayjames said:

Not sure how closely the density of unique words correlates with difficulty though.

I think you'd also need to look at the frequency of those unique words in the text as a whole. If many of those unique words only appeared once or twice in total, but when combined made up a significant percentage of total words, then that would affect difficulty, because it would mean lots of words you need to put in work to learn, but that don't really lead to increased comprehension for the rest of the text.

Looking at the words it takes to get 98% comprehension (or some other reasonably high percentage) serves as a decent proxy for that.

June 27, 2019 at 09:49 AM

9 hours ago, imron said:

Speech to text you mean?

That sounds more likely.

July 1, 2019 at 02:24 AM

Reinstalling CTA after a hard drive crash. I made a backup of the ChineseTextAnalyser AppData folder before the crash. After reinstalling and running CTA, do I replace the new AppData folder with the old AppData folder to get my known words back?

UPDATE: I did and it worked perfectly. The license copied over too!

July 1, 2019 at 01:13 PM

Happy to have helped :mrgreen:

July 12, 2019 at 07:15 PM

I'm doing the 14 day trial right now. I think it's a useful piece of software and will probably buy it, but I wish the word segmentation was better since it's a core feature of the app. Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

July 13, 2019 at 01:56 AM

5 hours ago, drungood said:

but I wish the word segmentation was better since it's a core feature of the app

Segmentation is always something that I've wanted to improve, and in fact have worked on implementing a bunch of different segmenters but the main issue is one of not having enough time to build something suitable - both in terms of speed, memory usage and correctness.

As with everything, there are tradeoffs. Most of the problems can be solved, it's just that there's a large amount of work involved and it only returns a minor increase in correctness, and so when I have time to work on CTA it usually goes towards other features because the segmenter is ball-park level correct, and that is sufficient for what I see as the main features of the app:

1. Finding frequently occurring unknown words in a piece of text.

2. Comparing texts to see the relative difficulties.

Based on tests I've done, and on my own experience, improving the segmenter doesn't have a significant improvement on those two activities.

The current segmenter does mean that CTA is less useful if you are wanting to use it for precise segmentation on a sentence by sentence level.

6 hours ago, drungood said:

Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

Solving for the general case is not that bad, it's the edge cases where things fall down. E.g. what happens if the file is several GBs? Most tools lockup.

CTA on the other hand will open (and highlight text) instantly and let you scroll anywhere through the file (though statistics take a bit longer to generate).

A GB of text might seem a bit extreme, but that's only about 1000 books, which is not unreasonable if generating information for a corpus or similar.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

murrayjames

imron

Juraj 唐优来

imron

Juraj 唐优来

imron

Juraj 唐优来

imron

murrayjames

imron

murrayjames

roddy

imron

imron

roddy

murrayjames

imron

drungood

imron

Join the conversation