Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


Recommended Posts

Posted

I can barely manage to develop CTA.  Not sure where I can find the time at the moment to make ETA.

Posted

Potential $$$$ gained versus actual $$$$ lost in opportunity costs, and while I'd like to get around to doing something like that, ETA is probably about idea number 10 on my Potential $$$$ list :mrgreen:

  • 2 months later...
  • New Members
Posted

Hey. I am having an issue with scrolling down in the clipboard on the mac version. I can scroll down in the panels on the left but I can't in the clipboard panel. Thanks in advance.

Posted

Are you able to post a screenshot of what you mean?

 

Also, which version of macOS are you running?

  • New Members
Posted

I am using high sierra (10.13.6). The yellow arrows indicate parts where I am able to scroll, red arrow indicates part where I am unable to scroll. 

Thank you.

Clipboard-1_and_Introducing_Chinese_Text_Analyser_-_Resources_for_Studying_Chinese_-_Chinese-forums_com.thumb.jpg.b4cab48bc98c1384adb8d3860e8441a9.jpg

  • New Members
Posted

Hey!After installing the new security update, its working again. Thanks for your prompt reply and sorry for any inconvenience! 

  • Like 1
  • 2 months later...
Posted

imron, what is the best indicator of the difficulty of a text in CTA, if you've never uploaded a list of your Known Words?

 

Is it the number of unique words/unique characters in the text? The HSK percentages?

Posted

Not the HSK percentages.  They are there mostly to show how the HSK is not that useful :mrgreen:

 

The number of unique words is one potential indicator of difficulty, but I'd also look at the number of words it takes to get to 98% comprehension of the text and see how big a proportion of total words that is, and I'd also look at what percentage comprehension you get if you learnt every word that appeared more than once.

 

That gives you an idea of how many words you'd need to know/learn in order to read the book comfortably.

  • Like 2
Posted

A thought. Dividing the number of unique words by the total number of words gets you the percentage of unique words in a text. Dividing the other way tells you, for example, that 1 in 5 words is unique. Not sure how closely the density of unique words correlates with difficulty though.

 

  • Like 1
Posted

So when does this integrate with one of those text-to-speech APIs? I want word lists for TV shows. 

  • Like 1
Posted
7 hours ago, roddy said:

text-to-speech APIs? I want word lists for TV shows. 

Speech to text you mean?

 

I've actually been toying with writing an application that does this, but for any language, not just Chinese (and by toying I mean I've already written a bunch of code and done test calls with the APIs and got reasonable results back).

 

Still not sure if I have the time to make it though and if there is any demand for this kind of thing, especially as it would need to be a paid service (because Google/Microsoft charge for each API call).

Posted
8 hours ago, murrayjames said:

Not sure how closely the density of unique words correlates with difficulty though.

I think you'd also need to look at the frequency of those unique words in the text as a whole.  If many of those unique words only appeared once or twice in total, but when combined made up a significant percentage of total words, then that would affect difficulty, because it would mean lots of words you need to put in work to learn, but that don't really lead to increased comprehension for the rest of the text.

 

Looking at the words it takes to get 98% comprehension (or some other reasonably high percentage) serves as a decent proxy for that.

Posted
9 hours ago, imron said:

Speech to text you mean?

That sounds more likely. 

Posted

Reinstalling CTA after a hard drive crash. I made a backup of the ChineseTextAnalyser AppData folder before the crash. After reinstalling and running CTA, do I replace the new AppData folder with the old AppData folder to get my known words back?

 

UPDATE: I did and it worked perfectly. The license copied over too!

  • Like 1
  • 2 weeks later...
Posted

I'm doing the 14 day trial right now. I think it's a useful piece of software and will probably buy it, but I wish the word segmentation was better since it's a core feature of the app. Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

  • Good question! 1
Posted
5 hours ago, drungood said:

but I wish the word segmentation was better since it's a core feature of the app

Segmentation is always something that I've wanted to improve, and in fact have worked on implementing a bunch of different segmenters but the main issue is one of not having enough time to build something suitable - both in terms of speed, memory usage and correctness.

 

As with everything, there are tradeoffs.  Most of the problems can be solved, it's just that there's a large amount of work involved and it only returns a minor increase in correctness, and so when I have time to work on CTA it usually goes towards other features because the segmenter is ball-park level correct, and that is sufficient for what I see as the main features of the app:

 

1. Finding frequently occurring unknown words in a piece of text.

2. Comparing texts to see the relative difficulties.

 

Based on tests I've done, and on my own experience, improving the segmenter doesn't have a significant improvement on those two activities.

 

The current segmenter does mean that CTA is less useful if you are wanting to use it for precise segmentation on a sentence by sentence level.

 

6 hours ago, drungood said:

Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

Solving for the general case is not that bad, it's the edge cases where things fall down.  E.g. what happens if the file is several GBs?  Most tools lockup.

 

CTA on the other hand will open (and highlight text) instantly and let you scroll anywhere through the file (though statistics take a bit longer to generate).

 

A GB of text might seem a bit extreme, but that's only about 1000 books, which is not unreasonable if generating information for a corpus or similar.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...