Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

On 6/26/2019 at 12:17 AM, murrayjames said:

what is the best indicator of the difficulty of a text in CTA, if you've never uploaded a list of your Known Words?

 

On 6/26/2019 at 2:35 AM, imron said:

The number of unique words is one potential indicator of difficulty, but I'd also look at the number of words it takes to get to 98% comprehension of the text and see how big a proportion of total words that is, and I'd also look at what percentage comprehension you get if you learnt every word that appeared more than once.

 

This is what I do. In addition, I look at the bottom few words on that list (the list of words I'd have to learn to get to x% comprehension - I use 95%) and how often they appear in the text. If they don't appear at least 3 or 4 times in the text, I consider the text to hard. If I only see a word that I learned once in text, I probably am not going to remember it, and so it's not as worth learning. Basically because of what Imron says:

 

On 6/26/2019 at 5:57 PM, imron said:

If many of those unique words only appeared once or twice in total, but when combined made up a significant percentage of total words, then that would affect difficulty, because it would mean lots of words you need to put in work to learn, but that don't really lead to increased comprehension for the rest of the text.

 

 

 

22 hours ago, drungood said:

Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

It is possible, and this is what I do. The reason is, as Imron said, CTA's native segmentation is perfectly good for comparing texts and finding my next text to read, but less good for segmentation on a sentence by sentence level, which is what I need to create cloze deletion flashcards of unknown words with their corresponding sentences.

 

I use the Stanford Word Segmenter described in this post. It segments the words by spaces, so it's perfectly compatible with CTA. After I export the cloze sentences, I use excel to remove the spaces.

 

Link to comment
Share on other sites

  • 4 months later...

I've just released a new version of CTA.

 

This is more a maintenance release rather than a big new feature release.  The two main thing it adds are:

 

1. Fixing a crash bug in macOS if the document contained characters that didn't exist in the current font, and

2. High DPI support for windows (in both single and multi-monitor setups).

 

It also adds a bunch of minor bug fixes, plus minor new features such as drag-and-drop for opening files, and keyboard navigation - both with arrow keys and vi `hjkl` keys.  With keyboard navigation, you can also press `d` to show the definition of a word.

 

The full release notes are here.

  • Like 1
Link to comment
Share on other sites

Hello @imron,

 

Running 0.99.17. Today I noticed a segmentation problem I hadn't seen before. When the characters in a word are split across two lines, CTA treats these characters as separate words.

 

An example: The word 二流子 is split across two lines, with 二 as the last character of one line and 流子 as the first characters of the next line. When I mouseover the character 二, the whole word 二流子 is highlighted, as expected. If I right-click 二 and select Show Definition, CTA should show me the definition for 二流子. Instead, CTA shows me the definition for 二, then marks the character 二 as unknown. (See attached picture.)

 

This seems to happen only when a word is split across two lines. When a word fits within a line, CTA segments correctly.

 

image.thumb.png.9d000527defab447a8329994bb4fd77a.png

Link to comment
Share on other sites

  • 1 month later...

@Imron: I like CTA a lot and use it to analyse new texts.

 

I know that CTA tells me what % of words in my text are HSK 3,4,5,6, etc.

 

I wonder if there is a way for CTA to tell me what % of the 1300 HSK 5 words (or the 2500 HSK1-5 or the 5000 HSK1-6) words are covered in the text I copy/pasted into CTA? In my eyes this could be useful for selecting suitable texts to study for HSK levels. If I knew this, I could create a "reading list" that would cover all HSK5 vocabulary...

 

 

 

Link to comment
Share on other sites

Having studied formally a few years, I don't know if there's anything (course work or otherwise) that can get you to the level of 98% on a normal novel. The best I've gotten is 90% on some 三毛, and the majority of unknown words are not covered on the HSK or otherwise. 

Link to comment
Share on other sites

12 hours ago, Jan Finster said:

If I knew this, I could create a "reading list" that would cover all HSK5 vocabulary...

@dougwar's suggestion is what you are looking for, however based on my experience, you won't really find many real-world texts that are suitable.  The HSK goes for breadth rather than depth, and HSK 5 will only get you 50-70% comprehension on general native texts, and HSK 6 also falls in to the same range (only gives you a few extra percentage points of comprehension vs HSK 6).

 

This is partly the problem that CTA was designed to solve - it helps you figure out the most relevant vocab to learn in a given piece of text, which is a far better use of time than learning words from HSK lists (see here for some figures on how that plays out).

 

6 hours ago, PerpetualChange said:

I don't know if there's anything (course work or otherwise) that can get you to the level of 98% on a normal novel.

Regular reading of novels is the only thing that will do it.  Train what you want to learn.

  • Like 2
Link to comment
Share on other sites

  • 3 weeks later...

Could you please remind me - or point me to the right post - about adding words to the dictionary?

Is my memory correct that: CTA knows that two or more characters form a 'word' by looking for them in the "words" file? So, years ago before 特朗普  was a thing, if I wanted it to be recognised as a name, I could simply add it to the "words" file? And then, optionally, if I want to add a definition, I need to modify the cedict_ts file?

Link to comment
Share on other sites

1 hour ago, realmayo said:

Could you please remind me - or point me to the right post - about adding words to the dictionary?

Here was my initial post (before I implemented the feature) about how it would work.   Then here is a post confirming how it worked once that feature was finished.

 

Let me know if you run in to any problems, or if anything's not clear, and I'll provide more detail on what you need to do.

Link to comment
Share on other sites

Thanks Imron. One other thing: if for example I wanted to teach CTA what 朋友们 meant, am I right I'd need to add 朋友们 to the "words" file too? Otherwise it wouldn't be segmented as a single word but would instead be recognised as 朋友 + 们.

 

Link to comment
Share on other sites

22 hours ago, murrayjames said:

Just FYI

Yep.  A general text analyser is one of about a dozen ideas I have that I'd like to make at some point.  It's just a matter of finding the time (which I don't really have at the moment).

 

22 hours ago, murrayjames said:

$$$

Also $$$ to develop.

Link to comment
Share on other sites

  • 2 weeks later...
7 hours ago, imron said:

because ultimately it’s the number of words you know that affect how well you will be able to read a text, not the number of characters.

I understand and somewhat agree. However, would it not make more sense to leave it to whoever uses CTA to decide, what they focus on, and provide as comprehensive a statistic as possible. Again, I agree with your point, but at times being able to extract new characters could be useful.

  • Good question! 1
Link to comment
Share on other sites

54 minutes ago, Jan Finster said:

However, would it not make more sense to leave it to whoever uses CTA to decide, what they focus on

Yes, however CTA is opinionated software, and some opinions it holds strongly ("words as the main unit", and "looking up a word means you don't know the word" being 2 strong ones).

 

I agree though that users should also be given flexibility to do more than what CTA provides out of the box and that's why I added scripting capability to CTA so that people could extend it without needing me to implement a certain feature, which I might not be able to do due to time or other constraints.

 

For example, it would be possible to write a Lua script that built a list of known characters from CTAs list of known words, and then use that to build a list of characters in a document that were not on that list.  If that's something you would really like, I can probably write a quick script to do it.

  • Like 1
Link to comment
Share on other sites

1 minute ago, imron said:

it would be possible to write a Lua script that built a list of known characters from CTAs list of known words, and then use that to build a list of characters in a document that were not on that list.  If that's something you would really like, I can probably write a quick script to do it.

 

I would love it if you or anyone else capable of writing such a script could do this ?

 

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...