Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

The reason I'm asking:

 

Out of curiosity, I changed chineseOnly to false and created an "English" word list profile. Then I pasted an English news article into CTA and read through it like I normally would. It was fun, but the lack of word wrapping made the article difficult to read. (I know CTA is not presently set up for reading and analysing English texts. Just checking it out.) :P

 

I don't think I've ever seen word wrapping in Chinese, so that's not an issue.

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...

Dealing with words that were segmented incorrectly is something I still struggle with. I am aware of the "Custom word" function, but in many cases the word already exists.

So for example if 对生活 gets segmented into 对生 and 活 , is there anything I can do about it? Currently I have no other choice as to mark the word 对生 as unknown.

 

I see the segmentation as THE central feature of the software. Everything kinda evolves around it. Have you looked into improving the segmentation algorithm?

http://www.oschina.net/project/tag/264/segment?sort=view&lang=0&os=0

 

There seem to be many open source projects dealing with it.

Link to comment
Share on other sites

I agree that correct segmentation is a very important feature, it's just that implementing improvements involves significant effort (especially to make them performant) for relatively small gains in correctness, and so it's always been on the back burner while other more user facing problems were worked on.

 

Segmentation is definitely something I want to improve and have been looking at - both allowing custom segmentation of incorrect words *and* better segmentation algorithm.  Unfortunately there's no timeframe yet for when I'll be able to do that, but it is slowly creeping up the priority list.

Link to comment
Share on other sites

In Chinese, whenever a word isn't in the dictionary, the segmenter (as far as I can tell) will just segment the word into characters... What will happen in English? Will this cause any problems in terms of making words "known" that have no dictionary entry? And can I still export unknown English words with no dictionary entry?

Link to comment
Share on other sites

Quote

allowing custom segmentation of incorrect words

This would already help alot. Something like a "segmentation editor" where you get a view that looks something like this:
我/把/人们/对生/活/的/很多/疑问/写成/了/一篇/作文/

And you could just edit the segmentation by placing the "/" differently.

 

Quote

 What will happen in English? 

The problem simply doesn't exist in English because words are divided by spaces already, so no word segmentation is necessary.

Link to comment
Share on other sites

7 hours ago, Yadang said:

What will happen in English?

CTA groups runs of the English alphabet as a single unit.  An apostrophe or a character with an umlaut or accent will break the 'word'.

 

5 hours ago, tiantian said:

so no word segmentation is necessary.

Yadang was likely referring to a discussion a few posts back about using CTA for people learning English and/or other languages.

  • Like 1
Link to comment
Share on other sites

Recently I've started appending statistics to the end of texts I read in CTA. Something like this:
 

Quote

WORDS
known 10,046
percent known 88.25%

UNIQUE WORDS
known 1,248
percent known 66.31%

VOCABULARY SIZE 8,318


When I finish reading a text, this is what I see at the bottom. Then I check it against the (up-to-date) word statistics pane to see many new words I've learned.

This only a rough estimate of progress, of course. Many of these words I've known for years; I am simply encountering them in CTA for the first time. It's nice to have a sense of numerical progress, though. Good for motivation.

  • Like 2
Link to comment
Share on other sites

CTA actually keeps track of all the words you mark known or unknown in a given session (defined by when you start the program to when you close it).

 

It's been doing that since a very early version, so once I get around to adding the GUI for it, there will be graphs will all of this sort of data, as well as the ability to see how well you were able to read a text over time.

  • Like 1
Link to comment
Share on other sites

29 minutes ago, imron said:

CTA actually keeps track of all the words you mark known or unknown in a given session (defined by when you start the program to when you close it).

 

It's been doing that since a very early version, so once I get around to adding the GUI for it, there will be graphs will all of this sort of data, as well as the ability to see how well you were able to read a text over time.

 

I can't believe it's been doing that all this time without my knowledge! That's awesome!

Link to comment
Share on other sites

42 minutes ago, imron said:

CTA actually keeps track of all the words you mark known or unknown in a given session (defined by when you start the program to when you close it).

 

It's been doing that since a very early version, so once I get around to adding the GUI for it, there will be graphs will all of this sort of data, as well as the ability to see how well you were able to read a text over time.

Sounds like a must have for enthusiastic learners. If the data is inaccessible to users, at least provide a simple action like dumping the learned session words into a list in a popup window or a simple counter "Learned X words in session".

Link to comment
Share on other sites

7 hours ago, Yadang said:

I can't believe it's been doing that all this time

It's mentioned a couple of times earlier in the thread.

 

Basically, your known vocabulary is saved as sets of additions and removals.  You can find the files in c:\users\<username>\AppData\Local\ChineseTextAnalyser\wordlists\objects\<wordlist name> - there's one file per Word List you have created.  As long as you don't delete the word lists from the 'Word List Manager' dialog, your entire history of marking words as known/unknown is saved.

 

Lines with a + at the start are words that were marked known, and lines with a minus are words that are marked unknown.  Note: you should not change anything in these files, or you might lose data.  A checksum is calculated for each set of words and if you change the data the checksum won't match and CTA will likely discard it.

 

The main reason for saving wordlists this way is that it will make it easier when support for online syncing of vocabulary between computers is added - you just apply the additions and removals in order and the checksums allow CTA to make sure it doesn't add the same revision twice.

 

Being able to then use that same data for graphs and other statistics is an added bonus.

 

7 hours ago, Mati1 said:

Sounds like a must have for enthusiastic learners.

And also a great distraction for enthusiastic learners :lol:

 

One of the things I wanted to get away from with CTA was focusing on the minutiae of learning and look more at the bigger picture.  "Can I read this book" and "What words do I need to learn in order to see the greatest increase in understanding for a given text" are much more useful questions than "how many words did I learn today" or "how many words have I learnt since the start of the year".

 

The number of words you learn is meaningless if it doesn't help you read the things you are wanting to read.

Link to comment
Share on other sites

43 minutes ago, imron said:

One of the things I wanted to get away from with CTA was focusing on the minutiae of learning and look more at the bigger picture.  "Can I read this book" and "What words do I need to learn in order to see the greatest increase in understanding for a given text" are much more useful questions than "how many words did I learn today" or "how many words have I learnt since the start of the year".

 

I agree with this. I usually hide the statistics pane as I read, because the goal is reading, not increasing your "word count."

 

On the other hand, it's a good feeling to finish a long article, and see that you know 10% more of the words than when you started. 

 

Quote

once I get around to adding the GUI for it, there will be graphs will all of this sort of data, as well as the ability to see how well you were able to read a text over time.

 

Does this means that CTA will (or is currently) keeping tracking of our performance on individual texts?

Link to comment
Share on other sites

I am thinking of this scenario:

One uses CTA to read a long text because otherwise it would be too hard. At the end of the day before it is closed, it shows the learner the amount of learned words, thereby confirming how awesome the learner and CTA are :D

Link to comment
Share on other sites

15 hours ago, murrayjames said:

Does this means that CTA will (or is currently) keeping tracking of our performance on individual texts?

No, it means the changesets (i.e. the list of additions and removals each session) are timestamped, and CTA is fast enough that it can apply the differences more or less in real time (see for example when you change Word Lists from the Word List Manager).

 

So, you'd open up a text and then choose the date and then based on the timestamps, CTA would build your vocabulary at that date (by applying all additions and removals in order) and apply it to the text.

 

15 hours ago, Mati1 said:

thereby confirming how awesome the learner and CTA are

Using CTA is already confirmation of how awesome the learner is :mrgreen:

 

I do get your point though, and this sort of information will probably be there eventually.

Link to comment
Share on other sites

I think this can be an extremely useful tool and am looking forward to future versions. I applaud your work to date!!
 

Like a number of other posters to this thread, it appeared to me that the program does not maintain a running history of known vs. unknown words. Whenever I opened a file for a second time, it seemed to have lost this information.

 

While it is completely understandable that this functionality is currently not exposed in the GUI, I think you should describe this in the release notes (and possibly even in the current documentation). This could reduce confusion about the issue and spare users the need to figure it out for themselves.

 

Note: BTW, I'm a technical writer. Should you need it, I could write some brief content that you could add to the release notes.

Link to comment
Share on other sites

6 hours ago, bossidy said:

Like a number of other posters to this thread, it appeared to me that the program does not maintain a running history of known vs. unknown words

Are you talking about the issue mentioned by realmayo here?

 

If so, then as long as you are not changing word lists, it should be maintaining history correctly.

 

I've already got a fix for it waiting to go, but was holding off to add a couple of other features too (and then other work things took priority), because it can be worked around (inconveniently) by closing the program and re-open it before changing lists you want to save (closing the program saves the words marked known to disk).

 

I may actually just release a new version in the next day or two with just with that fix.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...