Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


Recommended Posts

Posted

The riders are called tabs. The clipboard is what is used for copying and pasting. If you paste something from the clipboard in to CTA it will create a tab called 'clipboard-1', 'clipboard-2', and so on.

Posted

Version 0.99.6 is now released and it fixes the blank item in the wordlist combo box of the export dialog.

 

The rendering glitch with circled numbers is a bit more complex and will take some time to address so in the interim I've also disabled using circled numbers in definitions.

  • Like 1
Posted

Imron - any plans to incorporate full keyboard control? I'd prefer to have this rather than clicking/scrolling around with a mouse in longer articles. My thinking is commands for:

  1. Move up/down 1/5/10 line(s), same position
  2. Move left/right by 1 word in same line, as well jump forward X words

Not sure about others but at least for me this would make the program much more useful as a reader.

 

However, to be clear about my preferences, functionality to support a personal corpus would be preferred before this! I also read many short articles, usually with lots of words with 1-2 hits, and it would be nice to more easily compare frequency across them rather than simply copying/pasting them all into a single file. Perhaps a easier fix is allowing use to do a combined scan of clipboard-1,2,3... all at once, which at least would group my daily readings together, if not across days.

Posted

I have been working on some keyboard functionality, but there are still a few things I need to do to make it work well.  Keystrokes will be similar to Vi, so <number>w moves <number> words to the right, <number>b moves <number> words back.  <number>j moves <number> lines down, <number>k moves <number> lines up and so on.

 

In the meantime, j and k are working now for single lines, and Ctrl-D and Ctrl-U will scroll (D)own and (U)p respectively.

 

Will move corpus functionality up on my todo list.  Recently I've also been working on the OSX version.

Posted

really enjoying this app. i'm using it with the transcripts from www.slow-chinese.com to fill in the gaps in my vocab. 

  • Like 1
  • 2 weeks later...
Posted

Xiao Kui

 

Thank you for the link to www.slow-chinese.com. I don't have an iPod, but I played one recording on my computer with the character text under it, and I could catch the match between the recording and the text once in a while, and the effect is great (at speed recognition of the character should lead to good reading speed down the road). I've downloaded CTA, so I like your idea there, too.

  • Like 1
Posted

I was graciously granted a free license for the Chinese Text Analyzer (CTA), version 0.99.6, and this post is my promised review.

Because I am a frequent reader of the King James version of the Holy Bible, I believe that it would be easier for me to learn Chinese chracters by cross-referencing their words and sentences in the English and Chinese versions (which is one of the ways that I learned to read quite a bit of Spanish), and because the material would maintain my interest.

So, I found a website that will display the Holy Bible in Chinese in several different formats, including interlineation with the King James version (KJV) or even Basic English (I prefer the KJV, but the Basic English version was surprisingly acceptable and might even facilitate easier learning of the basic meaning of Chinese characters):
http://www.o-bible.com/cgibin/ob.cgi?chapter=1&book=gen&version=hb5

I quickly discovered that I could not use the simple copy and paste functions of Windows to put the text into the text display area of the CTA. However, I could highlight and copy the text in the usual Windows way; and, then, when CTA is open, I could press the clipboard icon at the upper left of CTA's icons, and the text would paste. A little difference in procedure is not a problem.

I didn't figure out the copy and paste procedure right away, so I first tried to copy and paste Genesis Chapter 1 into Notepad and then save it, but I got a warning message that I would have to choose a format other than plain text (.txt) to preserve the characters. I didn't know which format would be best, but I chose the Big Endian format from Notepad's drop-down list of options. I saved the file and then re-opened it, and the characters were preserved. But, I don't remember how I was able to find the file and open it from within CTA, so today I dragged and dropped the file out of my Notepad list of documents and put it on my desktop so that I could find it from within CTA. Again, a little work is not a problem.

When using the paste procedure, I had a little trouble figuring out how to save the pasted text to a file from within CTA. Choosing the menu option File => Exit will close CTA without prompting to save. However, choosing File => Close will prompt the user to save the text to file. The default save format for the file is with the plain-text ending of .txt, and I tested it to see if it would work, and it did, the characters were preserved, so I'm a little confused.

A problem occurs in the text display area: I lose about three letters of English text and one character of Chinese text under the right-hand margin of the text display area, even after I drag and close the unknown-words box. So, there is an imperfect word wrap in the text display area when using this particular text (I have not yet tried other text).

I tried correcting the problem by increasing the right-hand margin format of the Notepad file, but it did not change how it was displayed in CTA. I remembered discussions of VIM commands in the thread here, so I went online and learned a little about VIM through a tutorial, etc., but I did not come across any way to change the margins of a text with VIM commands. I completely re-read this thread, and it became apparent that the VIM functionality was limited to j for scrolling one line down, k for scrolling one line up, Ctrl-u for scrolling up in big jumps, and Ctrl-d for scrolling down in big jumps.

The thread alerted me to the fact that the characters in CTA are displayed in the UTF-8 format, so I saved a new file of the text in the UTF-8 format, but it did not change the problem of incomplete word wrap in CTA's text display area.

Another margin problem is that when a character is right-clicked to get the definition option, the definition box is displayed where the character was clicked, and if it is near the margin, the definition will be cut off by the margins. I could not drag the definition box to where I could see the complete definition.

I like this program, and I want to use it as a reader with the definitions feature, so just a few tweaks would make it perfect for me. If I could wish it, I would only add sound pronunciation to the definitions feature (including alternate pronunciations of the same character when different meanings are indicated). And, if I could be really troublesome, I would allow pronunciation of complete sentences with correct tone sandhi in both a male and female voice. I don't know if a computer-generated voice could do this effectively.

Nonetheless, I like the current program, and I will be working around its problems to enjoy its benefits.

 

Very Respectfully,

Ray Donald Pratt
 

  • Like 1
Posted

Thanks for the feedback.  I'll have a look in to the margin wrapping problems.  Can you send me a copy of the file that causes the problems?  I'll write a longer reply shortly addressing your other concerns.

Posted
Thank you, Imron.
 
I have emailed you the two files with the exact same content, but which were created in different ways: one from a cut and paste into Notepad, and the other from a paste into CTA and then a save-to-file.
 
That you are asking for my particular files tells me that it is just these files or my system, so I will check other text later and see how it behaves.
 
Very Respectfully,

Ray Donald Pratt

Posted

Note, I've found a pretty good fix for personal-corpus needs which I describe here. If possible to tie in third-party access to Evernote (or similar services) that may help bridge much of the gap for CTA. Even without, my current fix greatly improves the utility of CTA for me without any additional programming.

 

Of course a built in corpus may have other advantages (besides consolidated/localized functionality), but just FYI. 

  • Like 1
Posted

I have a probably silly question:

I input a text and the Word Statistics box tells me there are 150 Unique Unknown words.

I export and only 124 are exported. 何以? :P

Posted

OK I have found where the problem lies:

 

 

Exporting to Excel

and including the sentence or the cloze sentence fields

 

 

Some of the results for unknown words do not paste into their own row but merge with the final cell in a preceding row

For instance this passage:

 

 

5)老栓正在专心走路,忽然吃了一惊,远远里看见一条丁字街,明明白白横着。他便退了几步,寻到一家关着门的铺子,蹩进檐下,靠门立住了。好一会,身上觉得有些发冷。
"哼,老头子。"
"倒高兴……。"

6)老栓又吃一惊,睁眼看时,几个人从他面前过去了。一个还回头看他,样子不甚分明,但很像久饿的人见了食物一般,眼里闪出一种攫取的光。老栓看看灯笼,已经熄了。按一按衣袋,硬硬的还在。仰起头两面一望,只见许多古怪的人,三三两两,鬼似的在那里徘徊;定睛再看,却也看不出什么别的奇怪。

7)没有多久,又见几个兵,在那边走动;衣服前后的一个大白圆圈,远地里也看得清楚,走过面前的,并且看出号衣⑶上暗红的镶边。——一阵脚步声响,一眨眼,已经拥过了一大簇人。那三三两两的人,也忽然合作一堆,潮一般向前进;将到丁字街口,便突然立住,簇成一个半圆。

8)老栓也向那边看,却只见一堆人的后背;颈项都伸得很长,仿佛许多鸭,被无形的手捏住了的,向上提着。静了一会,似乎有点声音,便又动摇起来,轰的一声,都向后退;一直散到老栓立着的地方,几乎将他挤倒了。

9)"喂!一手交钱,一手交货!"一个浑身黑色的人,站在老栓面前,眼光正像两把刀,刺得老栓缩小了一半。那人一只大手,向他摊着;一只手却撮着一个鲜红的馒头⑷,那红的还是一点一点的往下滴。

 

 

When I export with sentence field it works normally up to 老头子:

 

 

丁字街    1. T-junction    ding1 zi4 jie1    71    5)老栓正在专心走路,忽然吃了一惊,远远里看见一条丁字街,明明白白横着。
铺子    1. store; 2. shop    pu4 zi5    149    他便退了几步,寻到一家关着门的铺子,蹩进檐下,靠门立住了。
蹩    1. limp    bie2    158    他便退了几步,寻到一家关着门的铺子,蹩进檐下,靠门立住了。
檐下    1. underside of the eaves    yan2 xia4    164    他便退了几步,寻到一家关着门的铺子,蹩进檐下,靠门立住了。
发冷    1. to feel a chill (as an emotional response); 2. to feel cold (as a clinical symptom)    fa1 leng3    221    好一会,身上觉得有些发冷。
老头子    1. old fogey; 2. old codger; 3. my old man    lao3 tou2 zi5    239    "哼,老头子。

 

... but then there's nothing. Looking more closely, there are another 11 full entries, but they're all crammed into the Excel single cell which is supposed to contain only the sentence for the 老头子 entry, i.e. "哼,老头子。

Posted

Can you send me (via email) a copy of the text that causes the problem and a copy of the final exported file.  I ask via email because there may be various things like end-of-line characters that will change when you copy/paste and post them online here, but if I have the actual file then it will be easier to reproduce the problem.

Posted

A killer feature for me would be:

 

- Feed the analyzer with a large corpus of Chinese texts/sentences (for example a chinese internet corpus).

- Let the analyzer find texts/sentences with a predefined "Percent Known" value and display them in a list for me to choose or mark and display all undeath each other.

 

Example 1: I want to display only texts of which I know 80% of all characters. Text analyser scans the corpus and comes up with a list of 50 source texts from the corpus.

Example 2: I want to display only texts of which I know 95% of all characters. Text analyser cannot find any texts that match the criteria and will search for sentences instead. It comes up with a list of 150 sentences I know 95% of all characters/words of.

 

Of course the problem is to get a large open source corpus. Another problem would be the download size of the corpus (the bigger the better, possible more than one Gigabyte.)

 

There are automated ways to create internet corpora using search engine crawlers (the steps are described in http://corpus.leeds.ac.uk/internet.html#description.

 

Another way would be to have the user collect their own corpus ("Add to corpus" button). The user could then throw several Chinese novels in and then let the analyzer display sentences with the desired "Percent Known" values.

 

This would transform Chinese Text Analyzer in a powerful "n+1" word reader that lets the user filter through his corpus and read texts that consist of lots of known characters/words for a fluent reading experience without too much new vocabulary.

 

"n+1" word readers that already exist:

 

http://www.nulinu.li - Nice, but corpus is only sentences from Tatoeba, no updates in months, possible dead project.

http://bliubliu.com/ - Currently unusable for Chinese (faulty word parsing).

  • Like 2
Posted

Those are some good ideas, some of which I already had in mind for the eventual corpus feature I'll be adding.  At some point I'd also like to start offering subscription content where people would get regular new content matched to their current level.

 

That would avoid the large download problem because I'd just keep user word lists on a server, do all the crunching and analysing there and then send the content as needed.

Posted

Having the corpus online is a great idea. Perhaps you could make the whole software browser based. This would remove the barrier of having to download a software that only runs on windows.

Posted
This would remove the barrier of having to download a software that only runs on windows.

But would introduce other barriers in terms of processing times and in being able to run it offline.  At the moment, I can open a multi-gigabyte file instantly, with correctly highlighted words shown wherever I scroll in the document, full statistics available within a couple of minutes and the program taking up less than 50 MB of RAM.  That's just impossible over the web.

 

Development of OSX and Linux versions are underway which should mitigate the windows only problem.

  • 2 weeks later...
Posted

A new version is now out that has a few minor fixes and allows you to add custom words for the segmenter (no dictionary definitions yet).

 

Next release will hopefully contain either the OSX version, the start of the corpus features, or both :mrgreen:

  • Like 2

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...