Introducing Chinese Text Analyser

March 2, 2017 at 04:00 AM

Crazy question: Is there something similar for Spanish? Or maybe can I use it with Spanish as-is?

I need some way to input a wordlist into some text editor (or other software) and check, preferably in real time, that I'm not writing too many words outside of the list. It's just that a few days ago I had a crazy idea, and a tool like this would make it a lot easier to implement.

Something like xkcd's "simple writer", but with the ability to load any word list.

Edited March 2, 2017 at 04:08 AM by mlescano
added graphic

March 2, 2017 at 05:51 AM

That is actually an eventual goal of Chinese Text Analyser (or a companion product). To make it easier for people to create graded content.

CTA unfortunately doesn't work very well with Spanish. I'll likely add a space based word segmenter at some not to distant point in the future which should make it easier.

March 2, 2017 at 03:41 PM

9 hours ago, imron said:

To make it easier for people to create graded content.

You just read my mind.

9 hours ago, imron said:

I'll likely add a space based word segmenter at some not to distant point in the future which should make it easier.

That would be fantastic!

March 7, 2017 at 08:54 PM

On 3/1/2017 at 9:51 PM, imron said:

CTA unfortunately doesn't work very well with Spanish. I'll likely add a space based word segmenter at some not to distant point in the future which should make it easier.

Couldn't you just do what you describe here?

Anyways, I've been working at getting this set up for my friend who's leaning English. I found an English-Japanese dictionary that I was able to add to the dictionary file (see here). I couldn't figure out how to use the wordnet English-English dictionary that was suggested here, until I found this on sourceforge. Looks like someone extracted all the files from wordnet and put it into a mysql database? Although I couldn't get the mysql to work (probably due to my having no knowledge of how to use it) I was able to download the "azdictionary.txt" and it seems like just a huge dictionary in text form. I'm currently working at how to transform it into the same format as cedict.

So, I have a question. If I append it to the dictionary file, right below the English-Japanese dictionary file, what would happen if I have duplicate entries (one in the En-Jp and one in the En-En)? Will it just use the entry that comes up first (En-Jp)? Or would it be better to delete the duplicates?

Also, I know in the dictionary you can comment lines with #, but is there any "block comment" like in lua, where you use --[[ at the beginning of any line and ]] after however many lines and it will comment the whole chunk? I'm just thinking it would be nice to be able to to comment out whole dictionaries...

Also, although I know CTA isn't meant for other languages yet, here're a few things I've noticed that should be kept in mind whenever CTA does become multi-lingual:

It counts “THE”, “the” and “The” all as different words. Presently, I can get around this by using some excel formulas and basically adding definitions for all words three times, but eventually it would be nice not to have to do this, especially as my known words count is much less meaningful when duplicates count.
Because the dictionary entry is for “to” it not only counts “To” differently from “to” but it won’t show the definition for “To”, only for “to”. Again, I work around this will excel.
Dictionary entries for words with spaces are a problem (ex. The United Nations would result in “The” being the headword.) - not sure how to get around this as cedict uses spaces as the delimiter...
Plural words are a problem (emotions vs emotion). I’m sure words with -ed are a problem too, as are conjugations.

Anyways, just some things to keep in mind...

March 8, 2017 at 02:27 AM

4 hours ago, Yadang said:

Couldn't you just do what you describe here?

No, because a large number of Spanish words have accents so they'll get chopped in two.

5 hours ago, Yadang said:

what would happen if I have duplicate entries (one in the En-Jp and one in the En-En)?

I would expect it to show both entries, but I haven't tested it. Try it and let me know the results :-)

5 hours ago, Yadang said:

Anyways, just some things to keep in mind.

Yep, lots of things to take care of that I could conveniently just ignore when doing the Chinese segmenter.

March 10, 2017 at 01:31 PM

Just a quick update to say that the next release of CTA (hopefully sometime before the end of this month), in addition to fixing the word list problem mentioned above, will also have a new statistics-based Chinese segmenter that will be more accurate and that will also do a reasonable job of guessing names.

March 13, 2017 at 09:20 AM

I noticed that the ce-dict definitions in CTA are based on a 2013 version of the dictionary. Are you likely to update it to a more recent version or should I do it manually?

Edit: actually it's not exactly complex to do manually!

March 13, 2017 at 12:43 PM

Yup, I haven't updated it at all. I'll add it to the upcoming release.

March 18, 2017 at 07:40 AM

Hi Imron, not sure if this is a bug but it caught me out a bit earlier.

1. In the words.u8 file there is (already) the word 戴高帽子.
- the list of words in the file is alphabetical so 戴高帽子 is, let's say, the 30,000th word in the list.
- 戴高帽子 is recognised as a single word by CTA

2. I add a new word to the words.u8 file: 戴高帽 (i.e. without the 子).
- I add this word as well as a couple of others: this means that 戴高帽 is one of the last words in the list.
(- I save the words.u8 file, close and reopen CTA.)

3. I paste text which includes 戴高帽.
- CTA doesn't recognise it as a single word but as three separate characters.

4. I open the words.u8 file again
- I remove the 戴高帽子 from it's position in the middle of the list and put it near the end, although still before 戴高帽.
(- I save the words.u8 file, close and reopen CTA.)

5. I paste text which includes 戴高帽.
- CTA this time *does* recognise 戴高帽 as a single word, as well as continuing to recognise 戴高帽子 as a single word.

March 18, 2017 at 08:16 AM

That's strange. The order of the word in the file shouldn't matter. I'll look in to it. Do you have a copy of the text that you were using?

March 18, 2017 at 10:30 AM

The text I was pasting into CTA? Nothing specific: just copying the text of my previous 'reply' produces the same result.

Also I initially was going to send you the feedback via the feedback form in CTA (not sure if you have a preference for that or for here) but my text was too long for the CTA window.

March 18, 2017 at 01:37 PM

3 hours ago, realmayo said:

but my text was too long for the CTA window

That's a bug that has already been fixed in the next version

April 5, 2017 at 08:16 AM

For future builds, a previous button on the search bar would be nice so that I don't get stuck hitting "find" 13 times to get back to the beginning.

April 5, 2017 at 12:11 PM

Reverse search has been on my todo list for ages, but never had high enough priority. I'll bump it up few notches.

May 22, 2017 at 05:00 AM

Would you be able to add stats for the number of characters in the text that are known and unknown in the known word list and what they are? This is useful for heritage speakers as I can usually sound a word with known characters out and figure out it's meaning but unknown characters are a complete mystery and priority for study.

May 23, 2017 at 09:27 AM

I plan to add a character-based segmenter at some point which will work on characters rather than words, which would allow you to do most of what you want.

July 22, 2017 at 07:44 PM

Is there any difference of importing words via the file>import compared with just opening the text file in CTA and marking all as known? Because the words are each on a separate line via file>import, does this mean CTA will add custom words for the words that aren't already defined?

Because the wordlist view can be navagated by arrow keys, it would be cool to have a shortcut to mark words known without the use of a mouse (enter? or ctrl+enter?). It would also be cool if I could select and mark as known multiple words at once.

July 22, 2017 at 07:50 PM

Also, and this you can add to the very very bottom of your list (or not at all): it would be nice if when switching between documents if the statistics view didn't get reset each time. For example, I have the HSK and TOCFL stats hidden, but they are extended again if I switch documents.

Edit: this appears to happen only in linux - in windows the un-extended stats stay that way when switching between documents

July 23, 2017 at 02:16 PM

18 hours ago, Yadang said:

Is there any difference of importing words via the file>import compared with just opening the text file in CTA and marking all as known?

Yes.

If you use file->import then it will treat each line (up to the first whitespace character) as an entire word.

If you open the file and just mark as known, the segmenter might segment words within the line and so those words will be used instead.

For example if you had the word 中国银行 on a line and the segmenter split it as 中国银行 then with the import method you'd get 中国银行 as a word and with the open the file and mark everything as known method you'd get 中国 and 银行 marked as known.

18 hours ago, Yadang said:

it would be cool to have a shortcut to mark words known without the use of a mouse (enter? or ctrl+enter?).

I'll add it to my list of things todo.

18 hours ago, Yadang said:

It would also be cool if I could select and mark as known multiple words at once.

You can with the mouse.

18 hours ago, Yadang said:

this appears to happen only in linux - in windows the un-extended stats stay that way when switching between documents

I noticed this when I was developing the linux version and thought not to worry about it until someone complains about it :-)

July 24, 2017 at 03:50 AM

13 hours ago, imron said:

On 7/22/2017 at 1:44 PM, Yadang said:

It would also be cool if I could select and mark as known multiple words at once.

You can with the mouse.

You can? Do you mean in the wordlist view or just selecting words in the text? Because I've tried marking multiple words as known in the wordlist view, but only the one I actually click gets marked known (windows & linux).

Sign In

Introducing Chinese Text Analyser

Recommended Posts

mlescano

imron

mlescano

Yadang

imron

imron

Guest realmayo

imron

Guest realmayo

imron

Guest realmayo

imron

艾墨本

imron

Shalmanese

imron

Yadang

Yadang

imron

Yadang

Join the conversation