Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

Crazy question: Is there something similar for Spanish? Or maybe can I use it with Spanish as-is? 

I need some way to input a wordlist into some text editor (or other software) and check, preferably in real time, that I'm not writing too many words outside of the list. It's just that a few days ago I had a crazy idea, and a tool like this would make it a lot easier to implement.

 

Something like xkcd's "simple writer", but with the ability to load any word list.

Screen Shot 2017-03-01 at 11.06.05 PM.png

Edited by mlescano
added graphic
Link to comment
Share on other sites

That is actually an eventual goal of Chinese Text Analyser (or a companion product).  To make it easier for people to create graded content.

 

CTA unfortunately doesn't work very well with Spanish.  I'll likely add a space based word segmenter at some not to distant point in the future which should make it easier.

  • Like 2
Link to comment
Share on other sites

9 hours ago, imron said:

To make it easier for people to create graded content.

 

You just read my mind. 

 

9 hours ago, imron said:

I'll likely add a space based word segmenter at some not to distant point in the future which should make it easier.

 

That would be fantastic!

Link to comment
Share on other sites

On 3/1/2017 at 9:51 PM, imron said:

CTA unfortunately doesn't work very well with Spanish.  I'll likely add a space based word segmenter at some not to distant point in the future which should make it easier.

 Couldn't you just do what you describe here?

 

Anyways, I've been working at getting this set up for my friend who's leaning English. I found an English-Japanese dictionary that I was able to add to the dictionary file (see here). I couldn't figure out how to use the wordnet English-English dictionary that was suggested here, until I found this on sourceforge. Looks like someone extracted all the files from wordnet and put it into a mysql database? Although I couldn't get the mysql to work (probably due to my having no knowledge of how to use it) I was able to download the "azdictionary.txt" and it seems like just a huge dictionary in text form. I'm currently working at how to transform it into the same format as cedict.

 

So, I have a question. If I append it to the dictionary file, right below the English-Japanese dictionary file, what would happen if I have duplicate entries (one in the En-Jp and one in the En-En)? Will it just use the entry that comes up first (En-Jp)? Or would it be better to delete the duplicates?

 

Also, I know in the dictionary you can comment lines with #, but is there any "block comment" like in lua, where you use --[[ at the beginning of any line and ]] after however many lines and it will comment the whole chunk? I'm just thinking it would be nice to be able to to comment out whole dictionaries...

 

Also, although I know CTA isn't meant for other languages yet, here're a few things I've noticed that should be kept in mind whenever CTA does become multi-lingual:

 

  • It counts “THE”, “the” and “The” all as different words. Presently, I can get around this by using some excel formulas and basically adding definitions for all words three times, but eventually it would be nice not to have to do this, especially as my known words count is much less meaningful when duplicates count.

  • Because the dictionary entry is for “to” it not only counts “To” differently from “to” but it won’t show the definition for “To”, only for “to”. Again, I work around this will excel.

  • Dictionary entries for words with spaces are a problem (ex. The United Nations would result in “The” being the headword.) - not sure how to get around this as cedict uses spaces as the delimiter...

  • Plural words are a problem (emotions vs emotion). I’m sure words with -ed are a problem too, as are conjugations.

 

Anyways, just some things to keep in mind...

Link to comment
Share on other sites

4 hours ago, Yadang said:

Couldn't you just do what you describe here?

No, because a large number of Spanish words have accents so they'll get chopped in two.

 

5 hours ago, Yadang said:

what would happen if I have duplicate entries (one in the En-Jp and one in the En-En)?

I would expect it to show both entries, but I haven't tested it.  Try it and let me know the results :-)

 

5 hours ago, Yadang said:

Anyways, just some things to keep in mind.

Yep, lots of things to take care of that I could conveniently just ignore when doing the Chinese segmenter.

 

  • Like 1
Link to comment
Share on other sites

Hi Imron, not sure if this is a bug but it caught me out a bit earlier.

1. In the words.u8 file there is (already) the word 戴高帽子.
- the list of words in the file is alphabetical so 戴高帽子 is, let's say, the 30,000th word in the list.
- 戴高帽子 is recognised as a single word by CTA

 

2. I add a new word to the words.u8 file: 戴高帽 (i.e. without the 子).
- I add this word as well as a couple of others: this means that 戴高帽 is one of the last words in the list.
(- I save the words.u8 file, close and reopen CTA.)

 

3. I paste text which includes 戴高帽.
- CTA doesn't recognise it as a single word but as three separate characters.

 

4. I open the words.u8 file again
- I remove the 戴高帽子 from it's position in the middle of the list and put it near the end, although still before 戴高帽.
(- I save the words.u8 file, close and reopen CTA.)

 

5. I paste text which includes 戴高帽.
- CTA this time *does* recognise 戴高帽 as a single word, as well as continuing to recognise 戴高帽子 as a single word.

 

 

Link to comment
Share on other sites

The text I was pasting into CTA? Nothing specific: just copying the text of my previous 'reply' produces the same result.

 

Also I initially was going to send you the feedback via the feedback form in CTA (not sure if you have a preference for that or for here) but my text was too long for the CTA window.

Link to comment
Share on other sites

  • 3 weeks later...
  • 1 month later...
  • New Members

Would you be able to add stats for the number of characters in the text that are known and unknown in the known word list and what they are? This is useful for heritage speakers as I can usually sound a word with known characters out and figure out it's meaning but unknown characters are a complete mystery and priority for study. 

Link to comment
Share on other sites

  • 1 month later...

Is there any difference of importing words via the file>import compared with just opening the text file in CTA and marking all as known? Because the words are each on a separate line via file>import, does this mean CTA will add custom words for the words that aren't already defined?

 

Because the wordlist view can be navagated by arrow keys, it would be cool to have a shortcut to mark words known without the use of a mouse (enter? or ctrl+enter?). It would also be cool if I could select and mark as known multiple words at once.

  • Like 1
Link to comment
Share on other sites

Also, and this you can add to the very very bottom of your list (or not at all): it would be nice if when switching between documents if the statistics view didn't get reset each time. For example, I have the HSK and TOCFL stats hidden, but they are extended again if I switch documents. 

 

 

Edit: this appears to happen only in linux - in windows the un-extended stats stay that way when switching between documents

  • Like 1
Link to comment
Share on other sites

18 hours ago, Yadang said:

Is there any difference of importing words via the file>import compared with just opening the text file in CTA and marking all as known?

Yes.

If you use file->import then it will treat each line (up to the first whitespace character) as an entire word.

If you open the file and just mark as known, the segmenter might segment words within the line and so those words will be used instead.

 

For example if you had the word 中国银行 on a line and the segmenter split it as 中国   银行 then with the import method you'd get 中国银行 as a word and with the open the file and mark everything as known method you'd get 中国 and 银行 marked as known.

 

18 hours ago, Yadang said:

it would be cool to have a shortcut to mark words known without the use of a mouse (enter? or ctrl+enter?).

I'll add it to my list of things todo.

 

18 hours ago, Yadang said:

It would also be cool if I could select and mark as known multiple words at once.

You can with the mouse.

 

18 hours ago, Yadang said:

this appears to happen only in linux - in windows the un-extended stats stay that way when switching between documents

I noticed this when I was developing the linux version and thought not to worry about it until someone complains about it :-)

  • Like 1
Link to comment
Share on other sites

 

13 hours ago, imron said:
On 7/22/2017 at 1:44 PM, Yadang said:

It would also be cool if I could select and mark as known multiple words at once.

You can with the mouse.

 

You can? Do you mean in the wordlist view or just selecting words in the text? Because I've tried marking multiple words as known in the wordlist view, but only the one I actually click gets marked known (windows & linux). 

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...