Jump to content
Chinese-Forums
  • Sign Up

Creating lists of unknown Hanzi?


Recommended Posts

Posted (edited)

I have a spreadsheet with all the Hanzi I have learned so far. I have gathered some material (text and vocabs) which I want to learn. Is there a possibility to check all items in the to-be-learned-Chinese text for Hanzi that are in my Hanzi-spreadsheet and automatically delete those? (This would result in a list of the Hanzi I have not learned so far).

Any ideas - has this been done before?

Edited by HerrPetersen
Posted

HP to HP

I haven't done precisely this but it would be fairly easy if you are happy using VBA.

I would do it as follows

(1) Read in the known hanzi into a collection, using the character as the index

(2) Read through your text and check each character against what is in the collection to see if it exists.(If a lookup against your reference collection returns an error, then it is a new character makes sense)

If you are not comfortable doing this and want to send me your spreadsheet, I may get a chance to look at it (no promises).

Regards

HedgePig

Posted

This is a python script I have lying around for doing exactly that. You'll need a python interpreter, or you'll have to rewrite it in VB or whatever excel uses.

It's not a quick algorithm, and it doesn't filter out duplicates (I use it for filtering character frequency lists, which have no duplicates), but it could be a starting point.

findmissing.zip

Posted

In all the years I've wondered about an Excel function to do this, I just never bothered to look it up or figure it out. I'm just trying it now, and this seems to work:

- Your established reference list is in a column, say A1 through A500

- Your tentative list is in another column, say C1 through C25

- In the cell next to C1 (e.g., B1), enter "=VLOOKUP(C1,A$1:A$500,1, FALSE). The "$" is important for the next step to work

- highlight the cells B1 through B25, and type Control-D, or Edit->Fill->Down

- You should see "#N/A" for any cell not in the master list

- Select rows B and C (and any others associated), and sort by row B.

- Delete all the cells in row C which have "#N/A" next to it

Note: IANAEW (I am not an Excel wizard)

Posted (edited)

I think this spreadsheet should do the trick but I haven't checked it thoroughly., so no guarantees! Instructions are in the notes tab.

Regards

HedgePig

Edited by HedgePig
Posted (edited)

Wow - thanks for the plentifull replies. I will check out HedgePig's spreadsheet once I am in university. (My home computer runs OpenOffice).

I have very little programming experience (unless I am talking to some pretty good programmers - then it's not "little" but more like "none"), so I will first check out the Excel-sheet before trying my luck with python or VBA.

Edit: @HedgePig - Great job with the file! It works like magic. I also tried to open it with OpenOffice. While there is a buttom for "Analyse" nothing happens when pressed - so unfortunatly Microsoft is the way to go here.

Edit2: There is a very minor thing that is strange: I put in a list of roughly 2500 hanzi.

I checked the list of those 2500 hanzi against the list of the 2500 hanzi itself. It did not produce an output of "2500 known hanzi", but rather an output of 2499 known hanzi and 1 unknown hanzi: Now what is so special about 钱?

Not that this takes anything away from the programm - it is just a little strange. If interested, here the file:

Chinese Char Analyser V01test.xls

Edited by HerrPetersen
Posted

Hello H P

What so special about 钱? Well, as they say, money changes everything :-)

In this case, your reference list includes a space after the 钱 character, so the program is checking to see whether "钱 " is a known character, not "钱"

There is also a space your "source" list but this doesn't matter as any Western characters, punctuation, spaces, etc. are ignored (actually a little cruder than this but essentially works like this.)

I guess I should change the program so that it only picks up the first character in the reference list, or at least pops up a warning or something. I might try that later.

Regards

HP

P.S. Glad you find it useful.

Posted

Hi H P,

Damn, it was too strange to be just a random bug - deleting the space fixed it. Yea, I like it a lot.

Cheers,

HP

  • 1 month later...
Posted

If you're on a *nix system, this may work, too:

me@you:/tmp$ echo "这

> 是

> 一

> 个

> 据

> 自" > known_chars

me@you:/tmp$ echo "这

> 不

> 十

> 一

> 个

> 句

> 资" > new_chars

me@you:/tmp$ grep -v -f known_chars new_chars

Or, if you have a ruby installation:

me@you:/tmp$ ruby -e 'f1=IO.readlines("new_chars");f2=IO.readlines("known_chars");puts (f1-f2).join()'

  • 9 months later...
Posted

I have a list of all Chinese characters I've studied so far. I would like to find a way to sort out characters that are not on that list. That is, for example, when I copy some text, I would like to be able to see the characters that are new to me, that is, the ones that have no match on my list.

I've been using Wakan for quite some time and it has been doing an excellent job for some time. The problem is, the number of characters on my list has grown, they're a little over 4000 now, and Wakan's support for Chinese characters is pretty poor. For example, of the 20 newest characters that I've added to my list, it only recognizes 13, which is quite a poor ratio.

So, does anyone know any software that can help me on that? Any help would be greatly appreciated. :)

Posted

Look here and here.

Do you really want characters, or do you want words? The first focuses more on characters and will only show you new characters, the second will parse a text into words (with some degree of accuracy....) and, for the new words, get the pinyin and the definition.

Both are linux based.

Posted (edited)

A couple of questions for clarification:

1. so you know 4,000 characters now? Are you using a SRS like anki to help you remember this large amount? Or are you reading a lot to maintain your level?

2. If you indeed know 4,000 characters, I would highly advise to concentrate on words not characters. There are multiple threads in the "General Study" forum on this, this one, for instance.

EDIT: oh jbradfor beat me to it :mrgreen:

Edited by chrix
Posted

Many thanks to both of you for responding.

Jbradfor, I'm indeed very grateful to you, I was asking for characters, but I think the second one might actually be even more useful - if I get to work, of course. I'm gonna try those as soon as I get a chance. I don't use linux but I suppose it's ok if I just use those codes in python under windows (don't know how that sounds, I'm pretty illiterate as far as both coding and linux goes... but I sure can do as much installing python and testing the codes)

Chrix, you guessed right, I'm using anki to study characters/words; and I also do a lot of reading. And yeah, I've realized that there's not much use focusing on characters, with over 4000 I seldom encounter any unfamiliar ones unless I'm deliberately looking for them. It's been 4 years and a half since I started studying Chinese but for the last one year or so I've added probably less than 15% of those 4k, so the curve is a lot flatter now...

My intention is to use this kind of mostly software for statistical purposes, for example, if I open a web page or a short story or a novel etc., I'd be able to find out how many unknown characters there are in it.

Posted

@Ednorog, the second one is a horribly kludged together, mixing python and gawk -- and I'm allowed to say that, as I built it :wall

If you really like it, with a bit of work I could move it all into python, if that helps. I've gotten no requests yet, so it's not done.

Posted

Ok, that .xls file by HedgePig was everything I asked plus more (the number of instances of each character in the sample was a huuuge bonus)! :)

So, thanks a million, to both of you, HP's! :)

As far as the python scripts are concerned, I got totally nowhere so far and I've pretty much given up since I apparently need to know a lot more about coding than I presently do (which is not very far from zero, actually).

  • 2 months later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...