Creating lists of unknown Hanzi?

February 12, 2009 at 12:07 AM

I have a spreadsheet with all the Hanzi I have learned so far. I have gathered some material (text and vocabs) which I want to learn. Is there a possibility to check all items in the to-be-learned-Chinese text for Hanzi that are in my Hanzi-spreadsheet and automatically delete those? (This would result in a list of the Hanzi I have not learned so far).

Any ideas - has this been done before?

Edited February 12, 2009 at 05:49 PM by HerrPetersen

February 12, 2009 at 01:28 AM

HP to HP

I haven't done precisely this but it would be fairly easy if you are happy using VBA.

I would do it as follows

(1) Read in the known hanzi into a collection, using the character as the index

(2) Read through your text and check each character against what is in the collection to see if it exists.(If a lookup against your reference collection returns an error, then it is a new character makes sense)

If you are not comfortable doing this and want to send me your spreadsheet, I may get a chance to look at it (no promises).

Regards

HedgePig

February 12, 2009 at 01:29 AM

This is a python script I have lying around for doing exactly that. You'll need a python interpreter, or you'll have to rewrite it in VB or whatever excel uses.

It's not a quick algorithm, and it doesn't filter out duplicates (I use it for filtering character frequency lists, which have no duplicates), but it could be a starting point.

findmissing.zip

February 12, 2009 at 03:17 AM

In all the years I've wondered about an Excel function to do this, I just never bothered to look it up or figure it out. I'm just trying it now, and this seems to work:

- Your established reference list is in a column, say A1 through A500

- Your tentative list is in another column, say C1 through C25

- In the cell next to C1 (e.g., B1), enter "=VLOOKUP(C1,A$1:A$500,1, FALSE). The "$" is important for the next step to work

- highlight the cells B1 through B25, and type Control-D, or Edit->Fill->Down

- You should see "#N/A" for any cell not in the master list

- Select rows B and C (and any others associated), and sort by row B.

- Delete all the cells in row C which have "#N/A" next to it

Note: IANAEW (I am not an Excel wizard)

February 12, 2009 at 03:31 AM

You'll need a python interpreter,

Python can be downloaded here.

February 12, 2009 at 05:50 AM

I think this spreadsheet should do the trick but I haven't checked it thoroughly., so no guarantees! Instructions are in the notes tab.

Regards

HedgePig

Edited February 12, 2009 at 09:39 AM by HedgePig

February 12, 2009 at 10:45 AM

Wow - thanks for the plentifull replies. I will check out HedgePig's spreadsheet once I am in university. (My home computer runs OpenOffice).

I have very little programming experience (unless I am talking to some pretty good programmers - then it's not "little" but more like "none"), so I will first check out the Excel-sheet before trying my luck with python or VBA.

Edit: @HedgePig - Great job with the file! It works like magic. I also tried to open it with OpenOffice. While there is a buttom for "Analyse" nothing happens when pressed - so unfortunatly Microsoft is the way to go here.

Edit2: There is a very minor thing that is strange: I put in a list of roughly 2500 hanzi.

I checked the list of those 2500 hanzi against the list of the 2500 hanzi itself. It did not produce an output of "2500 known hanzi", but rather an output of 2499 known hanzi and 1 unknown hanzi: Now what is so special about 钱?

Not that this takes anything away from the programm - it is just a little strange. If interested, here the file:

Chinese Char Analyser V01test.xls

Edited February 13, 2009 at 12:57 AM by HerrPetersen

February 13, 2009 at 03:07 AM

Hello H P

What so special about 钱? Well, as they say, money changes everything :-)

In this case, your reference list includes a space after the 钱 character, so the program is checking to see whether "钱 " is a known character, not "钱"

There is also a space your "source" list but this doesn't matter as any Western characters, punctuation, spaces, etc. are ignored (actually a little cruder than this but essentially works like this.)

I guess I should change the program so that it only picks up the first character in the reference list, or at least pops up a warning or something. I might try that later.

Regards

HP

P.S. Glad you find it useful.

February 14, 2009 at 12:00 AM

Hi H P,

Damn, it was too strange to be just a random bug - deleting the space fixed it. Yea, I like it a lot.

Cheers,

HP

April 8, 2009 at 08:49 AM

If you're on a *nix system, this may work, too:

me@you:/tmp$ echo "这

> 是

> 一

> 个

> 据

> 自" > known_chars

me@you:/tmp$ echo "这

> 不

> 十

> 一

> 个

> 句

> 资" > new_chars

me@you:/tmp$ grep -v -f known_chars new_chars

不

十

句

资

Or, if you have a ruby installation:

me@you:/tmp$ ruby -e 'f1=IO.readlines("new_chars");f2=IO.readlines("known_chars");puts (f1-f2).join()'

不

十

句

资

January 26, 2010 at 07:21 AM

I have a list of all Chinese characters I've studied so far. I would like to find a way to sort out characters that are not on that list. That is, for example, when I copy some text, I would like to be able to see the characters that are new to me, that is, the ones that have no match on my list.

I've been using Wakan for quite some time and it has been doing an excellent job for some time. The problem is, the number of characters on my list has grown, they're a little over 4000 now, and Wakan's support for Chinese characters is pretty poor. For example, of the 20 newest characters that I've added to my list, it only recognizes 13, which is quite a poor ratio.

So, does anyone know any software that can help me on that? Any help would be greatly appreciated.

January 27, 2010 at 03:16 PM

Look here and here.

Do you really want characters, or do you want words? The first focuses more on characters and will only show you new characters, the second will parse a text into words (with some degree of accuracy....) and, for the new words, get the pinyin and the definition.

Both are linux based.

January 27, 2010 at 03:41 PM

A couple of questions for clarification:

1. so you know 4,000 characters now? Are you using a SRS like anki to help you remember this large amount? Or are you reading a lot to maintain your level?

2. If you indeed know 4,000 characters, I would highly advise to concentrate on words not characters. There are multiple threads in the "General Study" forum on this, this one, for instance.

EDIT: oh jbradfor beat me to it

Edited January 27, 2010 at 04:03 PM by chrix

January 27, 2010 at 04:18 PM

Many thanks to both of you for responding.

Jbradfor, I'm indeed very grateful to you, I was asking for characters, but I think the second one might actually be even more useful - if I get to work, of course. I'm gonna try those as soon as I get a chance. I don't use linux but I suppose it's ok if I just use those codes in python under windows (don't know how that sounds, I'm pretty illiterate as far as both coding and linux goes... but I sure can do as much installing python and testing the codes)

Chrix, you guessed right, I'm using anki to study characters/words; and I also do a lot of reading. And yeah, I've realized that there's not much use focusing on characters, with over 4000 I seldom encounter any unfamiliar ones unless I'm deliberately looking for them. It's been 4 years and a half since I started studying Chinese but for the last one year or so I've added probably less than 15% of those 4k, so the curve is a lot flatter now...

My intention is to use this kind of mostly software for statistical purposes, for example, if I open a web page or a short story or a novel etc., I'd be able to find out how many unknown characters there are in it.

January 27, 2010 at 04:51 PM

@Ednorog, the second one is a horribly kludged together, mixing python and gawk -- and I'm allowed to say that, as I built it

If you really like it, with a bit of work I could move it all into python, if that helps. I've gotten no requests yet, so it's not done.

January 28, 2010 at 08:32 AM

Did you check out this page?

http://www.chinese-forums.com/index.php?/topic/20159-creating-lists-of-unknown-hanzi

January 28, 2010 at 10:37 AM

Merged.

January 28, 2010 at 04:10 PM

Ok, that .xls file by HedgePig was everything I asked plus more (the number of instances of each character in the sample was a huuuge bonus)!

So, thanks a million, to both of you, HP's!

As far as the python scripts are concerned, I got totally nowhere so far and I've pretty much given up since I apparently need to know a lot more about coding than I presently do (which is not very far from zero, actually).

March 31, 2010 at 07:00 PM

Seems I made something similar. See here: http://www.chinese-forums.com/showthread.php?p=228844

Sign In

Creating lists of unknown Hanzi?

Recommended Posts

HerrPetersen

HedgePig

renzhe

c_redman

imron

HedgePig

HerrPetersen

HedgePig

HerrPetersen

m_k_e

Ednorog

jbradfor

chrix

Ednorog

jbradfor

HerrPetersen

chrix

Ednorog

phyrex

Join the conversation