Jump to content
Chinese-Forums
  • Sign Up

Top character/file parser utility - make your own char lists


Recommended Posts

Posted (edited)

This is something I wrote myself. I wanted to extract the individual characters from my word lists, and also create top character lists from large samples of chinese text.

It's fairly small and uncomplicated, so I hope intuitive to use.

Parameters:

- Inc RegEx (Include Regular Expression). Programmers will recognize this name, others don't have to worry. Each character in a file will be included in the count if it passes this range test. Can make [A-Z] to make this application count only english characters. The default works for Chinese, so do nothing for this purpose.

- Exc RegEx. Similar to above but characters matching this one are excluded. It's not necessary to use.

- Folder/File. The path to a file or folder. If file chosen then only that file is parsed, otherwise contents of folder.

- Search Pattern. A wild card pattern to match against files in folder mode (see above).

- Exclude Char File. A file who's characters will be excluded from this list generation. Only the far left character of each line is considered. Comment lines "//" are ignored.

- Recurse Sub-Folders. For folder mode only.

- Ignore Case. Relevant for english (A-Z) only. So it treats upper and lower as equals. (won't count 'A' and 'a' seperately)

- Output Tag. This is used to generate the out file name and also the category names for the pleco file. There's some smart logic in the code to parse out the range when outputing the file. %l and %u are substituted for the range bounds in the Pleco category. Other styles of range specification are also handled e.g. "-" and "to", with/without spaces.

- Output Encoding. Encoding of both output files, suggest leave as is.

- Output CSV. This will output a file in format "Character,Count,Frequency,SumPercent,CharCode". This is good for analysis purposes (as opposed to flashcard).

- Output Pleco. Outputs a Pleco flash card file (non-xml). Category Step Count defines the size of the category groups, and we can set a maximum number of characters to output with Max Characters.

For Pleco when you import a character list like this I think it's best to use the Tuttle dictionary to get the definition, which seems to be their 'character' dictionary (character characteristics?). Note that some, actually many characters have multiple pronounciations and I can only get Pleco to display one of them on the flash card.

Suggested uses:

- Parse your lesson flashcard files to get individual characters for each. Same for HSK vocab lists.

- Parse the Cedict dictionary, other dictionaries.

- Parse plain text books.

- Parse downloaded HTML files from Chinese sites (beware ads and other replication)

- Parse other large samples of written Chinese (see below).

- Dump all of the above into a folder and parse the lot!

(on the last point, I did have it in mind to allow multiple folder sources to contribute a percentage to the total, say if you figure people read news papers 20 % of the time then that should contribute 20 % to a general top character list, decided that was over-kill, for now)

I've attached the zipped executable which requires .Net framework 3.5. This is the most recent and can be obtained from Microsoft as an update.

Character lists attached...

3575 top characters generated from Mnemosyne's 20,000+ Chinese sentences flash card list. Categories are broken every 500 characters. This seems like a good source for a general written Chinese list, kind of like dictionary sample sentences.

For the above I have attached the Excel generated chart derived from the CSV file which is a chart of Top Chars vs Percent Of All. e.g. the top 1000 characters represent 89 % of all characters in this file.

The character lists for each HSK vocab level are included. Each level only contains the new characters at that level, so no repeats in any of the files. Grouped in blocks of 200.

HSK level 1: 799 chars.

HSK level 2: 802 chars. (additional)

HSK level 3: 592 chars. (additional)

HSK level 4: 668 chars. (additional)

Total 2861 chars.

Report a bug if you like, cheers.

2009-08-20 (v1.0.1.0)

Added Encoding parameter.

Added numeric character code to CSV output.

Changed some defaults.

2009-08-21 (v1.0.2.0)

Added preset regex strings.

Misc name changes.

2009-09-30 (v1.0.3.0)

Added Exclude Char File list option.

Added HSK character lists.

CharFreq.zip

2719_thumb.attach

Mnemosyne_Chars_csv.txt

Mnemosyne_Chars_pleco.txt

HskL1Chars.txt

HskL2Chars.txt

HskL3Chars.txt

HskL4Chars.txt

Edited by HarryCallahan
Posted

Wow! This is what I have been looking for recently. Unfortunately, it throws an error when trying to parse a file.

Posted

Can you tell me the error? I'd really like to fix it.

Is it ?

"Unable to translate Unicode character uDE00 at index 87 to specified code page."

I get that when I point it to a folder with large binary files, it happens on the output file write. It shouldn't (can't) be used on binary files so I'm happy to leave that.

If you're getting as far as it sounds then you probably do have v3.5 Framework, though maybe that is it. If you give me some more information it will be very helpful.

Posted

The error message (I have a non-English version of Windows Vista) says something about not being able to find and load a file or a build of System.Core, Version=3.5.00, Culture=neutral, PublicKeyToken=b77a561934e089 or some other dependent element.

Does it mean that I should install an additional program (framework or something) on my computer?

Posted (edited)

Yeah I'd say you're missing the framework.

Sorry but it's about 200 MB!

This is something Microsoft wants you to have, you can consider it a Windows update.

http://download.microsoft.com/download/6/0/f/60fc5854-3cb8-4892-b6db-bd4f42510f28/dotnetfx35.exe

Edit: Or the SP1 version, this should be better, will save you downloading the updates via Windows Update

http://download.microsoft.com/download/2/0/e/20e90413-712f-438c-988e-fdaa79a8ac3d/dotnetfx35.exe

Edited by HarryCallahan
Posted

I installed the framework and IT WOOOOOORKS!

THANKS for your parsing utility.

I am often eager to know the character statistics in news articles and other online texts I read. Until now, I have been using a complicated method of copying texts to Word, then breaking the characters one-per-line, saving it as plain text, importing it to a MS Access table and doing some database hacking to get the same result I am now able to achieve in one click!!!

I'm so happy! :clap

Posted

That's good, it is kind of useful and really not too hard to program. Reading files and counting characters isn't rocket science.

Do get the latest version. The unicode output encoding in particular makes a difference, you can double click the CSV file and Excel will recognize the asian characters.

Also the numeric character code allows you to deduce the lower and upper characters of a give set of characters you are interested in (sort by in Excel). Do a parse with no filtering and get everything then work out the ranges of what you are interested in to do a further filtered parse.

  • 1 month later...
Posted

I added the Exclude Char File option. I wanted my HSK lists to contain only the characters that are new at that level, not duplicates from previous levels. This allows me to exclude those that were previously generated. The HSK character lists have been attached. At least one person is using it, or two with me ;-)

Posted

So I'm curious. Given that most people on this forum seem to think that learning characters by themselves is worse than learning them in words, what do you see is the purpose of this tool? Do you feel that learning words in isolation is the most productive approach?

Posted

I wouldn't tell anyone to learn characters on their own in preference to words, actually you would never learn words if you did that.

I created it party to satisfy my curiosity, I like to know how many characters are in a given set of words, if you've learnt them all then perhaps you can brag about knowing that many characters.

But (contradicting above) if you've learnt the word you don't necessarily learn the characters, I've found words often get learnt just from the general appearance of the string and I would like to have learnt the characters in the words if I've learnt the word.

Knowing the characters does make your progress easier because you will find the same characters in new words and will know how to pronounce them (with exceptions sometimes). Also knowing characters and their meaning can allow one to estimate a meaning of unknown words you encounter.

I ran it across the NCPR lessons 1-30 and found 1246 distinct characters. Sometimes I just like knowing useless trivia like that.

Posted

A parser utility is not a learning tool in itself. But it is a very useful tool that can be used in a number of sophisticated ways. Eg I use the tool to guess the potential benefit of reading certain texts. It helps me to decide whether to read or not to read a text depending on how many unknown characters it contains. I may choose a text with the least number of unknown characters to read first. Or I may choose a text with as many characters familiar to me as possible if I want to read familiar characters in new contexts.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...