Program for searching Unihan.txt

August 27, 2008 at 03:13 PM

Is there a program for searching through Unihan.txt? I like HanDict, but I would like the ability to filter out some fields. For example, if I search or 才 using HanDict, I get the below

Since I don't need most of this data, I would like to filter it out. Is there a program that resembles HanDict but lets me do that?

BigFive A47E

Cangjie DH (木竹)

Cantonese coi4

CCCII 213F7B

CihaiT 558.201

CNS1986 1-445F

CNS1992 1-445F

Cowles 4804

DaeJaweon 0763.010

Definition talent, ability; just, only

EACC 28736D

Fenn 22C

FennIndex 539.04 540.04

FourCornerCode 4020.0

Frequency 2

GB0 1837

GB1 1837

GradeLevel 1

GSR 0943a

Hangul 재

HanYu 31824.020

HanyuPinlu cai2(2108) cai5(33)

HKGlyph 1481

IICore 2.1

IRG_GSource 0-3245

IRG_JSource 0-3A4D

IRG_KPSource KP0-EBD8

IRG_KSource 0-6E26

IRG_TSource 1-445F

IRG_VSource 1-5647

IRGDaeJaweon 0763.010

IRGDaiKanwaZiten 11769

IRGHanyuDaZidian 31824.020

IRGKangXi 0416.300

JapaneseKun WAZUKANI ZAE

JapaneseOn SAI ZAI

Jis0 2645

KangXi 0416.300

Korean CAY

KPS0 EBD8

KSC0 7806

Lau 364

MainlandTelegraph 2088

Mandarin CAI2

Matthews 6660

MeyerWempe 3447

Morohashi 11769

Nelson 0270

Phonetic 24 247

RSAdobe_Japan1_6 C+2109+64.3.0

RSKangXi 64.0

RSUnicode 64.0

SBGY 099.49

SpecializedSemanticVariant U+7E94

TaiwanTelegraph 2088

Tang *dzhəi

TotalStrokes 3

Vietnamese tài

Xerox 242:153

XHC1983 0099.010:cái

Edited August 27, 2008 at 03:22 PM by roddy
edited slightly for layout.

August 27, 2008 at 03:36 PM

How comfortable are you with text processing languages such as Python or Perl? It would be trivial to write a script that will process Unihan.txt and strip out all the fields you don't want, and in fact I did that very thing for the Chinese Perakun project.

August 27, 2008 at 09:56 PM

I'm absolutely oblivious to any kind of programming. If it is a simple task, can you point me in the right direction?

August 28, 2008 at 03:38 AM

Fairly simple.

First download and install python (at least version 2.5).

Then copy the code below into a new file (e.g. unihan.py):

from __future__ import with_statement

import re

fields = 'kMandarin|kDefinition|kTotalStrokes'

regex = re.compile( '^U+[a-hA-H0-9]+s(?:' + fields + ')' ) 

with open( 'newunihan.txt', 'w' ) as outf:
   with open('Unihan.txt') as inf: 
       for line in inf: 
           m = regex.match( line)
           if m:   
               outf.write( line )

Change the line fields = 'kMandarin|kDefinition|kTotalStrokes' to contain the fields that you are interested in. The entire group of fields you are interested in should be surrounded by single quotes, with each field separated by a vertical bar without any spaces in between (you should be able to see what I mean from the above code). The field names need to match exactly with the field names used in the Unihan.txt file. See here for the full list.

Then copy the Unihan.txt to the same directory as unihan.py

If you're using windows and python is installed, you should just be able to double click on unihan.py to run it, or from the command line you could type: python unihan.py

A new file "newunihan.txt" will be created with with all the fields you are not interested in stripped out. This file will be overwritten each time you run the script.

If you're interested in understanding how it all works, here is a good place to start, followed by here and here.

Basically, what this script does is load the file Unihan.txt and then go through the file line by line checking to see if the line matches one of the fields we are interested in. If it finds a match, then it writes that line to the new file.

Having this raw file by itself might not be much use though. I'm not sure if HanDict can load different versions of the Unihan.txt file.

Edited August 28, 2008 at 08:03 AM by imron

August 28, 2008 at 07:31 AM

Just checked the stripped file with Handict then, and it seems to more or less work. You just need to copy the resulting newunihan.txt to the Handict folder, and rename it unihan.txt and then restart Handict and you should be good to go. You may or may not wish to make a backup copy of the other unihan.txt before doing so.

August 28, 2008 at 12:38 PM

Thanks, mate!

By the way, is anyone working on Chinese Pera-kun?

Edited August 29, 2008 at 04:49 AM by Hofmann

Sign In

Program for searching Unihan.txt

Recommended Posts

Hofmann

imron

Hofmann

imron

imron

Hofmann

Join the conversation