Jump to content
Chinese-Forums
  • Sign Up

Program for searching Unihan.txt


Hofmann

Recommended Posts

Is there a program for searching through Unihan.txt? I like HanDict, but I would like the ability to filter out some fields. For example, if I search or 才 using HanDict, I get the below

Since I don't need most of this data, I would like to filter it out. Is there a program that resembles HanDict but lets me do that?

BigFive A47E

Cangjie DH (木竹)

Cantonese coi4

CCCII 213F7B

CihaiT 558.201

CNS1986 1-445F

CNS1992 1-445F

Cowles 4804

DaeJaweon 0763.010

Definition talent, ability; just, only

EACC 28736D

Fenn 22C

FennIndex 539.04 540.04

FourCornerCode 4020.0

Frequency 2

GB0 1837

GB1 1837

GradeLevel 1

GSR 0943a

Hangul 재

HanYu 31824.020

HanyuPinlu cai2(2108) cai5(33)

HKGlyph 1481

IICore 2.1

IRG_GSource 0-3245

IRG_JSource 0-3A4D

IRG_KPSource KP0-EBD8

IRG_KSource 0-6E26

IRG_TSource 1-445F

IRG_VSource 1-5647

IRGDaeJaweon 0763.010

IRGDaiKanwaZiten 11769

IRGHanyuDaZidian 31824.020

IRGKangXi 0416.300

JapaneseKun WAZUKANI ZAE

JapaneseOn SAI ZAI

Jis0 2645

KangXi 0416.300

Korean CAY

KPS0 EBD8

KSC0 7806

Lau 364

MainlandTelegraph 2088

Mandarin CAI2

Matthews 6660

MeyerWempe 3447

Morohashi 11769

Nelson 0270

Phonetic 24 247

RSAdobe_Japan1_6 C+2109+64.3.0

RSKangXi 64.0

RSUnicode 64.0

SBGY 099.49

SpecializedSemanticVariant U+7E94

TaiwanTelegraph 2088

Tang *dzhəi

TotalStrokes 3

Vietnamese tài

Xerox 242:153

XHC1983 0099.010:cái

Edited by roddy
edited slightly for layout.
Link to comment
Share on other sites

Fairly simple.

First download and install python (at least version 2.5).

Then copy the code below into a new file (e.g. unihan.py):

from __future__ import with_statement

import re

fields = 'kMandarin|kDefinition|kTotalStrokes'

regex = re.compile( '^U+[a-hA-H0-9]+s(?:' + fields + ')' ) 

with open( 'newunihan.txt', 'w' ) as outf:
   with open('Unihan.txt') as inf: 
       for line in inf: 
           m = regex.match( line)
           if m:   
               outf.write( line )

Change the line fields = 'kMandarin|kDefinition|kTotalStrokes' to contain the fields that you are interested in. The entire group of fields you are interested in should be surrounded by single quotes, with each field separated by a vertical bar without any spaces in between (you should be able to see what I mean from the above code). The field names need to match exactly with the field names used in the Unihan.txt file. See here for the full list.

Then copy the Unihan.txt to the same directory as unihan.py

If you're using windows and python is installed, you should just be able to double click on unihan.py to run it, or from the command line you could type: python unihan.py

A new file "newunihan.txt" will be created with with all the fields you are not interested in stripped out. This file will be overwritten each time you run the script.

If you're interested in understanding how it all works, here is a good place to start, followed by here and here.

Basically, what this script does is load the file Unihan.txt and then go through the file line by line checking to see if the line matches one of the fields we are interested in. If it finds a match, then it writes that line to the new file.

Having this raw file by itself might not be much use though. I'm not sure if HanDict can load different versions of the Unihan.txt file.

Edited by imron
Link to comment
Share on other sites

Just checked the stripped file with Handict then, and it seems to more or less work. You just need to copy the resulting newunihan.txt to the Handict folder, and rename it unihan.txt and then restart Handict and you should be good to go. You may or may not wish to make a backup copy of the other unihan.txt before doing so.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...