Hofmann Posted August 27, 2008 at 03:13 PM Report Posted August 27, 2008 at 03:13 PM (edited) Is there a program for searching through Unihan.txt? I like HanDict, but I would like the ability to filter out some fields. For example, if I search or 才 using HanDict, I get the below Since I don't need most of this data, I would like to filter it out. Is there a program that resembles HanDict but lets me do that? BigFive A47E Cangjie DH (木竹) Cantonese coi4 CCCII 213F7B CihaiT 558.201 CNS1986 1-445F CNS1992 1-445F Cowles 4804 DaeJaweon 0763.010 Definition talent, ability; just, only EACC 28736D Fenn 22C FennIndex 539.04 540.04 FourCornerCode 4020.0 Frequency 2 GB0 1837 GB1 1837 GradeLevel 1 GSR 0943a Hangul 재 HanYu 31824.020 HanyuPinlu cai2(2108) cai5(33) HKGlyph 1481 IICore 2.1 IRG_GSource 0-3245 IRG_JSource 0-3A4D IRG_KPSource KP0-EBD8 IRG_KSource 0-6E26 IRG_TSource 1-445F IRG_VSource 1-5647 IRGDaeJaweon 0763.010 IRGDaiKanwaZiten 11769 IRGHanyuDaZidian 31824.020 IRGKangXi 0416.300 JapaneseKun WAZUKANI ZAE JapaneseOn SAI ZAI Jis0 2645 KangXi 0416.300 Korean CAY KPS0 EBD8 KSC0 7806 Lau 364 MainlandTelegraph 2088 Mandarin CAI2 Matthews 6660 MeyerWempe 3447 Morohashi 11769 Nelson 0270 Phonetic 24 247 RSAdobe_Japan1_6 C+2109+64.3.0 RSKangXi 64.0 RSUnicode 64.0 SBGY 099.49 SpecializedSemanticVariant U+7E94 TaiwanTelegraph 2088 Tang *dzhəi TotalStrokes 3 Vietnamese tài Xerox 242:153 XHC1983 0099.010:cái Edited August 27, 2008 at 03:22 PM by roddy edited slightly for layout. Quote
imron Posted August 27, 2008 at 03:36 PM Report Posted August 27, 2008 at 03:36 PM How comfortable are you with text processing languages such as Python or Perl? It would be trivial to write a script that will process Unihan.txt and strip out all the fields you don't want, and in fact I did that very thing for the Chinese Perakun project. Quote
Hofmann Posted August 27, 2008 at 09:56 PM Author Report Posted August 27, 2008 at 09:56 PM I'm absolutely oblivious to any kind of programming. If it is a simple task, can you point me in the right direction? Quote
imron Posted August 28, 2008 at 03:38 AM Report Posted August 28, 2008 at 03:38 AM (edited) Fairly simple. First download and install python (at least version 2.5). Then copy the code below into a new file (e.g. unihan.py): from __future__ import with_statement import re fields = 'kMandarin|kDefinition|kTotalStrokes' regex = re.compile( '^U+[a-hA-H0-9]+s(?:' + fields + ')' ) with open( 'newunihan.txt', 'w' ) as outf: with open('Unihan.txt') as inf: for line in inf: m = regex.match( line) if m: outf.write( line ) Change the line fields = 'kMandarin|kDefinition|kTotalStrokes' to contain the fields that you are interested in. The entire group of fields you are interested in should be surrounded by single quotes, with each field separated by a vertical bar without any spaces in between (you should be able to see what I mean from the above code). The field names need to match exactly with the field names used in the Unihan.txt file. See here for the full list. Then copy the Unihan.txt to the same directory as unihan.py If you're using windows and python is installed, you should just be able to double click on unihan.py to run it, or from the command line you could type: python unihan.py A new file "newunihan.txt" will be created with with all the fields you are not interested in stripped out. This file will be overwritten each time you run the script. If you're interested in understanding how it all works, here is a good place to start, followed by here and here. Basically, what this script does is load the file Unihan.txt and then go through the file line by line checking to see if the line matches one of the fields we are interested in. If it finds a match, then it writes that line to the new file. Having this raw file by itself might not be much use though. I'm not sure if HanDict can load different versions of the Unihan.txt file. Edited August 28, 2008 at 08:03 AM by imron Quote
imron Posted August 28, 2008 at 07:31 AM Report Posted August 28, 2008 at 07:31 AM Just checked the stripped file with Handict then, and it seems to more or less work. You just need to copy the resulting newunihan.txt to the Handict folder, and rename it unihan.txt and then restart Handict and you should be good to go. You may or may not wish to make a backup copy of the other unihan.txt before doing so. Quote
Hofmann Posted August 28, 2008 at 12:38 PM Author Report Posted August 28, 2008 at 12:38 PM (edited) Thanks, mate! By the way, is anyone working on Chinese Pera-kun? Edited August 29, 2008 at 04:49 AM by Hofmann Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.