Jump to content
Chinese-Forums
  • Sign Up

Unknown Characters in texts


phyrex

Recommended Posts

I just slapped something together for private use, and thought maybe this would be interesting for other people too.

It's a tiny (and badly written and buggy) script that shows you which characters are new to you and where they appear (line number and line in question).

It needs a list of your known characters in a textfile (I got mine by hacking around in hanzistats, so it would spit out a list of all the characters in my Anki deck), and the text you're interested in, again as a textfile.

Here's an example of how I use it and the output it produces:

equilibrium:Desktop phyrex$ python unknownchars.py /Users/phyrex/hanzistats/ankiHanzi2010325.txt /Users/phyrex/Downloads/Fendou ep.3 complete.txt 
Total new chars: 12 

冥
63 : 冥冥中自有天意 
63 : 冥冥中自有天意 
馄
234 : 我最爱吃鲜虾馄饨了  走
哆
795 : 下得我浑身直哆嗦
饨
234 : 我最爱吃鲜虾馄饨了  走
驰
141 : 我给你买奔驰车  我带你去巴黎
攒
739 : 一辈子攒那么多钱容易吗
倔
831 : 她这个人本来性子就倔
酗
815 : 酗酒  赌博  爱说瞎话
筹
618 : 公司正准备筹备也没太多事儿
涛
25 : 我觉得陆涛说得有道理  什么人都能原谅  除了自己没见过面的父亲
80 : 陆涛
83 : 陆涛
91 : 我一定要为你和陆涛做件事儿
96 : 我还有陆涛
136 : 我也爱你陆涛
168 : 陆涛  你先回家
184 : 我爱你  陆涛
195 : 陆涛  你别这么心里阴暗了
240 : 陆涛  求求你了
251 : 陆涛  你已经长大成人了
315 : 陆涛  你记住
324 : 你陆涛想用你的青春做什么
355 : 陆涛  你现在在干什么
369 : 我说陆涛  你的理想我很欣赏
400 : 陆涛
485 : 陆涛  你怎么出来啦
541 : 陆涛
593 : 来了陆涛  我给你介绍一下
608 : 这是徐伯伯的公子  陆涛
610 : 你好  我叫陆涛
612 : 陆涛  你方伯伯千斤
619 : 老方  放心把灵珊交给陆涛吗
621 : 陆涛哥 
642 : 陆涛哥  你再带我去别的地方玩嘛
689 : 陆涛回来了
764 : 陆涛  你别忘了答应我的事儿
768 : 陆涛  你对你爸怎么那样啊
787 : 陆涛  你能不能答应我
808 : 你知足吧  陆涛
喽
649 : 那我们AA喽
卤
653 : 那我带你去吃卤煮火烧
equilibrium:Desktop phyrex$ 

I also added option where you can exclude characters that you're not interested in (such as names), like so:

equilibrium:Desktop phyrex$ python unknownchars.py /Users/phyrex/hanzistats/ankiHanzi2010325.txt /Users/phyrex/Downloads/Fendou ep.3 complete.txt "涛冥"
Total new chars: 10 

馄
234 : 我最爱吃鲜虾馄饨了  走
哆
795 : 下得我浑身直哆嗦
饨
234 : 我最爱吃鲜虾馄饨了  走
驰
141 : 我给你买奔驰车  我带你去巴黎
攒
739 : 一辈子攒那么多钱容易吗
倔
831 : 她这个人本来性子就倔
酗
815 : 酗酒  赌博  爱说瞎话
筹
618 : 公司正准备筹备也没太多事儿
喽
649 : 那我们AA喽
卤
653 : 那我带你去吃卤煮火烧
equilibrium:Desktop phyrex$ 

Now please understand that this program doesn't check for errors and is positively user-hostile and is useless if you don't have a list of the characters you know or don't know how to use python. However, it works for me, and I thought maybe somebody else would want to have something like that too. So, unless you want to pay me, I cannot and will not provide support and add features and a pretty GUI and whatnot. You're all welcome to use or ignore this thing as you see fit, but that's it from my side. Sorry, I've got bad experiences ;)

I've attached the file to this post, but since it's so small, here's also the code, in case you want to see if this is interesting to you. Corrections are welcome, but hold your criticism: I know my programs aren't pretty, but I'm not a programmer, so deal with it :P

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os, sys
import codecs, string


CHARFILE  = codecs.open( sys.argv[1], "r", encoding='utf-8')
TXTFILE = codecs.open( sys.argv[2], "r", encoding='utf-8')
exclude_chars=""
if len(sys.argv)>3: exclude_chars = sys.argv[3].decode("utf-8")

known_chars = CHARFILE.read()
txtfilelines = TXTFILE.readlines()
i=1
unknown_chars={}
processed_chars = ""

for line in txtfilelines:
line.strip()
if line is not "":
	for char in line:
		if char not in string.ascii_letters and char not in string.digits and char not in string.whitespace and char not in string.punctuation and char not in u"。,&;:-!◎#‘、“, 》─《’′☆—”?·¥%……※×()+§『" and char not in processed_chars and char not in exclude_chars:
			if char not in known_chars:
				context = [str(i)+" : "+line.encode("utf-8")]
				if char in unknown_chars.keys():
					if context not in unknown_chars[char]:
						unknown_chars[char].extend(context)
				else:
					unknown_chars[char] = context
		processed_chars = processed_chars+char				
	processed_chars = ""				
	i = i + 1

print "Total new chars:", len(unknown_chars.keys())
print "-----"
for char in unknown_chars.keys():
print char.encode('utf-8')+"("+str(len(unknown_chars[char]))+")",
print "n-----"

for char in unknown_chars.keys():
print "n======="
print char.encode('utf-8'),"(",len(unknown_chars[char]),")"
print "======="
for entry in unknown_chars[char]:
	print "t",entry,

unknownchars.py.zip

Edited by phyrex
Link to comment
Share on other sites

I don't know if anybody is reading this, but here's a bugfix release, which

- makes things prettier

- shows all new characters at the beginning

- shows how often these new characters occur

- fixes a problem with lines appearing a couple of times

- adds a lot of Chinese punctuation signs.

I'll add this to the top, but if anybody is reading this please let me know, so I know if I should post further updates here or not.

This is a sample of what the new output looks like:

Total new chars: 297
-----
阀(1) 沃(4) 茂(4) 禅(1) 暇(4) 霆(4) 辉(1) 崭(1) 紊(1) 椎(1) 凰(7) 氯(1) 隅(1) 肛(1) 挚(1) 窟(1) F(2) 疡(1) 蚣(2) 奚(1) 隧( 弦(3) 涩(1) 刨(1) 帆(1) 箭(9) 尬(1) 贮(3) 骼(1) 尴(1) ·(60) 眶(3) 簸(1) 眺(1) 羽(1) 吼(11) 殴(1) 4(1) 巅(3) 茧(5) 翩(1) 勋(6) 镊(1) 豌(1) 蝎(4) 寝(2) 揽(1) 滓(1) 卒(1) 沸(1) 艘(43) 苛(3) 茎(1) 篝(1) 坞(3) 沈(32) 绣(1) 卢(1) 嗥(1) 啭(1) 胧(3) 诵(1) 菩(1) 桨(1) 惫(9) 啪(1) 蹬(6) 僚(4) 腮(3) 忱(2) 驰(4) 儒(2) 韵(1) 凹(1) 槽(3) 灼(1) 獾(1) 傀(1) 蠃(6) 茅(1) 焉(1) 颈(6) 憎(20) 1(3) 庐(1) 谴(2) 5(2) 胯(2) 伎(1) 隘(1) 湮(1) 撂(2) 膛(1) 吟(2) 迄(1) 砾(1) 颠(3) 枢(1) 姚(1) 汰(1) 眩(1) 鲨(2) 循(5) 缭(1) 怯(1) 垮(7) 撰(7) 舷(1) 蒸(2) 儡(1) 粼(1) 耿(1) 胀(3) 拈(1) 晋(9) I(2) 遏(1) 俎(1) 慑(1) 荚(1) 哔(1) 烘(1) 跚(1) 靡(2) 瘸(1) 矢(2) 畜(2) 懦(3) 笆(1) 竽(1) 鹫(6) 朦(2) 彭(6) 惚(1) 叮(3) 捱(2) 亩(1) 扳(5) 潭(1) 顷(1) 拦(1) 呻(2) 跺(1) 煽(2) 恿(2) 苔(11) 梁(3) 缀(2) 怂(2) 袅(1) 龇(1) 舆(2) 劈(2) 蜈(2) 垦(1) 侏(2) 萎(2) 0(6) 貂(1) 唔(2) 咙(3) 8(2) 涛(1) 倚(1) 咝(1) 愠(1) 琢(2) 窥(3) 翱(1) 萦(2) 钩( 垫(4) 瘪(2) 梭(1) 晦(1) 澈(1) 渲(1) 玷(1) 嘶(1) 炽(2) 戾(1) 镀(1) 埃(2) 遂(2) 髅(1) 掷(3) 穆(1) 刍(1) 僧(1) 柏(1) 噎(2) 仑(2) 捐(1) 摒(1) 剖(6) 潘(1) 诛(1) 黝(3) 啜(2) 藤(1) 2(1) 杠(1) 揣(1) 骛(1) 盥(14) 牧(1) 岐(1) 黩(1) 琪(1) 婪(1) 峭(2) 佬(1) 尼(60) 拱(3) 蝰(1) 呲(1) 灶(1) 蛹(1) 浸(2) 跻(1) 豺(2) 绽(3) 栩(1) 煌(1) 喀(1) 咂(1) 栅(1) 隆(1) 嘉(1) 镭(2) 稼(1) 逍(2) 瞌(1) 辐(3) 3(1) 耕(3) 骷(1) 肘(1) 忑(1) (1) 洛(61) 颚(1) 啄(1) 匣(1) 梢(1) 弧(1) 翰(1) 谩(1) 膨(3) 溪(3) 渭(1) 奠(1) 掴(2) 蜷(2) 溶(3) 纺(1) 碾(3) 臀(2) 坊(1) 柄(2) 魇(1) 裆(5) 冈(1) 腋(1) 囊(2) 衍(1) 诌(1) 筏(17) 胎(6) 忐(1) 卓(2) 翔(6) 彗(1) 蹙(1) 绚(1) 硝(1) 偻(1) 绞(2) 陡(1) 壑(1) 佣(1) 淤(1) 荧(1) 峦(1) 桩(2) 叨(1) 赫(2) 咦(2) 橱(1) 寰(1) 蹒(1) 惶(1) 汹(1) 叱(1) 篱(1) 哺(1) 睐(1) 
-----

=======
阀 ( 1 )
=======
3035 :   “那些怪异的梦就象是个安全阀,安德,在你的生命中,我第一次把重担压在了你身上。你的身体在压力下寻求补偿,就是这样。你是个大小伙了。不要再害怕漆黑的夜晚了。”

unknownchars.py.zip

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...