phyrex Posted March 31, 2010 at 03:20 AM Report Share Posted March 31, 2010 at 03:20 AM (edited) I just slapped something together for private use, and thought maybe this would be interesting for other people too. It's a tiny (and badly written and buggy) script that shows you which characters are new to you and where they appear (line number and line in question). It needs a list of your known characters in a textfile (I got mine by hacking around in hanzistats, so it would spit out a list of all the characters in my Anki deck), and the text you're interested in, again as a textfile. Here's an example of how I use it and the output it produces: equilibrium:Desktop phyrex$ python unknownchars.py /Users/phyrex/hanzistats/ankiHanzi2010325.txt /Users/phyrex/Downloads/Fendou ep.3 complete.txt Total new chars: 12 冥 63 : 冥冥中自有天意 63 : 冥冥中自有天意 馄 234 : 我最爱吃鲜虾馄饨了 走 哆 795 : 下得我浑身直哆嗦 饨 234 : 我最爱吃鲜虾馄饨了 走 驰 141 : 我给你买奔驰车 我带你去巴黎 攒 739 : 一辈子攒那么多钱容易吗 倔 831 : 她这个人本来性子就倔 酗 815 : 酗酒 赌博 爱说瞎话 筹 618 : 公司正准备筹备也没太多事儿 涛 25 : 我觉得陆涛说得有道理 什么人都能原谅 除了自己没见过面的父亲 80 : 陆涛 83 : 陆涛 91 : 我一定要为你和陆涛做件事儿 96 : 我还有陆涛 136 : 我也爱你陆涛 168 : 陆涛 你先回家 184 : 我爱你 陆涛 195 : 陆涛 你别这么心里阴暗了 240 : 陆涛 求求你了 251 : 陆涛 你已经长大成人了 315 : 陆涛 你记住 324 : 你陆涛想用你的青春做什么 355 : 陆涛 你现在在干什么 369 : 我说陆涛 你的理想我很欣赏 400 : 陆涛 485 : 陆涛 你怎么出来啦 541 : 陆涛 593 : 来了陆涛 我给你介绍一下 608 : 这是徐伯伯的公子 陆涛 610 : 你好 我叫陆涛 612 : 陆涛 你方伯伯千斤 619 : 老方 放心把灵珊交给陆涛吗 621 : 陆涛哥 642 : 陆涛哥 你再带我去别的地方玩嘛 689 : 陆涛回来了 764 : 陆涛 你别忘了答应我的事儿 768 : 陆涛 你对你爸怎么那样啊 787 : 陆涛 你能不能答应我 808 : 你知足吧 陆涛 喽 649 : 那我们AA喽 卤 653 : 那我带你去吃卤煮火烧 equilibrium:Desktop phyrex$ I also added option where you can exclude characters that you're not interested in (such as names), like so: equilibrium:Desktop phyrex$ python unknownchars.py /Users/phyrex/hanzistats/ankiHanzi2010325.txt /Users/phyrex/Downloads/Fendou ep.3 complete.txt "涛冥" Total new chars: 10 馄 234 : 我最爱吃鲜虾馄饨了 走 哆 795 : 下得我浑身直哆嗦 饨 234 : 我最爱吃鲜虾馄饨了 走 驰 141 : 我给你买奔驰车 我带你去巴黎 攒 739 : 一辈子攒那么多钱容易吗 倔 831 : 她这个人本来性子就倔 酗 815 : 酗酒 赌博 爱说瞎话 筹 618 : 公司正准备筹备也没太多事儿 喽 649 : 那我们AA喽 卤 653 : 那我带你去吃卤煮火烧 equilibrium:Desktop phyrex$ Now please understand that this program doesn't check for errors and is positively user-hostile and is useless if you don't have a list of the characters you know or don't know how to use python. However, it works for me, and I thought maybe somebody else would want to have something like that too. So, unless you want to pay me, I cannot and will not provide support and add features and a pretty GUI and whatnot. You're all welcome to use or ignore this thing as you see fit, but that's it from my side. Sorry, I've got bad experiences ;) I've attached the file to this post, but since it's so small, here's also the code, in case you want to see if this is interesting to you. Corrections are welcome, but hold your criticism: I know my programs aren't pretty, but I'm not a programmer, so deal with it #!/usr/bin/python # -*- coding: utf-8 -*- import os, sys import codecs, string CHARFILE = codecs.open( sys.argv[1], "r", encoding='utf-8') TXTFILE = codecs.open( sys.argv[2], "r", encoding='utf-8') exclude_chars="" if len(sys.argv)>3: exclude_chars = sys.argv[3].decode("utf-8") known_chars = CHARFILE.read() txtfilelines = TXTFILE.readlines() i=1 unknown_chars={} processed_chars = "" for line in txtfilelines: line.strip() if line is not "": for char in line: if char not in string.ascii_letters and char not in string.digits and char not in string.whitespace and char not in string.punctuation and char not in u"。,&;:-!◎#‘、“, 》─《’′☆—”?·¥%……※×()+§『" and char not in processed_chars and char not in exclude_chars: if char not in known_chars: context = [str(i)+" : "+line.encode("utf-8")] if char in unknown_chars.keys(): if context not in unknown_chars[char]: unknown_chars[char].extend(context) else: unknown_chars[char] = context processed_chars = processed_chars+char processed_chars = "" i = i + 1 print "Total new chars:", len(unknown_chars.keys()) print "-----" for char in unknown_chars.keys(): print char.encode('utf-8')+"("+str(len(unknown_chars[char]))+")", print "n-----" for char in unknown_chars.keys(): print "n=======" print char.encode('utf-8'),"(",len(unknown_chars[char]),")" print "=======" for entry in unknown_chars[char]: print "t",entry, unknownchars.py.zip Edited March 31, 2010 at 08:00 AM by phyrex Quote Link to comment Share on other sites More sharing options...
phyrex Posted March 31, 2010 at 07:59 AM Author Report Share Posted March 31, 2010 at 07:59 AM I don't know if anybody is reading this, but here's a bugfix release, which - makes things prettier - shows all new characters at the beginning - shows how often these new characters occur - fixes a problem with lines appearing a couple of times - adds a lot of Chinese punctuation signs. I'll add this to the top, but if anybody is reading this please let me know, so I know if I should post further updates here or not. This is a sample of what the new output looks like: Total new chars: 297 ----- 阀(1) 沃(4) 茂(4) 禅(1) 暇(4) 霆(4) 辉(1) 崭(1) 紊(1) 椎(1) 凰(7) 氯(1) 隅(1) 肛(1) 挚(1) 窟(1) F(2) 疡(1) 蚣(2) 奚(1) 隧( 弦(3) 涩(1) 刨(1) 帆(1) 箭(9) 尬(1) 贮(3) 骼(1) 尴(1) ·(60) 眶(3) 簸(1) 眺(1) 羽(1) 吼(11) 殴(1) 4(1) 巅(3) 茧(5) 翩(1) 勋(6) 镊(1) 豌(1) 蝎(4) 寝(2) 揽(1) 滓(1) 卒(1) 沸(1) 艘(43) 苛(3) 茎(1) 篝(1) 坞(3) 沈(32) 绣(1) 卢(1) 嗥(1) 啭(1) 胧(3) 诵(1) 菩(1) 桨(1) 惫(9) 啪(1) 蹬(6) 僚(4) 腮(3) 忱(2) 驰(4) 儒(2) 韵(1) 凹(1) 槽(3) 灼(1) 獾(1) 傀(1) 蠃(6) 茅(1) 焉(1) 颈(6) 憎(20) 1(3) 庐(1) 谴(2) 5(2) 胯(2) 伎(1) 隘(1) 湮(1) 撂(2) 膛(1) 吟(2) 迄(1) 砾(1) 颠(3) 枢(1) 姚(1) 汰(1) 眩(1) 鲨(2) 循(5) 缭(1) 怯(1) 垮(7) 撰(7) 舷(1) 蒸(2) 儡(1) 粼(1) 耿(1) 胀(3) 拈(1) 晋(9) I(2) 遏(1) 俎(1) 慑(1) 荚(1) 哔(1) 烘(1) 跚(1) 靡(2) 瘸(1) 矢(2) 畜(2) 懦(3) 笆(1) 竽(1) 鹫(6) 朦(2) 彭(6) 惚(1) 叮(3) 捱(2) 亩(1) 扳(5) 潭(1) 顷(1) 拦(1) 呻(2) 跺(1) 煽(2) 恿(2) 苔(11) 梁(3) 缀(2) 怂(2) 袅(1) 龇(1) 舆(2) 劈(2) 蜈(2) 垦(1) 侏(2) 萎(2) 0(6) 貂(1) 唔(2) 咙(3) 8(2) 涛(1) 倚(1) 咝(1) 愠(1) 琢(2) 窥(3) 翱(1) 萦(2) 钩( 垫(4) 瘪(2) 梭(1) 晦(1) 澈(1) 渲(1) 玷(1) 嘶(1) 炽(2) 戾(1) 镀(1) 埃(2) 遂(2) 髅(1) 掷(3) 穆(1) 刍(1) 僧(1) 柏(1) 噎(2) 仑(2) 捐(1) 摒(1) 剖(6) 潘(1) 诛(1) 黝(3) 啜(2) 藤(1) 2(1) 杠(1) 揣(1) 骛(1) 盥(14) 牧(1) 岐(1) 黩(1) 琪(1) 婪(1) 峭(2) 佬(1) 尼(60) 拱(3) 蝰(1) 呲(1) 灶(1) 蛹(1) 浸(2) 跻(1) 豺(2) 绽(3) 栩(1) 煌(1) 喀(1) 咂(1) 栅(1) 隆(1) 嘉(1) 镭(2) 稼(1) 逍(2) 瞌(1) 辐(3) 3(1) 耕(3) 骷(1) 肘(1) 忑(1) (1) 洛(61) 颚(1) 啄(1) 匣(1) 梢(1) 弧(1) 翰(1) 谩(1) 膨(3) 溪(3) 渭(1) 奠(1) 掴(2) 蜷(2) 溶(3) 纺(1) 碾(3) 臀(2) 坊(1) 柄(2) 魇(1) 裆(5) 冈(1) 腋(1) 囊(2) 衍(1) 诌(1) 筏(17) 胎(6) 忐(1) 卓(2) 翔(6) 彗(1) 蹙(1) 绚(1) 硝(1) 偻(1) 绞(2) 陡(1) 壑(1) 佣(1) 淤(1) 荧(1) 峦(1) 桩(2) 叨(1) 赫(2) 咦(2) 橱(1) 寰(1) 蹒(1) 惶(1) 汹(1) 叱(1) 篱(1) 哺(1) 睐(1) ----- ======= 阀 ( 1 ) ======= 3035 : “那些怪异的梦就象是个安全阀,安德,在你的生命中,我第一次把重担压在了你身上。你的身体在压力下寻求补偿,就是这样。你是个大小伙了。不要再害怕漆黑的夜晚了。” unknownchars.py.zip Quote Link to comment Share on other sites More sharing options...
HerrPetersen Posted March 31, 2010 at 04:12 PM Report Share Posted March 31, 2010 at 04:12 PM This seems to be similar to the excel-sheet, Hedge Pig created. link: http://www.chinese-forums.com/index.php?/topic/20159-creating-lists-of-unknown-hanzi Quote Link to comment Share on other sites More sharing options...
phyrex Posted March 31, 2010 at 06:59 PM Author Report Share Posted March 31, 2010 at 06:59 PM I haven't tried it, but from what the others write, it seems to be a lot like this, yes! But even the thought of Excel and of copy&pasting stuff back and forth makes me sick, so I don't feel like I wasted time.. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.