uni419 Posted February 11, 2015 at 07:09 AM Report Posted February 11, 2015 at 07:09 AM Have some lengthy transcripts that are dual Character/Pinyin, am looking for efficient utilities to get rid of the pinyin. Any and all help greatly appreciated. Quote
imron Posted February 11, 2015 at 07:17 AM Report Posted February 11, 2015 at 07:17 AM Can you provide an example of some of the text? Also, what operating system are you using? There are a bunch of text processing tools that come with OS X or Linux that could probably help strip out unwanted characters. Quote
uni419 Posted February 11, 2015 at 07:46 AM Author Report Posted February 11, 2015 at 07:46 AM Chinese pod transcripts. See below. EDIT: OS is OS X, and plus for anything that can do batch processing. EDIT2: To clarify, also looking to use this with some longer documents as well, just want to see what tools are available. A: 司令,侦查卫星显示蓝方的部队已向前推进了两公里,并有数十辆坦克已朝我方移动了十公里。估计是蓝方的试探性进攻。 sīlìng,zhēnchá wèixīng xiǎnshì lánfāng de bùduì yǐ xiàngqián tuījìn le liǎng gōnglǐ,bìng yǒu shù shí liàng tǎnkè yǐ cháo wǒfāng yídòng le shí gōnglǐ.gūjì shì lánfāng de shìtàn xìng jìngōng. Commander, the spy satellite shows that the Blue troops have moved forward two kilometres, and there are dozens of tanks that have already moved ten kilometres toward us. I reckon it's the Blue's probing attack. B: 距离我方前哨阵地大约多久路程? jùlí wǒfāng qiánshào zhèndì dàyuē duōjiǔ lùchéng? They are at approximately how long of a distance from our advance guards' position? A: 根据目前的移动速度,估计30分钟左右。 gēnjù mùqián de yídòng sùdù,gūjì sānshí fēnzhōng zuǒyòu. According to the present speed of movement, I estimate it will be about 30 minutes. B: 告诉反坦克连,从014地区进行拦截,并全力牵制住他们。另外,我们工程连搭桥的进展怎么样? gàosu fǎntǎnkè lián,cóng dòngyāosì dìqū jìnxíng lánjié,bìng quánlì qiānzhì zhù tāmen.lìngwài,wǒmen gōngchénglián dāqiáo de jìnzhǎn zěnmeyàng? Tell the anti-tanks company to intercept from area 014 and spare no effort to pin them down. Besides that, what is the progress of our engineering company in building the bridge? A: 正在进行中,他们连长说还需要15分钟。 zhèngzài jìnxíng zhōng,tāmen liánzhǎng shuō hái xūyào shíwǔ fēnzhōng. It is in progress and their company commander says it will take another 15 minutes. B: 看来蓝方已经揣摩出我们想要偷袭他们的补给线,然后派出坦克来进行骚扰。 kànlái lánfāng yǐjīng chuǎimó chū wǒmen xiǎngyào tōuxí tāmen de bǔjǐxiàn,ránhòu pàichū tǎnkè lái jìnxíng sāorǎo. It looks like Blue has already figured out that we want to make a raid on their supply line, and so then they sent out tanks to carry out harrassment. C: 司令,突击团已经准备完毕,现正在075地区待命。 sīlìng,tūjītuán yǐjīng zhǔnbèi wánbì,xiàn zhèngzài dòngguǎiwǔ dìqū dàimìng. Commander, the shock team has already finished its preparations and is now at area 075 awaiting orders. B: 好,告诉工程连,我只给他们5分钟。桥一搭好,便让突击团全力攻击他们的补给线。一定要拿下。 hǎo,gàosu gōngchénglián,wǒ zhǐ gěi tāmen wǔ fēnzhōng.qiáo yī dā hǎo,biàn ràng tūjītuán quánlì gōngjī tāmen de bǔjǐxiàn.yīdìng yào náxià. Ok, tell the engineering company that I am only giving them 5 minutes. As soon as the bridge is built, then let the shock team spare no effort to attack their supply line. We must get it. C: 已通知他们的通讯员。 yǐ tōngzhī tāmen de tōngxùnyuán. I already have informed their correspondent. B: 我们要尽可能地避开他们的装甲部队,把他们的步兵引入到伏击圈中。 wǒmen yào jǐnkěnéng de bìkāi tāmen de zhuāngjiǎbùduì,bǎ tāmen de bùbīng yǐnrù dào fújīquān zhōng. We have to do our utmost to avoid their armoured forces and lure their infantry into an ambush ring. A: 司令,我们三连在C城遇上了蓝方的特种部队,已展开巷战。 sīlìng,wǒmen sān lián zài C chéng yùshang le lánfāng de tèzhǒngbùduì,yǐ zhǎnkāi xiàngzhàn. Commander, our third company has run into Blue's special forces and this has already set off street fighting. C: 不好,我们的主控电脑中毒,通讯中断! bùhǎo,wǒmen de zhǔkòngdiànnǎo zhòngdú,tōngxùn zhōngduàn! That's not good. Our main control computer has been infected. Discontinue communications! B: 赶快开启备用数据! gǎnkuài kāiqǐ bèiyòng shùjù! Open the alternate data immediately! C: 我们的卫星连接已被蓝方控制。 wǒmen de wèixīng liánjiē yǐ bèi lánfāng kòngzhì. Our satellite connection is already being controlled by Blue. B: 糟了。 zāole. We're ruined. Quote
imron Posted February 11, 2015 at 09:17 AM Report Posted February 11, 2015 at 09:17 AM In terminal there are a whole host of text processing tools that can do this thing if you know about regular expressions. E.g. assuming a utf8 text file something like this would work: sed -e "s/[A-Za-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ:.,'?\!\-]//g" -e '/^[[:space:]]*[0-9]*[[:space:]]*$/d' original.txt > new.txt where original.txt is the name of the file and new.txt is the name of the stripped down content. That's basically just a regular expression that deletes any alphabetic or pinyin character plus various punctuation, and then cleans up spaces and things. It's not perfect, because it also strips out the A:, B:, and C: at the start of the Chinese, however if you are always going to have A: B: C: or similar at the start of the chinese lines, you can do it with a simpler command: sed -n '/^[ABC]:/p' original.txt > new.txt It's not to hard to hook this up to handle batch processing if you know bash scripting. 1 Quote
dwq Posted February 12, 2015 at 10:42 PM Report Posted February 12, 2015 at 10:42 PM Haven't tried it, but that would seem to strip out the English translation as well, not sure if that is what the OP wants. Quote
dwq Posted February 12, 2015 at 10:48 PM Report Posted February 12, 2015 at 10:48 PM I think something along the line of "grep -v 'āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ' original.txt > new.txt" should mostly work to strip only lines containing pinyin. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.