mlescano Posted February 13, 2021 at 10:56 PM Report Posted February 13, 2021 at 10:56 PM Hi! If I have around 1,000 .txt files with combined pinyin + hanzi, how can I replace all the pinyin (and other non-Chinese characters) with spaces so I can keep the word segmentation intact? Here's an example of how the original rick text looks: 候鳥Hòuniǎo導航dǎoháng的de本能běnnéng已經yǐjīng很hěn令lìng人rén驚訝jīngyà,但dàn有些yǒuxiē動物dòngwù更gèng厲害lìhai,牠們tāmen就算jiùsuàn被bèi帶dài到dào陌生mòshēng的de地方dìfang也yě懂得dǒngde怎樣zěnyàng回huí老家lǎojiā。例如Lìrú,研究員yánjiūyuán曾céng從cóng太平洋Tàipíngyáng中心zhōngxīn的de小島xiǎodǎo帶dài了le18隻zhī信天翁xìntiānwēng,搭dā飛機fēijī到dào幾千jǐqiān公里gōnglǐ之zhī外wài的de不bù同tóng地方dìfang, But when it's pasted to a TXT, it looks like this: 候鳥Hòuniǎo導航dǎoháng的de本能běnnéng已經yǐjīng很hěn令lìng人rén驚訝jīngyà,但dàn有些yǒuxiē動物dòngwù更gèng厲害lìhai,牠們tāmen就算jiùsuàn被bèi帶dài到dào陌生mòshēng的de地方dìfang也yě懂得dǒngde怎樣zěnyàng回huí老家lǎojiā。例如Lìrú,研究員yánjiūyuán曾céng從cóng太平洋Tàipíngyáng中心zhōngxīn的de小島xiǎodǎo帶dài了le18隻zhī信天翁xìntiānwēng,搭dā飛機fēijī到dào幾千jǐqiān公里gōnglǐ之zhī外wài的de不bù同tóng地方dìfang, I want to keep just the hanzi. Replace any non-hanzi string with a single space, so I can then use AntConc to analyze the corpus. Thank you for your help! 1 Quote
Demonic_Duck Posted February 14, 2021 at 01:28 AM Report Posted February 14, 2021 at 01:28 AM https://onlinetexttools.com/replace-text Replace regex: /[a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ]+/gi With: <SPACE> Output from your sample looks like this: 候鳥 導航 的 本能 已經 很 令 人 驚訝 ,但 有些 動物 更 厲害 ,牠們 就算 被 帶 到 陌生 的 地方 也 懂得 怎樣 回 老家 。例如 ,研究員 曾 從 太平洋 中心 的 小島 帶 了 18隻 信天翁 ,搭 飛機 到 幾千 公里 之 外 的 不 同 地方 , Edit: As you have around 1000 files this might get a bit laborious. So you could also try the following method: Install Visual Studio Code Open the parent folder of your files in it Press Ctrl+Shift+H (multi-file find and replace) Use the little icons to the right of the input box to turn on the regex option, and make sure case insensitive option is on too Replace regex [a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ]+ with <SPACE> Make sure the text files you're modifying are backed up before doing this, as it'll overwrite them in-place (depending on your settings, you might have to save each one individually in the editor). Edit 2: If you literally only want Hanzi remaining (i.e. you also want to remove all punctuation, Arabic numerals, and so on), you can use this regex instead: Online text tools version: /\P{Script=Han}+/giu VS Code version: \P{Script=Han}+ Sample output using this regex: 候鳥 導航 的 本能 已經 很 令 人 驚訝 但 有些 動物 更 厲害 牠們 就算 被 帶 到 陌生 的 地方 也 懂得 怎樣 回 老家 例如 研究員 曾 從 太平洋 中心 的 小島 帶 了 隻 信天翁 搭 飛機 到 幾千 公里 之 外 的 不 同 地方 2 Quote
Jan Finster Posted February 14, 2021 at 08:55 AM Report Posted February 14, 2021 at 08:55 AM Maybe this thread is helpful: https://www.chinese-forums.com/forums/topic/58401-extracting-chinese-characters-from-text-file/ Publius' method worked for me and may be adapted to your needs. 1 Quote
calculatrix Posted February 14, 2021 at 11:35 AM Report Posted February 14, 2021 at 11:35 AM Do you have Linux? 1. cat myfile.txt | grep -P -o '[\p{Han}]+' gives you one word per line, but you lose the line feeds in your text. 2. cat myfile.txt | grep -P -o '[\p{Han}]+' | sort -u gives you a sorted unique wordlist 3. cat myfile.txt | grep -P -o '[\p{Han}]+' | paste -sd " " joins all your words space separated into one long line. And here comes the big one-liner: for i in *txt; do echo "converting $i"; cat $i | grep -P -o '[\p{Han}]+' > $i.converted ; done It takes all files with extension "txt" from your current directory , extracts the chinese words (one word per line as in 1.) and writes the output in a file withe the same name but extension "txt.converted". 1 Quote
alantin Posted February 14, 2021 at 09:23 PM Report Posted February 14, 2021 at 09:23 PM Another way to do this, if you don't have a linux machine and/or coding skills, is to open the text in word and, provided that the hanzi in the text is in different font than the pinyin, you can run search and replace to replace anything with a given font with <empty>. This will effectively remove everything written with that font. Quote
mlescano Posted February 14, 2021 at 09:33 PM Author Report Posted February 14, 2021 at 09:33 PM Thanks everyone for your replies! I ended up using Demonic_Duck's regex, applied to a whole folder using Notepad++ AntConc is now working perfectly with the resulting files. Thanks a lot! Quote
china-euro Posted March 24, 2021 at 02:22 PM Report Posted March 24, 2021 at 02:22 PM regex is best Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.