How can I replace all non-Chinese characters in a text with spaces?

February 13, 2021 at 10:56 PM

Hi! If I have around 1,000 .txt files with combined pinyin + hanzi, how can I replace all the pinyin (and other non-Chinese characters) with spaces so I can keep the word segmentation intact?

Here's an example of how the original rick text looks:

候鳥Hòuniǎo導航dǎoháng的de本能běnnéng已經yǐjīng很hěn令lìng人rén驚訝jīngyà，但dàn有些yǒuxiē動物dòngwù更gèng厲害lìhai，牠們tāmen就算jiùsuàn被bèi帶dài到dào陌生mòshēng的de地方dìfang也yě懂得dǒngde怎樣zěnyàng回huí老家lǎojiā。例如Lìrú，研究員yánjiūyuán曾céng從cóng太平洋Tàipíngyáng中心zhōngxīn的de小島xiǎodǎo帶dài了le18隻zhī信天翁xìntiānwēng，搭dā飛機fēijī到dào幾千jǐqiān公里gōnglǐ之zhī外wài的de不bù同tóng地方dìfang，

But when it's pasted to a TXT, it looks like this:

候鳥Hòuniǎo導航dǎoháng的de本能běnnéng已經yǐjīng很hěn令lìng人rén驚訝jīngyà，但dàn有些yǒuxiē動物dòngwù更gèng厲害lìhai，牠們tāmen就算jiùsuàn被bèi帶dài到dào陌生mòshēng的de地方dìfang也yě懂得dǒngde怎樣zěnyàng回huí老家lǎojiā。例如Lìrú，研究員yánjiūyuán曾céng從cóng太平洋Tàipíngyáng中心zhōngxīn的de小島xiǎodǎo帶dài了le18隻zhī信天翁xìntiānwēng，搭dā飛機fēijī到dào幾千jǐqiān公里gōnglǐ之zhī外wài的de不bù同tóng地方dìfang，

I want to keep just the hanzi. Replace any non-hanzi string with a single space, so I can then use AntConc to analyze the corpus.

Thank you for your help!

February 14, 2021 at 01:28 AM

https://onlinetexttools.com/replace-text

Replace regex: /[a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ]+/gi

With: <SPACE>

Output from your sample looks like this:

候鳥導航的本能已經很令人驚訝，但有些動物更厲害，牠們就算被帶到陌生的地方也懂得怎樣回老家。例如，研究員曾從太平洋中心的小島帶了 18隻信天翁，搭飛機到幾千公里之外的不同地方，

Edit: As you have around 1000 files this might get a bit laborious. So you could also try the following method:

Install Visual Studio Code
Open the parent folder of your files in it
Press Ctrl+Shift+H (multi-file find and replace)
Use the little icons to the right of the input box to turn on the regex option, and make sure case insensitive option is on too
Replace regex [a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ]+ with <SPACE>

Make sure the text files you're modifying are backed up before doing this, as it'll overwrite them in-place (depending on your settings, you might have to save each one individually in the editor).

Edit 2: If you literally only want Hanzi remaining (i.e. you also want to remove all punctuation, Arabic numerals, and so on), you can use this regex instead:

Online text tools version: /\P{Script=Han}+/giu
VS Code version: \P{Script=Han}+

Sample output using this regex:

候鳥導航的本能已經很令人驚訝但有些動物更厲害牠們就算被帶到陌生的地方也懂得怎樣回老家例如研究員曾從太平洋中心的小島帶了隻信天翁搭飛機到幾千公里之外的不同地方

February 14, 2021 at 08:55 AM

Maybe this thread is helpful: https://www.chinese-forums.com/forums/topic/58401-extracting-chinese-characters-from-text-file/

Publius' method worked for me and may be adapted to your needs.

February 14, 2021 at 11:35 AM

Do you have Linux?

1.

cat myfile.txt | grep -P -o '[\p{Han}]+'

gives you one word per line, but you lose the line feeds in your text.

2.

cat myfile.txt | grep -P -o '[\p{Han}]+' | sort -u

gives you a sorted unique wordlist

3.

cat myfile.txt | grep -P -o '[\p{Han}]+' | paste -sd " "

joins all your words space separated into one long line.

And here comes the big one-liner:

for i in *txt; do echo "converting $i"; cat $i | grep -P -o '[\p{Han}]+' > $i.converted ; done

It takes all files with extension "txt" from your current directory , extracts the chinese words (one word per line as in 1.) and writes the output in a file withe the same name but extension "txt.converted".

February 14, 2021 at 09:23 PM

Another way to do this, if you don't have a linux machine and/or coding skills, is to open the text in word and, provided that the hanzi in the text is in different font than the pinyin, you can run search and replace to replace anything with a given font with <empty>. This will effectively remove everything written with that font.

February 14, 2021 at 09:33 PM

Thanks everyone for your replies! I ended up using Demonic_Duck's regex, applied to a whole folder using Notepad++

AntConc is now working perfectly with the resulting files.

Thanks a lot!

March 24, 2021 at 02:22 PM

regex is best

Sign In

How can I replace all non-Chinese characters in a text with spaces?

Recommended Posts

mlescano

Demonic_Duck

Jan Finster

calculatrix

alantin

mlescano

china-euro

Join the conversation