Jump to content
Chinese-Forums
  • Sign Up

How can I replace all non-Chinese characters in a text with spaces?


Recommended Posts

Posted

Hi! If I have around 1,000 .txt files with combined pinyin + hanzi, how can I replace all the pinyin (and other non-Chinese characters) with spaces so I can keep the word segmentation intact?

 

Here's an example of how the original rick text looks:

 

候鳥Hòuniǎo導航dǎoháng的de本能běnnéng已經yǐjīng很hěn令lìng人rén驚訝jīngyà但dàn有些yǒuxiē動物dòngwù更gèng厲害lìhai牠們tāmen就算jiùsuàn被bèi帶dài到dào陌生mòshēng的de地方dìfang也yě懂得dǒngde怎樣zěnyàng回huí老家lǎojiā例如Lìrú研究員yánjiūyuán曾céng從cóng太平洋Tàipíngyáng中心zhōngxīn的de小島xiǎodǎo帶dài了le18隻zhī信天翁xìntiānwēng搭dā飛機fēijī到dào幾千jǐqiān公里gōnglǐ之zhī外wài的de不bù同tóng地方dìfang

 

But when it's pasted to a TXT, it looks like this:

 

候鳥Hòuniǎo導航dǎoháng的de本能běnnéng已經yǐjīng很hěn令lìng人rén驚訝jīngyà,但dàn有些yǒuxiē動物dòngwù更gèng厲害lìhai,牠們tāmen就算jiùsuàn被bèi帶dài到dào陌生mòshēng的de地方dìfang也yě懂得dǒngde怎樣zěnyàng回huí老家lǎojiā。例如Lìrú,研究員yánjiūyuán曾céng從cóng太平洋Tàipíngyáng中心zhōngxīn的de小島xiǎodǎo帶dài了le18隻zhī信天翁xìntiānwēng,搭dā飛機fēijī到dào幾千jǐqiān公里gōnglǐ之zhī外wài的de不bù同tóng地方dìfang,

 

I want to keep just the hanzi. Replace any non-hanzi string with a single space, so I can then use AntConc to analyze the corpus.

 

Thank you for your help!

  • Like 1
Posted

https://onlinetexttools.com/replace-text

 

Replace regex: /[a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ]+/gi

 

With: <SPACE>

 

Output from your sample looks like this:

 

候鳥 導航 的 本能 已經 很 令 人 驚訝 ,但 有些 動物 更 厲害 ,牠們 就算 被 帶 到 陌生 的 地方 也 懂得 怎樣 回 老家 。例如 ,研究員 曾 從 太平洋 中心 的 小島 帶 了 18隻 信天翁 ,搭 飛機 到 幾千 公里 之 外 的 不 同 地方 ,

 

Edit: As you have around 1000 files this might get a bit laborious. So you could also try the following method:

  1. Install Visual Studio Code
  2. Open the parent folder of your files in it
  3. Press Ctrl+Shift+H (multi-file find and replace)
  4. Use the little icons to the right of the input box to turn on the regex option, and make sure case insensitive option is on too
  5. Replace regex [a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ]+ with <SPACE>

 

Make sure the text files you're modifying are backed up before doing this, as it'll overwrite them in-place (depending on your settings, you might have to save each one individually in the editor).

 

Edit 2: If you literally only want Hanzi remaining (i.e. you also want to remove all punctuation, Arabic numerals, and so on), you can use this regex instead:

  • Online text tools version: /\P{Script=Han}+/giu
  • VS Code version: \P{Script=Han}+

 

Sample output using this regex:

 

候鳥 導航 的 本能 已經 很 令 人 驚訝 但 有些 動物 更 厲害 牠們 就算 被 帶 到 陌生 的 地方 也 懂得 怎樣 回 老家 例如 研究員 曾 從 太平洋 中心 的 小島 帶 了 隻 信天翁 搭 飛機 到 幾千 公里 之 外 的 不 同 地方 

  • Helpful 2
Posted

Do you have Linux? 

 

1.

cat myfile.txt | grep -P -o '[\p{Han}]+'

gives you one word per line, but you lose the line feeds in your text.

 

2.

cat myfile.txt | grep -P -o '[\p{Han}]+' | sort -u

gives you a sorted unique wordlist

 

3.

cat myfile.txt | grep -P -o '[\p{Han}]+' | paste -sd " "

joins all your words space separated into one long line.

 

And here comes the big one-liner:

 

for i in *txt; do echo "converting $i"; cat $i | grep -P -o '[\p{Han}]+' > $i.converted ; done

 

It takes all files with extension "txt" from your current directory , extracts the chinese words (one word per line as in 1.) and writes the output in a file withe the same name but extension "txt.converted".

 

 

  • Like 1
Posted

Another way to do this, if you don't have a linux machine and/or coding skills, is to open the text in word and, provided that the hanzi in the text is in different font than the pinyin, you can run search and replace to replace anything with a given font with <empty>. This will effectively remove everything written with that font.

Posted

Thanks everyone for your replies! I ended up using Demonic_Duck's regex, applied to a whole folder using Notepad++

AntConc is now working perfectly with the resulting files. 

Thanks a lot!

  • 1 month later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...