Extracting Chinese characters from text file

May 17, 2019 at 09:38 AM

I wonder if there is an automatic way to extract only the Chinese sentences from a text file (.txt) that contains both English and Chinese? (I was thinking of searching for a certain UNICODE range, but this all seems way above my head)....

Any ideas?

May 17, 2019 at 12:00 PM

If the English part contains no non-ASCII characters (e.g. é, €, ©, —) and the Chinese part does not use English punctuation (it shouldn't) or Arabic numerals (it might),

then a quick-n-dirty Search & Replace in an editor with regular expression support (e.g. Notepad++) will do:

search for "[ -~]+"

replace with ""

This removes all printable ASCII characters (0x20 through 0x7E) from the text, so make a backup copy of the original beforehand.

Usually there is something you can look for to separate the English part from the Chinese part. For example, a delimiter may be used (a pair of parenthesis, tow slashes, four spaces, a tab, a dash, or whatever).

Or if the Chinese is followed immediately by English, like this:

　这是一个中文句子。This is an English sentence.

then you can discard anything beyond the last Chinese punctuation mark that can end a sentence (。！？…—”」).

Quite simple, but we need to see your sample text to know for sure.

May 17, 2019 at 01:25 PM

Thank you publius for your suggestion!!

Here is a sample of the text. Not too unusual:

May I introduce myself?

Wǒ kěyǐ jièshào wǒ zìjǐ ma? 我可以介绍我自己吗?

(Woh ker-ee jeh-shou woh-dzu-jee mah)

My name is _____.

Wǒ de míngzi shì _____.

(Woh-der meeng-dzu shr ____) 我的名字是 _____。

This is my name card.

Zhè shì wǒ de míngpiàn.

(Jur shr woh-der meeng pee-an) 这是我的名片。

I tried your search and replace with notepad ++ ( entering [ -~]+" after strg+F in ther search box it did not find any entries...). Not quite sure what I am doing wrong?

May 17, 2019 at 02:03 PM

You need to change "Search mode" to "Regular expression" at the lower left, then

1)

Find ".* (?=[^ -ɏ])"

Replace with ""

2)

Find "^[ -ɏ\r\n]"

Replace with ""

May 17, 2019 at 02:04 PM

Sounds like a job for UNIX command-line tools... are you using a Mac or a Linux machine by any chance?

May 17, 2019 at 02:23 PM

Hello Publius,

thank you so much again (beats me how anyone can know this).

Initially, your method worked. The search terms in your second post identified the Mandarin characters (which is helpful). The search term in your first term worked like a charm to delete the normal English words, but left the pinyin (tone marks) undeleted.

Somehow, however, after trying it again, neither of the search terms work anymore (see picture: unfortunately, I have the German version, but it says "invalid expression".

May 17, 2019 at 02:30 PM

@Publius: somehow your search term "^[ -ɏ\r\n]" works on and off. The other ".* (?=[^ -ɏ])" still produces an error. At the moment I am at the stage of having Hanzi and lines with pinyin tone marks (e.g. " è ì ĭ í à à"). Any idea how to get rid of the pinyin marks?

Thanks! :))

May 17, 2019 at 02:30 PM

Well, did you do the step 2)? It should remove any line that does not begin with a non-Latin character.

May 17, 2019 at 02:35 PM

@Publius: My bad. Made a silly mistake. It works! You really helped me a lot!!!! :))

Thanks man!!!!

May 17, 2019 at 02:37 PM

Okay. No problem.

May 18, 2020 at 06:09 AM

It worked easily. Thanks

Sign In

Extracting Chinese characters from text file

Recommended Posts

Jan Finster

Publius

Jan Finster

Publius

mungouk

Jan Finster

Jan Finster

Publius

Jan Finster

Publius

lokaiwen

Join the conversation