Jump to content
Chinese-Forums
  • Sign Up

Extracting Chinese characters from text file


Recommended Posts

Posted

I wonder if there is an automatic way to extract only the Chinese sentences from a text file (.txt) that contains both English and Chinese? (I was thinking of searching for a certain UNICODE range, but this all seems way above my head)....

 

Any ideas?

 

  • Helpful 1
Posted

If the English part contains no non-ASCII characters (e.g. é, , ©, —) and the Chinese part does not use English punctuation (it shouldn't) or Arabic numerals (it might),

then a quick-n-dirty Search & Replace in an editor with regular expression support (e.g. Notepad++) will do:

    search for "[ -~]+"

    replace with ""

This removes all printable ASCII characters (0x20 through 0x7E) from the text, so make a backup copy of the original beforehand.

 

Usually there is something you can look for to separate the English part from the Chinese part. For example, a delimiter may be used (a pair of parenthesis, tow slashes, four spaces, a tab, a dash, or whatever).

Or if the Chinese is followed immediately by English, like this:

 这是一个中文句子。This is an English sentence.

then you can discard anything beyond the last Chinese punctuation mark that can end a sentence (。!?…—”」).

Quite simple, but we need to see your sample text to know for sure.

 

  • Like 1
Posted

Thank you publius for your suggestion!! :)

 

Here is a sample of the text. Not too unusual:

 

May I introduce myself?

Wǒ kěyǐ jièshào wǒ zìjǐ ma? 我可以介绍我自己吗?

(Woh ker-ee jeh-shou woh-dzu-jee mah)

My name is _____.

Wǒ de míngzi shì _____.

(Woh-der meeng-dzu shr ____) 我的名字是 _____。

This is my name card.

Zhè shì wǒ de míngpiàn.

(Jur shr woh-der meeng pee-an) 这是我的名片。

 

I tried your search and replace with notepad ++ ( entering [ -~]+" after strg+F  in ther search box it did not find any entries...). Not quite sure what I am doing wrong?

Posted

You need to change "Search mode" to "Regular expression" at the lower left, then

1)

Find ".* (?=[^ -ɏ])"

Replace with ""

2)

Find "^[ -ɏ\r\n]"

Replace with ""

 

  • Helpful 1
Posted

Sounds like a job for UNIX command-line tools... are you using a Mac or a Linux machine by any chance?

 

Posted

Hello Publius,

 

thank you so much again (beats me how anyone can know this).

 

Initially, your method worked. The search terms in your second post identified the Mandarin characters (which is helpful). The search term in your first term worked like a charm to delete the normal English words, but left the pinyin (tone marks) undeleted.

Somehow, however, after trying it again, neither of the search terms work anymore (see picture: unfortunately, I have the German version, but it says "invalid expression".

example.jpg

Posted

@Publius: somehow your search term "^[ -ɏ\r\n]" works on and off.  The other ".* (?=[^ -ɏ])" still produces an error. At the moment I am at the stage of having Hanzi and lines with pinyin tone marks (e.g. " è  ì  ĭ  í  à  à"). Any idea how to get rid of the pinyin marks?

 

Thanks! :))  

Posted

Well, did you do the step 2)? It should remove any line that does not begin with a non-Latin character.

 

  • Helpful 1
Posted

@Publius: My bad. Made a silly mistake. It works! You really helped me a lot!!!! :))

 

Thanks man!!!!

  • 1 year later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...