Jan Finster Posted May 17, 2019 at 09:38 AM Report Posted May 17, 2019 at 09:38 AM I wonder if there is an automatic way to extract only the Chinese sentences from a text file (.txt) that contains both English and Chinese? (I was thinking of searching for a certain UNICODE range, but this all seems way above my head).... Any ideas? 1 Quote
Publius Posted May 17, 2019 at 12:00 PM Report Posted May 17, 2019 at 12:00 PM If the English part contains no non-ASCII characters (e.g. é, €, ©, —) and the Chinese part does not use English punctuation (it shouldn't) or Arabic numerals (it might), then a quick-n-dirty Search & Replace in an editor with regular expression support (e.g. Notepad++) will do: search for "[ -~]+" replace with "" This removes all printable ASCII characters (0x20 through 0x7E) from the text, so make a backup copy of the original beforehand. Usually there is something you can look for to separate the English part from the Chinese part. For example, a delimiter may be used (a pair of parenthesis, tow slashes, four spaces, a tab, a dash, or whatever). Or if the Chinese is followed immediately by English, like this: 这是一个中文句子。This is an English sentence. then you can discard anything beyond the last Chinese punctuation mark that can end a sentence (。!?…—”」). Quite simple, but we need to see your sample text to know for sure. 1 Quote
Jan Finster Posted May 17, 2019 at 01:25 PM Author Report Posted May 17, 2019 at 01:25 PM Thank you publius for your suggestion!! Here is a sample of the text. Not too unusual: May I introduce myself? Wǒ kěyǐ jièshào wǒ zìjǐ ma? 我可以介绍我自己吗? (Woh ker-ee jeh-shou woh-dzu-jee mah) My name is _____. Wǒ de míngzi shì _____. (Woh-der meeng-dzu shr ____) 我的名字是 _____。 This is my name card. Zhè shì wǒ de míngpiàn. (Jur shr woh-der meeng pee-an) 这是我的名片。 I tried your search and replace with notepad ++ ( entering [ -~]+" after strg+F in ther search box it did not find any entries...). Not quite sure what I am doing wrong? Quote
Publius Posted May 17, 2019 at 02:03 PM Report Posted May 17, 2019 at 02:03 PM You need to change "Search mode" to "Regular expression" at the lower left, then 1) Find ".* (?=[^ -ɏ])" Replace with "" 2) Find "^[ -ɏ\r\n]" Replace with "" 1 Quote
mungouk Posted May 17, 2019 at 02:04 PM Report Posted May 17, 2019 at 02:04 PM Sounds like a job for UNIX command-line tools... are you using a Mac or a Linux machine by any chance? Quote
Jan Finster Posted May 17, 2019 at 02:23 PM Author Report Posted May 17, 2019 at 02:23 PM Hello Publius, thank you so much again (beats me how anyone can know this). Initially, your method worked. The search terms in your second post identified the Mandarin characters (which is helpful). The search term in your first term worked like a charm to delete the normal English words, but left the pinyin (tone marks) undeleted. Somehow, however, after trying it again, neither of the search terms work anymore (see picture: unfortunately, I have the German version, but it says "invalid expression". Quote
Jan Finster Posted May 17, 2019 at 02:30 PM Author Report Posted May 17, 2019 at 02:30 PM @Publius: somehow your search term "^[ -ɏ\r\n]" works on and off. The other ".* (?=[^ -ɏ])" still produces an error. At the moment I am at the stage of having Hanzi and lines with pinyin tone marks (e.g. " è ì ĭ í à à"). Any idea how to get rid of the pinyin marks? Thanks! :)) Quote
Publius Posted May 17, 2019 at 02:30 PM Report Posted May 17, 2019 at 02:30 PM Well, did you do the step 2)? It should remove any line that does not begin with a non-Latin character. 1 Quote
Jan Finster Posted May 17, 2019 at 02:35 PM Author Report Posted May 17, 2019 at 02:35 PM @Publius: My bad. Made a silly mistake. It works! You really helped me a lot!!!! :)) Thanks man!!!! Quote
Publius Posted May 17, 2019 at 02:37 PM Report Posted May 17, 2019 at 02:37 PM Okay. No problem. Quote
New Members lokaiwen Posted May 18, 2020 at 06:09 AM New Members Report Posted May 18, 2020 at 06:09 AM It worked easily. Thanks Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.