How could I insert “||” right before the first Chinese character of each line?

December 15, 2010 at 03:24 AM

Below is a sample of the TXT-format terminology I was trying to import into my déjà vu terminology database; my effort was not fruitful. The problem seems to be that a separator is missing between the English term and its Chinese equivalent. As the size of the list is very large, about 20,000 terms, I don’t want to undertake to add the mark “||” one by one; it may take me months to do that! Do you have any ideas as to the solution? Thanks!

English 中文

abatement 消减

abatement notice 消减噪音通知书

abattoir 屠场

abrasive cleaning 研磨去污

December 15, 2010 at 03:42 AM

Open your favorite word processing program, like Open Office.

Set it to show all characters. You'll see there's a space between the English and Chinese words. Go to Find and Replace and replace all instances of [space] with [space]||[space]. It works, but you'll note that since you've also got some lines with two English words and a space between them, you also get [space]||[space] added there, and you don't want that. So you'll have to go through the alphabet and replace all [space]||[space]a with [space]a, then [space]||[space]b with [space]b, etc. Cumbersome, but easier than dealing with 20,000 listings individually.

Also note that "saving" a file like this as a .txt file in OpenOffice seems to destroy the Chinese characters. I got around this by cutting and pasting the final OpenOffice text to a notepad file, then saving the notepad file as Unicode.

December 15, 2010 at 04:19 AM

889, What version of Open Office do you use that destroys your characters when you save as .txt?

December 15, 2010 at 04:22 AM

If you have perl installed on your machine and you know how to work the command prompt, something like this would do the trick:

[user@prompt:~]$ perl -pi -e 's/([a-z])\s+([^a-z\s])/$1 || $2/i' <filename>

where <filename> is the path to your text file. Note that this modifies the file in place, so if you want to see what the output looks like before modifying your file you could do:

[user@prompt:~]$ perl -pe 's/([a-z])\s+([^a-z\s])/$1 || $2/i' <filename>

which would spew all 20,000 modified lines to your terminal.

December 15, 2010 at 04:28 AM

@yezze, almost all of them. If you save as .txt, you get an 8-bit ASCII output. To get unicode, you need to save as something else. I forgot exactly what it's called, I'll find it tomorrow at work.

@kenny2006woo, there is an easier way using regular expression. Here is an explanation of them for OpenOffice: http://wiki.services...sions_in_Writer . If you've never used regex it looks like nonsense; if you want to wait a bit I'll try to get one for you tomorrow. Offhand you want to search for something like '( [^ ]*)' and replace it with ' || $1' , but I'm sure that's wrong as I never get regex right the first time.

@muirm, wouldn't that replace all groups of spaces with ' || '? Not just the last one?

December 15, 2010 at 04:38 AM

Nope, without the '/g' global modifier it would only replace the first case of a letter followed by some space followed by a non-letter with the letter followed by " || " followed by the non-letter. It would match incorrectly if the english words are separated by more than 1 space, so I will modify it accordingly.

I think the first part of your regular expression would need to be anchored also, otherwise it would put pipes between english words separated by spaces (and that approach assumes the chinese phrases never have spaces).

December 15, 2010 at 04:38 AM

I have OpenOffice 3.1.1.

There may well be a quick way around the problem or a setting I've got wrong, but it was simpler to just cut-and-paste to notepad than try to cure OpenOffice.

December 15, 2010 at 04:51 AM

@muirm，ah, right, see that now. But I believe mine was anchored with the $ for end of line? But you're right, I was assuming no spaces in the Chinese.

December 15, 2010 at 04:52 AM

Kenny, can you upload the file so this lot can play with it?

I suspect this might actually be possible in Excel, using the SUBSTITUTE function - I think that can be used to do a find and replace on the last occurrence of a string (here, a space) in a cell. Or you could first reverse the contents of the cell, substitute the first space, then reverse back.

December 15, 2010 at 04:52 AM

If you have a text editor that can handle regular expressions, you should be able to do exactly what you want in a few seconds without the need to do repeated search and replaces. For example, with vim, you can use the following regex to split the lines at the point just before the first Chinese character on the line:

%s/$[[:print:]]\+$$[^[:print:]]$/\1|| \2

Vim is not the easiest of editors to get used to. If you don't mind posting the file here, I can update it for you without any trouble.

@muirm, your regex may produce unintended results if there are any punctuation or digits in the English side of the expression.

December 15, 2010 at 04:57 AM

Every time someone mentions regular expressions Arthur C. Clarke comes to mind:

Any sufficiently advanced technology is indistinguishable from magic.

December 15, 2010 at 05:02 AM

here's the file.环保.txt谢谢同志们这么热心。

If anyone can do it successuflly, please let me know how i could do the job myself.

December 15, 2010 at 05:10 AM

Here you go.

To do it yourself, read this book, and then learn to use this editor.

It may take some time

环保.txt

December 15, 2010 at 05:17 AM

非常谢谢Imron同志，也谢谢大家。终于搞定了。

December 15, 2010 at 05:35 AM

Ok, I've figured out an easier way to do it without needing to use Vim or learn about regular expressions (you'll still need to copy and paste the ones I provide below though).

First, download Notepad++, which is a relatively user-friendly editor that supports regular expressions.

Open the file you need to edit.

From the menu, select Search->Replace (Ctrl-H)

Make sure the "Search Mode" is set to "Regular Expression" (正则表达式)

In the "Find what" field, copy and paste the following regular expression:

^([\x20-\x7e]+)(.*)$

This will match everything from the beginning of the line to the first non-ascii character in one group, and everything from the first non-ascii character to the end of the line in another group.

In the "Replace with" field, copy and paste the following:

\1|| \2

This will then replace the line with the first group of characters, followed by a ||, followed by the second group of characters.

Following this procedure should allow you to update new entries yourself in the future.

December 15, 2010 at 05:38 AM

Thank you so much, Imron. You're a computer genius!

December 15, 2010 at 03:55 PM

Most any text editor that supports regular expression search/replace (e.g. OpenOffice) should work as well, if you already have a favorite, using the search imron suggests.

For OpenOffice, to save Chinese text (well, any non-ASCII text), "Text Encoded" is the format you want.

December 15, 2010 at 04:00 PM

Imron has fixed the problem for me, but thank you as well.

Sign In

How could I insert “||” right before the first Chinese character of each line?

Recommended Posts

Kenny同志

889

Yezze

muirm

jbradfor

muirm

889

jbradfor

roddy

imron

roddy

Kenny同志

imron

Kenny同志

imron

Kenny同志

jbradfor

Kenny同志

Join the conversation