Jump to content
Chinese-Forums
  • Sign Up

How could I insert “||” right before the first Chinese character of each line?


Kenny同志

Recommended Posts

Below is a sample of the TXT-format terminology I was trying to import into my déjà vu terminology database; my effort was not fruitful. The problem seems to be that a separator is missing between the English term and its Chinese equivalent. As the size of the list is very large, about 20,000 terms, I don’t want to undertake to add the mark “||” one by one; it may take me months to do that! Do you have any ideas as to the solution? Thanks!

English 中文

abatement 消减

abatement notice 消减噪音通知书

abattoir 屠场

abrasive cleaning 研磨去污

Link to comment
Share on other sites

Open your favorite word processing program, like Open Office.

Set it to show all characters. You'll see there's a space between the English and Chinese words. Go to Find and Replace and replace all instances of [space] with [space]||[space]. It works, but you'll note that since you've also got some lines with two English words and a space between them, you also get [space]||[space] added there, and you don't want that. So you'll have to go through the alphabet and replace all [space]||[space]a with [space]a, then [space]||[space]b with [space]b, etc. Cumbersome, but easier than dealing with 20,000 listings individually.

Also note that "saving" a file like this as a .txt file in OpenOffice seems to destroy the Chinese characters. I got around this by cutting and pasting the final OpenOffice text to a notepad file, then saving the notepad file as Unicode.

Link to comment
Share on other sites

If you have perl installed on your machine and you know how to work the command prompt, something like this would do the trick:

[user@prompt:~]$ perl -pi -e 's/([a-z])\s+([^a-z\s])/$1 || $2/i' <filename>

where <filename> is the path to your text file. Note that this modifies the file in place, so if you want to see what the output looks like before modifying your file you could do:

[user@prompt:~]$ perl -pe 's/([a-z])\s+([^a-z\s])/$1 || $2/i' <filename>

which would spew all 20,000 modified lines to your terminal.

Link to comment
Share on other sites

@yezze, almost all of them. If you save as .txt, you get an 8-bit ASCII output. To get unicode, you need to save as something else. I forgot exactly what it's called, I'll find it tomorrow at work.

@kenny2006woo, there is an easier way using regular expression. Here is an explanation of them for OpenOffice: http://wiki.services...sions_in_Writer . If you've never used regex it looks like nonsense; if you want to wait a bit I'll try to get one for you tomorrow. Offhand you want to search for something like '( [^ ]*)' and replace it with ' || $1' , but I'm sure that's wrong as I never get regex right the first time.

@muirm, wouldn't that replace all groups of spaces with ' || '? Not just the last one?

Link to comment
Share on other sites

Nope, without the '/g' global modifier it would only replace the first case of a letter followed by some space followed by a non-letter with the letter followed by " || " followed by the non-letter. It would match incorrectly if the english words are separated by more than 1 space, so I will modify it accordingly.

I think the first part of your regular expression would need to be anchored also, otherwise it would put pipes between english words separated by spaces (and that approach assumes the chinese phrases never have spaces).

Link to comment
Share on other sites

Kenny, can you upload the file so this lot can play with it?

I suspect this might actually be possible in Excel, using the SUBSTITUTE function - I think that can be used to do a find and replace on the last occurrence of a string (here, a space) in a cell. Or you could first reverse the contents of the cell, substitute the first space, then reverse back.

Link to comment
Share on other sites

If you have a text editor that can handle regular expressions, you should be able to do exactly what you want in a few seconds without the need to do repeated search and replaces. For example, with vim, you can use the following regex to split the lines at the point just before the first Chinese character on the line:

%s/\([[:print:]]\+\)\([^[:print:]]\)/\1|| \2

Vim is not the easiest of editors to get used to. If you don't mind posting the file here, I can update it for you without any trouble.

@muirm, your regex may produce unintended results if there are any punctuation or digits in the English side of the expression.

Link to comment
Share on other sites

Ok, I've figured out an easier way to do it without needing to use Vim or learn about regular expressions (you'll still need to copy and paste the ones I provide below though).

First, download Notepad++, which is a relatively user-friendly editor that supports regular expressions.

Open the file you need to edit.

From the menu, select Search->Replace (Ctrl-H)

Make sure the "Search Mode" is set to "Regular Expression" (正则表达式)

In the "Find what" field, copy and paste the following regular expression:

^([\x20-\x7e]+)(.*)$

This will match everything from the beginning of the line to the first non-ascii character in one group, and everything from the first non-ascii character to the end of the line in another group.

In the "Replace with" field, copy and paste the following:

\1|| \2

This will then replace the line with the first group of characters, followed by a ||, followed by the second group of characters.

Following this procedure should allow you to update new entries yourself in the future.

  • Like 2
Link to comment
Share on other sites

Most any text editor that supports regular expression search/replace (e.g. OpenOffice) should work as well, if you already have a favorite, using the search imron suggests.

For OpenOffice, to save Chinese text (well, any non-ASCII text), "Text Encoded" is the format you want.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...