Kenny同志 Posted December 15, 2010 at 03:24 AM Report Share Posted December 15, 2010 at 03:24 AM Below is a sample of the TXT-format terminology I was trying to import into my déjà vu terminology database; my effort was not fruitful. The problem seems to be that a separator is missing between the English term and its Chinese equivalent. As the size of the list is very large, about 20,000 terms, I don’t want to undertake to add the mark “||” one by one; it may take me months to do that! Do you have any ideas as to the solution? Thanks! English 中文 abatement 消减 abatement notice 消减噪音通知书 abattoir 屠场 abrasive cleaning 研磨去污 Quote Link to comment Share on other sites More sharing options...
889 Posted December 15, 2010 at 03:42 AM Report Share Posted December 15, 2010 at 03:42 AM Open your favorite word processing program, like Open Office. Set it to show all characters. You'll see there's a space between the English and Chinese words. Go to Find and Replace and replace all instances of [space] with [space]||[space]. It works, but you'll note that since you've also got some lines with two English words and a space between them, you also get [space]||[space] added there, and you don't want that. So you'll have to go through the alphabet and replace all [space]||[space]a with [space]a, then [space]||[space]b with [space]b, etc. Cumbersome, but easier than dealing with 20,000 listings individually. Also note that "saving" a file like this as a .txt file in OpenOffice seems to destroy the Chinese characters. I got around this by cutting and pasting the final OpenOffice text to a notepad file, then saving the notepad file as Unicode. Quote Link to comment Share on other sites More sharing options...
Yezze Posted December 15, 2010 at 04:19 AM Report Share Posted December 15, 2010 at 04:19 AM 889, What version of Open Office do you use that destroys your characters when you save as .txt? Quote Link to comment Share on other sites More sharing options...
muirm Posted December 15, 2010 at 04:22 AM Report Share Posted December 15, 2010 at 04:22 AM If you have perl installed on your machine and you know how to work the command prompt, something like this would do the trick: [user@prompt:~]$ perl -pi -e 's/([a-z])\s+([^a-z\s])/$1 || $2/i' <filename> where <filename> is the path to your text file. Note that this modifies the file in place, so if you want to see what the output looks like before modifying your file you could do: [user@prompt:~]$ perl -pe 's/([a-z])\s+([^a-z\s])/$1 || $2/i' <filename> which would spew all 20,000 modified lines to your terminal. Quote Link to comment Share on other sites More sharing options...
jbradfor Posted December 15, 2010 at 04:28 AM Report Share Posted December 15, 2010 at 04:28 AM @yezze, almost all of them. If you save as .txt, you get an 8-bit ASCII output. To get unicode, you need to save as something else. I forgot exactly what it's called, I'll find it tomorrow at work. @kenny2006woo, there is an easier way using regular expression. Here is an explanation of them for OpenOffice: http://wiki.services...sions_in_Writer . If you've never used regex it looks like nonsense; if you want to wait a bit I'll try to get one for you tomorrow. Offhand you want to search for something like '( [^ ]*)' and replace it with ' || $1' , but I'm sure that's wrong as I never get regex right the first time. @muirm, wouldn't that replace all groups of spaces with ' || '? Not just the last one? Quote Link to comment Share on other sites More sharing options...
muirm Posted December 15, 2010 at 04:38 AM Report Share Posted December 15, 2010 at 04:38 AM Nope, without the '/g' global modifier it would only replace the first case of a letter followed by some space followed by a non-letter with the letter followed by " || " followed by the non-letter. It would match incorrectly if the english words are separated by more than 1 space, so I will modify it accordingly. I think the first part of your regular expression would need to be anchored also, otherwise it would put pipes between english words separated by spaces (and that approach assumes the chinese phrases never have spaces). Quote Link to comment Share on other sites More sharing options...
889 Posted December 15, 2010 at 04:38 AM Report Share Posted December 15, 2010 at 04:38 AM I have OpenOffice 3.1.1. There may well be a quick way around the problem or a setting I've got wrong, but it was simpler to just cut-and-paste to notepad than try to cure OpenOffice. Quote Link to comment Share on other sites More sharing options...
jbradfor Posted December 15, 2010 at 04:51 AM Report Share Posted December 15, 2010 at 04:51 AM @muirm,ah, right, see that now. But I believe mine was anchored with the $ for end of line? But you're right, I was assuming no spaces in the Chinese. Quote Link to comment Share on other sites More sharing options...
roddy Posted December 15, 2010 at 04:52 AM Report Share Posted December 15, 2010 at 04:52 AM Kenny, can you upload the file so this lot can play with it? I suspect this might actually be possible in Excel, using the SUBSTITUTE function - I think that can be used to do a find and replace on the last occurrence of a string (here, a space) in a cell. Or you could first reverse the contents of the cell, substitute the first space, then reverse back. Quote Link to comment Share on other sites More sharing options...
imron Posted December 15, 2010 at 04:52 AM Report Share Posted December 15, 2010 at 04:52 AM If you have a text editor that can handle regular expressions, you should be able to do exactly what you want in a few seconds without the need to do repeated search and replaces. For example, with vim, you can use the following regex to split the lines at the point just before the first Chinese character on the line: %s/\([[:print:]]\+\)\([^[:print:]]\)/\1|| \2 Vim is not the easiest of editors to get used to. If you don't mind posting the file here, I can update it for you without any trouble. @muirm, your regex may produce unintended results if there are any punctuation or digits in the English side of the expression. Quote Link to comment Share on other sites More sharing options...
roddy Posted December 15, 2010 at 04:57 AM Report Share Posted December 15, 2010 at 04:57 AM Every time someone mentions regular expressions Arthur C. Clarke comes to mind: Any sufficiently advanced technology is indistinguishable from magic. 1 Quote Link to comment Share on other sites More sharing options...
Kenny同志 Posted December 15, 2010 at 05:02 AM Author Report Share Posted December 15, 2010 at 05:02 AM here's the file.环保.txt谢谢同志们这么热心。 If anyone can do it successuflly, please let me know how i could do the job myself. Quote Link to comment Share on other sites More sharing options...
imron Posted December 15, 2010 at 05:10 AM Report Share Posted December 15, 2010 at 05:10 AM Here you go. To do it yourself, read this book, and then learn to use this editor. It may take some time 环保.txt Quote Link to comment Share on other sites More sharing options...
Kenny同志 Posted December 15, 2010 at 05:17 AM Author Report Share Posted December 15, 2010 at 05:17 AM 非常谢谢Imron同志,也谢谢大家。终于搞定了。 Quote Link to comment Share on other sites More sharing options...
imron Posted December 15, 2010 at 05:35 AM Report Share Posted December 15, 2010 at 05:35 AM Ok, I've figured out an easier way to do it without needing to use Vim or learn about regular expressions (you'll still need to copy and paste the ones I provide below though). First, download Notepad++, which is a relatively user-friendly editor that supports regular expressions. Open the file you need to edit. From the menu, select Search->Replace (Ctrl-H) Make sure the "Search Mode" is set to "Regular Expression" (正则表达式) In the "Find what" field, copy and paste the following regular expression: ^([\x20-\x7e]+)(.*)$ This will match everything from the beginning of the line to the first non-ascii character in one group, and everything from the first non-ascii character to the end of the line in another group. In the "Replace with" field, copy and paste the following: \1|| \2 This will then replace the line with the first group of characters, followed by a ||, followed by the second group of characters. Following this procedure should allow you to update new entries yourself in the future. 2 Quote Link to comment Share on other sites More sharing options...
Kenny同志 Posted December 15, 2010 at 05:38 AM Author Report Share Posted December 15, 2010 at 05:38 AM Thank you so much, Imron. You're a computer genius! Quote Link to comment Share on other sites More sharing options...
jbradfor Posted December 15, 2010 at 03:55 PM Report Share Posted December 15, 2010 at 03:55 PM Most any text editor that supports regular expression search/replace (e.g. OpenOffice) should work as well, if you already have a favorite, using the search imron suggests. For OpenOffice, to save Chinese text (well, any non-ASCII text), "Text Encoded" is the format you want. Quote Link to comment Share on other sites More sharing options...
Kenny同志 Posted December 15, 2010 at 04:00 PM Author Report Share Posted December 15, 2010 at 04:00 PM Imron has fixed the problem for me, but thank you as well. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.