adamlau Posted September 12, 2005 at 05:38 AM Report Posted September 12, 2005 at 05:38 AM I want to rename the following: 阿爾及利亞 阿尔及利亚 ...to: 阿爾及利亞 [阿尔及利亚] ...can someone give me a regexp to use? Quote
roddy Posted September 12, 2005 at 10:26 AM Report Posted September 12, 2005 at 10:26 AM Wow, regexp with Chinese characters. There's something I wouldn't want to have to try. Are all your strings exactly in that format? Could you just a) add a ] to the end of each string and then B) add a [ after the space between the trad / simple characters? Roddy Quote
trevelyan Posted September 12, 2005 at 12:22 PM Report Posted September 12, 2005 at 12:22 PM Roddy's plan is best. Its not a language issue so much as a question of which programming language you're using, and what encoding you're using for your Chinese characters. PHP is a mess at handling Unicode and still has limited and experimental support for non-ASCII functions. The reason is that once you shift to Unicode you get a lot of variable-length characters -- so the fundamental parsing engine needs to be overhauled. Last I checked Perl does REGEXP decently on GB2312 (fixed-length), but has trouble with Unicode. There are some new libraries there which might help though. Advice: find a language that allows you to do regexp on Unicode, and then convert any content to that encoding before doing any of the changes. I know IBM has a dedicated library in C++ that will do REGEXP on unicode strings, but that may be overkill. Quote
chinesetools Posted September 12, 2005 at 04:01 PM Report Posted September 12, 2005 at 04:01 PM With Perl 5.8 you can treat Chinese characters as one unit, if you are careful how you load them from file. In that case, a simple regex would do: s/^(S+s)(S+)/$1[$2]/; See http://www.chinesecomputing.com/programming/perl.html for some other possibilities. Quote
adamlau Posted September 13, 2005 at 09:47 PM Author Report Posted September 13, 2005 at 09:47 PM a) add a ] to the end of each string and then B) add a [ after the space between the trad / simple characters? The problem is that there are 29' date='079 entries in the latest CEDICT UTF-8 database. Would rather use a regexp... s/^(S+s)(S+)/$1[$2]/; Now how would I include this in a replace command? Replace: s/^(S+s)(S+) With This: $1[$2] Is that correct? Quote
adamlau Posted October 10, 2005 at 08:51 AM Author Report Posted October 10, 2005 at 08:51 AM I still have not figured out how to transform: 爱沙尼亚 愛沙尼亞 ai4 sha1 ni2 ya4 Estonia (trad word)(simp word)(pinyin)(definition) to: 爱沙尼亚 [愛--亞] ai4 sha1 ni2 ya4 Estonia (simp word)([trad word with - replacements])(pinyin)(definition) Can someone give me a nice regexp to use? The above examples were great, but i could not apply them sucessfully... Quote
Konglong Posted October 10, 2005 at 01:03 PM Report Posted October 10, 2005 at 01:03 PM Sometimes the easiest is to take your Chinese text and convert it to decimal codes (number;) in Wenlin (The demo has this function for free) and use regex with numbers instead. Works well. If you want to spend the money, PowerGREP 3 works with double byte characters now. I highly recommend this program to anyone who uses regex on a frequent basis. J Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.