Chinese Character Syntax For RegExps?

September 12, 2005 at 05:38 AM

I want to rename the following:

阿爾及利亞阿尔及利亚

...to:

阿爾及利亞 [阿尔及利亚]

...can someone give me a regexp to use?

September 12, 2005 at 10:26 AM

Wow, regexp with Chinese characters. There's something I wouldn't want to have to try.

Are all your strings exactly in that format? Could you just

a) add a ] to the end of each string and then

B) add a [ after the space between the trad / simple characters?

Roddy

September 12, 2005 at 12:22 PM

Roddy's plan is best.

Its not a language issue so much as a question of which programming language you're using, and what encoding you're using for your Chinese characters. PHP is a mess at handling Unicode and still has limited and experimental support for non-ASCII functions. The reason is that once you shift to Unicode you get a lot of variable-length characters -- so the fundamental parsing engine needs to be overhauled.

Last I checked Perl does REGEXP decently on GB2312 (fixed-length), but has trouble with Unicode. There are some new libraries there which might help though. Advice: find a language that allows you to do regexp on Unicode, and then convert any content to that encoding before doing any of the changes. I know IBM has a dedicated library in C++ that will do REGEXP on unicode strings, but that may be overkill.

September 12, 2005 at 04:01 PM

With Perl 5.8 you can treat Chinese characters as one unit, if you are careful how you load them from file. In that case, a simple regex would do:

s/^(S+s)(S+)/$1[$2]/;

See http://www.chinesecomputing.com/programming/perl.html for some other possibilities.

September 13, 2005 at 09:47 PM

a) add a ] to the end of each string and then

B) add a [ after the space between the trad / simple characters?

The problem is that there are 29' date='079 entries in the latest CEDICT UTF-8 database. Would rather use a regexp...

s/^(S+s)(S+)/$1[$2]/;

Now how would I include this in a replace command?

Replace:

s/^(S+s)(S+)

With This:

$1[$2]

Is that correct?

October 10, 2005 at 08:51 AM

I still have not figured out how to transform:

爱沙尼亚愛沙尼亞 ai4 sha1 ni2 ya4 Estonia

(trad word)(simp word)(pinyin)(definition)

to:

爱沙尼亚 [愛--亞] ai4 sha1 ni2 ya4 Estonia

(simp word)([trad word with - replacements])(pinyin)(definition)

Can someone give me a nice regexp to use? The above examples were great, but i could not apply them sucessfully...

October 10, 2005 at 01:03 PM

Sometimes the easiest is to take your Chinese text and convert it to decimal codes (number;) in Wenlin (The demo has this function for free) and use regex with numbers instead. Works well.

If you want to spend the money, PowerGREP 3 works with double byte characters now. I highly recommend this program to anyone who uses regex on a frequent basis.

J

Sign In

Chinese Character Syntax For RegExps?

Recommended Posts

adamlau

roddy

trevelyan

chinesetools

adamlau

adamlau

Konglong

Join the conversation