Jump to content
Chinese-Forums
  • Sign Up

Perl: Searching Chinese strings


Recommended Posts

Posted

I'm trying to write a simple program in Perl for indexing Chinese words in a text file, but I think the Chinese encoding is giving me problems. I just want to do some things like search and replace strings, such as

$line =~ s/。/。n/g;

if ($line =~ /$word/) ...

In the above line, searching for most words works fine, but if the word has 加 in it, for example, it causes the program to crash (unmatched [ error). Another word, 社會, is found in every line. I'm using big5 encoding for the files. I tried utf8, but I couldn't get that to work. I don't want to have to become a perl/chinese encoding expert just to write a simple program. Any advice?

Posted

I've had success treating Chinese text as regular perl strings in the GB2312 and GB18030 encodings. Can you use those or are you locked to the complex character set?

There are unicode libraries on CPAN designed for handling Unicode characters. It sounds as if one of the characters in the string is triggering an End-Of-String command and the program is subsequently screwing up. Why not post the details of what you're trying to do below? There may be alternatives to perl as well.

Posted

Does perl know what encoding your source file is? It may be assuming it to be utf8 or latin1 or something.

Posted

travelyan,

The finished text needs to be big5, but I can use whatever works for processing. So, as you suggested, I tried processing it as gbk (I don't know if that's GB2312 or GB18030), and it works! Thanks a lot! I've already wasted lots of time looking for an answer, I wish I had posted here sooner. Now, I'll just use Convertz to convert texts to GBK and then back to big5.

adrian,

I don't know if perl knows the encoding. I tried

use encoding 'big5', STDIN => 'big5', STDOUT => 'big5';

If I run the program by double-clicking the .pl file, the output to STDOUT is Chinese. If I run the program from the command line, I get 亂碼.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...