in_lab Posted April 7, 2005 at 06:43 AM Report Posted April 7, 2005 at 06:43 AM I'm trying to write a simple program in Perl for indexing Chinese words in a text file, but I think the Chinese encoding is giving me problems. I just want to do some things like search and replace strings, such as $line =~ s/。/。n/g; if ($line =~ /$word/) ... In the above line, searching for most words works fine, but if the word has 加 in it, for example, it causes the program to crash (unmatched [ error). Another word, 社會, is found in every line. I'm using big5 encoding for the files. I tried utf8, but I couldn't get that to work. I don't want to have to become a perl/chinese encoding expert just to write a simple program. Any advice? Quote
trevelyan Posted April 7, 2005 at 07:18 AM Report Posted April 7, 2005 at 07:18 AM I've had success treating Chinese text as regular perl strings in the GB2312 and GB18030 encodings. Can you use those or are you locked to the complex character set? There are unicode libraries on CPAN designed for handling Unicode characters. It sounds as if one of the characters in the string is triggering an End-Of-String command and the program is subsequently screwing up. Why not post the details of what you're trying to do below? There may be alternatives to perl as well. Quote
adrian440 Posted April 7, 2005 at 10:47 PM Report Posted April 7, 2005 at 10:47 PM Does perl know what encoding your source file is? It may be assuming it to be utf8 or latin1 or something. Quote
in_lab Posted April 8, 2005 at 04:51 AM Author Report Posted April 8, 2005 at 04:51 AM travelyan, The finished text needs to be big5, but I can use whatever works for processing. So, as you suggested, I tried processing it as gbk (I don't know if that's GB2312 or GB18030), and it works! Thanks a lot! I've already wasted lots of time looking for an answer, I wish I had posted here sooner. Now, I'll just use Convertz to convert texts to GBK and then back to big5. adrian, I don't know if perl knows the encoding. I tried use encoding 'big5', STDIN => 'big5', STDOUT => 'big5'; If I run the program by double-clicking the .pl file, the output to STDOUT is Chinese. If I run the program from the command line, I get 亂碼. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.