MySQL/PostgreSQL: Unicode support?

November 28, 2005 at 03:37 AM

I was originally going to send this to Roddy by e-mail, but then I thought why not just post it to the forum?

I'm working on a website that will be using Perl on Linux to create and access a bunch of data stored in a database, data that includes both English and Chinese characters. Naturally, I will be storing this data in Unicode, UTF-8 in particular.

Since I've never worked with a dedicated database, I'm looking for advice on what package to use. Particularly, I'm wondering about feasibility and ease of storing and working with Unicode data in this database. It took quite some time to competely understand how Perl deals with Unicode, and I'm not looking forward to another similarly protracted battle. Generally, I'm also wondering about speed, maintainability, and stability of the database.

Server space is not an issue (to a reasonable degree). Cost is an issue, so I prefer free/OSS, hence the two packages I'm looking at.

November 28, 2005 at 04:00 AM

I've only ever used MySQL

Never had any issues when setting up databases for storing multiple languages in UTF-8, and between weblogs and the forums I've got quite a few. There's English, both forms of Chinese, Korean, Japanese and I think bits of Russian on here, all with no problem.

The only problems I've had was with MySQL's new collocation setting, which caused me hassles when moving databases from an older version to a more up to date one, but never when setting them up - although I always run install scripts, never build them myself.

As for speed, maintainability and stability, I've never had a single problem with these related to language / encoding issues. As far as I can tell, it goes in as ones and zeros, comes out as ones and zeros, and the database doesn't too much care what it actuallly is.

You probably want to speak to the team of suave bilingual geniuses who run Adsotrans, if you haven't already - they've got a lot more experience of this kind of thing than I do, as they're running their own server and are also using Perl (I only know php, and that through a glass, darkly)

Roddy

November 28, 2005 at 04:03 AM

Why perl? Why not Python? Unless we're talking job security through obscurity....

November 28, 2005 at 05:06 AM

kudra: I'm using Perl because it's what I know. Besides, Python has separate functions for unicode and non-unicode strings, I hear? That seems a little silly to me. As long as you take a few (obscure... Perl 6 will support Unicode natively) precautions, Perl has pretty good transparent support for Unicode.

But anyhoo, I'm not here to rehash the perl-python flamewars of old.

Roddy: Thanks for the reply; one reason I posted this here was because I figured you might not have experience with PostgreSQL, so that's OK. Really, I'm just trying to sound out if there are any show-stoppers with either format. I think I've already got MySQL installed, so if the Adso crew gives it a Pass than I'll probably go with that choice. The current plan only calls for two small (500 and 1000 entry) databases, so performance is probably not something to worry about.

(EDIT: oh yeah, I'm guessing that the encoding bit matters at the point where you want to do sorts and finds, when the database package has to actually parse and understand the data that you've inputted. I pulled up some stuff accusing MySQL of mangling data based on encoding mix-ups, but they are old documents.)

Project peek:

http://msittig.wubi.org/test/bus01.html

November 28, 2005 at 05:23 AM

Regarding sorting and so on - I think that's where MySQL's collation (not collocation, as I thought) setting comes in, so you might want to read up on that before going ahead. I remember seeing unicode / utf-8 settings in there.

Roddy

November 28, 2005 at 06:28 AM

I havn't had problem with chinese or korean characters, just remember to create the db with UNICODE

$ createdb -E UNICODE dbname

Why not use PHP? It very similar to perl since it was derived from it.

You can run PHP like shell script like

$ php myprogram.php

November 28, 2005 at 07:29 AM

It depends which variant of Unicode you're working with, and what you need from your database. All of the Unicode variants work differently. Some are variable length, some are fixed-length. Some will give you problems feeding data into your database system by screwing up your SQL commands. Some will not.

If all you want is to be able to get data into and out of your database without screwing it up, want to store English and Chinese in the same database, and also want to display output directly to the web, your best bet is using UTF8. This is a variable-width encoding designed in a way that non-ASCII characters will never be confused with ASCII characters. This means that there will never be a situation where Chinese characters screw up your SQL command by confusing the system into seeing an "unexpected" ASCII quote or bracket sign hidden in the encoding for any particular character.

This has some cool features -- the best of which is that it tricks most applications into playing nicely with Chinese text even if they don't natively support it. So UTF8 works smoothly with applications like PHP which don't technically support Unicode and tend to cough up blood at the sight of GB2312. If you need to have your text processed by other applications and particularly third-party libraries, you should almost definitely go with UTF8. Since UTF8 plays nicely with ASCII, you can leave your default character set in MySQL as latin1.

The disadvantage is that parsing UTF8 is non-trivial since it is a variable-width encoding. So if you want to do something like grab the first/second/third/etc. character from a string to do some sort of lookup you have some coding ahead of you.

So why change character sets? The main reason to specify character sets explicitly is if you want your database to be able to automatically sort information and present it ascending and descending order, etc. based on the characteristics of the encoded. In most cases, this isn't a concern for Chinese the way it is for other languages. We get by at Adso with storing Chinese in a combination of GB2312 and UTF8 and keeping the database in latin1 for compatibility with other applications. No problems with this approach to date, with the exception of some PHP functions screwing up GB2312 and forcing us to find workarounds.

Good luck.

November 28, 2005 at 09:03 AM

Thanks for the reply, trevelyan. Like I said above, 1) I've already settled on using utf-8, and 2) I'm using Perl, which means I don't have to mess with bytes because Perl abstracts that away, allowing me to deal with text at the character level.

What I'm really looking for is advice on which database package to use. Maybe you can tell me which one Adsotrans uses/has tried, and how your experience has been? Was it a hassle to set up PostgreSQL to handle utf-8, while MySQL was a breeze? Was MySQL regularly corrupting data, while PostgreSQL has been stable about handling several different encodings?

And a little off-topic, "keeping the database in latin1 for compatibility with other applications"? I assume you didn't use that database to store Chinese characters... and what application was giving you trouble?

November 28, 2005 at 10:14 AM

I use postgres as a backend and didnt have to deal with any particular nuances. All my data is stored as UTF-8, although utf-8 is variable length, chinese chars will always be 3 bytes in length, so if you are only dealing with english and chinese characters its trival enough to differeniate between the two. although having said the data in my db is never mixed within a single field, its always chinese or english so its simple enough to randomly access characters in a string etc...

what type of application are you creating?

good luck

November 28, 2005 at 10:56 AM

hoodmulti and 笨笨德: Thanks for the tips. I hope it's all that simple ;)

I'm not using php for the reasons given here:

http://tnx.nl/php

Plus like I said, I know Perl, and I'm not a CS person who picks up new programming languages quickly.

The site is going to be about the Shanghai bus system, and I'm storing a buncha route information in a database for easy look-up. Right now Chinese-speakers can use the tools on the Shanghai Traffic Authority's website (http://www.jt.sh.cn/) or call the Transportation Inquiry Hotline to look up bus routes, but at the moment there is no English language equivalent. The goal of my site will be to help non-Chinese-reading folk in Shanghai use the bus system.

(Shanghai had a total of 90k expat residents in 2004.)

November 28, 2005 at 04:26 PM

>> All my data is stored as UTF-8, although utf-8 is variable length, chinese chars will always be 3 bytes in length <<

This isn't technically true, although it is a small subset of Chinese characters that fall outside the three-character range. Probably nothing in the Shanghai bus system though.

I use MySQL with the default character encoding set to latin1. Have never had any problems with it corrupting data yet, although one issue is portability. MySQL handles SQL syntax slightly differently than other SQL programs (ie. "LIKE BINARY" in MySQL is the same as " = " in SQLite, and " = " in MySQL works like "LIKE"). We added support for SQLite it required a few changes to the code that might have been avoidable using postgresql.

Site looks like it'll be useful.

November 29, 2005 at 03:58 AM

>> All my data is stored as UTF-8, although utf-8 is variable length, chinese chars will always be 3 bytes in length <<
This isn't technically true, although it is a small subset of Chinese characters that fall outside the three-character range. Probably nothing in the Shanghai bus system though.

good point, although its far from a small subset these days, the Extensions B section has more characters than the main CJK unified ideographs section... over 40,000 i think...

does anyone actually use these characters? to my knowledge they are subtle variations and the more rare characters... is this true? I have a font that supports them, and half the time i cant even see the variation in the characters...

personally i think it imperitive that every bus station website contains all 40k of these characters

Sign In

MySQL/PostgreSQL: Unicode support?

Recommended Posts

msittig

roddy

kudra

msittig

roddy

hoodmulti

trevelyan

msittig

笨笨德

msittig

trevelyan

笨笨德

Join the conversation