Jump to content
Chinese-Forums
  • Sign Up

Certain characters delete themselves and stuff after it.


Hofmann

Recommended Posts

MySQL (in a utf8_general_ci column, but I also just tried utf8_unicode_ci and utf8_bin ) just truncates from certain (rare) characters. Same thing for this, adjacent to the one linked above. I'm seeing this on different installs / servers.

If anyone's got any ideas, I'd like to hear them. It's not an everyday problem, but it is annoying. Also, if anyone can think of any old posts that might have used similar rarely-seen characters, take a look and check they're all in one piece. If not I should be able to restore them from a back up.

Will move this into the computing section in the hope of attracting some gurus. Would be useful to know the scope of the problem, if it's restricted to certain MySQL versions, and anything that can be done (binary columns?)

Link to comment
Share on other sites

I clicked on the link in the first post and ... in Firefox the "Your Browser" area contains a rectangle with hex (I assume) inside; in IE it's an empty rectangle.

So even if you fix the problem, I wouldn't be able to see the characters anyway ;) Is it only me with this issue?

Link to comment
Share on other sites

I have Arial Unicode MS which I thought contained everything. Apart from the links on this thread, everything else works fine, so I'll leave it as it is and hope the characters talked about on here aren't used frequently ;)

Link to comment
Share on other sites

Unfortunately this problem reflects a weakness in MySQL 5's Unicode implementation. Currently, MySQL can only store characters within the Basic Multilingual Plane (BMP, also known as Plane 0.) Hofmann's character lies outside the BMP, in the Supplementary Ideographic Plane (SIP or Plane 2.)

If a character's assigned Unicode codepoint has more than four hex digits, then it's outside the BMP. For example, the character in question has a codepoint of 25D32, so it can't be stored by MySQL. Fortunately, commonly used Chinese characters should all be within the BMP.

This problem will be fixed in MySQL 6 (still in Alpha.) Technically, it's possible to store these characters in a MySQL 5 binary column, but that's not generally recommended (e.g., full-text searches don't work on binary columns.)

For a more technical description of this issue... basically the current implementation can only store Unicode characters which encodes to three utf8 bytes or less. As you can see from the page Hofmann linked to, U+25D32 encodes to four utf8 bytes (F0 A5 B4 B2). MySQL can't handle the extra byte and truncates the string.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...