Jump to content
Chinese-Forums
  • Sign Up

UTF-8 or UTF-16BE?


Recommended Posts

Posted

Hello everybody,

Could someone please me recommend which encoding to use for text files with Chinese and other languages? When I save a new file with Wenlin, in "Save as type" (in the "Save As" window) says: Unicode UTF-16 (recommended, international).  But it seems that in XML declarations the UTF-8 encoding is used more frequently than UTF-16. So I guess that using one or the other may have to do with what one wants to use the text files for, and that one can always convert them from one encoding to the other with some piece of software.

Thank you in advance for any comments/suggestions.

 

Posted

Use UTF-8.  There's no benefit to UTF-16 and the files are larger.  

  • Like 1
Posted

Thank you vellocet, and wibr. While we're at it, I know that in UTF-8 the BOM (byte order mark) is optional. Now I'm reading this https://en.wikipedia.org/wiki/Byte_order_mark , but I'd like to know whether you normally use a BOM or not.

Thank you again.

 

Edit: Added from Some reasons to use utf-8 everywhere

Q: What do you think about Byte Order Marks?

A: According to the Unicode Standard (v6.2, p.30): Use of a BOM is neither required nor recommended for UTF-8.

Byte order issues are yet another reason to avoid UTF-16. UTF-8 has no endianness issues, and the UTF-8 BOM exists only to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, most UTF-8 text files omit BOMs today.

Using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation. This is unacceptable.

 

Posted

BOM is not recommended for UTF-8. A text file with the BOM may cause some mysterious and hard to diagnose failure in some software, especially those originally developed for Linux. The best way to create a text file on Windows is to use Notepad++, which defaults to UTF-8 no BOM.

  • Like 1
Posted
3 hours ago, vellocet said:

There's no benefit to UTF-16 and the files are large

It depends on the language, and it turns out that for Chinese utf8 is larger than utf16 because it uses 3 bytes per Chinese character compared to two for utf16 for the majority of commonly used characters (and both encodings use 4 bytes for the rarer characters).

 

Anyway I agree that even despite the larger file size you should still use utf8, and if you must use utf16 to use utf16-LE instead of BE. 

 

I also agree that it's preferable to leave the BOM off any utf8 you generate, it's completely unnecessary from a technical standpoint because by its  very nature utf8 is byte order independent. It is however used by some Microsoft programs to determine that a file is utf8 encoded and is not using a legacy encoding. So I'd you're going to be opening the files in notepad I'd add the BOM otherwise I'd leave it off.

Posted

UTF-8 with no BOM. Nowadays pretty much any decent text editor will default to UTF-8 if it doesn't know the format of a file and the file doesn't have some format-specifying extension like .gb or .b5; if you're using Notepad to edit Chinese, don't, there are lots of other text editors that do a much better job. (I'd personally recommend EmEditor which, though not free, is extremely well optimized for CJK, not to mention being insanely fast - just about the only thing we can edit a CJK-Extension-B heavy 200MB dictionary data file without it crashing or becoming uselessly slow; it's Windows-only, but EmEditor in a virtual machine is better than any Mac-native text editor we're aware of)

 

Space is only a concern if these files are being moved around a lot uncompressed, but assuming you .zip / .7z anything large before you send it to someone there should not be a significant size difference between -8 and -16 even for an exclusively Chinese file.

 

For internal use, however, UTF-16; makes Chinese text manipulation a whole lot easier and it's what the default iOS/Android string classes work with.

  • Like 1
Posted
18 hours ago, mikelove said:

I'd personally recommend EmEditor which, though not free, is extremely well optimized for CJK

Thank you, Mike

I've been using EditPad Pro for quite some time now. Lately, I've been thinking of buying PowerGREP, but I'm not sure whether it's exactly a text editor or something else. PowerGREP and EmEditor cost more or less the same, and seem to have similar functions (search & replace, join & split files, change text encoding, etc.). Although I don't know what "[EmEditor] is extremely well optimized for CJK" means, I think it would be a good idea to install the trial version to know how it looks and feels like.

 

P.S. PowerGREP, as its name makes it clear, is something else, something more focused on regex, I think. My apologies for getting off topic. I'm not a programmer.

 

Posted

For plain text that is mainly in Chinese, I prefer UTF-16 for files that I create, it is just a personal preference.  For texts that I need to pass to other tools, UTF-8 might work better depending on the tool, but I try to use BOM if possible.

 

Others argued, "if you use a decent editor..."  Well, if you are using a decent editor, UTF-8 or UTF-16 doesn't make a difference, it can handle them both, and many other encodings others might throw at you.  If you are distributing files to others, UTF-8 w/BOM is unambiguous.

 

With text files that starts in Chinese it is not much of a problem, as most editors will be able to auto-detect those even without BOM, but I sometimes receive text files that seemed to be purely ASCII but turns out to have, e.g. an é way down inside, and the editor chose the wrong encoding because it was only looking at the first x bytes of the file when it did the detection.  And the files were in different encodings so any default would be wrong.

 

So, I try to make it unambiguous when I save files, and I breath a sigh of relief if I see a file in UTF-8 w/BOM because I don't need to guess whether the editor got the encoding right.

 

I don't do much heavy text processing, so I can't say much about tools that cannot handle UTF-8 w/BOM, I have encountered them though, and yes it was a hassle.

 

However, my opinion is:

 

1) "hoping everyone will fix their software" is about as futile as "hoping everyone will send you files in UTF-8 w/o BOM only" so we're screwed either way :) 

2) Using non-Unicode-aware tools to process Unicode text is a kludge, if it doesn't break now just because you don't feed it BOM, it'll break later in other ways.

 

So, I'm doing it in the way that suits my workflow best.

 

Posted
37 minutes ago, dwq said:

Using non-Unicode-aware tools to process Unicode text is a kludge,

Using a BOM in utf8 text is a kludge :mrgreen:

Posted

I accept that BOM was not originally intended for UTF-8, call it a kludge if you want, but in my case it works much better than the alternative.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...