kenneth540 Posted July 6, 2005 at 09:33 AM Report Posted July 6, 2005 at 09:33 AM Hi All, I am writing a Visual Basic program that pulls data from a web page (reads the html file) which contains some BIG5 code in it. I need to be able to convert these BIG5 codes into the corresponding Chinese characters before storing them into a database (MS Access). Does anyone have any idea how to accomplish this? For example, my data input would be "風采依然" and I need to be able to convert it to "風采依然". TIA, Kenneth Quote
trevelyan Posted July 6, 2005 at 02:50 PM Report Posted July 6, 2005 at 02:50 PM I'm not sure I understand what you mean. BIG5 *is* an accepted encoding for Chinese characters... do you mean that your database is having trouble with BIG5 and you want to convert the data to another encoding like GB2312 or UTF-8? If this is what you actually mean, take a look at the GNU program iconv (library == libiconv). It comes with everything necessary to handle conversions between different encodings. You should be able to compile the functionality directly into your software. If this isn't what you mean, it would be helpful if you'd clarify the problem. Quote
kenneth540 Posted July 6, 2005 at 05:06 PM Author Report Posted July 6, 2005 at 05:06 PM Hi trevelyan, My input data contains the BIG5 code represnetations, the xxxxx; where xxxxx is a 5-digit code. I like to convert this to the corresponding Chinese character before I write it to the MS Access database. Thanks, Kenneth Quote
Jose Posted July 7, 2005 at 12:24 AM Report Posted July 7, 2005 at 12:24 AM Like Trevelyan, I also find it difficult to understand what you mean by "converting the Big5 representation into the corresponding character". A Chinese character when stored electronically is nothing more, and nothing less, than a numeric representation. There are different encoding schemes: Big5, GB2312, UTF-8 and so on, and there are tools to convert between them, as Trevelyan said. But you'll always be using one particular encoding. I don't know if there's anything I'm missing in your question. Maybe I haven't understood it properly, but I would say that converting a numeric code into a proper Chinese character is part of the visualisation process that a text processor performs to display the glyphs associated with particular letters or characters from their underlying numeric value. If you're handling the characters programatically, a character and its numeric representation are the same thing as far as I can tell. Quote
kenneth540 Posted July 7, 2005 at 03:10 AM Author Report Posted July 7, 2005 at 03:10 AM Let me try to clarify it further. The attached file is a sample HTML file that I am dealing with. In the file, there are four BIG5 code, representation four different Chinese symbols. Note, I had to change the file extension from .HTML to .TXT in order to attach it here. When the file is viewed in my web browser (after the extension has been changed back to .HTML), I see the four Chinese symbols instead of the code themselves. Now, when the same file is fed into my Visual Basic program, the program treats it as a normal text file. It would only see the BIG5 codes. If I output what I read straight to let's say a MS Access database, it would store the BIG5 code as seen in the .HTML file into the Access table. What I want is to store the Chinese symbol in the Access table instead. I know Access XP (English version) supports the storage of Chinese symbols as I can copy a Chinese symbol from my web brower displaying the attached .HTML file and paste it into the Access table. I understand the Chinese symbols when stored in their most basic electronic form is nothing more than just a number (1's & 0's). However, I am not dealing at the most basic level here. I am dealing at the MS Access level. Hope I've explained myself clearly here and thanks for the input so far. Kenneth Quote
kenneth540 Posted July 7, 2005 at 10:43 AM Author Report Posted July 7, 2005 at 10:43 AM One more note, if there is a way I can capture the HTML output rather than the actual HTML file from a web page in Visual Basic, it will solve my problem. Quote
smalltownfart Posted July 7, 2005 at 05:06 PM Report Posted July 7, 2005 at 05:06 PM I have no exp with MS Access or VB, but I have encountered similar problems on other platforms. Does your MS Access using unicode? I would chk the help to see how to find this out. Typically when you are inserting these kinds of strings into a db, you would have to do a conversion into the encoding used by your DB. If your db is unicode/UTF-XX, there shd be some way to convert from Big5 to unicode, chk the api available to your VB program. Quote
Jose Posted July 8, 2005 at 01:46 AM Report Posted July 8, 2005 at 01:46 AM I think I understand what your problem is now. Sorry that I didn't read your first post more carefully. I was thinking that the HTML files you were using were Big5-encoded but, if I understand correctly, they use a Western encoding and they store Chinese characters in ANSI text format as " + (Big5 code)", so your input consists of the Big5 codes AS TEXT. If I understand things correctly, I think you will have to convert that code (written as text in the HTML) into binary format, so that you get real Big5 text. To do that, you would use a function like atoi in C, like this: /*Example in C*/ char* big5_code_as_text; /*This will store the input, a string of text with the code*/ int big5_code; /*This will store the binary value of the Big5 code, as used in Chinese text files*/ (...)/*Store input (the number after ) in big5_code_as_text*/ /*Now we can convert it to binary*/ big5_code = atoi(big5_code_as_text); Now if you were working with a Traditional Chinese version of MS access, you could write the "big5_code" values into a file and that would appear as Chinese text when viewed under MS Access or whatever. However, if, as seems to be your case, you're working under a non-Chinese system, the program won't recognise the text as Chinese (it will look something like "AüÍñÖè..."). I'm not sure about how MS Office programs treat multibyte text, but I guess if you convert the encoding to utf-16, the characters will appear correctly. To convert from Big5-coded characters to a utf encoding, you will need to use a function that does that, like the libiconv library mentioned by Trevelyan. The example in C would continue like this: int utf_code; utf_code = WhateverConversionFunction(big5_code, bla, bla, bla);/*Just an example*/ So, we would get utf_code as the output of a conversion function. Now, I guess you could write those utf_code values into your database. If things don't work, check the Microsoft documentation for MS Office developers to see if there is any useful information about storing Unicode strings in MS Office files. This is all I can think of. Sorry that I cannot be more concrete about conversion functions or MS Access stuff, but I'm not an expert on this. Hope this helps. Quote
smalltownfart Posted July 9, 2005 at 01:56 PM Report Posted July 9, 2005 at 01:56 PM Whoops, I didnt see your text file earlier. What you have is a HTML file with encodings for the big5 chars. It will render correctly in a browser but the text is not actually in Big5 or Unicode as the previous poster noted. If like me, you have WinXP, what you *shd* be able to do is to: - rename the file ext back to HTML - open it in Internet Explorer/Firefox/whatever browser - select all the text and copy to clipboard - Open Notepad, create a file of type unicode, - paste the text into notepad & save it. I think saving it to ANSI would also work but you may need to set your "Language for non-unicode programs" in regional options. You may want to get another unicode editor - although Notepad supports Unicode in Win2K and WinXP, it is kinda spartan. You will find quite a few free ones out there. Quote
smalltownfart Posted July 10, 2005 at 04:57 PM Report Posted July 10, 2005 at 04:57 PM If u need to do this programatically, you shd chk out the BIG5 to Unicode mapping table at: http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT There is a freeware (GPL'd) utility to that does the reverse of what you want (big5 to character reference codes) - Tsai Chih-Hao's B5TOUNI, http://technology.chtsai.org/b5touni/ Maybe you can also use this as guide for your coding efforts, (it does basically what you want in PHP): http://annevankesteren.nl/2005/05/character-references Pls keep us posted on progress, I am sure other ppl could use your utility Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.