colinuk Posted September 25, 2008 at 12:16 PM Report Posted September 25, 2008 at 12:16 PM I have a very long PDF document written in Simplified Chinese that I would like to study. I am unable to use Kingsoft's Powerword with it on windows, nor am I able to use CEDICT / Pera-Kun on my Mac with it to look up words, as it is a PDF. So I wanted to convert the PDF to HTML or Word Format so that I could use one of these methods. However when I try and convert the file using the export feature on Adobe Acrobat Professional 8 (or its 'save as' function, for that matter), the resulting HTML or Word document that is created shows a really wierd character set - its all gobbledygook basically. I can't get it to 'save as' or export to HTML or Word and preserve the chinese characters. Would anyone know what is going wrong here. Is there anything I should be tweaking to make sure that the characters transfer correctly during export? Any help would be appreciated. Cheers Colin Quote
renzhe Posted September 25, 2008 at 02:37 PM Report Posted September 25, 2008 at 02:37 PM It's hard to say, so I'm taking blind guesses. Have you tried all possible character sets on the resulting HTML file? UTF-8, UTF-16, GBK, GB2312, Big-5, etc? Can you cut and paste characters from the PDF into other programs (like Word)? Which encoding do they end up in? Could you try cutting and pasting excerpts out manually if the file is not too long? Quote
roddy Posted September 25, 2008 at 02:48 PM Report Posted September 25, 2008 at 02:48 PM Is it available online? If so let us know where and the forum geeks . . er . .. technical experts . . will no doubt take a look. Quote
colinuk Posted September 25, 2008 at 02:59 PM Author Report Posted September 25, 2008 at 02:59 PM Hi there Thanks for the thoughts. When I export to HTML from Adobe Acrobat Professional it gives you several options in the settings to choose from: UTF-8, UTF-16, UCS-4, ISO-Latin-1, HTML / ASCII, Use maping table default. I have tried all of them to create the HTML document. Then when I try to open the HTML document in a web browser the characters are all just a mess. I then try and change the view settings on the browsers, going through the full range of possible simplified chinese character encodings that are available (eg ISO-2022-CN, HZ, GB18130 etc etc). None of them display the characters correctly, as they did in the original document. Likewise if I try to export to a Word document, although there are no coding settings to choose from when exporting, when the I open the exported Word document displayes I go through every Chinese font on my list to try and make the characters display properly, but nonthing has worked so far. Is this the kind of thing that you are talking about? Am really at a bit of a loss on this one. I don't know enough about PDF documents to know what is going on in the background really. Cheers Colin Quote
colinuk Posted September 25, 2008 at 03:11 PM Author Report Posted September 25, 2008 at 03:11 PM I have attached one of the 280 pages of the document here in case anyone on the forum dares have a look at it! There must be a way to sort out the coding for this, it's just escaping me right now..... One page extract.pdf Quote
colinuk Posted September 25, 2008 at 03:15 PM Author Report Posted September 25, 2008 at 03:15 PM Oh I forgot to say, If I try and cut and past out of the PDF document dircetly, I get the same problem......a whole load of unrecognisable rubbish.....or just dots Quote
roddy Posted September 25, 2008 at 03:15 PM Report Posted September 25, 2008 at 03:15 PM “Document" eh? Looks suspiciously like the adventures of a certain boy wizard to me. Google search specific phrases in quotes, you'll find what you're looking for easy enough. Might be worth checking for quality, as there may be different versions out there. Quote
colinuk Posted September 25, 2008 at 03:31 PM Author Report Posted September 25, 2008 at 03:31 PM Yep 'tis the boy wizard himself. Maybe his magic is stopping me from exporting it to HTML!! There are a lot of really bad translations out there, and many not complete. I have done quite a lot of searching already. The one I have is faithful to the English version, well translated and complete, so would like to stick with it, if i could only get it to export...... Quote
davidj Posted September 25, 2008 at 03:38 PM Report Posted September 25, 2008 at 03:38 PM It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR. Quote
roddy Posted September 25, 2008 at 03:40 PM Report Posted September 25, 2008 at 03:40 PM Fair enough, but I'd be surprised if this isn't online already. There's an Adobe email address which accepts pdf attachments and sends back plain text, but it says it works for 'English and most West European langauges'. I've sent the single page in just in case, but it hasn't come back yet. Quote
colinuk Posted September 25, 2008 at 03:50 PM Author Report Posted September 25, 2008 at 03:50 PM "It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR." David, thanks for that, however, does that mean to say that there would be no tweaks that one could do on Adobe Acrobat Professional or whatever to export the Chinese to another document format. Roddy, I didnt know Adobe had such a service. I'll try and look into that too, let me know if you get any results from your email though. Thanks Colin Quote
roddy Posted September 25, 2008 at 04:29 PM Report Posted September 25, 2008 at 04:29 PM See here. Hasn't come back though, so I assume either it doesn't work, or our attempts to feed it Chinese have broken it. Actually, the FAQ does say Languages requiring double-byte characters, such as Japanese, Chinese, Arabic, and Hebrew are not supported. Quote
trevelyan Posted September 25, 2008 at 04:29 PM Report Posted September 25, 2008 at 04:29 PM The internal PDF format doesn't necessarily contain recognizable encoding data for any particular character set. There is not necessarily a way to get the text back. Quote
Luobot Posted September 25, 2008 at 05:43 PM Report Posted September 25, 2008 at 05:43 PM Here is a pdf to word converter and a pdf to text converter. Both claim to support Chinese simplified & traditional characters. (Their pdf to html converter doesn't mention Chinese character support.) I don't know if it will work, but it's free to try. Let us know how it comes out if you try it. There are also a lot of other pdf to html converters that you can try for free. Quote
renzhe Posted September 26, 2008 at 12:44 AM Report Posted September 26, 2008 at 12:44 AM If it really is a special coding embedded in the document and specific to this particular text, then there's no way to convert it to anything, short of using a Chinese-aware OCR program. Quote
ipsi() Posted September 26, 2008 at 04:27 AM Report Posted September 26, 2008 at 04:27 AM (edited) It's very definitely using a special encoding - 在 comes out as Edited September 26, 2008 at 04:28 AM by ipsi() Rephrased slightly Quote
davidj Posted September 26, 2008 at 07:48 AM Report Posted September 26, 2008 at 07:48 AM My impression was that CID Identity-H can't produce more than 16 bit characters, so I suspect that it is actually spoofing 32 bits ones by triggering a Unicode escape mechanism. Actually, Windows, and I think Linux, don't support storing UTF-32. However, PDF is a final form document format. It was never designed to allow conversion back to revisable form (although it was a design aim that you can cut and paste plain text, subject to permission flags, and authors are supposed to make that possible - but not all authoring tools comply; more recently, accessibility requirements have meant that accessibility tools should be able to get at the plain text). As there should be a revisable form document that underlies the PDF file, you should ask the PDF file creator for that document. If they have given you a copyright licence to convert the document to HTML, that is the least they can do; they really ought to have set security flags on the document if they didn't want that, but it is not safe to infer a licence from a lack of DRM. Of course, if it really is a derivative of a JK Rowlings work, I believe she is very strict on requiring royalties, so I'm surprised that you managed to get that permission, legitimately. Quote
flameproof Posted September 26, 2008 at 09:10 AM Report Posted September 26, 2008 at 09:10 AM Looks like a piece of cake.... till one actually tries to do a click an paste.... New to me. Quote
ABCinChina Posted September 26, 2008 at 11:39 AM Report Posted September 26, 2008 at 11:39 AM Did the author do this on purpose to make it hard to cut and paste the translation? Seems like an interesting read...Does anybody have a Microsoft Word version? Quote
roddy Posted September 26, 2008 at 11:44 AM Report Posted September 26, 2008 at 11:44 AM Quite possibly - one of the ideas behind the pdf format is that authors can protect content - they're also used for ebooks as the DRM allows documents to be tied to certain computers. Or it could have been done accidentally - at a guess the creator opted to embed the fonts so users wouldn't need to have them installed, with the unintended consequence of making it not-copyable. There are links to some Harry Potter books in Chinese here, not sure if this one is included. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.