Converting Chinese PDF's to HTML

September 25, 2008 at 12:16 PM

I have a very long PDF document written in Simplified Chinese that I would like to study. I am unable to use Kingsoft's Powerword with it on windows, nor am I able to use CEDICT / Pera-Kun on my Mac with it to look up words, as it is a PDF. So I wanted to convert the PDF to HTML or Word Format so that I could use one of these methods. However when I try and convert the file using the export feature on Adobe Acrobat Professional 8 (or its 'save as' function, for that matter), the resulting HTML or Word document that is created shows a really wierd character set - its all gobbledygook basically. I can't get it to 'save as' or export to HTML or Word and preserve the chinese characters.

Would anyone know what is going wrong here. Is there anything I should be tweaking to make sure that the characters transfer correctly during export?

Any help would be appreciated.

Cheers

Colin

September 25, 2008 at 02:37 PM

It's hard to say, so I'm taking blind guesses.

Have you tried all possible character sets on the resulting HTML file? UTF-8, UTF-16, GBK, GB2312, Big-5, etc?

Can you cut and paste characters from the PDF into other programs (like Word)? Which encoding do they end up in?

Could you try cutting and pasting excerpts out manually if the file is not too long?

September 25, 2008 at 02:48 PM

Is it available online? If so let us know where and the forum geeks . . er . .. technical experts . . will no doubt take a look.

September 25, 2008 at 02:59 PM

Hi there

Thanks for the thoughts. When I export to HTML from Adobe Acrobat Professional it gives you several options in the settings to choose from: UTF-8, UTF-16, UCS-4, ISO-Latin-1, HTML / ASCII, Use maping table default. I have tried all of them to create the HTML document. Then when I try to open the HTML document in a web browser the characters are all just a mess. I then try and change the view settings on the browsers, going through the full range of possible simplified chinese character encodings that are available (eg ISO-2022-CN, HZ, GB18130 etc etc). None of them display the characters correctly, as they did in the original document.

Likewise if I try to export to a Word document, although there are no coding settings to choose from when exporting, when the I open the exported Word document displayes I go through every Chinese font on my list to try and make the characters display properly, but nonthing has worked so far.

Is this the kind of thing that you are talking about?

Am really at a bit of a loss on this one. I don't know enough about PDF documents to know what is going on in the background really.

Cheers

Colin

September 25, 2008 at 03:11 PM

I have attached one of the 280 pages of the document here in case anyone on the forum dares have a look at it! There must be a way to sort out the coding for this, it's just escaping me right now.....

One page extract.pdf

September 25, 2008 at 03:15 PM

Oh I forgot to say, If I try and cut and past out of the PDF document dircetly, I get the same problem......a whole load of unrecognisable rubbish.....or just dots

September 25, 2008 at 03:15 PM

“Document" eh? Looks suspiciously like the adventures of a certain boy wizard to me. Google search specific phrases in quotes, you'll find what you're looking for easy enough. Might be worth checking for quality, as there may be different versions out there.

September 25, 2008 at 03:31 PM

Yep 'tis the boy wizard himself. Maybe his magic is stopping me from exporting it to HTML!!

There are a lot of really bad translations out there, and many not complete. I have done quite a lot of searching already. The one I have is faithful to the English version, well translated and complete, so would like to stick with it, if i could only get it to export......

September 25, 2008 at 03:38 PM

It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR.

September 25, 2008 at 03:40 PM

Fair enough, but I'd be surprised if this isn't online already.

There's an Adobe email address which accepts pdf attachments and sends back plain text, but it says it works for 'English and most West European langauges'. I've sent the single page in just in case, but it hasn't come back yet.

September 25, 2008 at 03:50 PM

"It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR."

David, thanks for that, however, does that mean to say that there would be no tweaks that one could do on Adobe Acrobat Professional or whatever to export the Chinese to another document format.

Roddy, I didnt know Adobe had such a service. I'll try and look into that too, let me know if you get any results from your email though. Thanks

Colin

September 25, 2008 at 04:29 PM

See here. Hasn't come back though, so I assume either it doesn't work, or our attempts to feed it Chinese have broken it. Actually, the FAQ does say

Languages requiring double-byte characters, such as Japanese, Chinese, Arabic, and Hebrew are not supported.

September 25, 2008 at 04:29 PM

The internal PDF format doesn't necessarily contain recognizable encoding data for any particular character set. There is not necessarily a way to get the text back.

September 25, 2008 at 05:43 PM

Here is a pdf to word converter and a pdf to text converter. Both claim to support Chinese simplified & traditional characters. (Their pdf to html converter doesn't mention Chinese character support.) I don't know if it will work, but it's free to try. Let us know how it comes out if you try it.

There are also a lot of other pdf to html converters that you can try for free.

September 26, 2008 at 12:44 AM

If it really is a special coding embedded in the document and specific to this particular text, then there's no way to convert it to anything, short of using a Chinese-aware OCR program.

September 26, 2008 at 04:27 AM

It's very definitely using a special encoding - 在 comes out as

Edited September 26, 2008 at 04:28 AM by ipsi()
Rephrased slightly

September 26, 2008 at 07:48 AM

My impression was that CID Identity-H can't produce more than 16 bit characters, so I suspect that it is actually spoofing 32 bits ones by triggering a Unicode escape mechanism. Actually, Windows, and I think Linux, don't support storing UTF-32.

However, PDF is a final form document format. It was never designed to allow conversion back to revisable form (although it was a design aim that you can cut and paste plain text, subject to permission flags, and authors are supposed to make that possible - but not all authoring tools comply; more recently, accessibility requirements have meant that accessibility tools should be able to get at the plain text).

As there should be a revisable form document that underlies the PDF file, you should ask the PDF file creator for that document. If they have given you a copyright licence to convert the document to HTML, that is the least they can do; they really ought to have set security flags on the document if they didn't want that, but it is not safe to infer a licence from a lack of DRM.

Of course, if it really is a derivative of a JK Rowlings work, I believe she is very strict on requiring royalties, so I'm surprised that you managed to get that permission, legitimately.

September 26, 2008 at 09:10 AM

Looks like a piece of cake.... till one actually tries to do a click an paste.... New to me.

September 26, 2008 at 11:39 AM

Did the author do this on purpose to make it hard to cut and paste the translation?

Seems like an interesting read...Does anybody have a Microsoft Word version?

September 26, 2008 at 11:44 AM

Quite possibly - one of the ideas behind the pdf format is that authors can protect content - they're also used for ebooks as the DRM allows documents to be tied to certain computers. Or it could have been done accidentally - at a guess the creator opted to embed the fonts so users wouldn't need to have them installed, with the unintended consequence of making it not-copyable.

There are links to some Harry Potter books in Chinese here, not sure if this one is included.

Sign In

Converting Chinese PDF's to HTML

Recommended Posts

colinuk

renzhe

roddy

colinuk

colinuk

colinuk

roddy

colinuk

davidj

roddy

colinuk

roddy

trevelyan

Luobot

renzhe

ipsi()

davidj

flameproof

ABCinChina

roddy

Join the conversation