Jump to content
Chinese-Forums
  • Sign Up

Converting Chinese PDF's to HTML


Recommended Posts

Posted

I have a very long PDF document written in Simplified Chinese that I would like to study. I am unable to use Kingsoft's Powerword with it on windows, nor am I able to use CEDICT / Pera-Kun on my Mac with it to look up words, as it is a PDF. So I wanted to convert the PDF to HTML or Word Format so that I could use one of these methods. However when I try and convert the file using the export feature on Adobe Acrobat Professional 8 (or its 'save as' function, for that matter), the resulting HTML or Word document that is created shows a really wierd character set - its all gobbledygook basically. I can't get it to 'save as' or export to HTML or Word and preserve the chinese characters.

Would anyone know what is going wrong here. Is there anything I should be tweaking to make sure that the characters transfer correctly during export?

Any help would be appreciated.

Cheers

Colin

Posted

It's hard to say, so I'm taking blind guesses.

Have you tried all possible character sets on the resulting HTML file? UTF-8, UTF-16, GBK, GB2312, Big-5, etc?

Can you cut and paste characters from the PDF into other programs (like Word)? Which encoding do they end up in?

Could you try cutting and pasting excerpts out manually if the file is not too long?

Posted

Is it available online? If so let us know where and the forum geeks . . er . .. technical experts . . will no doubt take a look.

Posted

Hi there

Thanks for the thoughts. When I export to HTML from Adobe Acrobat Professional it gives you several options in the settings to choose from: UTF-8, UTF-16, UCS-4, ISO-Latin-1, HTML / ASCII, Use maping table default. I have tried all of them to create the HTML document. Then when I try to open the HTML document in a web browser the characters are all just a mess. I then try and change the view settings on the browsers, going through the full range of possible simplified chinese character encodings that are available (eg ISO-2022-CN, HZ, GB18130 etc etc). None of them display the characters correctly, as they did in the original document.

Likewise if I try to export to a Word document, although there are no coding settings to choose from when exporting, when the I open the exported Word document displayes I go through every Chinese font on my list to try and make the characters display properly, but nonthing has worked so far.

Is this the kind of thing that you are talking about?

Am really at a bit of a loss on this one. I don't know enough about PDF documents to know what is going on in the background really.

Cheers

Colin

Posted

I have attached one of the 280 pages of the document here in case anyone on the forum dares have a look at it! :mrgreen: There must be a way to sort out the coding for this, it's just escaping me right now..... :roll:

One page extract.pdf

Posted

Oh I forgot to say, If I try and cut and past out of the PDF document dircetly, I get the same problem......a whole load of unrecognisable rubbish.....or just dots

Posted

“Document" eh? Looks suspiciously like the adventures of a certain boy wizard to me. Google search specific phrases in quotes, you'll find what you're looking for easy enough. Might be worth checking for quality, as there may be different versions out there.

Posted

Yep 'tis the boy wizard himself. :lol: Maybe his magic is stopping me from exporting it to HTML!!

There are a lot of really bad translations out there, and many not complete. I have done quite a lot of searching already. The one I have is faithful to the English version, well translated and complete, so would like to stick with it, if i could only get it to export...... :help

Posted

It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR.

Posted

Fair enough, but I'd be surprised if this isn't online already.

There's an Adobe email address which accepts pdf attachments and sends back plain text, but it says it works for 'English and most West European langauges'. I've sent the single page in just in case, but it hasn't come back yet.

Posted

"It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR."

David, thanks for that, however, does that mean to say that there would be no tweaks that one could do on Adobe Acrobat Professional or whatever to export the Chinese to another document format.

Roddy, I didnt know Adobe had such a service. I'll try and look into that too, let me know if you get any results from your email though. Thanks

Colin

Posted

See here. Hasn't come back though, so I assume either it doesn't work, or our attempts to feed it Chinese have broken it. Actually, the FAQ does say

Languages requiring double-byte characters, such as Japanese, Chinese, Arabic, and Hebrew are not supported.

Posted

The internal PDF format doesn't necessarily contain recognizable encoding data for any particular character set. There is not necessarily a way to get the text back.

Posted

If it really is a special coding embedded in the document and specific to this particular text, then there's no way to convert it to anything, short of using a Chinese-aware OCR program.

Posted (edited)

It's very definitely using a special encoding - 在 comes out as

Edited by ipsi()
Rephrased slightly
Posted

My impression was that CID Identity-H can't produce more than 16 bit characters, so I suspect that it is actually spoofing 32 bits ones by triggering a Unicode escape mechanism. Actually, Windows, and I think Linux, don't support storing UTF-32.

However, PDF is a final form document format. It was never designed to allow conversion back to revisable form (although it was a design aim that you can cut and paste plain text, subject to permission flags, and authors are supposed to make that possible - but not all authoring tools comply; more recently, accessibility requirements have meant that accessibility tools should be able to get at the plain text).

As there should be a revisable form document that underlies the PDF file, you should ask the PDF file creator for that document. If they have given you a copyright licence to convert the document to HTML, that is the least they can do; they really ought to have set security flags on the document if they didn't want that, but it is not safe to infer a licence from a lack of DRM.

Of course, if it really is a derivative of a JK Rowlings work, I believe she is very strict on requiring royalties, so I'm surprised that you managed to get that permission, legitimately.

Posted

Did the author do this on purpose to make it hard to cut and paste the translation?

Seems like an interesting read...Does anybody have a Microsoft Word version? :D

Posted

Quite possibly - one of the ideas behind the pdf format is that authors can protect content - they're also used for ebooks as the DRM allows documents to be tied to certain computers. Or it could have been done accidentally - at a guess the creator opted to embed the fonts so users wouldn't need to have them installed, with the unintended consequence of making it not-copyable.

There are links to some Harry Potter books in Chinese here, not sure if this one is included.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...