Jump to content
Chinese-Forums
  • Sign Up

Difficulty editing Chinese pdfs on Mac OS X?


Recommended Posts

Posted

Hi all.

I'm stumped. Hopefully this thread isn't too out of context here.

I work a lot with Chinese PDFs on my MacBook Pro (OS 10.8.2), ones either written/converted by me in Word, etc; downloaded from the web; or scanned and then OCR'd (with ReadIris). I use a number of different PDF software, including a couple of editors (PDFPen Pro, and a newcomer, PDF Nomad).

In the past (up to when, I'm not too certain; maybe the beginning of the year or late last year), I never had any trouble editing Chinese PDFs and then re-saving them; meaning, if the document was a searchable PDF before editing, it would remain so afterwards; I could search for characters, copy text, etc. All the characters' metadata was intact (and the fonts were, of course, embedded in the files).

However, recently I kept having a recurring problem with one of the editors I used, where the characters' metadata, after re-saving a file in this program, seemed get messed up if not eliminated altogether. After re-saving (or exporting) the file, although the text looked fine in a PDF reader, as soon as I copy any text (say four characters) and paste it into another document, all I got were four of the same symbols, something akin to a box. Again, I thought this was just this particular program's fault, but just a few days ago realized that this is actually system wide: no matter what program I use to alter and re-save a Chinese PDF (from Preview to Adobe to various PDF utilities), the same thing happens: the text appears OK in a reader, but none of its information is behind the veneer.

And, given my work and research, this problem renders these PDFs nearly useless for me. A great frustration.

I have spoken with Apple and they too are stumped. I was wondering if a Mac cleaning utility (MacKeeper) erased my Chinese fonts on my computer, but I've restored the operating system from scratch (with all the language packs) with no luck, and even restored from Time Machine back to a point where I believe things were working better. Still, this issue persists.

Am I the only one experiencing this? Any thoughts?

If this is news to anyone, could someone with a Mac do an experiment? Take a Chinese PDF that is searchable (meaning you can find 是 or whatever), then use Preview to highlight a text (or change something in the file), and then export it to a new file, open it, copy some text and see what you have.

I'd be very interested if this is indeed, as one PDF editor's developer told, a Mac OS X issue at the moment.

Much gratitude for reading this far, and for any feedback and/or insights.

--Kongmu (釋空目)

Posted

I was able to reproduce this problem. What I am seeing is characters are being re-mapped to code points in the "private use" UNICODE plane. For example, UNICODE U+662F '是' in the original PDF is being encoded as U+10FC54 in the exported PDF. That's something that might be done when special, non-standard characters or glyphs were used, but using the "private" space is not "portable" and certainly not necessary in this case. It seems like a bug that Apple should be told about.

Posted

Matthew....thank you very much; excellent information.

I'm happy to hear that this may indeed be a bug in OS X, as it seems I've exhausted just about every other possibility.

I will pass along your information to the Apple people I've been working with. Thank you for being so thorough.

--Kongmu

Posted
but using the "private" space is not "portable" and certainly not necessary in this case.

It depends. If the fonts are being embedded, then there's a significant space reduction if they only include a small subset of the font in the document (especially for CJK fonts) and then use a custom encoding based on the subset of characters used. My guess is that in this case they are mapping to the private range when doing this. PDFs however usually have a mapping table so that when you copy from such a document, it's able to map these characters back to their original unicode codepoints, so it seems like either the mapping table doesn't exist, or is being corrupted somehow.

Posted

I checked, and the fonts are embedded in the PDFs I tested. So, it looks like PDFs exported from Preview are missing the mapping table or have a bad one. In comparison, exports from Acrobat Reader did not show this problem.

Posted

Thank you again, Mathew; and thanks as well, imron.

Yes, I had confirmed the other day that the fonts are also embedded in the files I've been working with.

That is interesting that Adobe Reader can manipulate Chinese PDF files Ok (just confirmed this myself as well). The other PDF apps I use share the same PDF libraries with Preview, according to a developer I emailed about this issue.

The consensus, however, is that this still is an Apple OS issue, correct?

Posted

I have had a similar problem when using the Find function in Chinese PDFs (also 10.8.2 and Preview), which is that I can only search for one character at a time - if I search for two characters it will *ding* at me and only search for the first one.

Is this related to the OP's problem?

Posted
The consensus, however, is that this still is an Apple OS issue, correct?

Yeah, it looks like a problem with Preview itself or in lower levels of Apple's machinery.

  • 2 months later...
Posted

Regarding this issue again, for what it is worth, I just spoke with Apple and their engineers are aware of this issue and are working on a fix. No time was mentioned as to when in may come out. In the meantime, much of my work here in Taiwan regarding Chinese texts is on hold, as this issue affects combining, splitting, highlighting, OCR'ing, or manipulating and re-saving of any kind of Chinese pdfs. Thinking of installing Parallels and running Windows on this machine so I can continue to work.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...