Shi Kongmu Posted March 2, 2013 at 10:04 AM Report Posted March 2, 2013 at 10:04 AM Hi all. I'm stumped. Hopefully this thread isn't too out of context here. I work a lot with Chinese PDFs on my MacBook Pro (OS 10.8.2), ones either written/converted by me in Word, etc; downloaded from the web; or scanned and then OCR'd (with ReadIris). I use a number of different PDF software, including a couple of editors (PDFPen Pro, and a newcomer, PDF Nomad). In the past (up to when, I'm not too certain; maybe the beginning of the year or late last year), I never had any trouble editing Chinese PDFs and then re-saving them; meaning, if the document was a searchable PDF before editing, it would remain so afterwards; I could search for characters, copy text, etc. All the characters' metadata was intact (and the fonts were, of course, embedded in the files). However, recently I kept having a recurring problem with one of the editors I used, where the characters' metadata, after re-saving a file in this program, seemed get messed up if not eliminated altogether. After re-saving (or exporting) the file, although the text looked fine in a PDF reader, as soon as I copy any text (say four characters) and paste it into another document, all I got were four of the same symbols, something akin to a box. Again, I thought this was just this particular program's fault, but just a few days ago realized that this is actually system wide: no matter what program I use to alter and re-save a Chinese PDF (from Preview to Adobe to various PDF utilities), the same thing happens: the text appears OK in a reader, but none of its information is behind the veneer. And, given my work and research, this problem renders these PDFs nearly useless for me. A great frustration. I have spoken with Apple and they too are stumped. I was wondering if a Mac cleaning utility (MacKeeper) erased my Chinese fonts on my computer, but I've restored the operating system from scratch (with all the language packs) with no luck, and even restored from Time Machine back to a point where I believe things were working better. Still, this issue persists. Am I the only one experiencing this? Any thoughts? If this is news to anyone, could someone with a Mac do an experiment? Take a Chinese PDF that is searchable (meaning you can find 是 or whatever), then use Preview to highlight a text (or change something in the file), and then export it to a new file, open it, copy some text and see what you have. I'd be very interested if this is indeed, as one PDF editor's developer told, a Mac OS X issue at the moment. Much gratitude for reading this far, and for any feedback and/or insights. --Kongmu (釋空目) Quote
ManManLai Posted March 3, 2013 at 03:42 AM Report Posted March 3, 2013 at 03:42 AM I was able to reproduce this problem. What I am seeing is characters are being re-mapped to code points in the "private use" UNICODE plane. For example, UNICODE U+662F '是' in the original PDF is being encoded as U+10FC54 in the exported PDF. That's something that might be done when special, non-standard characters or glyphs were used, but using the "private" space is not "portable" and certainly not necessary in this case. It seems like a bug that Apple should be told about. Quote
Shi Kongmu Posted March 3, 2013 at 03:44 AM Author Report Posted March 3, 2013 at 03:44 AM Matthew....thank you very much; excellent information. I'm happy to hear that this may indeed be a bug in OS X, as it seems I've exhausted just about every other possibility. I will pass along your information to the Apple people I've been working with. Thank you for being so thorough. --Kongmu Quote
imron Posted March 3, 2013 at 06:48 AM Report Posted March 3, 2013 at 06:48 AM but using the "private" space is not "portable" and certainly not necessary in this case. It depends. If the fonts are being embedded, then there's a significant space reduction if they only include a small subset of the font in the document (especially for CJK fonts) and then use a custom encoding based on the subset of characters used. My guess is that in this case they are mapping to the private range when doing this. PDFs however usually have a mapping table so that when you copy from such a document, it's able to map these characters back to their original unicode codepoints, so it seems like either the mapping table doesn't exist, or is being corrupted somehow. Quote
ManManLai Posted March 3, 2013 at 09:06 AM Report Posted March 3, 2013 at 09:06 AM I checked, and the fonts are embedded in the PDFs I tested. So, it looks like PDFs exported from Preview are missing the mapping table or have a bad one. In comparison, exports from Acrobat Reader did not show this problem. Quote
Shi Kongmu Posted March 3, 2013 at 10:48 AM Author Report Posted March 3, 2013 at 10:48 AM Thank you again, Mathew; and thanks as well, imron. Yes, I had confirmed the other day that the fonts are also embedded in the files I've been working with. That is interesting that Adobe Reader can manipulate Chinese PDF files Ok (just confirmed this myself as well). The other PDF apps I use share the same PDF libraries with Preview, according to a developer I emailed about this issue. The consensus, however, is that this still is an Apple OS issue, correct? Quote
hbuchtel Posted March 4, 2013 at 03:28 AM Report Posted March 4, 2013 at 03:28 AM I have had a similar problem when using the Find function in Chinese PDFs (also 10.8.2 and Preview), which is that I can only search for one character at a time - if I search for two characters it will *ding* at me and only search for the first one. Is this related to the OP's problem? Quote
ManManLai Posted March 4, 2013 at 06:58 AM Report Posted March 4, 2013 at 06:58 AM Is this related to the OP's problem? Dunno. I'd still report the problem to Apple. 1 Quote
ManManLai Posted March 4, 2013 at 07:00 AM Report Posted March 4, 2013 at 07:00 AM The consensus, however, is that this still is an Apple OS issue, correct? Yeah, it looks like a problem with Preview itself or in lower levels of Apple's machinery. Quote
Shi Kongmu Posted May 17, 2013 at 01:14 AM Author Report Posted May 17, 2013 at 01:14 AM Regarding this issue again, for what it is worth, I just spoke with Apple and their engineers are aware of this issue and are working on a fix. No time was mentioned as to when in may come out. In the meantime, much of my work here in Taiwan regarding Chinese texts is on hold, as this issue affects combining, splitting, highlighting, OCR'ing, or manipulating and re-saving of any kind of Chinese pdfs. Thinking of installing Parallels and running Windows on this machine so I can continue to work. Quote
roddy Posted May 17, 2013 at 09:43 AM Report Posted May 17, 2013 at 09:43 AM Thanks for the follow up. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.