markhavemann Posted October 16, 2020 at 04:12 AM Report Posted October 16, 2020 at 04:12 AM I'll post this here in case somebody else comes across the same problem. I bought an e-book at https://www.blcup.com/. Somehow the e-book version was 100rmb but the physical book only costs 30 or something Taobao. But oh well, it's worth it if I can just carry a tablet to class instead of a bunch of heavy books. After paying for the book I was really annoyed to find out it's in some weird .opz format, and you need to download their own really crappy reader to open it. There is also nothing on the internet about the opz format or converting it to a better format. Anyway, here's what I figured out: Rename file to .pdf open with PDF-XChange Editor It will open but say there are errors and ask if you want to save a new, fixed file. Save it as a fresh PDF that can be read in any application Unfortunately the text seems to have some weird encoding issues so copying to another application just results in garbage (not so great for quick looking up of characters). I'm trying to figure this out and I'll post the solution if I do. 1 2 Quote
thelearninglearner Posted October 16, 2020 at 05:39 AM Report Posted October 16, 2020 at 05:39 AM As far as the copy paste thing goes, you can probably use pleco to ocr the parts you want to copy. Since you're using it on a tablet or phone As far as the copy paste thing goes, you can probably use pleco to ocr the parts you want to copy. Since you're using it on a tablet or phone Quote
markhavemann Posted October 16, 2020 at 06:41 AM Author Report Posted October 16, 2020 at 06:41 AM 1 hour ago, thelearninglearner said: As far as the copy paste thing goes, you can probably use pleco to ocr the parts you want to copy. Since you're using it on a tablet or phone Yeah looks like I will have to resort to that. Not as convenient as copying and pasting, but I guess it will do. Quote
thelearninglearner Posted October 16, 2020 at 10:21 AM Report Posted October 16, 2020 at 10:21 AM 3 hours ago, markhavemann said: Yeah looks like I will have to resort to that. Not as convenient as copying and pasting, but I guess it will do. Maybe also check out some advanced pdf readers(I'm thinking Adobe reader) . Might have some features that can help. Extra conversions Quote
大块头 Posted October 16, 2020 at 06:30 PM Report Posted October 16, 2020 at 06:30 PM ocrmypdf may be a solution Quote
markhavemann Posted October 17, 2020 at 12:14 AM Author Report Posted October 17, 2020 at 12:14 AM 13 hours ago, thelearninglearner said: Maybe also check out some advanced pdf readers(I'm thinking Adobe reader) . Might have some features that can help. Extra conversions I've never had so many PDF editors installed on my computer at once. Eventually I found a tool to look at the "unicode mapping" tables of the PDF. Looks like the character appearances were saved as vector "glyphs" so that they could be displayed, and a text character is link to each one, but when the PDF was created it didn't specify WHICH unicode character was linked to which glyph, meaning it's completely unrecoverable without identifying each character manually. 5 hours ago, 大块头 said: ocrmypdf may be a solution I eventually settled on PDF-Xchange's built in OCR, which seems to work much better than Adobe for some reason, and it also had the option to OCR existing "text" which saved me having to flatten each page into an image or anything like that. 1 Quote
ChTTay Posted October 17, 2020 at 03:20 AM Report Posted October 17, 2020 at 03:20 AM Another option (for next time!?) would be to buy the book and manually scan it in. It sounds tedious but it’s not that bad now that phone scanners are decent. I scanned in a 300 page textbook myself and it took less than an hour. You could also do a chapter at a time if you wanted to (probably takes less than 5 minutes). That hour included taking the scans and tweaking a few of them. It’s not as perfect as it would have been if there was a pdf actually available (it’s an old book) but for personal use on my iPad it’s great. At least it is a pdf file that can be opened by any standard reader. 1 Quote
NinKenDo Posted October 22, 2020 at 11:15 PM Report Posted October 22, 2020 at 11:15 PM I'm guessing the OPZ format used some kind of character map and that's why you get garbage out. To get it to be proper text data, you would need to know the mapping of the codepoints, which might be relatively easy if they've just shifted them over by 1000 or something, just to make copy-pasting not work, but if it's a full remapping that might be harder. Given the size of the book, my guess is that they haven't done a complete remapping as that would cut down on the size dramatically (assuming they didn't do a random shuffle just to prevent copy-paste). If they have just shifted them over, you could reverse engineer it by just looking at the glyph, finding the relevant Unicode codepoint, and calculating the difference from the codepoint that sits under the glyph in the file. Check against a few characters to be sure, and if two or three map the same way, probably that's what they've done. Quote
markhavemann Posted October 23, 2020 at 08:09 AM Author Report Posted October 23, 2020 at 08:09 AM 8 hours ago, NinKenDo said: If they have just shifted them over, you could reverse engineer it by just looking at the glyph, finding the relevant Unicode codepoint, and calculating the difference from the codepoint that sits under the glyph in the file. Check against a few characters to be sure, and if two or three map the same way, probably that's what they've done. That's a good point. I've noticed that copying "的人“ pretty consistently gives ".¶" while 的 alone is "3" and 人 alone is "+" so it does like like there is method to the madness but it's slightly beyond my own expertise unfortunately. I've uploaded the pdf here if you or anyone else wants to try crack the code. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.