HerrPetersen Posted October 1, 2008 at 09:30 PM Report Posted October 1, 2008 at 09:30 PM Dajia hao, I downloaded some PDF-files from ChinesePod with an trial account. I found some of the material pretty usefull so I tried to cut out some sentences in order to learn them via my learning programm anki. However when trying to copy the text out I ended up with nonsense like this: Hß`Tû` : w´ ang t ` aitai, wˇ o zˇ enme g¯ en nˇ ı li ´ anx` ı? nˇ ı yˇ ou shˇ ouj¯ ı ma? Mrs. Wang, how can I get in touch with you? Do you have a mobile phone? So basically the hanzi are all messed up and the pinyin is somewhat messed up. I also have some older Pdfs where this works just fine. My question is: Is this due to the chinesepod staff trying to copyright their stuff more strictly? Or is this due to the pdf being created with a Chinese computer or some other computer issue? Are there any solutions (beside copying it all by hand)? I remember downloading subtitles for a Chinese movie, where the subtitles ended up somewhat like in the lines above. So my guess (and hope) is, that it is some kind of computer issue. Quote
trevelyan Posted October 1, 2008 at 09:35 PM Report Posted October 1, 2008 at 09:35 PM How are you importing things into Anki? Is there any way to automate it? Quote
HerrPetersen Posted October 1, 2008 at 09:41 PM Author Report Posted October 1, 2008 at 09:41 PM I create spreadsheets in excel when I gathered enough sentences I use anki's import function; here is some typical material I put in my excel sheet (this is from a song): L0001 祝你生日快乐。 zhù nǐ shēng rì kuài lè, [sound:L0001.ogg] L0002 祝你幸福, 祝你健康。 zhù nǐ xìng fú,zhù nǐ jiàn kāng. [sound:L0002.ogg] L0003 祝你前途光明。 zhù nǐ qián tú guāng míng。 [sound:L0003.ogg] L0004 有个温暖家庭。 yǒu gè wēn nuǎn jiā tíng。 [sound:L0004.ogg] There are some fields empty here I use for definitions and translations. Quote
davidj Posted October 1, 2008 at 09:42 PM Report Posted October 1, 2008 at 09:42 PM There was a thread about this, for another source of PDFs, recently. Without actually seeing it (and please don't post it) I would say it is almost certainly the use of poorly designed software, which embeds a font subset, without using the standard encoding, rather than conspiracy. PDF provides other ways of protecting intellectual property, so if they really wanted to protect it, you wouldn't be able to cut and paste at all, using Acrobat Reader. Being able to do cut and paste, were permitted, is one of the design goals of PDF, and, if authors were to obey the authoring rules, should be possible, if copying is marked as permitted. Actually using bogus encoding is potentially illegal, in some contexts, because it makes the material unavailable to "assistive technology", used by the disabled, particularly text to speech convertors. I think the US legislation is very weak in this area, except for federal government projects. The UK legislation is a bit stronger, but not enforced. Quote
trevelyan Posted October 2, 2008 at 04:32 AM Report Posted October 2, 2008 at 04:32 AM @davidj - the PDF isn't a great standard and the problem here is with automated PDF generation as opposed to manual PDF generation. I'd be very interested if you know of any software that can batch-process PDFs and support copy-and-pasting. @HerrPetersen - we provide text transcripts of everything at Popup Chinese, so you may find the site a suitable replacement. As long as we're in beta you can get a free premium account by using the voucher number 2008AOYUN. Really depends on your level of difficulty though - ChinesePod excels at producing great Newbie content and has quite sizable archives while we are focusing more on intermediate and advanced students and are quite new (still in beta, in fact). Quote
davidj Posted October 2, 2008 at 07:17 AM Report Posted October 2, 2008 at 07:17 AM Although I have no experience of the Adobe tools, and negligible experience of Chinese PDFs, I would suggest that randomly encoded font subsets are likely to be the result of the software that creates the PostScript file (or prints to pdfwriter), rather than any Adobe tool. I'd be rather suspicious of any third party tool that directly generates PDF, except for ghostscript. At least one just generates a bitmap with no text underlay, which goes completely against the spirit of PDF, but I suspect many violate the spirit in other ways. ghostscript requires a PostScript intermediate, which is where I think the problem would arise here. It is a good tool for batch processing, but I have little experience with its use with Chinese. A non-Chinese, non-PDF example of this problem is that, with the default font handling settings some versions of Microsoft Visual Studio print gibberish on one of our PostScript printers, because it causes its print driver to embed a font set, encoded in the order of first appearence, but the printer (driver) tries to do a font substitution into a similar built-in font. Quote
ipsi() Posted October 2, 2008 at 08:27 AM Report Posted October 2, 2008 at 08:27 AM I can copy/paste from a PDF created by pdfLaTeX on Ubuntu, but I can't do it from one created by MikTeX's pdfLaTeX. Though in both cases they've got some weird font embedding going on. I also didn't get perfect extraction, as it decided that it would only paste in ',' rather than ','. At least it works, I guess. Also, OpenOffice seems to be very good about exporting PDFs. Neither of which helps an awful lot, but those are my observations. Quote
jbradfor Posted October 2, 2008 at 03:41 PM Report Posted October 2, 2008 at 03:41 PM The problem is, as many people had guessed, is that the software used to generate the pdf files is broken. ChinesePod staff admit as much. They say they "want" to fix, but they haven't in over a year, so I wouldn't hold my breath. [This has been discussed quite extensively on ChinesePod and ChinesePod forums.] The only solution is to use the .html files instead. Those allow valid cut and paste. I think inside each pdf file is a link to the html file. [At least it was last time I checked; but that was a long time ago, as the pdf is useless to me.] Note that if you prefer traditional, rather than simplified, you can add "trad" before the ".html" or the ".pdf" to get that version. Quote
Normunds Posted December 1, 2008 at 05:00 PM Report Posted December 1, 2008 at 05:00 PM from what i understand the problem might arise from the use of ghostscript based utilities. I stumbled upon this thread while investigating, why I cannot copy chinese characters from PDF converted using ghostview or printed using CutePDF that both base upon ghostscript. It looks like ghostscript by default displays chinese characters using CID fonts, that is a different encoding than Unicode or any other standard used for Asian characters. So copying this "text" does not make sense in any external app that does not expect this. It looks like it could be possible to configure ghostscript to use some free available Unicode fonts instead, but I've failed to find any instructions of how to do it, just the hints that it might be possible. Does this ring the bell to anybody? or I'm off in my assumptions? Quote
dieterlu Posted January 1, 2009 at 10:39 AM Report Posted January 1, 2009 at 10:39 AM I had the same problem: - copying with foxit free pdf reader did not grab all chinese characters. I solved this by using haihai pdf reader. You can copy to clipboard by just marking the text (no need to ctrl-c) - paste to Word. Chinese character come out well. pinyin is the same mess you mentioned in your post. - i clean this up with the following macro: Sub convert_to_unicode() Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¨ u ¯ u" .Replacement.Text = ChrW(470) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¨ u ´ ü" .Replacement.Text = ChrW(472) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¨ u " & ChrW(711) & " u" .Replacement.Text = ChrW(474) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¨ u ` u" .Replacement.Text = ChrW(476) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¯ a" .Replacement.Text = ChrW(257) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "´ a" .Replacement.Text = "á" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = ChrW(711) & " a" .Replacement.Text = ChrW(462) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "` a" .Replacement.Text = "à" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¯ e" .Replacement.Text = ChrW(275) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "´ e" .Replacement.Text = "é" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = ChrW(711) & " e" .Replacement.Text = ChrW(283) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "` e" .Replacement.Text = "è" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¯ " & ChrW(305) .Replacement.Text = ChrW(299) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "´ " & ChrW(305) .Replacement.Text = ChrW(237) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = ChrW(711) & " " & ChrW(305) .Replacement.Text = ChrW(299) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "` " & ChrW(305) .Replacement.Text = "ì" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¯ o" .Replacement.Text = ChrW(333) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "´ o" .Replacement.Text = "ó" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = ChrW(711) & " o" .Replacement.Text = ChrW(466) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "` o" .Replacement.Text = "ò" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¯ u" .Replacement.Text = ChrW(363) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "´ u" .Replacement.Text = "ú" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = ChrW(711) & " u" .Replacement.Text = ChrW(468) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "` u" .Replacement.Text = "ù" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¯ ü" .Replacement.Text = ChrW(470) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "´ ü" .Replacement.Text = ChrW(472) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Selection.Find.ClearFormatting Selection.Find.Replacement.ClearFormatting With Selection.Find .Text = "¨ u ` u" .Replacement.Text = ChrW(476) .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchByte = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute Replace:=wdReplaceAll Not the most elegant way to do this, but it solves the problem. Quote
HerrPetersen Posted January 2, 2009 at 03:34 PM Author Report Posted January 2, 2009 at 03:34 PM Good job, finding a way to solve the pinyin-issue! :-) Unfortunatly I am using OpenOffice right now - so I can't use it until maybe later I have three sets of pdfs from ChinesePod: -one is pretty old (couple of years) -one is kinda old (a year or so) -and one is new (some weeks/months) -In the first set characters would copy out perfectly -in the second set characters would not mark up but when pasting everything was fine. -and in the third set characters are just messed up. So I guess your chinese-pod pdfs are from the second set. It looks to me as if ChinesePod is just trying to make it harder to copy their stuff. (and who would blame them - they are doing a great job and deserve some money) Quote
dieterlu Posted January 2, 2009 at 05:10 PM Report Posted January 2, 2009 at 05:10 PM I use the same on the whole range of pdfs and so far it works. I download new lessons as soon as they are available and often use this to save bits and pieces. I have seen other pinyin pdfs from other sites that look the same, so it must have something to do with the encoding. I did not investigate whether that can be changed. You should be able to write an easy macro for yourself with open office. All it is is searching and replacing. Quote
jbradfor Posted January 3, 2009 at 06:25 AM Report Posted January 3, 2009 at 06:25 AM It looks to me as if ChinesePod is just trying to make it harder to copy their stuff. (and who would blame them - they are doing a great job and deserve some money) Given that they also provide the .html versions, which allow full cut and paste, I think it's just incompetence on their part. Quote
HerrPetersen Posted May 30, 2010 at 10:49 AM Author Report Posted May 30, 2010 at 10:49 AM Sorry for the necropost. I finally came around to this problem again, and tried to install the script dieterlu wrote, however, when creating a macro I get: Runtime error (450) Wrong number of arguments and when starting the debugger it skips to: Selection.Find.ClearFormatting I do not have any experience with macros in excel, so could someone please do an explanation for dummies? Quote
hanyu_xuesheng Posted May 30, 2010 at 01:02 PM Report Posted May 30, 2010 at 01:02 PM Copy and Paste of the text in CPod's PDF doesn't work, but at the page bottom of the PDFs are links to HTML versions. You may also change the download link from: http://s3.amazonaws.com/chinesepod.com/xxxx/yyyyyyyyyy/pdf/chinesepod_Zxxxx.pdf to: http://s3.amazonaws.com/chinesepod.com/xxxx/yyyyyyyyyy/pdf/chinesepod_Zxxxx.html (The above links won't work.) HTH Quote
HerrPetersen Posted May 31, 2010 at 08:26 AM Author Report Posted May 31, 2010 at 08:26 AM Copy/paste does work, if you use the Sumatra Pdf Reader. (The haihai reader doesn't seem to do it any more). Also a simpler version of an excel sheet repairing the messed up pinyin goes like that: Sub ZeichenTauschen() Dim Zelle As Range For Each Zelle In ThisWorkbook.Sheets("Tauschliste").Cells(1, 1).CurrentRegion.Columns(1).Cells Selection.Replace Zelle.Value, Zelle.Offset(0, 1).Value, lookat:=xlPart, MatchCase:=True Next End Sub You will have to create a worksheet called "Tauschliste" with the following content starting in a1: ˇ e ě ` e è ¯ ı ī ´ ı í ˇ ı ǐ ` ı ì ¯ o ō ´ o ó ˇ o ǒ ` o ò ¯ u ū ´ u ú ˇ u ǔ ` u ù ¯ ü ǖ ´ ü ǘ ˇ ü ǚ ` ü ǜ Quote
mnspg Posted June 29, 2010 at 12:28 PM Report Posted June 29, 2010 at 12:28 PM Why go to all this trouble when you can just click the [text version] link on the bottom of each page, and get an HTML page you can easily copy from? Quote
PhilipLean Posted July 7, 2010 at 01:25 AM Report Posted July 7, 2010 at 01:25 AM I know this is a very old topic, put the problem still seems to exist, copying and pasting from Chinesepod pdfs does not work properly. A very simple solution is to reprint the Chinesepod pdf to pdf using a pdf generation program. I use the free version of pdf995, it works fine. The output looks good and I can copy and paste normally. Quote
New Members Farzad Malik Posted July 13, 2018 at 12:41 PM New Members Report Posted July 13, 2018 at 12:41 PM @PhilipLean how can I reprint pdf with pdf 995. can you please explain? Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.