trisha2766 Posted December 16, 2010 at 08:12 PM Report Posted December 16, 2010 at 08:12 PM I'm sure this information is somewhere on this site but for whatever reason I did a search and couldn't find anything. Maybe I used the wrong search terms or maybe it has something to do with my ability to concentrate with a 2 year old around the house. At any rate, at one time I heard here that there was a way to extract the subtitles from dvd's but I thought it sounded difficult and thought I could manage without it. But at this point, I know I need the extra help for developing my listening skills and however much work it might be, I think it would be worth it. I think someone here may even have told me how to do it at one point. So how do you get the subtitles off of dvd's? And can they put in a format where I could copy and paste new characters into an online dictionary? Thanks Quote
renzhe Posted December 16, 2010 at 09:04 PM Report Posted December 16, 2010 at 09:04 PM And can they put in a format where I could copy and paste new characters into an online dictionary? AFAIK, no. The subtitles on DVDs are basically pictures. It makes it much easier to support all languages in all players. Quote
Gleaves Posted December 16, 2010 at 09:17 PM Report Posted December 16, 2010 at 09:17 PM I think that's right. I don't think there are programs smart enough to "read" DVD subtitle images in Chinese and produce plain text. You might want to have a look on shooter.cn to see if you can find corresponding subtitle files. Quote
trisha2766 Posted December 16, 2010 at 09:42 PM Author Report Posted December 16, 2010 at 09:42 PM Well even if I can't copy and paste it would still be useful - better than pausing the dvd player every few seconds. Quote
jbradfor Posted December 16, 2010 at 10:05 PM Report Posted December 16, 2010 at 10:05 PM As mentioned, the subtitle format on DVDs are pictures. HOWEVER, there are subtitle formats used for other purposes, e.g. rips of DVDs to avi files, that are text. ".srt" is the most common. You can do a web search and see if there are available. If not, there are plenty of programs that will at least put them into a graphics files for you. You might want to take a look at SupRip, which I think can. From this post, you might want to look at esrXP and subOCR. Quote
fanglu Posted December 16, 2010 at 10:08 PM Report Posted December 16, 2010 at 10:08 PM For English subtitles there are programs that will extract the subtitles from a dvd and attempt to turn them into text through OCR. No idea if there is anything similar for Chinese. I doubt it since the resolution of DVD subtitles is pretty low, making it hard for a program to recognise the characters. It would be much easier to just download the subtitles as others mentioned. Quote
c_redman Posted December 17, 2010 at 02:14 PM Report Posted December 17, 2010 at 02:14 PM I think SubRip will work, but the OCR process is quite time-consuming and requires a lot of manual corrections. Ironically, you will learn the characters through the process of creating the text file, and won't really need the file once it's done. As others have mentioned, shooter.cn or another subtitle site is preferable, if you can find the subtitle file there. Quote
trisha2766 Posted December 17, 2010 at 04:11 PM Author Report Posted December 17, 2010 at 04:11 PM With SupRip I was able to get a bunch of .bmp's. I guess that will have to do, its OCR didn't seem to recognize Chinese. Although it was listed as a language option somewhere, maybe they were thinking only pinyin. I don't think I will be able to find the subtitles on the internet anywhere - I'm working on Elmo's World dvd's. Quote
jbradfor Posted December 17, 2010 at 04:49 PM Report Posted December 17, 2010 at 04:49 PM You might want to look at esrXP and subOCR -- there are reports that it does a better job with OCR. And I wouldn't give up the fight without at least a google search! Quote
New Members Viktor Posted January 8, 2011 at 10:52 PM New Members Report Posted January 8, 2011 at 10:52 PM I looked around for OCR software to process that sequence of BMPs generated by SubRip. I successfully tested Abbyy Finereader 10. It is expensive, but the 15-day trial version does the job (for 15 days anyway). Here is what I did: 1. Generate BMPs with SubRip using the "Custom Colors and Contrast" option, and choose black characters and white background (I set those "border" areas to white as well). I set a minimum width of 360 pixels (and minimum height of 50 pixels) for the output bitmaps. (1.1) In my case SubRip had a problem generating a correct sequence of BMPs from the original VOB file on the DVD (ca. 5% of the subtitle BMPs were either blank or a grainy mess). I solve this problem by generating an IDX file out of the VOB file, using VSRip. I then used that IDX file with SubRip. 2. Insert the bulk of BMPs into a Word Document (e.g. simply select the whole directory on the insert dialogue in Word). The result should see all bitmaps arrayed vertically, i.e. one bitmap per "line". 3. Save that document as a PDF file. 4. Open that PDF in Abbyy Finereader. Then comes the time-consuming part of resolving those ambiguities that Abbyy Finereader has marked (i.e. manually select from a list of suggestions the right match for a character that the programm wasn't able to recognize with certainty). It took me around 2 hours of work for one hour of movie to do that. I haven't found a way to "teach" the software to learn from my corrections. For example, there may be a very common character in the text which Abbyy Finereader cannot read repeatedly. I then had to resolve each instance separately, rather than tell the software once and have it apply it to all instances. If anyone knows how to "teach" Abbyy Finereader, please let me know. 5. The 15-day trial version of Abbyy Finereader doesn't allow you to save the resulting document, so instead I simply copy and pasted each page separately into a new Word document. It is a good idea to insert a page break into the Word document for each new page copied from the Finereader output. 6. You will notice that Finereader does not observe the line breaks of the original document, instead putting a simple space between each line of subtitles. That is not a problem: Once you have copied and pasted the complete clean OCR output into Word, you can simple replace " " with "^p", and you are good. 7. If you inserted a page break in Word after each new copy+pasted page from Finereader, then it is easy to error-check the line breaking in Word: each page should have exactly the same number of rows (provided that each of the original BMPs had the same height). 8. The result is a text that contains all subtitles as plain text -- one row per original subtitle BMP. You can go on to convert it into an SRT file (e.g. take the time stamps from the corresponding English subtitle files, which SubRip can OCR without additional software). In this way, I obtained Chinese subtitles of the first episode of 痞子英雄 (Black & White). It is easy to create Pinyin subtitles from that file, too. Overall I am very surprised and satisfies by the OCR performance of Abbyy Finereader. This procedure is nevertheless time consuming. It could be sped up significantly if I found a way to "teach" Finereader. If anyone can tell me how to do that, I would highly appreciate the help. 1 Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.