Extracting Chinese hardsubs from a video

June 7, 2014 at 06:41 AM

I have heard that this is possible, but haven't been able to find anything that works. Have any of you succeeded at doing this?

June 7, 2014 at 05:16 PM

I've looked into it for many hours but with no success.

Apparently there's some software out there that can OCR softsubs (that are in graphics format like stored on a DVD) but I never managed to get it working, it's very complex and all the documentation is in Chinese and you need old versions of software and I just never had any success.

Some of the non-language specific software can sort-of OCR Chinese but frequently the characters are split in half, and as far as I can see there's no way of saving your character set so you end up having to "teach" it thousands of characters but can never re-use them.

As for what you actually asked for - hard-subs are even harder. Recognizing subtitles with no background is very hard, it's going to be even harder if there is background imagery.

The best solutions I have are to avoid the problem by looking for SRT files on sites like shooter.cn. The problem is they are not available for most native Chinese programs because Chinese people don't need them. Sometimes movies have them (especially for movies dubbed from English), but it's almost impossible to find TV shows unless someone has gone to the trouble of transcribing them and uploads them somewhere, and there's no central site for them.

It's a bit sad because almost all Chinese TV has subtitles, which means someone had to create them, but you can't find them anywhere.

If anyone has any success with soft or hard subs I'd love to hear about it.

June 7, 2014 at 10:08 PM

Softsubs really ought to be trivial as they're an independent stream that should be extracted without any issues.

Hard subs are basically impossible at this point. You have to OCR those and I don't know of any OCR software that can do it reliably and automatically. Probably the best would be to take screen shots and then run those screenshots through Pleco or similar. I don't think that Google Translate is good enough to do the job at this point.

June 8, 2014 at 01:28 AM

I've read posts in some forums that make it seem quite possible, but upon further investigation I always fail. To compound matters further, I have a mac, so software is limited.

I'm glad that chinese shows usually come with Chinese subtitles, and I tend to watch them on Viki, which has english subtitles too. This is a good thing, but I want to go one step further and use a mouse-over dictionary. I'm spoiled that way. I want the transcripts for 3 reasons. 1) English subtitles are great to check my understanding, but I also like being able to instantly look up individual words' meanings, which often get lost in translation. 2) The audio of the video is usually sufficient to figure out correct pronunciation, but it's also nice to be able to instantly look up pinyin when it's unclear. 3) Because I don't have scripts for my shows, and I like to read with a mouse-over dictionary for the reasons mentioned, I have been reading additional material. Unfortunately it's pretty hard to find long, interesting transcripts right at my level (or i+1) that also have audio, and I have milked most of my sources almost dry.

If I can't figure out how to do this with software, I might hire a cheap chinese typist to type them out. A friend is in the middle of doing this with a Russian show, and he just loves the results. Very expensive, but for a one-time deal, it might be a way to pay back the language learning community for all the help it has given me over the years.

Anyway, I'm still hopeful that someone will know a way. I know many of you can't see youtube videos, but here the first episode of the show I want to extract from:

盛夏晚晴天杨幂超清版01 HD第1集

June 10, 2014 at 05:31 AM

I had tried this years ago, with limited success. There are a few software tools to attempt this. What they do is to look for contiguous patches of white (or other color) surrounded with black (or other color for contrast). With average quality (non-HD) videos from TV, the anti-aliasing and the bleed-through of the video behind the text would result in about a 60% success rate at best. Correcting the numerous OCR errors made it not much better than simply typing in the subtitles by hand. If you just want bitmap images for each subtitle instead of OCR results, that's slightly less work but still requires a significant amount of cleanup.

I haven't kept up with newer versions of these tools, so I'll just mention the names and you can try them out. It sounds like you'll also need a way to save the FLV files from Youtube.

AVISubDetector - can only read AVI format directly, but can also read an AVISynth stream. Creator's website vanished from the web. Features are incredibly complex

esrXP - haven't tried this one. Can't find creator's original site

SubRip - Not too bad for beginners; you can quickly get started by dragging the subtitle area then clicking for the text and outline color. It can detect the beginning and end of title changes to create a timing (SRT) file. It has built-in OCR, but you either need to do a lot of manual training for Chinese characters, or else manually type in the text

AVISynth - these tools seem to be picky about what files they can read. AVISubDector, for example, only works with AVI format. But AVISynth is a frame server that can serve different formats, and AVISubDetector and SubRip can read these streams directly without extra conversion

sub2srs - I haven't used this myself. If you somehow obtained the timings of a video but not the OCR text, you might be able to use this program to get screenshots of the video frames containing the subtitles. Not as good as a text file of the transcribed text, but a lot less work

Good luck!

June 10, 2014 at 04:10 PM

I just found a previous post on using AVISubDetector. However, I never had as much luck detecting subtitles as that poster apparently had. The problem I have is that TV shows often jump from one subtitle to another one, with no text-free gap between the two frames. The way that AVISubDetector works, it's good at detecting the presence or absence of a subtitle, but can't tell when two adjacent frames are different, especially if they are the same length.

SubRip is more sensitive to text changes, but often too sensitive. However, once you learn the shortcuts, it's easy to tell it to merge a false positive into the previously detected title. It's easier to do it that way, than to manually insert a bunch of missed titles with the correct timings in a later step.

June 11, 2014 at 06:22 AM

esrXP software

(Embed Subtitle Ripper) is a program to help rip the subtitle embedded in the video.

https://sites.google.com/site/cphktool/esrxp

OCR software

CAJviewer or 尚书７号 are recommended, however correction and proofreading are still required.

使用EsrXP提取__文件中内嵌字幕（硬字幕）的方法

http://www.360doc.com/content/12/0911/23/2793098_235625294.shtml

手把手教_如何从RMVB__中提取出外挂字幕文件(轉自HDC)

http://www.360doc.com/content/11/0426/11/6373468_112401863.shtml

June 11, 2014 at 04:15 PM

Thanks eslang. Have you ever succeeded doing this for an entire episode? I wonder how long it takes.

June 12, 2014 at 06:47 AM

For an episode of family drama (e.g.夫妻那些事) about 45 minutes, that are without intensive consecutive dialogue/conversation and technical jargon such as medical/military/historical, etc... maybe around 3 hours.

Bear in mind, the time factor will depend on the quality of the hard subtitles, color separation, familiarity with the software programs and the individual ability to do editing/proofreading since there isn't any OCR software which have 100% accuracy conversion. And typists can make mistakes or typo-errors too.

If the drama is based on a novel, where a copy of it is available online, then it is easier in that some of the terms or words can be copy-and-paste into the transcript for subtitles.

June 13, 2014 at 05:46 AM

I was curious how long it would take to do the entire process and end up with a transcript, so I went ahead and did it using your Youtube episode. It took about 5 hours in total, with the majority of time spent doing corrections in the OCR. And this is with a good quality video, so this is probably the lower end of effort.

* 5 minutes: Using Freecorder to dowload the video and convert to mpg format.

* 1.5 hour: using esrXP to detect subtitles and convert to framegrab images. It took longer than 1 hour because I spent time getting the detection filters to work reliably.

* 3 hours: Using ABBYY FineReader 10, from the bitmap files generated by esrXP, doing OCR processing to get Chinese text. The OCR was fast, but individually checking characters it flagged as questionable was time-consuming. After pasting the text back into esrXP and matching it with the corresponding bitmap frames, I saved the subtitle file.

* 0.5 Hours: using Aegisub to run through the subtitles and check for errors. I set the subtitle font to be the same size as the hardsub, and quickly go through each line to make sure they match.

So there you go. It's possible but a painfully time-consuming process. I will write a blog post on how to do the steps in detail at some point, since I learned a few tricks along the way that would help in the future.

盛夏晚晴天杨幂超清版01 HD第1集 - YouTube.srt

June 13, 2014 at 10:39 AM

I tried typing it out when I came across this at the 2:11 mark.

Instead of c_redman's 是恋爱中爱神的庇护了 it had 是恋爱中爱神的疪护了

I was stumped. I wasn't sure if 疪 was a variant for 庇, so, I entered 庇 into the Dictionary of Chinese Character Variants put out by Taiwan's Ministry of Education and saw that it has no variants.

It wasn't until c_redman's post that I thought to look under 疪. Here they also say no variants for 疪. But 疪 itself is a variant for 痹.

So, definitely a typo. Unless there's a dictionary that says that it's commonly mistaken for...

In my Taishanese dialect, when our leg is numb or has fallen asleep we say GEUHK BEIH.

I entered 脚痹 and 腳痹 (the two common variants for foot/leg) into Google search just to see if they use it in Mandarin and HK Cantonese and got another variant for "numb/paralysis", 痺.

ArggggghhhhhhH!!!!!!!

Kobo, sitting in a corner pulling his hair out by the roots, mind going numb.

June 13, 2014 at 03:50 PM

Wow - 5 hrs. Not the quick fix I was hoping for. Thanks very much for the transcript though!

June 13, 2014 at 04:30 PM

Now that I think about it probably the best thing to do would be to crowd source it. If a group of folks chose the same video I'm sure that collectively the extraction could be done in a few minutes.

OCR helps, but I've found that even quality packages like Pleco seem to have serious issues at times. Especially if there's a weird font or the character is underlined.

June 15, 2014 at 04:17 PM

第1集.docx

I hired a typist. What do you think of this format? I find it easier to read in some respects, but I'm used to a new line when the character changes.

June 16, 2014 at 01:01 AM

Out of curiosity, what was the total cost and where did you find the typist?

June 16, 2014 at 03:02 AM

I pm'd you on that imron. I got lucky and it was cheaper than I expected, so I have budget for one more. I'm going to post a link to them here…they will be hosted on my friend's new site, all free, of course. We made a deal, he would get transcripts for a Russian series, and I would get transcripts for a Chinese series. But I also promised to match expenditures, so I need to choose another Chinese one, and maybe you guys can help me. Here are my requirements:

1) must have full english subtitles available on viki

2) must be modern real life (no magic cell phones, kung-fu, palace dramas, exploding varmints, etc)

Any suggestions?

June 16, 2014 at 03:13 AM

Any suggestions?

Many. 《奋斗》and 《我的青春谁做主》might fit the bill for modern, although 'real life' is always going to be questionable . If comedy is allowed, maybe consider a series of《爱情公寓》and finally, if there is some leeway on modern times, you might also consider 《潜伏》or《黎明之前》both of which are good shows.

Edit: Ah sorry, just realised your requirement about English subs. I don't know how the above shows match up with that.

June 16, 2014 at 11:52 AM

TV mini-series “Le Jun Kai” (乐俊凯), based on a short story by Fei Wo Si Cun (匪我思存) is available on Youtube. There are CC (closed captions) with English, Russian and some other languages which can be easily extracted using Google tool.

匪我思存

http://baike.baidu.com/view/937918.htm

April 7, 2020 at 02:39 PM

Sorry to bring up a dead thread but I have been struggling with the same issue as wulfgar for the past few weeks and I didn't want to hire a translator

I tried the suggestions in this thread.

I couldn't get to esxRP as it's website is banned in china, and I wasn't working with AVI.

I have had really good results from VideoSubFinder, sometimes it crashes but it has better support for weird formats and codecs.

VideoSubFinder watches the video, finds the subtitles, screen caps the subtitle and names the screen grab with the subtitles time codes (subtitle first appeared at this time, disappeared at this time) and cleans the image up a little bit.

When using VSF drag the detection area down so it only sees the subtitles and as little of the background as possible.

Then I used tesseract to OCR the subtitle screen grabs.

I wrote my own program to clean the images up some more, increase OCR accuracy (I found tesseract struggled with massive pictures produced by VSF) and create the SRT file.

I even put in an option to show you the OCR confidence of the sentence, lower numbers are more likely to have an error so you can manually correct them if you wish.

I just used it on a 40 min TV show and i got about 90% accuracy from the OCR

Time

I had already downloaded the show so I'm not counting that

VSF took about 10 mins of actual run time, but it crashed once and had to be restarted. However, once it starts it can be left unattended. (Assuming it doesn't crash)

I manually looked through the images and deleted the ones which were obviously errors, for example white things in the background casting shadows and no text. 5 mins

My cleanup took about 10 seconds for 600 images

Tesseract and building the srt file took just over a minute 75 seconds.

Manual correction could take a long time for the perfectionist

however i would certainly prefer to manually correct something that is 90% correct only when I actually run into an issue rather than have to spend hours to do it by hand, especially with handwriting recognition for the characters I don't know, or pay someone else to do it.

So depending on the speed of your computer, this method should take about quarter of your show's run time, provides pretty good accuracy and is entirely open source.

Link to my scripts

https://github.com/hamsolo474/VideoSubFinder_ocr_path

Link to VSF

https://sourceforge.net/projects/videosubfinder/

February 18, 2021 at 10:16 PM

Hey hamsolo474,

That sounds awesome. Would you minde expanding a bit your explaination on your github on how to integrate tesseract. I just dl tesseract with videocr but got very poor result. 90% success sounds ok.

Sign In

Extracting Chinese hardsubs from a video

Recommended Posts

wulfgar

tysond

hedwards

wulfgar

c_redman

c_redman

eslang

wulfgar

eslang

c_redman

Kobo-Daishi

wulfgar

hedwards

wulfgar

imron

wulfgar

imron

eslang

hamsolo474

Alqua

Join the conversation