martindbp Posted May 17, 2022 at 01:13 PM Report Share Posted May 17, 2022 at 01:13 PM Hey everyone! I'm a software/ML/computer vision engineer by trade and I've spent some time building an OCR subtitle extraction algorithm for videos, and made them accessible through a browser extension. My goal is to make available pretty much any video of interest online, whether on Youtube, Netflix or chinese sites like Bilibili. As of now I'm working only with Youtube though. You can download it here, and find short installation/user guide here. For now it's what I would consider this "beta" software, and it's Chrome only and requires manual installation. The current list of processed shows can be found at browse.zimu.ai. The list is pretty short right now but I'm processing new videos every day. As you probably know, there are quite a few similar extensions for soft subs (which are also supported naturally), but I'm trying out a slightly different concept/philosophy for the subtitles. The idea is we want to display the minimal yet sufficient information such that a learner can understand the content in a reasonable time frame. From the start, the pinyin, hanzi and word translations are visible for all words. Gradually you can hide information you know, but new unknown words are thus visible by default, hopefully keeping you in flow. If you keep learning until all the subtitles are completely hidden, viola, you're fluent! At least that's the idea. But, naturally everyone is free to use it as it suits best, I've tried to keep enough settings to make it flexible to use. The extension comes with the standard Anki CSV file export. You can export the usual basic or cloze notes, but I've also added the ability to export the JSON of the whole containing sentence, along with dictionary info, so that you can build very advanced cards in Anki if you wish (example cards are provided in the guide). That said, (deep) knowledge tracing has been a research interest of mine for quite a while and I do see a big potential in minimizing the amount of time we spend in SRS by helping us encode memories more efficiently, and use inter-card dependencies to improve the scheduling. Therefore at some point I'll probably take a stab at an embedded SRS. As for funding, I'm making this browser extension available for free. I'm putting as much functionality as I can client-side (in the browser), and optimizing for low cost so that each additional user has very low marginal cost. For full disclosure, my philosophy here is to try and reach and provide something useful to as many people as possible, and try to find other ways to support it financially rather than a subscription or locking important features behind a paywall. That might be Patreon donations, selling the OCR as a SaaS, or even VPN/affiliate ads on the browsing site (not in the extension). So, are there any cool Youtube videos or channels with hard subs (or soft) you've been wanting to watch? Any and all feedback is warmly welcome! Hope you find it useful! 4 Quote Link to comment Share on other sites More sharing options...
wibr Posted May 18, 2022 at 11:39 AM Report Share Posted May 18, 2022 at 11:39 AM 家有兒女 would be a good candidate, should be available on youtube. So you process the videos offline and provide the soft subs using the extension? What's the accuracy of the OCR engine? Personally I have my SRS setup in Pleco, so I would prefer just a list of words, similar to what I provide for some shows, based on soft-subs. Quote Link to comment Share on other sites More sharing options...
martindbp Posted May 18, 2022 at 12:44 PM Author Report Share Posted May 18, 2022 at 12:44 PM On 5/18/2022 at 1:39 PM, wibr said: So you process the videos offline and provide the soft subs using the extension? What's the accuracy of the OCR engine? Yes that's right, I process them offline at the moment. At one point I checked the accuracy on a few different videos that had soft captions as well as hard captions, the result then was about 1 character error in 200 to 1000 depending on difficulty. The difficulty depends on a lot of things, resolution, text blending into the background without a clear border, fade-in/fade-out, rare fonts etc. For example, I checked the show you suggested, and it's a bit on the challenging side due to the low resolution. Here's an excerpt of the first dialog from the first episode: Quote 你等会儿我还没说完呢 你瞧咱俩结婚刚两个月 这俩孩子就好得跟亲兄弟似的 多好啊 好如果是再多个就更好了 什么意思啊你还想让我再生啊 不是我不是那意思 我是说啊干脆把小雪 从她爷爷家也接过来一块住 你想啊头羊也是赶 一头羊也是 三头羊也是轰 At the end you can see a duplicate line where there was high uncertainty. In cases like this though, I can generate more specific training data to hopefully improve the model On 5/18/2022 at 1:39 PM, wibr said: Personally I have my SRS setup in Pleco, so I would prefer just a list of words, similar to what I provide for some shows, based on soft-subs. That sounds like a simple export function I could add for Pleco users. In the extension you can "star" words that you encounter and export from the dashboard page, would that work? By the way, I realized from my post I should probably lead with a screenshot of what to expect. The screenshot attached shows what it looks like to me. Personally I keep all pinyin and hanzi visible as I'm not yet focusing on listening comprehension. 2 1 Quote Link to comment Share on other sites More sharing options...
shawky.nasr Posted May 20, 2022 at 05:46 AM Report Share Posted May 20, 2022 at 05:46 AM Great Martin 1 Quote Link to comment Share on other sites More sharing options...
martindbp Posted May 24, 2022 at 10:11 AM Author Report Share Posted May 24, 2022 at 10:11 AM Here's a first week update Added shows 家有儿女 (Jiā yǒu érnǚ) Home with Kids - I retrained my segmentation model for improved OCR @wibr (first 26 episodes) 爱情公寓 (Àiqíng Gōngyù) iPartment (first season) 人民的名义 (Rénmín de míngyì) In the Name of the People 摩天大楼 (Mótiān dàlóu) A Murderous Affair in Horizon Tower 三国演义 (Sānguó yǎnyì) Romance of the Three Kingdoms - This is the 2010 version that I've heard is supposed to be good 开端 (Kāiduān) Reset New extension release: Export starred words to Pleco format Export subtitles to SRT format for those who use subs2srs or similar tools (can be found in the subtitle options menu, "Other" tab) New auto-pause feature based on limiting the Words Per Second (WPS). When enabled, this only pauses subtitles which are above a set WPS threshold, and pauses for the remaining duration. I found this to be very useful when watching with my wife, letting me get a bit more time to process difficult subtitles while not having to manually control things constantly Youtube video thumbnails now have an added icon to signify that the video has processed subtitles (see attached image). This makes it a bit easier to differentiate which videos are supported, and navigating back to the ones you were watching before. Let me know if there are any features that are blockers for effectively using it, it'll help me prioritize! 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.