Flickserve Posted November 9, 2016 at 12:05 AM Report Posted November 9, 2016 at 12:05 AM what are the solutions of converting book text into machine text? I have a couple of books with Chinese and English sentences. Basically, they are mass sentences which I want to convert into Anki notes (I already have the audio). Since these are mass sentences, the only easy way (time efficient) I can think of is to pay somebody to type it out. I live near a University in Hong Kong and so could get a student to do it. Each page has about 12 sentences, total of nearly 700 pages. Quote
wibr Posted November 9, 2016 at 06:06 AM Report Posted November 9, 2016 at 06:06 AM The solution is either using OCR software like FineReader or like you said paying someone to type it. The OCR result will definitely need some corrections, how many depends on the software, quality of the scans, font, etc.. Usually OCR software packages have a free trial so that you can see if that could work or not. 1 Quote
Zeppa Posted November 9, 2016 at 08:24 AM Report Posted November 9, 2016 at 08:24 AM OCR works well with books, but you would do best to cut out all the pages first so you can scan them flat. However, as wibr says, you will need to read through the result in case some characters have been wrongly identified. You can create PDF files and import them direct into Abbyy Finereader or an alternative OCR program. 1 Quote
Flickserve Posted November 9, 2016 at 11:05 AM Author Report Posted November 9, 2016 at 11:05 AM Thanks. I just bought the book and now to sacrifice it?... Quote
querido Posted November 9, 2016 at 11:50 AM Report Posted November 9, 2016 at 11:50 AM I have scanned whole books, intact, with pleco ocr. Edit: I should say there's pretty impressive tolerance for curvature. 1 Quote
889 Posted November 9, 2016 at 01:22 PM Report Posted November 9, 2016 at 01:22 PM Scanning tends to be pretty slow, especially with 700 pages. Try snapping photos instead; a real camera of course is better than your phone. You may have to convert the .jpgs into .tiffs and do some trial-and-error adjustments. There's a photocopy filter on Photoshop that's good for this, and you can also automate the conversion on Photoshop. 1 Quote
querido Posted November 9, 2016 at 01:41 PM Report Posted November 9, 2016 at 01:41 PM Yes, I photographed all of the pages first. I used my phone and pleco could easily access those. 1 Quote
Zeppa Posted November 10, 2016 at 08:58 PM Report Posted November 10, 2016 at 08:58 PM Yes, if you've got lots of time on your hands you can do that. If you photocopy you can use a multi-sheet feeder and get one file. Pleco is great, of course. I haven't tried it with a 700-page book though. Quote
querido Posted November 10, 2016 at 09:55 PM Report Posted November 10, 2016 at 09:55 PM If I were learning a lot of new words as I go - which has been the case - there would be no hurry. Quote
Flickserve Posted November 10, 2016 at 11:20 PM Author Report Posted November 10, 2016 at 11:20 PM Scanning tends to be pretty slow, especially with 700 pages. Try snapping photos instead; a real camera of course is better than your phone. You may have to convert the .jpgs into .tiffs and do some trial-and-error adjustments. There's a photocopy filter on Photoshop that's good for this, and you can also automate the conversion on Photoshop. I am just reinstalling my software. Havent got round to photoshop yet. Lightroom can do a mass export to TIFF in a straightforward manner but no photocopy filter. What does the photcopy filter actually do? I tried searching on the internet but not much detail. Wouldn't the the original jpg/raw converted to TIFF will hold better quality? Quote
wibr Posted November 11, 2016 at 04:23 PM Report Posted November 11, 2016 at 04:23 PM Why would you convert jpg to tiff? For OCR jpg is perfectly fine as long as the text is large enough. Quote
889 Posted November 11, 2016 at 04:32 PM Report Posted November 11, 2016 at 04:32 PM The photocopy filter on Photoshop does a very good job of producing a Xerox-like page from a jpg. Doing a direct conversion from a jpg to a tiff often doesn't work well in practice because uneven lighting produces splotches on the page. Just try it. (Of course, the better and more evenly lit your photos, the better the results, but it's hard to keep the quality up when shooting 700 pages.) The clearer and more distinct the text from the background, the better the results. It may "work" if you just feed in an unadjusted jpg but you'll probably have more errors to correct than if you use images that have been adjusted to resemble sharp black and white photocopies. There are always going to be errors to correct no matter what method you use, but with 700 pages it's important to keep that error rate as low as possible. In any event, you need to make some trial runs and see what works well for you with your camera and software. Quote
Flickserve Posted November 12, 2016 at 02:10 AM Author Report Posted November 12, 2016 at 02:10 AM Ahh, so uneven lighting makes a difference. Thanks for that tip! Will try to even it out before starting. One issue is that the pages are not wide and the book is thick being 700 pages. So, the pages are a little difficult to keep flat. Quote
Flickserve Posted December 18, 2016 at 10:24 AM Author Report Posted December 18, 2016 at 10:24 AM I held off this project for the moment. The book in question is from Taiwan containing 8000 chinese -english sentences. However, my learning is orientated towards mainland chinese style putonghua at this initial stage since I haven't enough structure and vocabulary to discern the differences. The majority of putonghua first language persons that I come into contact with are from the mainland. Apparently, there are similar books on the mainland. I will visit Guangzhou early next month so I will try to pick up such a textbook. Supposedly, the expressions contained in such books are very natural to mainland speakers. I am in no rush. There's plenty of other aspects of language learning to do. Quote
geraldc Posted December 19, 2016 at 11:19 AM Report Posted December 19, 2016 at 11:19 AM I'd say enter them yourself and treat it as a pinyin typing/listening exercise. Imagine how good you'll be after entering 700 pages. Quote
Flickserve Posted December 19, 2016 at 05:25 PM Author Report Posted December 19, 2016 at 05:25 PM I'd say enter them yourself and treat it as a pinyin typing/listening exercise. Imagine how good you'll be after entering 700 pages.even the great TysonD would pay others... :-)*edit* I am not too interested in typing out pinyin and I do it anyway when looking up words in pleco or typing in wechat. Besides, training pinyin is not my primary objective of converting it into text form. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.