What app to use - advice for ocr-ing Chinese language pdf files

October 27, 2013 at 12:54 PM

Hello,

I have a number of pdf files (scanned graphics) in my calibre library in Chinese and other languages. Now, I'd like to OCR-process them - if possible in batch mode - and I'm thinking about what program to use.

Requirements:

- free or affordable for private use

- should run from Win7 or Ubuntu 12.04 (command line would be ok)

Further,

- it would be cool obviously if the application could recognise if the language is Chinese, English or German, but I'm afraid that doesn't exist yet.

Btw.: Has anybody wondered too, why Pdfs that are OCR-ed with say, simplified Chinese, can't encode parts that are written in western alphabet? It would be great to fix this somehow and have a PDF-file that just knows all unicode characters.

Thanks in advance!

October 27, 2013 at 02:45 PM

Have you tried searching? This topic has been discussed numerous times including as recently as May of this year. Check here and here for starters and they'll give you some ideas of where to start looking.

On another note, I have personally never found anything that even comes close to matching all your requirements, especially the batch processing and free for personal use. Usually free doesn't do batch.

Also, any of the OCR's I've used with Chinese just have too high of an error rate to be worth my time. But that is just me.

October 28, 2013 at 03:41 PM

Thank you, I had obviously used the wrong search terms. However, OCR is a valuable technology to make a library full-text-searchable and definitely worth my time.

October 28, 2013 at 03:50 PM

Don't get me wrong, I love OCR. However, the current implementations (for Chinese) that I have used have caused me to spend more time error correcting than it would have for me to simply type up the whole document. I did that with multiple contracts and official business letters. Tried OCR and it took me over an hour on one page; typed it by hand and was done in 20 minutes per page.

October 28, 2013 at 07:04 PM

I am currently getting paid to correct the characters in a scanned book. I'd guess that about one in five to one in four is wrong, some of the mistakes being understandable because the characters are crazy weird, some of them really stupid (it consistantly doesn't recognise 平 because in the book, the dots are upside down). I seem to remember that Brendan O'Kane regularly twitter-rants about the shortcomings of the various programs, but I don't know much about OCR and such, it might be about something else.

Depending if you have more time or more money, it appears your options are:

- type the things you want digitally;

- pay a Chinese person to type your things;

- scan your things, then pay someone to correct them.

November 3, 2013 at 07:20 PM

We are talking about somewhat different use cases here. I agree to both of you about the shortcomings of ocr. However, many of us will probably collect more or less extensive archives of pdf files over the years. lecture materials, articles whatever... Surely it wouldn't be necessary to have digitized all of them to gold standard but on the other hand why not employ the technology to get the directories searchable. As the pattern recognition algorithms and computing capacities evolve you can always re-run the recognition later.

Sign In

What app to use - advice for ocr-ing Chinese language pdf files

Recommended Posts

c4oyu4n

muyongshi

c4oyu4n

muyongshi

Lu

c4oyu4n

Join the conversation