Jump to content
Chinese-Forums
  • Sign Up

What app to use - advice for ocr-ing Chinese language pdf files


c4oyu4n

Recommended Posts

Hello,

 

I have a number of pdf files (scanned graphics) in my calibre library in Chinese and other languages. Now, I'd like to OCR-process them - if possible in batch mode - and I'm thinking about what program to use.

 

Requirements:

- free or affordable for private use

- should run from Win7 or Ubuntu 12.04 (command line would be ok)

 

Further,

- it would be cool obviously if the application could recognise if the language is Chinese, English or German, but I'm afraid that doesn't exist yet.

 

Btw.: Has anybody wondered too, why Pdfs that are OCR-ed with say, simplified Chinese, can't encode parts that are written in western alphabet? It would be great to fix this somehow and have a PDF-file that just knows all unicode characters.

 

Thanks in advance!

Link to comment
Share on other sites

Have you tried searching? This topic has been discussed numerous times including as recently as May of this year. Check here and here for starters and they'll give you some ideas of where to start looking.

 

On another note, I have personally never found anything that even comes close to matching all your requirements, especially the batch processing and free for personal use. Usually free doesn't do batch.

 

Also, any of the OCR's I've used with Chinese just have too high of an error rate to be worth my time. But that is just me. 

Link to comment
Share on other sites

Don't get me wrong, I love OCR. However, the current implementations (for Chinese) that I have used have caused me to spend more time error correcting than it would have for me to simply type up the whole document. I did that with multiple contracts and official business letters. Tried OCR and it took me over an hour on one page; typed it by hand and was done in 20 minutes per page.

Link to comment
Share on other sites

I am currently getting paid to correct the characters in a scanned book. I'd guess that about one in five to one in four is wrong, some of the mistakes being understandable because the characters are crazy weird, some of them really stupid (it consistantly doesn't recognise 平 because in the book, the dots are upside down). I seem to remember that Brendan O'Kane regularly twitter-rants about the shortcomings of the various programs, but I don't know much about OCR and such, it might be about something else.

 

Depending if you have more time or more money, it appears your options are:

- type the things you want digitally;

- pay a Chinese person to type your things;

- scan your things, then pay someone to correct them.

Link to comment
Share on other sites

We are talking about somewhat different use cases here. I agree to both of you about the shortcomings of ocr. However, many of us will probably collect more or less extensive archives of pdf files over the years. lecture materials, articles whatever... Surely it wouldn't be necessary to have digitized all of them to gold standard but on the other hand why not employ the technology to get the directories searchable. As the pattern recognition algorithms and computing capacities evolve you can always re-run the recognition later.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...