Jump to content
Chinese-Forums
  • Sign Up

Adobe OCR Woes - Alternatives? a.k.a. What are you all using for your Chinese OCR needs?


陳德聰

Recommended Posts

Lately, I have been looking into ways to make my work easier and it has been quite frustrating to discover that the paid version of Adobe Acrobat (which thankfully I use for other things than processing my Chinese language documents) is painfully incompetent at running OCR on Chinese text. Often, what I get looks like a casualty of wrong encoding, but in fact it’s Adobe’s honest attempt at recognising the text (!?!?).

 

Am I just extremely behind the times? Is there an alternative magical moderately priced program that does not involve uploading my clients’ sensitive documents to the internet in order to produce a workable text version of the contents of my PDFs? Does Pleco have some kind of PDF-to-Word converter I didn’t know about? Does Chinese Text Analyser do exactly this work? Translators using CAT tools, have you found that they have been able to handle such a task? (SDL Trados seems to have terrible support for Chinese language docs as far as I can tell.)

 

I have stopped ripping my hair out, but I don’t know that I can realistically continue to avoid figuring out how to do this and survive in this market. This man-child needs help please.

  • Good question! 1
Link to comment
Share on other sites

Who better than Chinese people know which OCR program is the best for Chinese language? Well, I tried searching in Chinese and it seems that the best program is Abbyy FineReader. There is also finereaderonline.com website that is the cloud based version and if you register with your email you get 10 page credits, so you can try how accurate it is before purchasing it.

 

It seems that also the open source Tesseract library works well enough with Chinese language if you know how to program.

  • Like 1
  • Helpful 1
Link to comment
Share on other sites

12 hours ago, 陳德聰 said:

Is there an alternative magical moderately priced program that does not involve uploading my clients’ sensitive documents to the internet in order to produce a workable text version of the contents of my PDFs?

 

If your OS is Windows 10 or Windows 7 Ultimate (the multilingual version) and you have Word (Microsoft Office) 2013 or younger, you already have that magic tool. Go to your pdf file, don't open it > right click and select 'Open with' > select 'Word' from the list of files (or select 'Other' and sear for Word) 

Let Word do its job (you may have to 'allow editing' and click on 'pdf files' if Word gives you a list of possible conversions to choose from)

 

If you're using an earlier version of Windows10, you may need to download the Chinese language packages from Optional Windows Features and include OCR and 'rare characters' in the options. Ideally, the latest version of Windows 10 (v. 1803, soon-ish to be updated to v. 1809) Office 2016 or Office 365, are much easier and faster in this kind of job.

 

Can't say I have tested this extensively and in-depth, but have been using this method for years and never noticed any problem with either simplified or traditional characters.  

 

For some  pdf files where Adobe Reader can't figure out the font, I use Kingsoft (金山)  PDF (there's also a free full Kingsoft Office suite that includes pdf to Word conversion) but they may have vulnerabilities, make sure your devices are fully patched up (I haven't experienced any problem with the pdf reader).

http://www.kingsoftstore.com/index.

 

(Edited to add links to Kingsoft)

 

 

  • Helpful 2
Link to comment
Share on other sites

It very much depends what you're working with. Something that was originally a word document, has been saved as a pdf, and for some arcane reason the word document is no longer available - not too bad. Scans of handwritten medical notes in pdf format - good luck. 

 

Thankfully it's not something I have to deal with very often. A couple of times in the past I have simply paid someone to type the damn things out for me. 

 

I also suspect OCR is a bit like voice recognition now - for the best results, you need to be accessing cloud servers. 

  • Like 1
Link to comment
Share on other sites

I hear good things about ABBYY Fine Reader, although none of the people who say the good things use it for Chinese.

 

The last time I was asked to translate a scanned Chinese periodical article, I promised the client a 10% discount (off a rate inflated by the same amount) if they could get it to me as a flat text. They tried but couldn't. I used half that amount to hire a Chinese student to type the article up for me and fed that flat text to the CAT.

 

I use MemoQ, haven't tried feeding it PDF's, but for regular documents it works alright for Chinese, as far as I have used it. (Most of my translating from Chinese is literary, and I don't use the CAT for that.)

  • Helpful 1
Link to comment
Share on other sites

14 hours ago, fabiothebest said:

It seems that also the open source Tesseract library works well enough with Chinese language if you know how to program.

 

I use Tesseract often to OCR English PDFs. I happen to use a front-end for it called OCRmyPDF that has a command line interface, but I'm sure you can find a more user-friendly open-source implementation. Programming skills shouldn't be necessary.

 

I was curious to see how well it does with Chinese text, so I fed it a page from a low-quality scan of a textbook. The results weren't perfect, but they were pretty good for a free piece of software!

 

Screenshot of Input PDF:

image.thumb.png.be3dbb3ad3f7767d3fefee913e324fbc.png

 

OCRmyPDF command:

Quote

ocrmypdf --sidecar -l eng+chi_sim testpage.pdf testpage_ocr.pdf

 

Output Text File:

Quote

参考 译文

老师 : 那 是 谁 的 衬衫 ?

老师 : 戴 夫 , 这 是 你 的 衬衫 吗 ?
戴 夫 : 不 , 先 生 。 这 不 是 我 的 衬衫 。

戴 夫 : 这 是 我 的 衬衫 。 我 的 衬衫 是 蓝 色 的 。

老师 : 这 件 衬衫 是 蒂 姆 的 四?
戴天 : 也 许 是 , 先 生 。 蒂 姆 的 衬衫 是 白色 的 。

老师 : 蒂 姆
蒂 姆 : 什么 事 , 先 生 ?

老师 : 这 是 你 的 衬衫 吗 ?
ay}: 是 的 ,先生 。

老师 : 给 你 。 接 着 !
蒂 姆 : 谢谢 您 , 先 生 。


 

  • Like 1
  • Helpful 2
Link to comment
Share on other sites

Yeah, I didn't search, but I thought there might be a frontend for it..anyway I'm good enough with Python (there is a python implementation), otherwise it's in C++..just thought not everyone is..the CLI option seems easy to use. I just checked and there are many GUIs available and also versions for Android and iOS.

Link to comment
Share on other sites

  • 10 months later...

One alternative is Onenote (2016 for example) . Unedited results of the above text


参 考 译 文
老 师 · 那 是 谁 的 衬 衫 ?
戴 夫 , 这 是 你 的 衬 衫 吗 ?
老 师
戴 夫
不 , 先 生 。 这 不 是 我 的 衬 衫 。
戴 夫
这 是 我 的 衬 衫 , 我 的 衬 衫 是 蓝 色 的
老 帅
这 件 衬 衫 是 蒂 姆 的 吗 ?
戴 人
也 许 是 , 先 生 。 蒂 姆 的 衬 衫 是 白 色 的 。
老 师
蒂 姆 !
蒂 姆 , 什 么 事 , 先 生 ?
老 师
这 是 你 的 衬 衫 吗 ?

是 的 , 先 生 “
老 帅
给 你 。 接 着 !
蒂 的
谢 谢 您 , 先 生 。

image.png

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...