Results for even the most plain and recognizable cases (english language screenshots) are absolutely terrible. This aligns with my other experiences with Tesseract. In fact the only OCR I ever had any success with is ABBYY.
Yeah I tried Tesseract with screenshots from the Spotify UI, but with terrible results. Makes you wonder where the competition is. Is anyone building an OCR based on DL?
If you upload an image file to Google Drive and right-click on it to open the file in Google Docs, you get very accurate OCR of the text in the image. I’ve used it a lot for both English and Japanese, and it has consistently worked very well.
“This building was erected around 1874 to provide a location for a seismometer of the British Association. A seismometer is an instrument which is designed to record earthquarkes and the one located in this building was only one of a series of such instruments located in the vicinity of Comnie to investigate the earthquakes which had been, and continue to be, prevalent in the area.”
The only mistake seems to be “Comnie” instead of “Comrie.” (The misspelling “earthquarkes” appears in the original.)
In contrast, the In-Browser OCR gives the following for the same four lines:
“m .me W m m: swm mm m mm .. lucauan m a smmmm. m m. anus»
Lssm'm'm» A swmmm ‘5 3,. WWW mm. vs damned .a mom “mum: m m:
an» army! m a... “mm was Only an. a: .1 gem a; 5m m<llumems mm m we mm
0. Camus w WNW: we earmquakes Wm»... mm W. m coulmu: m be Wax/Nam m m M”
I just tried that same image in the new real-time OCR built into the iOS 15 beta, where you can select the text directly inside of Safari and copy it out. On a first scan, the results look flawless aside from Comrie becoming “Come”.
Ok, I tried your benchmark in Tesseract 4.0.0-beta.1 in Ubuntu 18.04, and it gave me:
> EARTHQUAKE HOUSE
> ‘his building was erected around 1874 to provide a location for a seismometer of the British
Association. A seismometer is an instrument which is designed to record earthquarkes and the
‘one located in this building was only one of a series of such instruments located in the vicinity
‘earthquakes which had been, and continue to be, prevalent in the area.
> of Comrie to investigate
Clearly, Tesseract 4.0 has problems following the baseline of the text. But otherwise, it is much better than the output from the website and even got the title correct. Which makes me think they use an older version (?)
This project is badly in need of an update. It's using an ancient 4 yo. version of tesseract.js. The current version:
https://github.com/naptha/tesseract.js
is based on tesseract v4.1.1, a newer your Ubuntu 18.04's.
The 4.0 version added new neural network system based on LSTMs, with major accuracy gains.
Ok, I tried the demo page of project naptha on exactly the same image, and it gave me:
> ARTHQUAKE HOUSE 1
> Ths builing was erecied Ground 1874 to provide a location for a seismomater of the Bitish Aasociation. A sefémomatar is an nsirument which is designed 10 racord earthquarkes and the ne ocate n his buiding was only one of a seies of such instruments located in the vicinity of Coms tonvestgete the earthauakes which had been, and continue to be, prevalet n the area.
My impression is that the open source tesseract could be a component of an OCR system, but much more has to be done in preprocessing, registration, and segmentation to be able to use it for what we could consider “OCR”. Quite disappointing that OCR is still something that is proprietary, closed source, and expensive in 2021.
ABBY is still very good, but for some languages other tools are meanwhile better or at least much cheaper/free: Google Cloud OCR, Amazon Textract, Azure OCR, OCR.space
Project Naptha [https://projectnaptha.com/] delivers a much more impressive result, but isn’t open source. Naptha uses text detection and extraction to isolate text from the image, which greatly improves accuracy.
What is the top of the line, publicity available (open source) OCR engine now available? Tessaract still? Not interested in proprietary cloud solutions.