Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes. Despite the unusual font, Google rendered it perfectly.


Ok, I tried your benchmark in Tesseract 4.0.0-beta.1 in Ubuntu 18.04, and it gave me:

> EARTHQUAKE HOUSE

> ‘his building was erected around 1874 to provide a location for a seismometer of the British Association. A seismometer is an instrument which is designed to record earthquarkes and the ‘one located in this building was only one of a series of such instruments located in the vicinity ‘earthquakes which had been, and continue to be, prevalent in the area.

> of Comrie to investigate

Clearly, Tesseract 4.0 has problems following the baseline of the text. But otherwise, it is much better than the output from the website and even got the title correct. Which makes me think they use an older version (?)

My commandline:

    tesseract 1587043_dcd093c4.jpg output -l eng


This project is badly in need of an update. It's using an ancient 4 yo. version of tesseract.js. The current version: https://github.com/naptha/tesseract.js is based on tesseract v4.1.1, a newer your Ubuntu 18.04's.

The 4.0 version added new neural network system based on LSTMs, with major accuracy gains.

https://fossies.org/diffs/tesseract/4.0.0_vs_4.1.0/ChangeLog...


Ok, I tried the demo page of project naptha on exactly the same image, and it gave me:

> ARTHQUAKE HOUSE 1

> Ths builing was erecied Ground 1874 to provide a location for a seismomater of the Bitish Aasociation. A sefémomatar is an nsirument which is designed 10 racord earthquarkes and the ne ocate n his buiding was only one of a seies of such instruments located in the vicinity of Coms tonvestgete the earthauakes which had been, and continue to be, prevalet n the area.

The 4.0.0 version looked better to me ...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: