Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> it's not possible to unambiguously segment the text into distinct letters (which is a necessary first step in any OCR engine that I'm aware of)

In my experience, the ability to handle overlapping letters (which is very common on type-written text and professionally typeset material) is one of the key things that separate the relatively lightweight OCRs (like Ocrad and GOCR) from the big complicated ones (Tesseract, Cuneiform, Abbyy etc). Whitespace character segmentation cannot be taken for granted if you want to do any useful OCR of "historical" material.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: