HOCR

HOCR is an open standard that describes a data format that is used for representation of text recognition results. This format can be in addition to the text its layout, capture detection accuracy, formatting, and other information. The format is based on XHTML (or HTML). Metadata are stored in special tags for embedding metadata in HTML to the Dublin Core Convention.

Software

The format was introduced in Google's OCRopus. Apart from OCRopus, the format also CuneiForm, HOCR, a company specializing in Hebrew script text recognition software, and are generated directly since version 3.0 by Tesseract.

The HOCR - Tools are a suite of tools for processing ( merging, splitting, inserting metadata) and analysis HOCR data. With hocr2pdf a command line tool exists for the generation of machine- searchable images PDF files based HOCR data. The Firefox extension moz- HOCR -edit allows the correction of recognition results in HOCR format.

395421
de