Tesseract (Software)

Tesseract is a free software for text recognition. It is a pure character recognition program without the use of statistical language models, and without a graphical user interface, but at the character level provides very good results.

It will be developed in the programming language C .

For a variety of languages ​​OCR data already exists in additional modules. With a corresponding module and the detection German Fraktur font is partly possible.

History

Originally, the software between 1985 and 1995 from Hewlett -Packard was developed. For a test at the University of Nevada, Las Vegas ( UNLV ) she went in 1995 as one of the three most accurate test candidates out. After the departure of HP from the OCR market development largely lay fallow until the code was handed over to the Information Science Research Institute of UNLV in 2005. Here it was found that the former developer Ray Smith is now working at Google. After a demand on Google, whether interested in the code would be, Google took the source code to, put him on a date and gave him the same year under the Apache license on SourceForge free.

This meant in the world of free software, a great leap in quality in the area of ​​text recognition. The project migrated from SourceForge to Google's own software developer platform Google Code, where it is further developed under the supervision of Google.

Since 2006, the program will be further developed as the basis of Google Books. Since version 3.0 in September 2010, results can be output directly into the HOCR format and it has introduced a new module for the analysis of the page layout.

In Version 3.02 dated 28 October 2012, inter alia, the recognition of Arabic and Hebrew texts in bidirectional mode introduced.

Application

Tesseract is controlled by the usual Unix conventions on Windows from the command line and has the following format:

Tesseract.exe imageName output base [ - l lang] [ configfile [ [ | -] varfile ] ... Tesseract reads the image in Tagged Image File Format (TIFF) and outputs the text continues in the output file. As no layout control takes place - this is an ongoing project OCRopus - are text columns to be distributed to individual image files. The recent research project Leptonica also targets analysis of page layout and other image formats.

An automated processing can be realized for example with ImageMagick.

Tesseract can since version 3 using an undocumented parameter, the scan results store in HOCR format, so that the page layout is retained.

There are a number software that integrates Tesseract as a backend. Tesseract can be used as a character recognition module in OCRopus, which additionally provides analysis of the document structure and statistical language models. However OCRopus used since version 0.4 by default a private character recognition module based on neural networks. In previous versions Tesseract was used as a standard module in OCRopus. In addition to other possible backends can be used for character recognition in the desktop OCR solution OCRFeeder. Means hocr2pdf it is for example in the Linux-based document management system Archivista the generation of a text layer to raster graphic images of scanned paper documents, to make them searchable by machine.

Availability

Tesseract is distributed as free software in the source code under the terms of version 2.0 of the Apache License ( Apache Software License, ASL). In virtually all major Linux distributions, it can be installed directly from the standard repositories.

Tesseract is used, inter alia, the following programs as the basis of OCR:

  • FreeOCR for Windows is available as version 4.2.2 ( October 2013 ).
  • TesseractOCR Mac makes it available for Mac OS X.
  • YAGF is one of several front-ends that can be used under Linux.
  • PDFScanner is a program for scanning documents on your Mac.
  • OCRextrACT provides Tesseract 3.0 as an online service. Are processed PDF, PS, TIF, PNG, JPEG, BMP, PBM / PGM / PPM.
350676
de