OCRopus

OCRopus is a free software for document analysis and OCR with a very modular design. OCRopus is developed with the support of Google, Inc. under the direction of Thomas Breuel the German Research Center for Artificial Intelligence ( DFKI) in Kaiserslautern and released as free software under the terms of version 2.0 of the Apache License.

Description

OCRopus is an OCR system, which combines analysis of the document structure, optical character recognition and the use of statistical language models in a modular manner. By adding modules components can be easily replaced. As application area is initially aimed at the reading large amounts of text - especially the retro-digitization of books for Google Book Search - but should also be suitable for use in the office or home use or for the visually impaired. The program is developed in C and Python with Jam as a build system under Ubuntu Linux.

Currently, the Tesseract developed by Hewlett -Packard is the only recognition module, which is OCRopus available, but in the future, other so-called " engines " should be able to be integrated ( the code to do this already exists and only needs to be installed), so OCRopus without Tesseract can be used. Thus for example, when needed, be switched to a handwriting recognition engine.

OCRopus already provides better analysis of the document structure than Tesseract alone. OCRopus has been no separate language model system, but uses that of Tesseract, however, it should be replaced by a system based on the OpenFST project once it has reached the first official release.

History

2004 began with the launch of Google Inc. Google Book Search (then Google Print ), which should enable the online search in conventional printed books published. For the necessary retrodigitization OCRopus was launched.

It is based on two research projects, a mid-90s developed powerful handwriting recognition, which is also used in the U.S. Census Bureau, and newer methods for structure analysis.

The project was announced in a press release on April 9, 2007 and made the code for developers about the Subversion version control system available.

The first alpha version 0.1 was released on 22 October 2007. Various preliminary versions appeared between December 2007 and October 2008, while the announced release of the first stable release was postponed several times.

Use

OCRopus is in the current pre-release version a pure command line program, while for the release of the first stable version ( Final) a graphical user interface is planned. It is developed primarily for Linux platforms, but should be able to run on many platforms, as long as its dependencies are fulfilled. It is used by the input image is provided on the command line. For more precise control even additional options can be passed to perform certain actions, such as the detection of a single line. The results on the standard output ( stdout) in HTML and CSS with special formatting ( HOCR ) output.

613402
de