Optical character recognition

OCR or Optical Character Recognition ( OCR abbreviation of English Optical Character Recognition, rarely also: OZE ) is a term from the information technology and refers to the automated text recognition within images.

  • 5.1 Proprietary software
  • 5.2 Free Software

Basics

Text recognition is necessary because optical input devices ( scanners or digital cameras, but also fax recipient ) as a result can only deliver raster graphics, ie arranged in rows and columns of points of different colors (pixels). OCR denotes the task of recognizing the letters as shown as such, that is to identify and assign them the numerical value according to the usual text encoding given to them (ASCII, Unicode). Automatic text recognition and OCR are often used in German-speaking interchangeably. In technical terms, however, OCR refers only to the portion of the pattern matching of separated image parts as candidates for recognition of individual characters. This OCR process is preceded by a global pattern recognition, first distinguished in the text blocks of graphic elements, recognized the line structures and finally individual characters are separated. When deciding which is present character, a linguistic context can be taken into account by other algorithms.

Originally specially designed fonts have been developed for automatic text recognition, which were used as for the printing of check forms. These fonts were designed so that the individual sign could be distinguished by an OCR reader quickly and without much computational effort. So the OCR-A font (DIN 66008, ISO 1073-1 ) is characterized by one another particularly dissimilar characters, especially in the digits from. OCR -B ( ISO 1073-2 ) is more like a sans serif, non-proportional font while OCR -H ( DIN 66225 ) handwritten digits and uppercase letters was modeled.

The increased performance of modern computers and improved algorithms now allow the detection of " normal" printer fonts to manuscripts ( for example in the distribution of letters ); However, if readability by people not primarily, printed and recognition technically easier to handle bar codes are used.

Modern text recognition now includes more than just OCR, that is the translation of individual characters. In addition, methods of context analysis, Intelligent Character Recognition (ICR ), will be associated with which the actual OCR results can be corrected. Thus, a character which was originally identified as "8" can be corrected to a "B" when it is within a word. Instead of " 8aum " is thus "tree" recognized. In the industrial text recognition systems is therefore spoken of OCR / ICR systems. However, the limits of the OCR term is fluid, since OCR and ICR also serve as a marketing concepts to promote technical developments better. Intelligent Also Word Recognition ( IWR) falls under this category. This approach attempts to the problem in the detection of flow manuscripts to be solved in which the individual characters are not separated clearly, and therefore can not be detected by conventional OCR methods.

A fundamentally different approach to text recognition ( PDA, etc. ) used in the handwriting recognition on touch screens or input fields. This vector-based patterns are processed, either offline ' as an entire word or ' online ' with additional analysis of the input flow ( for example, Apple's Inkwell ).

A special form of text recognition results, for example with regard to automatic processing of incoming mail large companies. One task is to sort the documents. For this purpose, the content does not always need to be analyzed, but sometimes it is enough already, the coarse features, such as the characteristic layout of forms, company logos, etc., can be seen. The classification of certain text types is similar to the OCR on pattern recognition, but globally refers to the entire sheet or defined sites rather than individual letters.

Method

The starting point is an image file ( raster graphics ), which is generated from the template via scanner, digital photography, or video camera. Text recognition itself is performed in three steps:

Pages and Outline Detection

The image file is divided into relevant areas (text, captions) and irrelevant areas (figures, white space, lines).

Pattern Recognition

Error correction at the pixel level

The Rohpixel can be corrected by their neighbor relationships to adjacent pixels. Individual pixels are deleted. Missing pixels can be added. Thus, the hit rate increases with a pure pattern matching. This is highly dependent on the contrast of the original.

Pattern matching mapping

The pixel pattern of the text regions are compared with patterns in a database, Rohdigitalisate generated.

Error correction character level ( Intelligent Character Recognition ICR)

The Rohdigitalisate be evaluated with dictionaries and compared according to linguistic and statistical methods with regard to their probable errors. In accordance with this evaluation, the text is output or optionally fed to a re- layout or pattern recognition with modified parameters.

Error correction at the word level (Intelligent Word Recognition, IWR)

Flow handwriting, in which the individual characters can not be separated from each detected is compared by global characteristics with dictionaries. The hit accuracy decreases with the increasing size of the included dictionary, since the possibilities of confusion increase. Areas of application are defined field areas with limited information options, for example handwritten addresses on envelopes.

Manual error correction

Many programs also offer a special mode for manual error correction by the user for those text areas that have been identified as "unsafe ".

Encoding in the output format

Depending on the task, the output is in a database or a text file in a defined format, such as ASCII or XML, where appropriate, with layout (such as HTML or PDF).

The quality of the OCR determine, among other several factors:

  • Quality of the layout detection,
  • Scope and quality of pattern database,
  • Scope and quality of dictionaries,
  • Quality of the algorithms for error correction,
  • Color, contrast, and font layout of the original document,
  • Resolution and quality of the image file.

The number of undetected errors in a document can be estimated, see spelling errors. While texts contain redundancies and therefore reflect a higher failure rate, requiring lists of numbers, such as telephone numbers, a repeated proofreading.

Successes by neural networks in handwriting recognition competitions

Recently, artificial neural networks for handwriting applications often achieved better results than competitive learning method. Between 2009 and 2012, the recurrent neural networks and deep forward the research group of Jürgen Schmidhuber at the Swiss AI lab IDSIA won a series of eight international competitions in the fields of pattern recognition. In particular, their recurrent LSTM networks won three competitions for connected handwriting recognition at the " 2009 Intl. Conf. On Document Analysis and Recognition ( ICDAR ) ", not incorporating a priori knowledge about the three different languages ​​to learn. The LSTM networks learned simultaneous segmentation and recognition. These were also the first international contests, which were won by Deep Learning or by recurrent networks.

Even deep forward networks such as Kunihiko Fukushima Konvolutionsnetz 80s are today important again for handwriting recognition. They have alternating Konvolutionslagen and layers of neurons, which are in competition with each other. Yann LeCuns team of New York University in 1989 turned the already well-known backpropagation algorithm to such networks. Modern versions use so-called "max -pooling" for the competition documents. At the end of one crowns the deep network by a plurality of normal neurons layers. Fast GPU implementations of this combination have been introduced by Dan Ciresan and colleagues in Schmidhuber's group in 2011. They won several competitions since then for the recognition of handwriting and other patterns. GPU-based "max -pooling" Konvolutionsnetze were also the first method that could detect the handwritten digits of the MNIST benchmarks as well as humans.

Applications

  • Retrieving text information from image files to these further editing using a word processor or be electronically searchable
  • Identification of relevant features (for example, zip code, contract number, invoice number ) for mechanical ( Post Road ) or electronic (Workflow management system ) sorting a document
  • An advanced full-text search in databases or document management systems in order to search PDFs and images.
  • Recognition of features for registration and, where appropriate, prosecution of objects ( for example, license plate )
  • Layout detection: Creates a formatted document that comes with respect to the template text, image and table arrangement as close as possible.
  • Aids for the blind: For the blind it is possible by the OCR to read scanned text on computer and Braille display or read aloud by voice output.

OCR software

Proprietary software

  • BIT - Alpha of B.I.T. Bureau Ingénieur Tomasi
  • FineReader ABBYY
  • FormPro of OCR systems
  • OCRKit for Mac OS and iOS
  • OmniPage from Nuance Communications (formerly ScanSoft )
  • Readiris of Image Recognition Integrated Systems Group ( I. R. I. S), since 2013 for Canon
  • NSOCR of Nicomsoft
  • ARGUS_Script Planet IS GmbH

As a secondary function in proprietary software:

Free Software

  • OCRopus
  • GOCR
  • CuneiForm
  • Ocrad
  • Tesseract
  • OCRFeeder
372508
de