Text normalization

Normalization of text is defined as the transfer to a different shape in which only the relevant context information for the desired be maintained. Depending on the application, the normalization run very differently.

Examples

Some character sets, in particular Unicode enable the representation of a character in different ways. In applications, however, is usually only one of the possible forms desired, so that the normalization must convert the text into this form. Especially for Unicode, there are four possibilities for this normalization.

When creating a search index, the normalization must meet different requirements depending on the expectation of the user. Some possibilities are:

  • Punctuation can be removed.
  • Accented characters can be replaced by their basic letters. Likewise ä can be replaced by ae and ß by ss.
  • All characters are converted to uppercase.
  • Characters from other alphabets can be transliterated.

Some of these requirements can be met using the Unicode Collation Algorithm.

To prevent spoofing, so for example, the possibility that in an Internet forum two users can log whose names look identical, visually similar characters must be replaced by the same character in the normalization. It could be replaced by the capital letter I, so both the number 1, and the lowercase letter l.

For speech synthesis have numbers, special characters and abbreviations - partly depending on the context - be resolved in order to be read correctly.

608416
de