UTF-7

UTF -7 is an encoding of the Unicode character set, which is defined in RFC 2152. UTF -7 is not part of the Unicode standard, despite the name similarity to other encodings. UTF -7 allows the use of Unicode in non- 8-bit fixed environments.

Motivation

Many protocols in the Internet (such as SMTP for e-mail and NNTP for news) requires the use of ASCII. This encoding allows only 128 different characters that are stored in 7 bits. All other codings using at least UTF- 8 bits to encode a signal. Thus, a transfer of UTF -8 would then require a 7-bit encoding.

There are various coding methods (see MIME), such as Base64 and Quoted-printable, convert any 8 -bit binary data in 7-bit ASCII text. Depending on this coding method and the data to be encoded, the data amount by the encoding inflates. UTF -7 was designed to keep this additional data consumption in the use of texts that contain only a few Unicode characters, as low as possible, and at the same time, let passages of text which can be represented in 7-bit ASCII readable.

Coding

In UTF -7, the characters A-Z a-z 0-9 '(), /: -. ? Shall provide, as they are.! The ASCII characters "# $% & *; < => @ [] ^ _` { |} can be transferred directly, but should also be coded as they are may not be transferred correctly from all e- mail gateways.

All other characters are specially coded. For this purpose, a sequence of input characters as a stream of two -byte characters (UTF -16, possibly with Surrogates ) according to a modified Base64 method ( "=" no final ) converted into a stream of ASCII characters. The launch of such an encoded character sequence is indicated by a plus sign ( " "), the end by a minus sign ( " -") or by the first ASCII character that can not occur as a result of Base64 encoding. Redundant bits in this encoding must be set to 0.

In English text, this coding of people is easily read as coded special characters occur only very rarely. However, the special characters other Western European languages ​​have to be encoded, which is already distorted the text considerably. Texts in languages ​​that do not use the Latin alphabet, are from people not easily readable.

UTF -7 has, however, despite its higher coding efficiency can not prevail, and that other means such as quoted-printable and base64 from virtually any e- mail and news program to be understood and the larger coding overhang in practice is irrelevant.

  • Unicode
796257
de