Punycode

Punycode is a standardized in RFC 3492 encoding method for converting Unicode strings to ASCII compatible strings. Consists of the characters a to z, 0 to 9 and the hyphen exist Punycode is designed to clearly and reversible display internationalized domain names from Unicode characters by ASCII characters.

Motivation

Important motivation for the introduction of Punycode was the fact that in the established Domain Name System are approved only names that 0 to 9, and the hyphen consist of the 26 Latin letters, digits. For the English language, this was sufficient, but most other languages ​​contain other characters - the German language, for example, the umlaut letters ä, ö and ü and ß. In order to process any text from such languages ​​, 2003, the proceedings Internationalizing Domain Names in Applications was introduced, Punycode is used as the encoding method.

Should be passed to a system, a text which dominates only ASCII, so he is first implemented by Punycode to ASCII. It should be noted that extended in many cases the resulting text. Conversely, when this text are taken from the ASCII system, it is translated back by Punycode in the original form. If a text contains no special characters, it will not change with this procedure.

The Punycode conversion process has been set in compliance with the following aspects:

  • Completeness: Each name can be implemented
  • Uniqueness: Each name is assigned to exactly one implementation
  • Reversibility: Any unreacted name can be converted back
  • Efficiency: The converted name is not much longer than the source name
  • Simplicity: The method is relatively simple to implement
  • Readability: The unreacted name remains essentially unreadable because the characters A to Z are not changed

Rules of transformation

As a base drawing the letters a through z and the digits 0 to 9 shall apply in the following together with the hyphen " - " as a delimiter, these 37 characters, the only valid characters in a coded according to Punycode text dar.

Contains the string to be converted

  • Only basic characters, so it is not changed.
  • Both base character and non- character basis, lists all the base character while maintaining their order, and finally append the encoded non-base characters separated by a hyphen.
  • Only non-base characters, the conversion result is only the code sequence without delimiters

To make the resulting string as compact as possible, the special characters are not " one-to-one ", but according to the Punycode encoding method. The non-base characters are first sorted according to their numeric value. The difference between the values ​​of each character is used with the respective position in the original string to form a number. This number is then represented by the 37- base character and is appended to the encoded text. The details of this process are defined in RFC 3492, where a reference implementation in C programming language for coding and decoding as well as numerous examples are included.

Prefix prepended, and otherwise (basic characters) does not apply Punycode - In the formation of domain names according to the standard Internationalizing Domain Names in Applications ( IDNA ) is in the presence of non-base characters "xn ".

665253
de