Percent-encoding

(Hereinafter URL encoding, and encoding percent ) URL - encoding is a mechanism to encode information in a URL, under certain circumstances. For coding, only certain characters in the ASCII character set are used.

Without this coding some information could not be expressed in a URL. For example, a space usually by the browser will be interpreted as the end of the URL, the following characters will be ignored or cause an error. With URL encoding a space can be passed through the string% 20. RFC 3986 defines a standard as a URI ( and therefore also a URL) should be constructed syntactically and under what conditions the URL encoding is applied.

Also for not contained in the ASCII character set characters the URL encoding is used with the percent sign. There are, however, so far only a recommendation in RFC 3986, a binding standard is still missing.

Reserved and non - reserved characters

URLs can contain a maximum of the following fields:

Certain characters identify within that term and separate the individual segments of the URL and allow for cutting and processing of the expression. When an HTTP access, for example:

  • Directs the question mark (?) the data part (query string ) of the URL,
  • Is the equal sign ( =) between the name of a parameter and its value,
  • Is the ampersand (&) as a delimiter between " parameter = value " elements in the data section,
  • Follows the pound sign ( # ) is the name of a document anchor ( see also URI reference ).

Other characters have specific meanings in the document path. Altogether, the following characters are considered to be reserved:

The following characters (groups) are not reserved, so have no predefined meaning in a URL:

% Representation

A URL consists of the mentioned reserved and non - reserved characters; they must not contain any other characters. However, there is in principle the need, in URLs arbitrary byte sequences - to be able to represent - that all values ​​between 0 and 255. In addition, a way must exist to be able to write reserved characters in a URL so that they lose their special meanings (see also: escape sequence ).

The % representation of characters carries both receivables. It is based on a coding method that each character code assigns a three-digit combination of characters that is introduced by the percent sign, the two-digit hexadecimal representation of the character code follows.

A reserved character in a URL to write in% -encoded form when. At the point at which it is located, has a special meaning, but in the present context this should not have Non- reserved characters can be % -encoded, but should not. Other characters (among binary data), there is usually no other option but to present it in a URL -encoded form in% (except reserved character " " instead of a space in the " query string ").

According to the ASCII character "#" is the hexadecimal character code 23 is assigned. Insofar, the term " % 23 ", the % - encoded form of the "#" character dar.

The interpretation of

Is clear: Here is a URL parameter called session were defined, the value is assigned A54C6FE2, as well as the documents specified anchor named info. The character " #" in the present context has the special meaning that the name of a document anchor follows him. Should it lose this meaning, that is, the URL parameter value of the session A54C6FE2 # info to be assigned, it must be the "#" character in the% - encoded form in the URL:

In practice, this mechanism is not always consistently applied. However, there are cases in which the use is necessary, for example, when calling an anchor over a Dereferrer service.

Non -ASCII characters

Also for the characters that are not included in the ASCII character set, the bytes are encoded preceded by %. Which bit sequence representing a character, however, depends on the character encoding to be used. While it is recommended by RFC 3986, UTF -8 to use for encoding, since this Unicode format can be used for all international characters, which UTF -8 indeed makes it the de facto standard encoding for URIs, but an explicit standard does not yet exist. In order to encode the URL, you need to know or guess the character encoding used for the file to recall or use the encoding of the target computer so. For this reason, it is still advisable to use only characters from the ASCII stock.

At the recommended UTF -8 would be the letter ö ( with the Unicode character value 246 ) is shown as % C3 % B6. All character values ​​127 are represented in two, three or four byte values ​​and the corresponding copied into the encoding %; with all the usual characters are represented with two bytes more bytes need is uncommon characters and marks the reversal of direction.

Sometimes, ISO 8859-1 ( Latin-1) is still used for the representation and the identical character value 246 (decimal) inserted directly using the % - encoding in the URL. The umlaut ö is then represented as a value F6 %.

Unambiguity

A distinction between two single encoded ASCII characters (such as% 23 % 23 for # #) and a 2 -byte UTF -8 characters ( eg% C3% B6 ) results from the way UTF -8 encoded. The individual bytes result in itself no valid ASCII characters, because of the decimal 195 corresponds to C3 and B6 182nd Since ASCII characters have the value 127 (126 ) maximum, there can also be no two individual characters and the characters are together assumed to be UTF -8 encoded. A mix of UTF -8 and ASCII is therefore not possible. On a similar basis, some servers can also determine which encoding is used in the URL. Another peculiarity in the % - encoding of UTF -8 characters is that the first code value is always a character code of the top row of the ISO 8859-1 ( 192-239 ), the following code value (or the following, if more than two bytes) is always between 128 and 191, a server can exploit this course only if the requested URL contains any characters with values ​​greater than 127; otherwise, but a distinction is not necessary, since UTF -8 is congruent with ASCII in the first 128 characters ( 0-127 indices ).

Character encodings, which allow values ​​above 127, can not be clearly distinguished. ISO 8859-1, for example, allowed characters C3 and B6 ( in the meaning of à and ¶).

MIME type

With the MIME type "application / x -www-form -urlencoded " URL-encoded data can be characterized. When transmitting web form data using the POST method, this MIME type is specified as a content type ( content-type ); sometimes with explicit coding: "application / x- www-form -urlencoded; charset: UTF -8 ".

269
de