Whitespace character

White space (fachsprachlich English also Whitespace / ' waɪtspeɪs /, space character, white space) is in the computer science a name for characters in a text that are not normally presented in the text editor or word processor, yet (memory) take space. They are primarily used to delineate words or lines from each other.

Depending on the context, different characters are considered white space, almost always at least space and tab, mostly line breaks. Many programs also offer the possibility of these characters by Deputy formatting symbols ( for example, " ¶" for line breaks "·" for space and " →" or ">" for tabs) to make visible and distinguishable.

In programming, these characters have a special role usually. In various programming languages ​​they can single words and protected as names of variables separate. Some languages ​​(such as Python) require special formatting of the source code by whitespace ( indentation of blocks ).

When counting of characters in a text document, the empty space is sometimes not counted.

Regular Expressions

For regular expressions are two slightly different definitions for the character class \ s or [: space:] spread respected as white space characters. In Perl - compatible regular expressions ( PCRE ) include at least the space character ( U 0020 ), horizontal tab (U 0009 ), the line (U 000 A) and feed ( U 000 C ) and the carriage return (U 000 D ) for white space. In regular expressions, according to the POSIX standard in addition to the vertical tab (U 000 B) belongs to the space. In both cases possibly other characters come according to the set locale to the Japanese, for example, the ideographic space (U 3000 ).

The ECMA standard and therefore JavaScript, meets its own definition for the prestigious than white space characters in regular expressions. It includes, among others, the non-breaking space (U 00 A0), byte order mark (U FEFF ) and all the Unicode Standard Version 3.0 characters defined as white space.

Unicode

In Unicode are each codepoint, ie each Unicode character categories, classes and properties associated. Among other things, the characters in general categories ( General_Category, gc) are divided. The prestigious as white space characters are here in the category for control characters (Cc ) and contain the three categories of line, paragraph, and other separator ( Zl, Zp and Zs). One category of white space does not exist. In addition, each character of a Bidirektionalitäts class ( Bidi_Class, bc) is assigned. Here is a class named White_Space (WS) exists for use within the Unicode bidi algorithm, however, only includes various spaces. Characters such as tabs, and newlines are here not as an empty space but are Bidirektionalitäts own classes for general separators ( CS), segment (S) and paragraph separator ( B) assigned.

For space 26 characters are counted, marked with the (property ) White_Space property.

  • Several control characters, in detail, the horizontal (U 0009 ), and vertical tab (U 000 B), the line (U 000 A) and feed ( U 000 C ) and the carriage return (U 000 D)
  • The space character (U 0020 )
  • The control character for the next line (U 0085 )
  • The non-breaking space (U 00 A0)
  • The Ogam space character ( U 1680 ) and the Mongolian vowel separator character (U 180 E)
  • Eleven narrow spaces, Haarspatium and em space in different sizes (U 2000 to U 200 A)
  • Line and paragraph separator ( U 2028 and U 2029 )
  • The narrow non-breaking spaces (U 202 F)
  • The mean mathematical spaces (U 205 F)
  • The ideographic space (U 3000 )

For use in software development and especially in programming languages ​​, Unicode defines a second property called Pattern_White_Space (literally " pattern space", according to the patterns as regular expressions) with only 11 characters (U 0009 to U 000 D, U 0020, U 0085, U 200 e, U 200 F, U 2028 and U 2029 ). Here in particular lack the protected and language-specific spaces.

Also, this list applies only suggestions and can be modified by the developers of the programming language, whereby it is recommended to use the Unicode standard as the basis for the different definition.

504396
de