String (computer science)

A string or (from the English ) is a string in the computer science a sequence of characters (eg, letters, digits, special characters and control characters) from a defined character set. Characters can be repeated in a string, the order of characters is defined. Strings are thus sequences of symbols of finite length.

In programming, a string is a data type that contains a string of characters with fixed or variable length. In order words, sentences and whole texts are mainly stored. Almost every programming language has such a data type, and some programming work exclusively with this data type. Examples are sed, awk and bash. In the source code of a computer program represent text strings, which is not seen as a programming command, but contains information. For example, error messages or other issues to the user can be held as a string in the source code or user input are stored as strings in variables.

The foundations of programming languages ​​are investigated in theoretical computer science. There the given character set is called the alphabet and the strings are called words. The theory of such words are a subject of formal languages ​​. In the context of programming languages ​​, however, ask questions of the representation, storage and handling of strings.

  • 2.1 PL / SQL
  • 2.2 BASIC
  • 2.3 C
  • 2.4 Java
  • 2.5 Pascal
  • PHP 2.6
  • 2.7 Rexx
  • 3.1 identify substrings 3.1.1 Python
  • 3.1.2 Rexx
  • PHP 3.1.3
  • 3.1.4 Blitzbasic

Representation

Strings can be represented at various levels. One of them is the source code of a program that is read and interpreted by the translator. Another is as a string at runtime of a program is stored in memory.

Syntax for literals

In general, a literal string with the programming represented by the simple joining of characters. It is enclosed by single or double quotes:

Such strings must normally be written in a single line. In some programming languages ​​such as Python but may strings that are delimited by quotation marks tripled, include multiple lines.

Internal

There are several methods to store strings efficiently. For example, a character can be defined from the character set used as a terminator. A string then stops before the first occurrence of that character. Another possibility is to store the length of the string separately.

Representation with terminator

In programming languages ​​such as C, the strings are continuously stored in memory and terminated with the null character ( NUL in ASCII). The null character is the character whose binary representation consists only of zeros. The following example shows how a string with 5 characters is stored in a buffer of 10 bytes in length.

The length of the string above 5; but it requires 6 bytes in the buffer. Letters after the NUL character no longer belong to the string; they may belong to a new string or simply unused. A string in C is an array of type char, the string contains the end code a null character. That is why such strings are also called null-terminated. Since the null character itself also needs a space that the string is busy, the memory requirements of a string is always at least one character larger than the usable length of the string. As a " string length " is the number of characters is called before the end identifier. It is determined by the C function strlen ().

The advantage of this method is that the length of a string is practically limited only by available memory; a disadvantage is that it can not contain null characters, and that dealing is comparatively difficult and inefficient; for example, the length of such strings only be determined by counting the characters.

Representation with separate length specification

Another way to store character strings is used in the programming language Pascal, BASIC, PL / 1 and others:

Strings that are stored in such a certain length can not exceed. In Turbo Pascal, the length is, for example, in the " zeroth " saved characters. Since an 8-bit characters is large, the length is thus limited to 255 characters. The successor Object Pascal language has the length field expanded to 31 bits, and supports strings of up to 2 gigabytes in length. In REXX, the length is stored in four bytes, so the maximum length for most practical purposes, is virtually unlimited.

Storage in the pool

The storage of strings requires a lot of storage space and is a very common task. Therefore, many high-level languages ​​use a special management in order to make the most efficient can. However, this is beyond the reach of the programmer of an application; there is no way to directly access these management or even determine whether such a rule is active.

It will be stored in a central "pool" all strings. The goal is that each required string is only stored exactly once. The variable in the application program receives only an identification number to access when needed to the string can.

The administration uses faster and more efficient methods (usually a hash table ) for the organization. Every time a string is to be stored is made ​​to see if an identical is already known. If this is the case, the identification number of the existing string is returned; otherwise, they must be re- created.

Every time a string is stored, its reference count is incremented by one. If a character string at a point in the program is no longer needed (because a subroutine is finished and the literals in it are meaningless, or because a variable is given a different value ), this will be reported to the administration and the reference count is decremented by one. This can be determined, which are used at the moment of the stored strings - the reference count is zero, it is currently not in use. This would make it possible to reorganize in shortages of space to manage and delete unneeded strings ( Garbage Collection). However, this is avoided where possible, because it can happen that identical strings are reassigned each time you call a subroutine repeatedly; advanced administrative registers the frequency of storing and deletes only particular rarely used and long strings.

If it is a programming language in which a compiled source code and the result is stored in an object file, then preserved in its data section on the resolution of all preprocessor operations, the resulting static strings usually a similar table administration. However, there is neither a clear nor reference counter. These literals are also the central character chain management is not available, because it is not backed up with dynamic integration that these data section is always loaded.

Multibyte characters

Traditionally, for the representation of a single character in accordance with 8 bits used as a byte, which allows up to 256 different characters. To the same characters from many languages ​​and especially not Latin shear headings like to be able to process about Greek, it is not enough.

In the meantime, see the programming for the storage of a single character before 2 bytes or 4 bytes; consequently avoids today in this context the word byte and speaks generally of char.

Under Microsoft Windows, all system functions that use strings in a version with a suffix A ( for ANSI says 1 byte according to ISO 8859 ) and available with a suffix W ( for wide, multibyte ). It is simpler, however, this does not explicitly specify: Compiles to a program with the appropriate option, then all of neutral function calls to 1 byte / character or multibyte be changed. Just as there are for the programming languages ​​C and C preprocessor macros with the help of all the standard features and literals can be written in an indefinite version in the source code; when compiling the just adequate function is then used. By definition, process the historical standard functions in C exactly 1 byte / character.

Internally, it is now in virtually all current programming languages ​​common to use multiple bytes for a character and the fact store the larger numbers to UCS ( "Unicode ").

A proprietary intermediate form was in the 1990s on systems from Microsoft under the name " Multibyte Character Set " in use. Here, different formats and encodings / decodings were used to rectify the problem of having to cover with 1 byte / character and Asian fonts. In the meantime, this is supported in nor out; internal representations and developments to use it but no longer, but use Unicode.

Basic operations with strings

The basic operations with strings that occur in almost all programming languages ​​are copying, determining the length, linking, formation of sub-chains, pattern recognition, searching for partial strings or individual characters.

For copying strings is used in many high-level languages ​​, the assignment operator (usually " =" or ": =" ) is used. In C, copying is performed with the standard strcpy function. Such as time-consuming, the copying, is strongly dependent on the representation of the character strings. In a method with reference counters copying consists only of the increase of the reference counter. In other methods may need to be copied the entire string.

There is to concatenate in many programming operators such as " " ( BASIC, Pascal, Python, Java), "&" ( Ada, BASIC), (Perl, PHP) or "." "| |" ( REXX ). In C there is for the function strcat.

To add another to an already existing string, put some languages ​​a private operator is available ( " =" in Java and Python. " =" In Perl and PHP). It is usually the operand is not simply added to the rear, but the expression is evaluated old new and old assigned to the variable, since strings are usually regarded as immutable; So it is just a shorthand notation. However, there are many modern programming languages ​​, such as Java, C Sharp or Visual Basic. NET so-called string builder classes that represent the variable string. However, string and string builder usually can not replace each other, but must be transformed into each.

Direct (with or without white space ) after the other strings are listed in some languages ​​implicitly concatenated (Python, REXX ).

To obtain a sub-string, there are various possibilities. By specifying ( string, start index, end index ) or ( string, start index, length ) a substring can be clearly defined. This operation is often called substr. Some programming languages ​​, such as Python, provide syntactic sugar for this operation (see examples).

PL / SQL

In Oracle, the following basic operations are possible in stored procedures, functions, and PL / SQL blocks:

DECLARE   Text1 varchar2 (30);   Text2 varchar2 (30);   Text3 varchar2 (61); BEGIN   Text1: = ' Frank ';   Text2: = ' Meier ';   Text3: = Text1 | | ' ' | | Text2 END; / BASIC

Text $ = " FRANK "   text2 $ = text $ The trailing dollar sign indicates that it is a string variable. Because a string is delimited by double quotes, as they can themselves only on the Chr ( 34) - or CHR $ ( 34 ) function to the string to be installed, which is the ASCII code of the geese Füßchens 34.

Several strings can (depending on the BASIC- dialect) with the plus sign or the ampersand "&" to be connected ( " concatenated " ):

Text2 $ = "***" text $ "***"   text2 $ = "***" & Text $ & "***" C

This C program defines two string variables that can each accommodate 5 characters "payload ". Since strings are terminated with a null character, the array must have 6 characters. The text " FRANK " is then copied to both variables.

# include   int main ( void) {    char text1;    char text2;      strcpy ( text1, " FRANK ");    strcpy ( text2, text1 );      return 0; } A simple method concatenate two strings, provides the standard function strcat:

# include   int main ( void) {    char buffer;      strcpy ( buffer, " FRANK ");    strcat ( buffer, " Enstein ");      return 0; } Java

String text1 = " FRANK "; String text2 = text1; Strings in Java are objects of class String. You can not be changed after creation. In the above example text1 and text2 represent the same object.

The concatenation of strings is performed by the ( overloaded in this case ) plus operator:

String text1 = " FRANK "; String text2 = " Enstein "; String FullName = text1 text2; Pascal

( Strictly speaking, the following works only since Turbo Pascal since the original created by Niklaus Wirth Pascal language only packed arrays of char knew that were a little more complicated to handle )

Var first name, last name, name: string; { ......} first name: = ' FRANK '; surname: = ' SMITH '; name: = first name '' last_name; PHP

In PHP, it behaves similar to Perl.

$ text = " FRANK ";   $ text2 = $ text; / / $ Text2 results " FRANK "   $ text3 = <<< HEREDOC I am a longer text with quotes, such as " or ' HEREDOC; Texts are concatenated with a period.

$ text = " FRANK "; $ text = " FRANK ". " Enstein "; / / $ Text gives " FRANKENSTEIN "   $ text = " FRANK "; $ text = " Enstein ."; / / $ Text gives " FRANKENSTEIN " Rexx

In Rexx is everything - including numbers - represented as a string. So a variable is assigned a string value: a = " Ottos Mops " The following expressions evaluate each have the value " Ottos Mops ":

  • " Otto " "Pug " (implicitly concatenated; exactly one space is automatically inserted )
  • " Otto " | | ' Pug ' ( explicitly linked, not insert a space )
  • " Otto " ' Pug ' (implicitly linked by directly attaching a further string which is limited by the other delimiters )

Further operations

Identify substrings

Suppose the variable s contains the string Ottos Mops hops away. Then let the first character (O), the first five characters ( Otto ), the seventh to tenth (Pug ) and the last four ( continued ) determined as follows:

Python

  • S → O
  • S [: 5] or s [ 0:5 ] → Otto
  • S [ 6:10 ] → Pug
  • S [-4: ] → continue

This process is called slicing (of English. "To slice" meaning " slice " or " split "). The first character has index 0

Rexx

  • Substr ( s, 1, 1) or Left (s, 1) → O
  • Left ( s, 4 ) or Word (s, 1) → Otto
  • SubStr (s, 7, 4) or Word (s, 2) → Mops
  • Right ( s, 4 ) or word (s, 4) → continuous

Rexx strings can also wordwise process, where words are separated by (any number ) space. The first character has, like Pascal strings, the index 1

  • PARSE VAR s A 2 1 include OMF ⇒ variables A, O, M, F ' O ', ' Otto ', ' Mops ', ' continue '

This process is called tokenizing (of English. " Token" with the meaning " shortcuts " or " token " and says here about "piece" or " chunks " ) and is also available in other languages ​​a standard feature.

PHP

  • Substr ($ s, 0, 5) → Otto
  • Substr ( $ s, 6, 4 ) → Mops
  • Substr ( $ s, -4 ) → continue
  • Other examples, see

Blitz Basic

  • Left ( s, 5 ) → Otto
  • Mid ( s, 7, 4 ) → Mops
  • Right ( s, 4 ) → continue

Algorithms

Various algorithms mainly work with strings:

  • Check string matching algorithms if a string is part of a larger string.
  • Determining partial strings, which are described with a regular expression.
  • Sorting algorithms in the context of strings mainly for sorting the suffixes of a text ( suffix tree, suffix array )
  • Parser
  • Code conversions (Unicode, etc. )

Today, a programmer this type algorithms no longer himself writes mostly, but uses constructs of a language or library functions.

Buffer overflow: strings and computer security

Whenever strings are taken from the outside world in the internal representation, special precautions should be taken. In addition to unwanted control characters and formatting, especially the maximum length of the string to be checked.

Example: An international telephone number to be read from a file. You should be separated only contain digits and tab character ( ASCII 9) of the address. For recording a string of fixed length is provided with 16 characters; this is sufficient for all valid telephone numbers. - Input file spaces or dashes could be included and extend the telephone number. Even if accidentally follows a TAB instead of just looking spaces, there are more than 16 characters.

If this is not controlled by appropriate tests and it responded adequately, it comes to a buffer overflow and potentially crash the program or to the mysterious sequence errors.

For the most common method of attack on web server include buffer overflows. An attempt is made to assign a string variable for a content whose length exceeds the length of the variable. This other, neighboring variables are overwritten in memory. With skillful use of this effect a program running on a server program can be manipulated and abused for attacks on the server. But it 's enough to bring the server software so to crash; since it seeks to guard the network connection ( "Gateway" ), rips her loss, a gap that a weakly secure server now defenseless leaves each manipulation.

If not already the validity was monitored in a manageable environment string operations should be carried out only with functions where the maximum length of the string is checked. In C, the functions such as strncpy () would, snprintf ( ), ... ( instead of strcpy (), sprintf ( ), ...).

758435
de