Sed

Sed stands for stream editor and is a Unix tool that can be edited with text data streams. The data stream can also be read from a file. Unlike a text editor, the original file is not changed.

The sed command set is based on that of the line-oriented text editor ed this case, a certain variety of regular expressions, so-called ( POSIX ) Basic Regular Expressions ( BRE ) are used for the text - screening according to the POSIX specification used. However, the widespread GNU implementation uses GNU BREs that differ slightly from POSIX BREs.

Even if the language range of sed plenty of limited and specialized appear, it is nevertheless a Turing - complete language. Evidence can be Turing - completeness by a Turing machine using sed to program or by an interpreter for another Turing - complete language to write with sed.

  • 2.3.1 Hold Space Manipulation
  • 2.3.2 Multi-line instructions
  • 3.1 Capacity limits
  • 3.2 Greedyness
  • 3.3 Practical limitations in shell programming
  • 3.4 In-Place Editing
  • 3.5 RegExp notation
  • 3.6 Some typical methods 3.6.1 Deletion of text parts
  • 3.6.2 at least one response mark
  • 3.6.3 Replacement of several or all occurrences within a line
  • 3.6.4 Filter specific rows
  • 3.6.5 Debugging

Operation

Sed can work both within a pipeline as well as files. Expenditures are always based , error messages to . The typical call looks therefore as follows:

Sed ' Instruction1       Instruction2       ...       Statementn ' input file > output file | sed ' Instruction1                  Instruction2                  ...                  Statementn '| sed reads an input file (or an input stream on ) one line at a time. These input data initially land in the so-called Pattern Space. In this pattern space is carried out sequentially every instruction of the given program. Each of these instructions may change the pattern space, the following instructions are then executed on the respective result of the last statement. Performs one of these changes to a zero text, the processing is terminated at this point and started the statement list with the next input line again. Otherwise, the result of the last statement is issued on and started the statement list also with the next input line again.

Programming

Sed statements can be roughly divided into three groups: text manipulation, and other branches. (Most sed manuals as well as the POSIX specification divide notwithstanding this instructions in 2- address, one - address and address -less - see below - but this grouping is not suitable for introduction purposes. )

Text manipulation

This is the function used by far the most common and the instruction set is also particularly rich. Generally, an instruction has the following structure (2- address command):

, Command [options ] Address1 Address2 and can be omitted. If both addresses is specified, the command is, for each row, starting with the one that matches Address1 to which the match Address2 executed. If Address1 Address2 and not specified, the command is executed for each row is only omitted Address2, so command is only executed for rows that match Address1. An address is either a line number or a regular expression. Regular expressions are case / enclosed in two. Two examples:

Sed ' / start /, / end / s / old / NEW / ' inputfile input output x old beginning y old end z old x old beginning y NEW end z old " old " is replaced by " NEW ", but only from the line that contains " start " to the row that contains the "end" (2- address - variant). In contrast, the same substitution in the second example is carried out in all lines that begin with "y" or "z " (1- address - variant):

Instead of a single command command can also contain a list of instructions that are enclosed by { ...}. Apply to these instructions again, the rules described above, they can in turn consist of other composite commands. An example:

Sed ' / ^ [ yz ] / {                 s / ^ \ ( [ yz ] \ ) / ( \ 1) /                 s / old / NEW /               } ' Inputfile input output x old beginning y old end z old x old beginning (y) NEW end ( z) NEW branches

Sed has two types of branches: unconditional branches ( jump instructions ) and conditional that come in response to a previously made ​​or not made ​​replacement operation for execution. A typical example is the following: A source code was indented with the help of leading tab characters, these leading tabs are to be replaced in each case by 8 blanks. Other than located at the beginning of the line tabs can appear in the text, but should not be changed. The problem is that multiple linkages (replace N Tabs by N * 8 blanks) can not be expressed as a RegExp. On the other hand, a global replacement would also affect the tab character within the text. Therefore, a loop is formed with jump instructions ( below are blanks and tabs for clarity by " " and " " symbolizes ):

Sed ': start       / ^ * / {                    s / ^ \ ( * \ ) / \ 1 /                    b start                  } ' Inputfile In each row, the first tab character, unless before that only zero or more spaces are replaced by 8 spaces, then the jump instruction ensures b that the returns program execution returns to the first line. If the last leading tab characters replaced so matched the expression / ^ * / no more and the block is not executed so that the program reaches the end and the next line is read.

Here the resemblance to assembly language is clear by a control structure is comparable to the usual built in high level languages ​​repeat-until with a condition and a label.

Other instructions

Hold Space Manipulation

A powerful (although relatively unknown ) function of sed is called the hold space. This is a free memory area, which is similar to the known in some assembler languages ​​accumulator in its operation. Although direct manipulation of the data in the Hold Space is not possible, but data in the pattern space can be shifted in the hold space, copied, or even with the contents thereof may be interchanged. Also attaching the Pattern Spaces to the Hold space or vice versa is possible.

The following example illustrates the function of the Hold Space: the text of a " chapter title " is stored and each line of each "chapter", readjusted the line itself but suppressed by the chapter heading:

Sed ' / ^ = ​​/ {               s / ^ = ​​/ /               s / ^ / ( /               s / $ / ) /               h               d            }       G ' inputfile input output = Chapter1 line 1 line 2 line 3 = Chapter2 row A line B line C Line 1 ( Chapter 1 ) Line 2 ( Chapter 1 ) Line 3 ( Chapter 1 ) Line A ( Chapter 2 ) Line B ( Chapter 2 ) Line C ( Chapter 2 ) Whenever a line with " =" begins, the statement block is executed, which removes this character and for the rest of the line provides you with a leading space and brackets. Then this text is in the Hold Space copied ( " h") and deleted from the Pattern Space ( "d"), whereby the program is terminated for that line and the next line is read. As for the "normal line " the condition of the input block is not the case, only the last instruction ("G" ) is carried out, which attaches the contents of the hold space in the pattern space.

Multi-line instructions

Not all text manipulations can be performed within individual rows. Sometimes information one line at cross- substitutions must be included from other rows in the decision-making, sometimes be performed. For the sed- programming language provides the instructions N, P and D before, spent with those multiple rows of the input text at the same time in the pattern space loaded ( "N") and parts thereof ( "P") or deleted (" D") can be. A typical example is the following one-liner (actually two one-liners ), which provides a text with line numbers:

Sed ' = ' input file | sed ' N; s / \ n / / ' The first sed substitution prints for each row in the input text, the line number and then the line itself The second sed- call connects these two lines into a single by first read each subsequent line ( "N") and then automatically inserted line separator ("\ n" ) is replaced by a tab character.

Applications, options, notes

Capacity limits

Sed is not subject to (real) limitations on file sizes. Apart from the available disk space, which is a practical limit, most implementations realize the line counter as int or long int In today's common 64 -bit processors, the risk of an overflow can therefore be neglected.

However, as most text -manipulating tools in UNIX sed is subject to a limitation with regard to the line length (more precisely, the number of bytes up to the next newline character ). The minimum size is defined by the POSIX standard, the actual size can vary from system to system and can be looked as the value of the constant LINE_MAX in each case in the kernel header file / usr / include / limits.h. The length is specified in bytes, not characters (which is why a conversion about the processing of UTF- encoded files that represent single characters with multiple bytes is needed ).

Greedyness

In the scope of regexps distinguish between greedy and non-greedy. sed regexps are always greedy, which means that the RegExp always has the longest possible scope:

/ a * B /; " 'a', followed by any zero or more characters followed by 'B' " axyBBBskdjfhaaBBpweruBjdfh; longest possible scope ( greedy ) axyBBBskdjfhaaBBpweruBjdfh; shortest possible scope ( non-greedy ) The reason is that sed is optimized for speed and non-greedy regexps would require costly backtracking. If you want to force a non -greedy behavior, one usually achieves this by negated character classes. In the example above:

/ a [^ B] B * /; " " A ", followed by zero or more non- " B ", followed by " B " " Practical limits in shell programming

It should be mentioned that the allerhäufigste application of sed (and awk, tr and similar filtering software ) in practice - the ad hoc manipulation of outputs of other commands, like so:

Ls -l / path / to / myfile | sed '. s / ^ \ ( [^ ] [^ ] * \ ) * / \ 1 / ' # prints File Type and File Mode from Strictly speaking, an abuse represents. Since each call to an external program requires the costly system call fork (), are shell internal methods, such as the so-called variable expansion, even if they are to write much longer usually consider calling external programs. The rule of thumb for this is: if the output of the filtering process is a file or data stream, the filter program must be used, otherwise variable expansion is preferable.

In-Place Editing

Because of the way like sed performs text manipulation, this can not be done directly on the input file. As a separate issue from this file is needed, which is optionally thereafter copied from the input file.

Sed ' ... ... ' / path / to / inputfile > / path / to / output mv / path / to / output / path / to / input This is so provided in the POSIX standard. The GNU version of sed offers in addition to the POSIX standard, the command line option-i. This allows a file seemingly without going to change ( in place), but is actually also created a temporary file in the background. This will not be deleted if an error occurs and the metadata ( owner, group, inode number, ...) changed the Originalatei definitely.

RegExp notation

It has become common, regular expressions - to limit by slashes - as in the above examples. sed, however, does not require this. Any character that follows a substitution command is accepted as a delimiter and then expected in the sequence. These two statements are therefore equivalent:

S / ^ \ ( [^ ] [^ ] * \ ) \ ( [^ ] [^ ] * \ ) / \ 2 \ 1 /; swapped first and second word of a line s_ ^ \ ( [^ ] [^ ] * \ ) \ ( [^ ] [^ ] * \ ) _ \ 2 \ 1_; "_ " Instead of " / " This is convenient if the backslash is required as part of the RegExp, because then you can save the tedious escaping ( adding terms to the use as literal) is. It then gives way simply to another, unused characters.

Some typical methods

Deletion of parts of text

If by replacing it with nothing. Explicit deletion of parts of a line is provided only by the beginning of the line to the first line separator (D). The term

/ Expression / d however, deletes NOT the subexpression, but any line that contains expression! Expression acts as the address ( see above, 1- address variant of the command d).

At least one character responsive

The quantifier \ not provided for one or more of the previous expression - In the scope of the POSIX BREs is - in contrast to the GNU BREs. To write portable sed scripts that run not only with GNU sed, therefore, the term should be doubled and the * quantifier (zero or more ) can be used.

/ xa \ y /; GNU variant of " " x " followed by one or more (but not zero), 'A' followed by 'y' " / xaa * y /; the same in POSIX: " ' x' followed by 'a' followed by zero or more ' a's followed by ' y ' " Replacement of several or all occurrences within a line

Without giving further options, only the first occurrence of the search text replacement rule is always subject to:

Sed ' s / old / NEW / ' inputfile input output old alt alt alt alt alt alt alt alt alt alt alt alt alt alt NEW NEW old NEW alt alt NEW alt alt alt NEW alt alt alt alt This behavior, however, can be changed by specifying a comma Dopt ion: If a number N is specified, the Nth occurrence is only changed, a g ( for global) changes all occurrences:

Sed ' s / old / NEW / g' inputfile input output old alt alt alt alt alt alt alt alt alt alt alt alt alt alt NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW Filter specific rows

Basically sed is always the contents of the pattern spaces after the last statement. If you want to suppress this behavior for individual lines, so you can either have a rule to delete certain rows (explicit filtering), but it is also possible with the command line option-n, total turn off this behavior (implicit filtering). Output is then only what is specified with the explicit print command (p). p can serve either as a separate statement or as an option for other instructions. The example is from the text above used only the " chapter headings " from:

Sed- n ' s / ^ = ​​\ (. * \ ) $ / chapter heading: \ 1 / p' inputfile input output = Chapter1 line 1 line 2 line 3 = Chapter2 row A line B line C Chapter Title: Chapter 1 Chapter Title: Chapter 2 debugging

For troubleshooting purposes, it may be useful to can be output intermediate results to the development in the pattern space can better understand. To the above -mentioned option can be used p. Lines may well be repeatedly output. In the above example program about:

Sed ' / ^ = ​​/ {               s / ^ = ​​/ / p               s / ^ / ( / p               s / $ / ) / p               h               d            }       p       G ' inputfile References

720843
de