Self-information

The information content (or surprise value ) of a message is a logarithmic quantity that indicates how much information was transmitted in this message. This term was first formalized by Claude Shannon in his information theory: The information content of a character is its statistical significance. So it refers to the minimum number of bits needed (ie, information ) represent a character or transfer. It is important that this is not ( the amount of data ) necessarily correspond to the actual number of bits received since the information content from the semantic context dependent. Therefore, the information content of another unit, the Shannon (sh), is measured.

  • 6.1 Example 1
  • 6.2 Example 2
  • 6.3 Example 3

Definition

The information content of a character X with a probability of occurrence is defined as px

A corresponds to the number of possible states of a news source.

The unit for the information content is bit; the unit Shannon ( sh) has not been enforced.

The following text is a = 2 was adopted ( the binary system ), yielding as a result get the number of binary digits. Instead, any other number system can be used.

General

The concept of information, as used in information theory to Shannon, is clearly distinguished from the usual use of this term. In particular, it must not be equated with the concept of meaning. In Shannon's theory can, for example, two messages, one of which is of particular importance, while the other is just " nonsense ", exactly the same amount of information included. For the simple case, is to be selected in the possible only between two messages, two is determined arbitrarily, that the information that is associated with this situation is equal to 1. The two messages to be decided between them with such a choice, it can be completely arbitrary. A message could be the text of the phone book, for example, and the other news of the single letter "A". These two messages could then be encoded, for example, by the symbols 0 and 1.

Generally is effected by any source of messages, a sequence of selections from a set of basic character, said selected sequence then represents the actual message. It should be appreciated that the probabilities of the characters in the generation of the message is especially important. Because if the successive characters are selected, this selection determines at least from the standpoint of the communication system from this probability. These probabilities are dependent on each other, even in most cases, i.e., they depend on the previous selection events. For example, if the last word of a phrase in the article " the ", then the probability that the next word again, an article or a verb occurs very low.

A measure which meets the natural requirements in a particular way, which one places on this information measure is exactly the same, which has become known in statistical physics as entropy. How this information measure depends on the corresponding probabilities is explained in the following section.

Formally, the information to be transmitted will be referred to as characters. This is only a finite character set available, but characters can be combined as desired. The minimum number of bits required for the representation and transmission of a character, is now dependent on the probability of a character is encountered: for characters which occur frequently, use fewer bits than characters that are rarely used. Data compression techniques make the advantage, in particular the entropy coding, the Arithmetic coding and Huffman coding. A similar method is used for balancing of binary trees.

Basically, the information content for statistically independent events and statistically dependent events is calculated differently.

Information content of statistically independent events

Let be a sequence of n statistically independent successive events. The information content is then calculated from the sum of the information contents of each character with the probability of occurrence.

Also, can the information content with the entropy calculated (average information content of a character ).

For a uniform distribution of the probabilities for all the characters of the alphabet can also calculate the total information about the maximum entropy or the alphabet size:

In the uniform distribution always:

The information content of the two sources " 01010101 ..." and " 10010110 ... " is from the consideration of statistically independent events according to the above formula the same. It can be seen that the signs of the first source are ordered by a repetitive structure. Therefore, one would intuitively in the first chain less information than in the second chain suspect. When considering as statistically independent event but each character is considered individually and not of any context into account multiple characters.

Another definition of information of a character provides the conditional entropy. For her, the occurrence of previous character is taken into account. The successive characters are considered in this case as statistically dependent events.

Information content of statistically dependent events

When statistically dependent events you know the context of the events more accurately and can draw conclusions that affect the information content. This can in most cases the following events by process of elimination and bindings guess ' will. An example of statistically dependent events is a text in the German language: the 'c' occurs most often in pairs with an ' h' or ' k' on. Other characters are also subject to such pairwise bonds.

For this purpose, the average and context-sensitive information content of a sign with the number of available characters is similar to statistically independent events multiplied:

The conditional entropy is calculated as following:

Conditional entropy as the difference of source information and mutual information:

Interpretation: Let X and Y be two stationary dependent sources. H (x) is the stationary considered source entropy. I (X, Y ) is the mutual information, the information which flows from X to Y, so the amount of information from which one can conclude from X to Y. If this information is high, so the dependence of X and Y is high. Accordingly, the above X is after an observation Y is not so high, because you will not get much new information about Y.

Conditional entropy as a whole information minus the entropy H (Y):

Interpretation: In the case of statistically related pulling of the total information ( Verbundentropie ) the common information ( = I (X, Y ) ) of X and Y on. In addition, the new information that brings Y with itself should not be included in the calculation, because you want to get out in the end only the amount of information of X that contains X alone. Therefore, is expected: H ( X | Y) = H ( X, Y) - I (X, Y ) - H ( Y | X)

Note: The information from statistically dependent events is always less than or equal to the statistically independent events since the following applies: H ( X | Y ) ≤ h (x)

Joint probability H ( X, Y)

Is it possible events x and y possible events, the joint probability is the probability that an event ever occurs in pairs with an event.

The probability that the event occurs in pairs with the event.

With the conditional probability, the joint probability results then.

The average information content of the Verbundentropie each event pair statistically dependent events is thus defined by:

Information content for analog signals

The information content of a single value from an analog signal is in principle infinite, because the probability of occurrence of a value in a continuous probability distribution is zero. For the average information content of a real, continuous signal, the differential entropy can be calculated instead of the entropy to Shannon.

Alternatively, the signal can be converted into a digital by an analog to digital converter, but this information is lost. Since after the reaction only occur discrete values ​​whose information content can be determined again.

Examples of statistically independent events

For a = 2

In the following examples, the base of the logarithm is 2 is thus obtained as a result of the bit unit.

Example 1

At a source enters a character x on the probability p ( x ) = 0.0625. For the maximum efficiency for transmission in a channel information of I ( p (x) ) = I ( 0.0625 ) = 4 bits for each character x necessary.

Example 2

Given a string " Mississippi ". It consists of n = 11 characters. Alphabet with the appearance probabilities p (i) = 4/ 11; P (M) = 1/ 11; P ( p) = 2/ 11; p (s) = 4/11 total information: From this, the total bit number of bit 21 follows, which is necessary in order to encode the word " Mississippi " binary optimal.

Example 3

Alphabet Z = {a, b} where p (a) = 0.01 and P (B) = 0.99. The string consists of 100 characters. I ( p (a )) = 6.6439 bit ( rare occurrence ⇒ high information in the event of occurrence) I ( p ( b )) = 0.0145 bit (frequent occurrence ⇒ little information in case of occurrence) Total Information: Itot = 1 ⋅ I ( P (A )) 99 ⋅ I (p (B) ) ≈ 8.08 bits This is followed by a total of 9 information bits.

One could also say that the information content of a character is inversely proportional to the logarithm of the probability with which one can guess it ( the entropy). The information is thus a measure of the maximum efficiency, with which information can be transmitted.

An alternative measure for the information content of a string is the Kolmogorov complexity and algorithmic information content, it is defined as the length of the shortest program that can generate this string. Another approach is the so-called algorithmic depth indicates how expensive it is to produce a particular message. Gregory Chaitin has also gone beyond the Shannon's definition of entropy of information (see Algorithmic Information Theory ).

In this context, the cross-entropy and the Kullback -Leibler divergence play a role as measures of the triggered by poor coding wastes of bits.

412567
de