Entropy (information theory)

Entropy (after the portmanteau εντροπία ) is a measure of the average information content or information density of a message. The term in information theory is named in analogy to entropy in thermodynamics and statistical mechanics.

The information- theoretical understanding of the term entropy goes back to Claude E. Shannon and has been around for 1948. This year, Shannon published his fundamental work A Mathematical Theory of Communication and thus shaped the modern information theory.

The entropy is usually denoted with a capital Eta ().


Claude Elwood Shannon defined the entropy of a discrete memoryless source ( discrete random variable) over a finite alphabet as follows: First, one assigns to each probability of an event to its information content. Then, the entropy of a character is defined as the expected value of the information content of

Be, then the probability of experiencing the characters of the alphabet, or equivalent

With. It is set (corresponding to the limit). Summands with vanishing probability have therefore by definition not contribute to the sum.

The entropy for words of length is made up by

Wherein the probability is, by which the word occurs. The entropy is then the limit of which:

If the individual characters are stochastically independent of each other, then for all, so ( cf. Blockentropie )


Entropy is a measure of the average information content of each character of a source, which is a system or an information sequence. In information theory, one speaks with information as a measure of uncertainty removed. The more characters are received from a source in general, the more information is obtained and simultaneously decreases the uncertainty about what could have been sent.

Can be graphically the definition of the information content found as follows: If an independent event that can occur with probability actually occurs, then by a specific event from a hypothetical quantity is selected by equiprobable independent events. To distinguish between this number of events required to binary bits. This value thus indicates the information content of a special event in bits. Weight is the actual information content of the possible events with their probability, we obtain the average or expected information content of a character.

The unit 1 is defined as the Shannon information content of an event with a probability. An example of such an event is the result of a coin toss head.

The base 2 logarithm of the is arbitrary. It just turns out that especially just leave bits ( binary digits ) technically handle. Would a different base can be selected, for example 3, one would obtain ternary digits ( trits ). The information content can be converted to trits easily by multiplying by the modulus of bits.

The minimum required number of bits ( of the text) are necessary for the presentation of the information, results from the product of the average information content of a character and the number of different characters in the information text ( = alphabet size ): or.

Shannon's original intention to use the entropy as the measure of the required bandwidth of a transmission channel has been generalized fast. The entropy has been generally regarded as a measure of the information content. For a small entropy contains the information text redundancies or statistical regularities.

Entropy is generally not given by ( 1). For example, the probability to find a 0 or 1 in the string, as great as in a string by statistically independent events (such as repeated coin toss ) has emerged. Therefore, the entropy of individual characters for both strings the same, although the first chain is less random. Already shows a difference: The first string provides, the second provides. This can also be interpreted as the probability of a character depends on the previous symbols. Stochastic independence is therefore not given.

For successive events that are not stochastically independent, the entropy of successive dependent events continuously reduced. In such a case, one can also work with the conditional entropy and Quellentropie, both based on composite probabilities.

In close connection with conditional entropy is the mutual information, which indicates the strength of the statistical relationship between two random variables.

Formulated more easily, the entropy is the average number of decisions ( bits) that are needed to identify or isolate a character from a character set.

It makes sense that an alphabet of at least two different characters present. An alphabet size of one means that you get neither new incoming characters from the transmitter source new information, nor can reduce the uncertainty relating to the previous character.

Maximum entropy and standardization

If you want to have a normalized measure of the entropy of an arbitrary discrete distribution, it is advantageous to select the maximum possible entropy that into consideration for uniform distribution is achieved for normalization. Be the number of characters in over the alphabet, then the maximum entropy is given by:

It follows, for example, for a binary distribution (), so you need a bit per character and characters for the complete information. This value is achieved when zeros and ones occur with equal frequency. Is now normalized entropy of any distribution with different characters with one obtains:

The entropy thus obtained is always at most equal.

In order to compare the entropies of messages of different lengths, one has introduced the entropy rate based on the single character relates the entropy ( see there).



With uniform distribution can be dispensed with an alphabet on no sign. In contrast, the frequency of letters in the German language is uneven. For example, the letter E is seven times as likely as M or O, which leads to redundancy in the alphabet. Now you want to determine how big this redundancy is.

Let N = 26, the size of the alphabet. The redundancy R is calculated with R = Hmax - H. For the German alphabet are calculated with the letter frequency an entropy H of 4.0629 bit / symbol. The maximum entropy is Hmax = log2 | 26 | = 4.7004 bit / symbol. This is followed by a redundancy of R = 4.7004-4.0629 = 0.6375 bit / symbol. If we calculate on the whole redundancy that results from the sum of the redundancies of each character, we obtain R ⋅ N = 16.575 bits. Now would be interesting to know how many characters this represents from our alphabet. If one divides the redundant bits by the average information content of a uniformly distributed character is obtained (R ⋅ N) / Hmax = 3.53 characters → 3 characters ( without redundancies). If one adds however (R ⋅ N ) / H = 4.08 ( ⇒ 4 characters ), we determined the number of characters with a redundancy, as it is also present in the alphabet.

Without loss of information, the alphabet could be so reduced by four letters. This consideration takes into account only the statistical distribution of the letters. Common letter combinations such as SCH or remain as ST ignored ( conditional entropy ) as the same sounding letters (Q, K).


In a coin toss heads or tails are ideally equally likely. When one conceives the entropy as a measure of uncertainty, it is here to have a maximum value. It is quite uncertain, but whether number is thrown on the next roll or head.

Let be a discrete random variable and the expected value with

It is clear from the above definition, entropy H = 1 bit.

Not so with a jointed coin, such as a coin, which in 60 % of cases, head and only in 40% of cases indicating number in the middle. The uncertainty is lower than for the normal coin, because you have a certain preference for head. Measured as the entropy of the uncertainty is only about 0,971.

The sum of the probabilities is always 1

The entropy can be in this case calculated with the following formula:

Replacing by the expression, we obtain the formula

This can be represented graphically as follows:

For each they tell you what the entropy directly. The function is symmetrical about the line. She falls in steeply to an entropy value of 0. Even at levels which approach the certain event of the entropy falls to 0.

This relationship is valid for a random event. For multiple random events one must add up the individual entropies and you can easily reach entropy values ​​over 1. The probability, however, remains even with repeats by definition always between 0 and 1

If you repeat the toss twice the number of possibilities grows to four. The probability of each option is 0.25. The entropy of the two-time coin toss is more then 2 Sh. If you repeat an ideal coin tossing several times, then the entropy simply added. The entropy of a set of 20 coin tosses ideal calculated simply. This is shown in the picture.

You can not just calculate a value of the probability entropy. The entropy affects the entire random process. Each partial probability of a possible outcome is included in the calculation of the entropy of the overall process. Specifying a Teilentropie for every possible outcome is little sense. In the Shannon's entropy formula ie, the sum of the probabilities 1 result, otherwise the result may be misleading.

If you save a sequence of coin tosses as a bit sequence, then it makes sense to head always be 0 and to always represent the number by 1 (or vice versa). In the coin -jointed compact encodings are possible. These are obtained for example by the Huffman code.

Ideally cube

On a roll of a perfect cube with six ways the entropy is greater than 1 In general, the entropy is greater than 1 for a random event with stochastically independent random variables from a random experiment with more than two equal possibilities in the sample space. Its value is calculated at equally likely possibilities in the sample space as follows:

Let n be the number of ways, then the probabilities and the entropy

In the ideal cube are six ways in the outcome space. It follows from the entropy for single throw:

Easy to calculate the entropy of an ideal of a litter of eight cube: He has eight equal opportunities.

The entropy of a litter with the ideal roller dice equal to the entropy of three litters with the ideal coin.

The figure shows the relationship between the entropy and the number of equal possibilities of a random experiment dar.

Entropy tests

In order to test how well the data are compressed, or to test random numbers entropy tests may be used. As random number test, the entropy of a certain number of random numbers is determined, and from a minimum value, for example 7 bits per byte, it is considered passed. However, there are many such assays because the entropy is not unique; they may be defined, for example bitbasiert or bytebasiert.

A simple example:

A source, such as a dice or a coin, give only the values ​​0xAA (decimal 170) and 0x55 (decimal 85) from, each with equal probability. Bitwise is the output to 50 % 0 or 1, byte-wise it is 50% 0xAA or 0x55. The bitwise entropy is ( with )

While the byte- entropy with

Is significantly less.

The manufacturer of this random number generator is, of course, as the entropy of the device bitwise entropy, ie 1, specify. Similarly, choose the one possible basis, in which the entropy is minimal ( here bytes), so take your best compress the data a programmer of a compression program.

This example is not very realistic, as are used only two of 256 possible values ​​, but when the other bytes are output with a small probability of, for example 1/123456789, so this changes nothing to the bitwise entropy and the byte-wise is barely larger; it remains at 1/2. Only with the approach of the byte probabilities at 1/256 the byte-wise entropy reaches the value 1, but then it can be even correlations of bytes, so the consequence of 0xAAAA be much more common than the result 0x5555. This is the main reason why there are many different random number tests.

This ambiguity is not possible for the entropy, since there is summed up not only probabilities, but on ergodic probabilities of states, state transition probabilities and conditional probabilities. It is calculated with the theory of Markov chain. However, the computational cost of this is high for real random number generators.

Data compression and entropy

The entropy coding is a compression algorithm to compress data without loss.

In this context, the cross-entropy and the Kullback -Leibler divergence play a role as a measure of the triggered by poor encoding wastes of bits.


  • If the string is ABBCAADA (see also entropy ).
  • The letters probability:; ;
  • Maximum Entropy ():
  • The Maximum Entropy can be calculated as well with the formula of maximum entropy:

Alternative means of quantifying information

Another approach, to measure the information content of a message is represented by the Kolmogorov complexity, wherein the shortest possible the algorithm indicates the complexity of the message for presentation of a given character string. Similarly, the logical depth is defined, but refers to the time complexity of an algorithm for generating the data. Gregory Chaitin has also gone beyond the Shannon's definition of entropy of information (see Algorithmic Information Theory ).

Similarity to the entropy in physics

In physics (see thermodynamics, specifically entropy ( thermodynamics) ) plays an equivalently named size a significant role. The similarities are in fact more than formal: the physical entropy differs from the Shannon information entropy only by an additional normalization factor, is Boltzmann's constant, and by the replacement of the logarithm used in the base ( the dual logarithm is replaced by the natural logarithm ). Thus, physical and mathematical entropy entropy differ by the positive conversion factor, provided with its physical unit. The connection was made ​​by a Maxwell's demon thought experiment called.