Zipf's law

The Zipf law (after George Kingsley Zipf, who set this law in the 1930s ) is a model that allows one whose value can be estimated from their rank in certain sizes that are placed in a rank order. Frequent use is the law in linguistics, especially in corpus linguistics and quantitative linguistics, where, for example, sets the frequency of words in a text to rank in relationship. The Zipf law marked the beginning of Quantitative Linguistics.

It is based on a power law, which is described by the Pareto distribution mathematically.

Simple Zipfverteilung

The simplified statement of Zipf's Law: If the elements of a set - for example, the words in a text - are ranked in order of frequency, the probability of their occurrence is inversely proportional to the position within the sequence:

The normalization factor in elements of the harmonic series

Given and can be specified only for finite sets. Thus follows:

Probability distribution

The Zipf law has its origin in linguistics. It states that certain words occur much more frequently than others and resembles the distribution of a hyperbola. For example, are present in most languages ​​words on the less frequently the longer they are. The rank order parameter n can be described as cumulative size: The rank n is equivalent to the number of all elements that are as large or larger than n for rank 1, there is exactly one element, namely the largest. For rank 2, there are two, namely the first and the second element, for 3 three, etc.

Zipf takes a simple inverse relationship to the rank of. In the original form of the Zipf law is free of parameters, it is.

The Zipf distribution corresponds exactly to the Pareto distribution, under the exchange of ordinate and abscissa:

It is the inverse function of the Pareto distribution. As this is a cumulative distribution function which obeys a power law. The exponent of the probability density function is accordingly:

And for the simple case:

Examples

The distribution of word frequencies in a text (left graph) is roughly equivalent to a simple high- Zipf distribution.

The Zipf law gives the exponent a is the cumulative distribution function: a = 1

The Fitwert for word frequencies is, however, a = 0.83, equivalent to the exponent = 1.20 apareto a Pareto distribution and the exponent e to a power - distribution density function of e = 2.20.

The distribution of the letter frequencies resembles a Zipf distribution. But a statistic based on letters 20-30 is not sufficient to adjust the development of a power function.

Another example from the article Pareto distribution deals with the size distribution of cities. Again, one can in some countries (eg Germany ) find a connection that seems to obey a power law. The chart on the right shows the Zipf - approximation with respect to the measured values. The linear trend in the log-log distribution supports the assumption of a power law. Unlike the assumption of Zipf the exponent does not have the value 1, but the value of 0.77, corresponding to an exponent of a power density distribution of e = 2.3.

The importance of the Zipf distribution is the rapid qualitative description of distributions from different areas, while the Pareto distribution refines the exponent of the distribution.

For example, the data base for a fit when specifying the population of only seven cities is too small. The Zipf law provides an approximation:

Among the slogans power law scaling law or self-organization is discussed reasons for the occurrence of power distributions.

836859
de