Gini coefficient

The Gini coefficient or Gini index is a statistical measure that was developed by the Italian statistician Corrado Gini for the representation of unequal distributions. Unequal distribution coefficients can be calculated for any distributions.

The Gini coefficient takes a value between 0 and 1 with equal distribution, if only one person receives the entire income (ie at maximum inequality ) to. With equal distribution is not the uniform distribution meant in the mathematical sense, but rather a distribution with a variance of 0 In the most common use case, the distribution of income in a country, it means that the income of each is the same, and not that each income level is equally common.

  • 4.1 Example

Applications

Economics

The Gini coefficient is used particularly in welfare economics, for example, to describe the degree of equality or inequality of the distribution of wealth or income.

Information Theory

In information theory, it is used as a measure of the "purity" or " impurity" of information.

Machine Learning

In the field of machine learning can be when creating a decision tree, the Gini index, are more precisely the change in the Gini index, also called " Gini Gain", used as a criterion to select the one decision rule, in which the child node "pure" are possible. The idea is that, for a "pure" decision tree is finished, and therefore the change of the Gini index is suitable as the measure.

Banking

In banks, the Gini coefficient is used as a measure of just how good a good separate rating system of bad customers ( selectivity ).

Standardization

The range of possible values ​​range depending on the application from 0 to 1, from 0 to 100, from 0 to 10,000. Depending on the application is the smallest or even the largest value for the uniform distribution.

The value of absolute inequality can generally be reached only asymptotically. By renormalization you can avoid this.

Calculation ( discrete distributions )

Often distribute Number forms a certain (large ) amount to a priori certain (small ): this can, for example, all coins in a pan to their nominal values ​​or all Buchstabenvorkommnisse a text (for example, the entire Bible ) on the individual letters types. Here one also speaks of the Herfindahl index.

One uses here

If for some i and 0 for all others, the following applies. A value of exactly 0 can not occur, the smallest possible value is the reciprocal of the number of occurrences.

Distribution with quantiles

A certain part of a set A is assigned to a part of another set B. This can, for example, Money (A) People ( B ) or current consumption ( A) to cities (B ) may be, it is crucial that A is a homogeneous good amount divisible. For example, the possession of car would not be appropriate because neither homogeneous car - individual types differ considerably - are still divisible into small units.

The Gini coefficient is normalized to the equal distribution area between the Lorenz curves of a uniform distribution and the observed distribution.

With GUK as the Gini inequality coefficient, the area under the Lorenz curve of a uniform distribution and the area under the Lorenz curve for the observed distribution.

Example

A is distributed to B, for example, the assets (A) to the population (B ) is distributed.

50 percent of B ( b1) is assigned to 2.5 percent of A (V1). 40 percent of B ( b2) is 47.5 per cent of A associated with (v2 ).   9 percent of B ( b3) is 27.0 per cent of A associated with ( v3).   1 percent of B ( B4) is assigned to 23.0 percent of A ( V4). In a first step, the data are represented as " normal "

B1 = 0.50, v1 = 0.025 v1/b1 = 0.05 b2 = 0.40 v2 = 0.475 = 1.188 v2/b2 b3 = 0.09, v3 = 0.270 v3/b3 = 3 b4 = 0.01, v4 = 0.230 v4/b4 = 23 In the second step, the Gini coefficient is calculated.

The Gini inequality coefficient ( GUK ) is obtained by evaluating a Lorenz curve.

This is actually a Lorenz curve is created, where appropriate, the above values ​​have to be reordered. All pairs of values ​​must first be pre-sorted, that:

In the above example there is already the correct collation, so that does not need to be reordered.

The requested Lorenz curve arises when one enters (xi, yi ) pairs as points in a Cartesian coordinate system and then connects adjacent points with a straight line. The pairs are formed from the pairs according to the following calculation rule:

In the second step, from the data of the first step, the following data determined by summation (where 1 is the beginning to a fixed value ):

X0 = y0 = 0 0.00 x1 = 0.50 y1 = 0.025 x2 = 0.90 y2 = 0.5 (because 0.5 0.4 = 0.9 and 0.025 0.475 = 0.5 ) x3 y3 = 0.99 = 0.77 x4 = 1.00 y4 = 1 In total equal distribution of the assets of the Lorenz curve is a straight line from point (0 | 0) to point ( 1 | 1).

To determine the Gini coefficient of two variables are first determined, which are graphically viewed surfaces. Once the area under the uniform distribution line, we call this size as A. The second area is the area under the actual distribution curve, we call this variable, for example B. With these two quantities is calculated the Gini inequality coefficient as follows:

Calculating the y- value of the Lorenz curve of the actual distribution:

Y0 = 0.000 y1 = v1 = 0.025 y2 = v1 v2 = 0.500 y3 = v1 v2 v3 = 0.770 y4 = v1 v2 v3 v4 = 1.000 Calculation of area B under the Lorenz curve of the actual distribution (see below):

(y1 - 0.5 · v1) · b1 = 0.00625 (y2 - 0.5 × v2) · b2 = 0.105 ( y3 - 0.5 · v3) · b3 = 0.05715 ( y4 - 0.5 * v4) · b4 = 0.00885 B = 0.17725 Since a normalized representation is used, the curve of the total equal distribution connects the vertices ( 0 | 0 ) and (1 | 1 ) to each other. The triangle with the area A is therefore 0.5. Therefore applies to the Gini inequality coefficient:

Considered Graphically, the Gini coefficient is the ratio of the area between the Lorenz curve and equal distribution line (AB ) to the area under the uniform distribution line (A).

Explanation of the calculation

The overall Gini area is a rectangle with the sides times. Gini surface a uniform distribution is the half of the total Gini surface. To calculate the area under the curve of all the individual areas to be added. Take, for example. Full to credit is the rectangle with the height and the width ( ie from to). From the rectangle that goes from the height to the height, is only half to take, as the other half does not belong above the Ginilinie the Gini surface. So is the

Or

Alternative view of Area calculation: The single surface over the difference from the rectangular area of the points (x1, y0 = 0), ( x2, y0 = 0), ( x2, y2), is limited (x1, y1 ) ( Content: ), is limited, less the area of ​​the right triangle, the points of (x1, y1 ), ( x2, y1 ), ( x1, y2) ( content: ), with the same result.

Data reduction

The Gini coefficient is a statistical measure used to calculate the inequality distribution. Such measures diminishing in principle a more or less complex data in a simple measure that can lead to misinterpretation if it is not used properly.

In the case of the Gini coefficient example, there are at almost any Lorenz curve is at least another Lorenz curve with exactly the same Gini value. This is obtained by mirroring the original Lorenz curve to the line (0,1) and (1,0) passes through the points. If the amounts are to be divided 10 % / 90 % to 50 % / 50 %, this results in the same Lorenz curve as the distribution of the amounts of 50 % / 50 % to 90 % / 10 % of the carriers of the trait. The two Lorenz curves are shown in the figure. The only exceptions are Lorenz curves are symmetrical with respect to the beginning of that line. For the two different curves results in a common Gini coefficient of 0.4. Indeed, there are infinitely many possible Lorenz curves to a Gini coefficient (except for absolute equality or absolute inequality ). At this point, the Gini coefficient is similar to any other measure that is derived from the accumulation of a larger amount of data. Inequality indicators such as the Gini coefficient arise from aggregation of data with the specific intention of reducing complexity. The associated loss of information is thus no unintended side effect. For complexity reduction is generally true that they only become a disadvantage if one forgets their existence and their mapping function.

Source of error in comparing

Statements in which inequality coefficients are compared with each other, require a particularly critical review of the calculation of the individual coefficients. For a correct comparison, it is necessary that these coefficients were calculated uniformly in all cases. For example, the granularity of the different input data leads to different results in the calculation of imbalance. One with a few quantiles calculated Gini coefficient usually shows a slightly lower inequality as one with more quantiles calculated coefficient, because in the latter case due to higher measurement resolution, the unequal distribution can be taken into account that within the ranges (that is, between the quantiles ) remains unevaluated in the first case because of the coarser measurement resolution.

In simple words: Better data provide (almost always) a lower uniform distribution.

117041
de