Benford's law

The Benford's law, also Newcomb - Benford 's Law ( NBL ), describes a regularity in the distribution of structures digits of numbers in empirical data sets, such as their first digit.

The law can be approximately in records on numbers of inhabitants of cities, observe amounts of money in accounting, physical constants, etc.. In short, it states:

  • 7.1 example
  • 7.2 In business
  • 7.3 In research
  • 7.5 size of the cities in Germany
  • 8.1 test for significant deviations

Discovery

In 1881 this law was discovered by the mathematician Simon Newcomb and published in the "American Journal of Mathematics ". He had noticed that in the used books with tables of logarithms the pages of tables with one as the first digit were significantly dirtier than the other sites, because they were obviously used more often. The treatise Newcomb remained unnoticed and was already forgotten, as the physicist Frank Benford (1883-1948) this law was rediscovered and published it in 1938 again. Since then, she was named after him, in recent times, but is also concerned by the term " Newcomb - Benford 's Law" ( NBL ), the original discoverer. Until a few years ago this law was not even known to all statisticians. Only since the American mathematician Theodore Hill has tried to make the Benford distribution to solve practical problems available, it has become much better known.

Benford distribution

Generally

Given a set of numbers that obey the Benford law. Then the probability of the occurrence of point D to the base B to the n- th position (counting from the front, starting with 0 )

Where the floor function called.

Especially for the first digit, the formula simplifies to

Is easy to verify that the sum of the probabilities of all different digits at a certain point one equals, since the sum on the application of the above Logarithmengesetzes already used for the first digit gives a telescopic sum.

Decimals

Given a set of decimal numbers that obey the Benford law. Then, the point d at the first location occurs by the probability P (D):

Graph

Benford's law states in its simplest consequence that the leading digits n (n = 1 ... 9 ) will appear with the following probabilities: log10 (n 1) - log10 ( n ), or

Validity of the NBL

A record is a Benford variable ( that is, the Benford law applies for this record ) when the mantissa of the logarithm of the data set are uniformly distributed within the limits of 0 to 1; Typically, this is A. the case when the variance is not below a certain of the class of distribution, after which the logarithms of the data set is distributed, dependent minimum value within the data record.

The Fibonacci numbers (each Fibonacci number is the sum of its two predecessors ) emerge at the first digits of the first 30 numbers a distribution which is amazingly close to a Benford distribution. This also applies to similar sequences with different initial numbers (eg, the Lucas sequences ). Many of numbers obey the Benford law, but many others do not obey him, so are not Benford variables.

Why follow many records the NBL

The NBL states that the occurrence probabilities of digits sequences in the numbers of real data sets (so that are here meant those who were not subject to manipulation ) that are sufficiently large and have numbers on the order of up to at least, data, therefore, which have been distributed reasonably well ( are dispersed ) are not uniformly distributed, but follow a logarithmic law. This means that the probability of occurrence of a sequence of digits is higher, the smaller it is in terms of value and the farther left it starts in number. Most commonly, the initial sequence, 1 ' with theoretically 30.103 %. The NBL is based on the uniform distribution of the mantissa of the logarithm of the number of values ​​of the data set. The reason for the surprisingly common Are the NBL is due to the fact that many real data sets are log-normally distributed, not so the frequencies of the data itself, but the frequencies of the logarithms of these data follow a normal distribution. At sufficiently broad dispersion of the normally distributed logarithms ( when the standard deviation is greater than / equal to about 0.74 ) leads to the fact that the mantissas of the logarithms stable follow a uniform distribution. If the standard deviation but smaller, the mantissas are normally distributed, and the NBL is no longer true, at least not in the simple form shown. Is the standard deviation of less than 0.74, it comes to that in the not -too-frequent statistics effects that even the respective mean value of the normal distribution of the logarithms influences the frequency of occurrence of the numeric sequences.

Assuming one hand on the NBL in its present form, so there are numerous records that do not satisfy the NBL. On the other hand, there is already a formulation of the NBL in the form that it satisfy all the records.

The Benford's Law applies in particular to figures which is subject to natural growth processes. For then the numbers change over time and multiply. The first position of the mantissa remains for about 30 % of the time at the 1, etc. 18 % of the time on the 2: This corresponds to the logarithmic distribution predicted by the law Benford, and is independent of the time in which a multiplication is carried out. Then the cycle starts all over again at the first A snapshot of prices in a supermarket you will find exactly this distribution, no matter when the survey is conducted.

Scale invariance

With a constant multiplied records with Newcomb - Benford - distributed initial digits are again Benford - distributed. Multiplying the data by a constant corresponding to the addition of a constant to the logarithm. If the data are distributed sufficiently far characterized the distribution of the mantissa does not change.

This property explains immediately why. In tax returns, financial statements, etc., or in general with data sets that represent numbers whose sums of money, which is considered Newcomb - Benford law If there is ever a universally valid distribution of the first digits in such records, then this distribution must be independent from the currency in which the data are reported, and the universal distribution must not change by inflation. Both mean that the distribution must be scale invariant. Since the Newcomb - Benford distribution is the only one that satisfies this condition, it must be for these consequently.

Baseninvarianz

A record that is sufficient to base B1 of the Benford law, this is also sufficient to base B2. More concretely, a decadal record that meets the Benford's law, the Benford law is fulfilled even if the decadic numbers (eg, in binary, octal or into the hexadecimal ) are translated into another numbering system.

Applications

Compliance real data sets, although complying with the parametric requirements of the Benford law inasmuch not, as the number of occurrences of a particular digit is significantly different from the expectation given by Benford's law, then an auditor those records that begin with this digit, undergo a more in-depth analysis to find the cause (s ) for these discrepancies. This easy method can lead to deeper insights into features of the examined data set or to detect tampering of data compilation.

Example

A table reports the harvest results from the year 2002. Give the blue bars in the chart the frequency of first digits of the numbers recorded at 87. The Benford distribution is shown as a red line. It reflects the distribution resist much better than an equal distribution ( green line). Despite the small sample, the preference for smaller values ​​at the first digit can be seen as well as a tendency for the second digit.

The table summarizes the results. In the first column number is the number of times the digit occurs first, in the Benford column, how often it is expected according to the Benford distribution there. The same applies to the number of figures with the number in second place in the 2nd digit column. The digit 1 occurs then 27 times in the first place on expected was 26.19 times. The digit 4 is 17 times in the first place, according to Benford they should occur on average 8.43 times.

With decreasing value of the digit to the above mentioned Benford distribution approaches more and more the equal distribution of digits.

In the economic

The Benford's law applies in detecting fraud in preparing the accounts, the forgery in accounting, generally for the rapid discovery of blatant irregularities in accounting. With the help of the Benford law, the remarkable " creative " accounting at Enron and Worldcom was uncovered, through which the management had cheated investors about their deposits (→ Economic Crime ). Today, accountants and tax investigators use methods based on the Benford law. These methods represent an important part of mathematical and statistical methods for several years to detect accounting fraud, tax and investor fraud and general fraud data in use. It was further shown that the leading digits of the market prices the Benford law follow (el Sehity, Hoelzl and Kirchler, 2005). The manipulation of the economic data of Greece could be detected by Benford law. Even a regularly occurring theft from the cash register of a Getränkehandlung could be detected.

Research

The Benford's Law can also help in the detection of data fabrication in science. There were records from the natural sciences, which led to the Benford law. Karl -Heinz Tödter from the Research Centre of the Deutsche Bundesbank has used the same law to verify the results of 117 economic work in a contribution to the German Economic Review.

Political scientists studied using the Benford law election results of several federal elections at the constituency level and beat isolated on significant irregularities.

It could also forgeries in the 2009 presidential elections in Iran are detected.

Size of the cities in Germany

The right figure shows the size distribution of German cities. The graphics are deposited the population of the 998 largest cities. A Benford analysis yields the following frequencies of the first digits:

The frequency of digits 3 and 4 correspond to the expectation. In contrast, takes the number 1 more on. Especially pronounced is the deviation of the point 2, at the expense of rarely observed in the first place by paragraphs 7, 8 and 9

This example shows again that records must meet certain requirements in order to satisfy the NBL; the current data does not. Reason here is the restriction to cities, the distribution of all communities should give a more accurate match. In addition, there is a natural minimum size of locality, as have municipal mergers affect the distribution. Curiously, even includes about 50 % of the examples that led Benford in his publication as evidence of the NBL, have to the class of records that do not Benford distributed initial digits, but a maximum of roughly similar distribution of initial digits.

Significance

How large are the deviations of the observed distribution of the theoretically expected distribution need to be for a reasonable suspicion of manipulation can be regarded as confirmed, with the aid of mathematical and statistical methods (eg, the chi -square test or the Kolmogorov -Smirnov test, " KS- test") determined. For the test, a sample from 109 numbers should when testing on random variations in the initial digit sufficient ( is satisfied for all). If the sample is much smaller, the results of the chi -square test of the KS test may be challenged and possibly tolerant. In such a case can be made of a very elaborate, but exact test, for example, based on the multinomial distribution. Moreover, the data of the data set must be statistically independent of each other. ( Therefore, numbers can, for example, the Fibonacci sequence can not be tested with the chi-square test for significance adaptation, since the resultant result is unreliable. )

That just balance lists, invoice lists, and similar statements in accordance with the Benford law behavior is due to the fact that it is in the majority of such number of rows is collections of numbers that have undergone a variety of arithmetic processes and therefore behave like quasi- random numbers. Leaving the business and accounting processes run free, then act beyond a certain business size, the laws of chance and it is therefore also the Benford's law. If, however, during the course of an accounting period consistently lobbied on these figures, by which often schönt, leaves certain numbers disappear or what should invent or even manipulated trials for a given competence limitations, then the chance is noticeably disturbed. These disorders manifest themselves in significant deviations from the theoretically expected number distribution.

In practice it is often found that the conventional significance tests are not entirely reliable at Benford analysis. In addition, at times the data of a data set are not completely independent of each other, which is why, for example, may not use the chi- square test for such records. The development of better adapted to the NBL significance tests are working.

Example: If an employee is allowed to perform orders up to 1,000 EUR without approval of the Board and he orders often divides in the presence of offers higher than 1000 EUR on several smaller items to save you the effort of approval, then find themselves in the Benford distribution of the order amounts significant deviations from the theoretical expectation.

However, this example also shows that statistical methods can not detect individual irregularities. A certain consequence of the manipulation is required. The larger the sample is, the more sensitive a test of significance generally responds to manipulations.

Test for significant differences

Benford analysis be kept for the simplest analysis of mathematical statistics. The example below is the result of the counting of the first digits of a sample of 109 sums from a list. The real ( observed ) count results are compared with the expected numbers beginning with 109 vote counts and examined to determine the chi -square test whether the deviations found can be random or can no longer be explained by chance alone. As a decision criterion is assumed in this example that assumes About randomness, as soon as the observed distribution of initial digits to those 5% belongs to this or even higher deviations have ( statistical test ). In our example, since 52 % of all distributions have this or higher deviations, an examiner will not reject the hypothesis that the deviations are caused by accident.

Deeper Benford analysis

In case of very long lists of thousands of results, with a Benford test is carried out not only with the first digit. Such a wealth of data allows also the 2nd, the 3rd, the sum of 1 2, and possibly even the sum 1 2 3 digit simultaneously to check ( for this you should, however, have at least 11,500 numbers, otherwise could bring uncertain results of the chi-square test). For these tests also exist Benford distributions, although they are also somewhat more extensive. For example, the theoretical expectation for the appearance of the first digits 123 ... 0.35166 %, whereas only 0.13508 % of all the numbers have the first digits ... 321.

Always the rule is that the numbers even more follow a uniform distribution, the smaller is their priority. Cents followed almost exactly a uniform distribution, eliminating the need for cent to the logarithmic approach in general. For very small currencies Tests for equal distribution of the coins amounts (eg kopeck - RUS, Heller -CZ, Fillér -H, Lipa -HR) of focus, as is often rounded in practice. Large currencies (U.S. Dollar, Pound Sterling, Euro ) allow such tests but mostly already.

Estimation and planning of corporate transactions

The Benford's law can be use also for estimating sales figures of the company. For the magnitudes of the Fakturenbeträge is assumed to follow approximately a normal distribution, the first digits of the Fakturenbeträge the Benford distribution, the expected value of the first digit about 3.91: (see above derivation of the expected value of the Benford distribution). With the knowledge of the highest auction invoice and the number of valid invoices from which make up the revenue to be estimated, a useful estimate of the turnover is possible, as the following example from the practice shows. The value in the table refers to the number before the decimal point of the logarithm. The actual sales amounted to 3.2 million units of currency. So close to the actual result, it is sales estimates but not always. If the assumption of normal distribution for the magnitude is not the case, you have to choose an estimator distribution is more like the real one. In most cases, the magnitudes of the Fakturenbeträge then follow a logarithmic normal distribution.

Although the actual distribution of Fakturenbeträge is always just happen to coincide with that of the estimate, the sum of estimation errors per priority, however, almost always compensated to a rather small amount.

Also as part of the planning of corporate transactions, this method for checking the reasonableness of forecasted sales, most of which are created as a result of estimates and extrapolations of past experience selling oriented departments, are used by elicited how many invoices to the achievement of specified sales are expected and how high the highest Fakturenbetrag will be. Often, this analysis shows that, not too much reliance on such estimates that are set planning is based. The Benfordanalyse are the sales department then the reality-based feedback to correct their expectations.

Assuming that the logarithms of the individual sales are evenly distributed, so the sales are almost " uniformly distributed logarithmically ." The density function of sales then a histogram of the Benford distribution looks very similar, with a suitable classification of the distribution of numeric sequences (eg, nine classes, compared with first digit).

Generation Benford - distributed initial digits

The generation of random numbers with virtually Benford - distributed initial digits is quite easily possible with the PC.

Uniformly distributed numbers

The function generates numbers with Benford - distributed initial digits for. Where r is a random uniformly distributed positive integer from a fixed interval, and s is a uniformly distributed random number between 0 and 1

Normally distributed numbers

The function generates for, with a uniformly distributed random variable, with numbers approximately normally distributed magnitudes of y and Benford - distributed initial digits. R should be chosen relatively high (r> 1000) for practical purposes. If r <1000, can be seen with decreasing r, that the distribution of numbers of the form a log-normal distribution is similar. If r <50, the first digits of the generated numbers are generally not Benford distributed. For applications in practice, the broad distribution of orders of magnitude of y that the square of the tangent function - and moreover for large r - produced, in many cases not optimal.

114908
de