Data binning

Classification or classification referred to in the statistics, the classification of feature values ​​or statistical series into separate groups, in classes. Each element of the studied population is associated with a function of its value in the corresponding variable exactly one class. The accuracy class is at too great a number of different values ​​of a ( observed ) random variables to be practicable processed or displayed. This type of processing of data is also displayed if the values ​​collected are to be regarded only as an approximation of the true values ​​or ( quasi-) continuous variables with methods for discrete variables to be examined.

The all values ​​of a class are within the upper and lower class limit, the difference of which is the class width. The middle class is the used for further analysis "representative" value of a class represents the class frequency or occupation number corresponds to the number of elements contained in the class.

Class and classification

Classes are disjoint, that is, non-overlapping, contiguous intervals of characteristic values ​​, which are defined by a lower and an upper class limit and clearly defined.

A classification is a summary of the same or similar characteristic attributes of a group or class. In statistical studies, since it is often not possible or advisable to collect or process any individual ( different ) characteristic values ​​or realizations of the random variables examined, a better overview of the data can be achieved by a classification. This is particularly true steady or quasi- steady characteristics or features, the number of (different) characteristic values ​​is very large, too.

Disadvantage of the classification is the loss of information, since the individual observed values ​​by sole consideration of the classes " lost " and instead are only representative variables such as the number of observations included in a particular class or the class center for further analysis. Within a class, the observations should be uniformly distributed on the characteristic attributes as possible, ie the dimensions should accumulate not only in a limited area of the class, so that class and class width for the observations contained sin representative.

Class limit

A class limit is the value of a metric scaled ( random ) variables defining a class up or down. A class is defined by two class boundaries, the lower class limit and upper class limit, the upper limit of the class - th class of the lower class limit of the - th class corresponds to, ie

.

The assignment of class boundaries to a class can be done in two ways. Either the lower class limit of the class and the upper class limit belongs to the class or the lower class boundary belongs to the class and the upper class limit for the class, ie

Or.

The following example illustrates the two alternatives in the classification:

An observation value or examined statistical unit is thus assigned to class if or applies.

Class width

The class width is the difference between upper and lower class limit.

.

The classes of a feature may also have different widths. The optimal number of classes and the width of the classes depending on the specific situation analysis ( data targets). Some " rules of thumb " to determine the number of classes or instead of the class width can be found in the article to the histogram. Jenks Caspall the algorithm provides a method for the automatic classification.

Middle class

After the classification can be used as a representative value of a class for further analysis the middle class. It can be determined as the arithmetic mean of the lower and upper class limit for a symmetrical distribution of the elements of a class on the characteristics or values ​​contained in the respective class.

Frequency density

As an example, the metric continuous parameter " net annual income" of a well-defined population of individuals is examined. As the number of people with rising incomes is low, choose typically the upper income classes wider than the middle and lower, so that the representation remains clear.

If a characteristic is divided into different broad classes, but the (absolute or relative) class frequency is not very meaningful without specifying the class width. Therefore, the calculation of the frequency density is important in order to make comparable the classes. It corresponds to the members of the class width and class Frequency column height in a histogram. The incidence density of a class is the ratio of the absolute or the relative frequency of a class to the appropriate class width.

The incidence density is therefore as follows:

With the absolute frequency of class

Or

With the relative frequency of class.

Representation of classified variables

One way of systematic and clear display of a binned continuous random variables provides a frequency table.

The number of objects is. For the representation of multi-dimensional frequency distributions crosstabs can be used. The graphical representation of classified variables can be a histogram, a bar chart or bar, a bar graph or a very few classes of a pie chart.

Location parameter

Since a classification only intervals, but no exact values ​​available, only intervals and not exact values ​​can be determined for the location parameter. As an example, the number of cars per thousand inhabitants is selected European countries.

  • Arithmetic mean
  • Quartiles
  • Mode
  • Modal class

Note: It is often taken as an example, a frequency distribution with the following additional assumptions:

  • The values ​​for each class are uniformly distributed, ie, adjacent values ​​are the distance class width / frequency = 1/Häufigkeitsdichte
  • The values ​​for each class are symmetrical to the class center.

It can be personalized with fine analysis and geometric considerations (eg, application of radiation rates) identify concrete values ​​for the location parameter. Or define a unique original list by the two assumptions.

In the example, the following can create unique original list

From this list, then the following values ​​are obtained

  • Arithmetic mean = ( 5 6 * 100 * 250 * 350 6 9 6 * 450 * 600) / 32 = 367.1875
  • 1st quartile = ( 241.67 258.33 ) / 2 = 250
  • 2nd quartile = Median = (375 391.67 ) / 2 = 383.33
  • 3rd quartile = (472.22 483.33) / 2 = 477.78
  • Each value is mode, because each value occurs exactly once

From such unique original list then scattering parameters can be calculated.

120368
de