Boxplot

The box plot (also Box - whisker plot or German box chart) is a diagram that is used for the graphical representation of the distribution cardinal scaled data. It consolidates the various robust scatter and measures of central tendency together in a presentation. A boxplot is to quickly give an idea about the area in which the data resides and how they are distributed over this range. Therefore, all values ​​of the so-called five-point summary, which is the median, the two quartiles and the two extreme values ​​shown.

Construction

A boxplot always consists of a rectangle, called a box, and two lines that extend this rectangle. These lines are called " antenna " or more rarely as a " sensor" or " whiskers " and are terminated by a semicolon. In general, the line in the box represents the median of the distribution.

Box

The box corresponds to the area in which the middle 50 % of the data lie. It is thus defined by the upper and the lower quartile, and the length of the box corresponds to the interquartile range (English interquartile range, IQR ). This is a measure of the scatter of the data, which is determined by the difference between the upper and lower quartiles. Furthermore, the median is shown as a continuous line in the box. This line divides the whole chart into two halves, each containing 50 % of the data lie. So Due to its location within the box you get a sense of the skewness of the distribution of the underlying data is conveyed. Is the median in the left part of the box, so the distribution is skewed to the right, and vice versa.

Antenna ( whiskers )

By antennas located outside the box values ​​are displayed. In contrast to the definition of the box to define the antenna is not uniform.

One possible definition, which comes from John W. Tukey, is to limit the length of the whiskers to a maximum of 1.5 times the interquartile range ( IQR 1.5 × ). However, the whisker ends not exactly this length, but at the value from the data, which is still within the limit. The length of the whiskers is thus determined by the data values ​​and not solely by the interquartile range. This is also the reason why the whiskers do not have to be on two sides of equal length. There are no values ​​outside the boundary of 1.5 × IQR, the length of the whisker is determined by the maximum and minimum value. Otherwise, the values ​​outside the whiskers are entered separately in the diagram. These values ​​can then be treated as outliers suspicious or may be directly described as outliers.

Frequently outliers that are between 1.5 × IQR and 3 × IQR, as " mild " refers to outliers and values ​​that are higher than 3 × IQR, as "extreme " outliers. These are then usually marked differently in the diagram.

Another possible definition is this, that the whiskers extend from the data to the largest or smallest value. In this representation, then no more outliers are apparent, since the box including whiskers covers the entire span of the data.

% Quantile and the calculation of the upper than 97.5 - - % quantile In another variant, the calculation of the lower whisker as 2.5 takes place. Within the Whiskergrenzen thus are 95 % of all observed values ​​. In this view, there is therefore points (depending on Quantilsdefinition ) always shown individually from a certain sample size (which you then should not automatically be interpreted as outliers).

Modifications

A variation is the arithmetic mean to a boxplot to write with. It states is usually entered as a star. As the box plot otherwise contains only robust scatter and measures of central tendency, the arithmetic mean as a non- robust measure of central tendency should not really be included in a box plot.

In notched (English notched ) boxplot also confidence intervals for the median are included.

Summary of characteristics

The advantage of boxplots is that certain characteristics of a distribution can be read directly from the graph.

Application

The simple construction of boxplots these are mainly used when you want to quickly gain an overview of existing data in question. It does not have to be known, what proportions data subject. The box indicates the area in which 50 % of the data lie, and the box including whisker indicates the range in which the bulk of the data lies. At the location of the median inside that box, you can tell whether a distribution is symmetric or skew. Less suitable is the boxplot for bi-or multimodal distributions. To detect such properties, the use of histograms or the graphical implementation of kernel density estimates is recommended.

Box plots with whiskers of a maximum of one and a half times the interquartile range are also useful to identify possible outliers, or provide information on whether the data are subject to a certain distribution. If the boxplot is highly unbalanced, contains an unusually high number of outliers or far away from the box outliers, which suggests, for instance the fact that the data are not normally distributed.

The main advantage of the boxplot consists of rapid comparison of the distribution in different subgroups. While a histogram has a two-dimensional expansion, a boxplot is essentially one-dimensional, so that is easy to display multiple records next to each other (or each other in horizontal view) on the same scale and can be compared.

Example

This example is based on a series of measurements with the following 20 data points:

A boxplot helps very quickly get an overview of these data. So you can see directly that the median (solid line ) lies exactly at 8.5 and that are each 25 % of the data below 7 or above 9.5, because these are precisely the dimensions of the box, included in the 50 % of the measured values are. Consequently, also the interquartile range, which corresponds to the length of the box, exactly 2.5.

This boxplot was created with whiskers to a length of 1.5 times the interquartile range. These are so maximum 3.75 units long. However, whiskers extend only ever up to a value from the data, which is still within that 3.75 units. The upper whisker thus extends only up to 10, since there is no greater value in the data, and the lower whisker only to 5, the next smallest value is more than 3.75 from the beginning of the box.

The values ​​of 1 and 3 are marked in the boxplot as an outlier, as they are outside of the box or the whiskers. These values ​​should be investigated whether it is in fact outliers or typos or otherwise unusual values ​​.

Since the median is within the box slightly to the right, can also be closed on a left skewness of the underlying distribution of the measured data. This distribution is also likely to be a normal distribution, as the box plot is asymmetrical and contains relatively large number of outliers.

7173
de