Box plot

Figure 1. Box plot of data from the Michelson–Morley experiment

In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

Types of box plots

Figure 2. Boxplot with whiskers from minimum to maximum

Figure 3. Same Boxplot with whiskers with maximum 1.5 IQR

Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the first and third quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:

the minimum and maximum of all of the data^[1] (as in figure 2)
the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile (often called the Tukey boxplot)^[2]^[3] (as in figure 3)
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.

Any data not included between the whiskers should be plotted as an outlier with a dot, small circle, or star, but occasionally this is not done.

Some box plots include an additional character to represent the mean of the data.^[2]

On some box plots a crosshatch is placed on each whisker, before the end of the whisker.

Rarely, box plots can be presented with no whiskers at all.

Because of this variability, it is appropriate to describe the convention being used for the whiskers and outliers in the caption for the plot.

The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to show the seven-number summary. If the data is normally distributed, the locations of the seven marks on the box plot will be equally spaced.

Variations

Figure 4. Four box plots, with and without notches and variable width

Since the mathematician John W. Tukey introduced this type of visual data display in 1969, several variations on the traditional box plot have been described. Two of the most common are variable width box plots and notched box plots (see Figure 4).

Variable width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.^[1]

Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide to significance of difference of medians; if the notches of two boxes do not overlap, this offers evidence of a statistically significant difference between the medians.^[1] The width of the notches is proportional to the interquartile range (IQR) of the sample and inversely proportional to the square root of the size of the sample. However, there is uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).^[1] One convention is to use $\pm {\frac {1.58IQR}{\sqrt {n}}}$ .^[3]

Adjusted box plots are intended for skew distributions. They rely on the medcouple statistic of skewness.^[4] For a medcouple value of MC, the lengths of the upper and lower whiskers are respectively defined to be

{\begin{matrix}1.5IQR\cdot e^{3MC},&1.5IQR\cdot e^{-4MC}{\text{ if }}MC\geq 0\\1.5IQR\cdot e^{4MC},&1.5IQR\cdot e^{-3MC}{\text{ if }}MC\leq 0\end{matrix}}

Observe that for symmetrical distributions, the medcouple will be zero, and this reduces to Tukey's boxplot with equal whisker lengths of $1.5IQR$ for both whiskers.

Visualization

Figure 5. Boxplot and a probability density function (pdf) of a Normal N(0,1σ²) Population

The box plot is a quick way of examining one or more sets of data graphically. Box plots may seem more primitive than a histogram or kernel density estimate but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of number and width of bins techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate.

As looking at a statistical distribution is more commonplace than looking at a box plot, comparing the box plot against the probability density function (theoretical histogram) for a normal N(0,1σ²) distribution may be a useful tool for understanding the box plot (Figure 5).

References

1 2 3 4 McGill, Robert; Tukey, John W.; Larsen, Wayne A. (February 1978). "Variations of Box Plots". The American Statistician. 32 (1): 12–16. doi:10.2307/2683468. JSTOR 2683468.
1 2 Frigge, Michael; Hoaglin, David C.; Iglewicz, Boris (February 1989). "Some Implementations of the Boxplot". The American Statistician. 43 (1): 50–54. doi:10.2307/2685173. JSTOR 2685173.
1 2 "R: Box Plot Statistics". R manual. Retrieved 26 June 2011.
↑ M. Hubert; E. Vandervieren (2008). "An adjusted boxplot for skewed distributions". Computational Statistics and Data Analysis. 52 (12): 5186–5201. doi:10.1016/j.csda.2007.11.008.

External links

Wikimedia Commons has media related to Box plots.

Visual Presentation of Data by Means of Box Plots
On-line box plot calculator with explanations and examples (Has beeswarm example)
Beeswarm Boxplot - superimposing a frequency-jittered stripchart on top of a box plot
Complex online box plot creator with example data - see also BoxPlotR: a web tool for generation of box plots Spitzer et al. Nature Methods 11, 121–122 (2014)

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 10/18/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.