Akaike information criterion

The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.

AIC is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. In doing so, it deals with the trade-off between the goodness of fit of the model and the complexity of the model.

AIC does not provide a test of a model in the sense of testing a null hypothesis, so it can tell nothing about the quality of the model in an absolute sense. If all the candidate models fit poorly, AIC will not give any warning of that.

Definition

Suppose that we have a statistical model of some data. Let L be the maximum value of the likelihood function for the model; let k be the number of estimated parameters in the model. Then the AIC value of the model is the following.^[1]^[2]

{\mathrm {AIC}}=2k-2\ln(L)

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, because increasing the number of parameters in the model almost always improves the calculated goodness of the fit.

AIC is founded in information theory. Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g₁ and g₂. If we knew f, then we could find the information lost from using g₁ to represent f by calculating the Kullback–Leibler divergence, D_KL(f ‖ g₁); similarly, the information lost from using g₂ to represent f could be found by calculating D_KL(f ‖ g₂). We would then choose the candidate model that minimized the information loss.

We cannot choose with certainty, because we do not know f. Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by g₁ than by g₂. The estimate, though, is only valid asymptotically; if the number of data points is small, then some correction is often necessary (see AICc, below).

How to apply AIC in practice

To apply AIC in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using a candidate model to represent the "true" model (i.e. the process that generates the data). We wish to select, from among the candidate models, the model that minimizes the information loss. We cannot choose with certainty, but we can minimize the estimated information loss.

Suppose that there are R candidate models. Denote the AIC values of those models by AIC₁, AIC₂, AIC₃, …, AIC_R. Let AIC_min be the minimum of those values. Then exp((AIC_min − AIC_i)/2) can be interpreted as being proportional to the probability that the ith model minimizes the (estimated) information loss.^[3]

As an example, suppose that there are three candidate models, whose AIC values are 100, 102, and 110. Then the second model is exp((100 − 102)/2) = 0.368 times as probable as the first model to minimize the information loss. Similarly, the third model is exp((100 − 110)/2) = 0.007 times as probable as the first model to minimize the information loss.

In this example, we would omit the third model from further consideration. We then have three options: (1) gather more data, in the hope that this will allow clearly distinguishing between the first two models; (2) simply conclude that the data is insufficient to support selecting one model from among the first two; (3) take a weighted average of the first two models, with weights proportional to 1 and 0.368, respectively, and then do statistical inference based on the weighted multimodel.^[4]

The quantity exp((AIC_min − AIC_i)/2) is the relative likelihood of model i.

If all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test. There are, however, important distinctions. In particular, the likelihood-ratio test is valid only for nested models, whereas AIC (and AICc) has no such restriction.^[5]

AICc

AICc is AIC with a correction for finite sample sizes. The formula for AICc depends upon the statistical model. Assuming that the model is univariate, linear, and has normally-distributed residuals (conditional upon regressors), the formula for AICc is as follows:^[4]^[6]

{\mathrm {AICc}}={\mathrm {AIC}}+{\frac {2k(k+1)}{n-k-1}}

where n denotes the sample size and k denotes the number of parameters.

If the assumption of a univariate linear model with normal residuals does not hold, then the formula for AICc will generally change. Further discussion of the formula, with examples of other assumptions, is given by Burnham & Anderson (2002, ch. 7) and Konishi & Kitagawa (2008, ch. 7–8). In particular, with other assumptions, bootstrap estimation of the formula is often feasible.

AICc is essentially AIC with a greater penalty for extra parameters. Using AIC, instead of AICc, when n is not many times larger than k², increases the probability of selecting models that have too many parameters, i.e. of overfitting. The probability of AIC overfitting can be substantial, in some cases.^[7]^[8]

Brockwell & Davis (1991, p. 273) advise using AICc as the primary criterion in selecting the orders of an ARMA model for time series. McQuarrie & Tsai (1998) ground their high opinion of AICc on extensive simulation work with regression and time series. Burnham & Anderson (2004) note that, since AICc converges to AIC as n gets large, AICc—rather than AIC—should generally be employed.

Note that if all the candidate models have the same k, then AICc and AIC will give identical (relative) valuations; hence, there will be no disadvantage in using AIC instead of AICc. Furthermore, if n is many times larger than k², then the correction will be negligible; hence, there will be negligible disadvantage in using AIC instead of AICc.

History

The Akaike information criterion was developed by Hirotugu Akaike, originally under the name "an information criterion". It was first announced by Akaike at a 1971 symposium, the proceedings of which were published in 1973.^[9] The 1973 publication, though, was only an informal presentation of the concepts.^[10] The first formal publication was in a 1974 paper by Akaike.^[2] As of October 2014, the 1974 paper had received more than 14000 citations in the Web of Science: making it the 73rd most-cited research paper of all time.^[11]

The initial derivation of AIC relied upon some strong assumptions. Takeuchi (1976) showed that the assumptions could be made much weaker. Takeuchi's work, however, was in Japanese and was not widely known outside Japan for many years.

AICc was originally proposed for linear regression (only) by Sugiura (1978). That instigated the work of Hurvich & Tsai (1989), and several further papers by the same authors, which extended the situations in which AICc could be applied. The work of Hurvich & Tsai contributed to the decision to publish a second edition of the volume by Brockwell & Davis (1991), which is the standard reference for linear time series; the second edition states, "our prime criterion for model selection [among ARMA models] will be the AICc".^[12]

The first general exposition of the information-theoretic approach was the volume by Burnham & Anderson (2002). It includes an English presentation of the work of Takeuchi. The volume led to far greater use of AIC, and it now has more than 31000 citations on Google Scholar.

Akaike originally called his approach an "entropy maximization principle", because the approach is founded on the concept of entropy in information theory. Indeed, minimizing AIC in a statistical model is effectively equivalent to maximizing entropy in a thermodynamic system; in other words, the information-theoretic approach in statistics is essentially applying the Second Law of Thermodynamics. As such, AIC has roots in the work of Ludwig Boltzmann on entropy. For more on these issues, see Akaike (1985) and Burnham & Anderson (2002, ch. 2).

Usage tips

Counting parameters

A statistical model must fit all the data points. Thus, a straight line, on its own, is not a model of the data, unless all the data points lie exactly on the line. We can, however, choose a model that is "a straight line plus noise"; such a model might be formally described thus: y_i = b₀ + b₁x_i + ε_i. Here, the ε_i are the residuals from the straight line fit. If the ε_i are assumed to be i.i.d. Gaussian (with zero mean), then the model has three parameters: b₀, b₁, and the variance of the Gaussian distributions. Thus, when calculating the AIC value of this model, we should use k=3. More generally, for any least squares model with i.i.d. Gaussian residuals, the variance of the residuals’ distributions should be counted as one of the parameters.^[13]

As another example, consider a first-order autoregressive model, defined by x_i = c + φx_i−1 + ε_i, with the ε_i being i.i.d. Gaussian (with zero mean). For this model, there are three parameters: c, φ, and the variance of the ε_i. More generally, a pth-order autoregressive model has p + 2 parameters. (If, however, c is not estimated, but given in advance, then there are only p + 1 parameters.)

Transforming data

The AIC values of the candidate models must all be computed with the same data set. Sometimes, though, we might want to compare a model of the data with a model of the logarithm of the data; more generally, we might want to compare a model of the data with a model of transformed data. Here is an illustration of how to deal with data transforms (adapted from Burnham & Anderson (2002, §2.11.3)).

Suppose that we want to compare two models: a normal distribution of the data and a normal distribution of the logarithm of the data. We should not directly compare the AIC values of the two models. Instead, we should transform the normal cumulative distribution function to first take the logarithm of the data. To do that, we need to perform the relevant integration by substitution: thus, we need to multiply by the derivative of the (natural) logarithm function, which is 1/x. Hence, the transformed distribution has the following probability density function:

x\mapsto \,{\frac {1}{x}}{\frac {1}{{\sqrt {2\pi \sigma ^{2}}}}}\,\exp \left(-{\frac {\left(\ln x-\mu \right)^{2}}{2\sigma ^{2}}}\right)

—which is the probability density function for the log-normal distribution. We then compare the AIC value of the normal model against the AIC value of the log-normal model.

Software unreliability

Some statistical software will report the value of AIC or the maximum value of the log-likelihood function, but the reported values are not always correct. Typically, any incorrectness is due to a constant in the log-likelihood function being omitted. For example, the log-likelihood function for n independent identical normal distributions is

\ln {\mathcal {L}}(\mu ,\sigma ^{2})=-{\frac {n}{2}}\ln(2\pi )-{\frac {n}{2}}\ln \sigma ^{2}-{\frac {1}{2\sigma ^{2}}}\sum _{{i=1}}^{n}(x_{i}-\mu )^{2}

—this is the function that is maximized, when obtaining the value of AIC. Some software, however, omits the constant term (n/2)ln(2π), and so reports erroneous values for the log-likelihood maximum—and thus for AIC. Such errors do not matter for AIC-based comparisons, if all the models have their residuals as normally-distributed: because then the errors cancel out. In general, however, the constant term needs to be included in the log-likelihood function.^[14] Hence, before using software to calculate AIC, it is generally good practice to run some simple tests on the software, to ensure that the function values are correct.

Comparisons with other model selection methods

Comparison with BIC

The AIC penalizes the number of parameters less strongly than does the Bayesian information criterion (BIC). A comparison of AIC/AICc and BIC is given by Burnham & Anderson (2002, §6.4). The authors show that AIC and AICc can be derived in the same Bayesian framework as BIC, just by using a different prior. The authors also argue that AIC/AICc has theoretical advantages over BIC. First, because AIC/AICc is derived from principles of information; BIC is not, despite its name. Second, because the (Bayesian-framework) derivation of BIC has a prior of 1/R (where R is the number of candidate models), which is "not sensible", since the prior should be a decreasing function of k. Additionally, they present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC. See too Burnham & Anderson (2004).

Further comparison of AIC and BIC, in the context of regression, is given by Yang (2005). In particular, AIC is asymptotically optimal in selecting the model with the least mean squared error, under the assumption that the exact "true" model is not in the candidate set (as is virtually always the case in practice); BIC is not asymptotically optimal under the assumption. Yang additionally shows that the rate at which AIC converges to the optimum is, in a certain sense, the best possible.

For a more detailed comparison of AIC and BIC, see Vrieze (2012) and Aho et al. (2014).

Comparison with least squares

Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting.

In this case, the maximum likelihood estimate for the variance of a model's residuals distributions, σ², is RSS/n, where RSS is the residual sum of squares: $\textstyle {\mathrm {RSS}}=\sum _{{i=1}}^{n}(y_{i}-f(x_{i};{\hat {\theta }}))^{2}$ . Then, the maximum value of a model's log-likelihood function is

-{\frac {n}{2}}\ln(2\pi )-{\frac {n}{2}}\ln(\sigma ^{2})-{\frac {1}{2\sigma ^{2}}}\mathrm {RSS} =-{\frac {n}{2}}\ln(\mathrm {RSS} /n)+C_{1}

—where C₁ is a constant independent of the model, and dependent only on the particular data points, i.e. it does not change if the data does not change.

That gives AIC = 2k + n ln(RSS/n) − 2C₁ = 2k + n ln(RSS) + C₂.^[15] Because only differences in AIC are meaningful, the constant C₂ can be ignored, which conveniently allows us to take AIC = 2k + n ln(RSS) for model comparisons. Note that if all the models have the same k, then selecting the model with minimum AIC is equivalent to selecting the model with minimum RSS—which is a common objective of least squares fitting.

Comparison with cross-validation

Leave-one-out cross-validation is asymptotically equivalent to the AIC, for ordinary linear regression models.^[16] Such asymptotic equivalence also holds for mixed-effects models.^[17]

Comparison with Mallows's C_p

Mallows's C_p is equivalent to AIC in the case of (Gaussian) linear regression.^[18]

Notes

↑ Burnham & Anderson 2002, §2.2
1 2 Akaike 1974
↑ Burnham & Anderson 2002, §2.9.1, §6.4.5
1 2 Burnham & Anderson 2002
↑ Burnham & Anderson 2002, §2.12.4
↑ Cavanaugh 1997
↑ Claeskens & Hjort 2008, §8.3
↑ Giraud 2015, §2.9.1
↑ Akaike 1973
↑ deLeeuw 1992
↑ Van Noordon R., Maher B., Nuzzo R. (2014), "The top 100 papers", Nature, 514.
↑ Brockwell & Davis 1991, p. 273
↑ Burnham & Anderson 2002, p. 63
↑ Burnham & Anderson 2002, p. 82
↑ Burnham & Anderson 2002, p. 63
↑ Stone 1977
↑ Fang 2011
↑ Boisbunon et al. 2014

References

Aho, K.; Derryberry, D.; Peterson, T. (2014), "Model selection for ecologists: the worldviews of AIC and BIC", Ecology, 95: 631–636, doi:10.1890/13-1452.1 .
Akaike, H. (1973), "Information theory and an extension of the maximum likelihood principle", in Petrov, B.N.; Csáki, F., 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2-8, 1971, Budapest: Akadémiai Kiadó, pp. 267–281 .
Akaike, H. (1974), "A new look at the statistical model identification", IEEE Transactions on Automatic Control, 19 (6): 716–723, doi:10.1109/TAC.1974.1100705, MR 0423716 .
Akaike, H. (1985), "Prediction and entropy", in Atkinson, A.C.; Fienberg, S.E., A Celebration of Statistics, Springer, pp. 1–24 .
Boisbunon, A.; Canu, S.; Fourdrinier, D.; Strawderman, W.; Wells, M. T. (2014), "Akaike's Information Criterion, C_p and estimators of loss for elliptically symmetric distributions", International Statistical Review, 82: 422–439, doi:10.1111/insr.12052 .
Brockwell, P. J.; Davis, R. A. (1987), Time Series: Theory and Methods, Springer, ISBN 0387964061 .
Brockwell, P. J.; Davis, R. A. (1991), Time Series: Theory and Methods (2nd ed.), Springer, ISBN 0387974296 . Republished in 2009: ISBN 1441903194.
Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.), Springer-Verlag, ISBN 0-387-95364-7 .
Burnham, K. P.; Anderson, D. R. (2004), "Multimodel inference: understanding AIC and BIC in Model Selection" (PDF), Sociological Methods & Research, 33: 261–304, doi:10.1177/0049124104268644 .
Cavanaugh, J. E. (1997), "Unifying the derivations of the Akaike and corrected Akaike information criteria", Statistics & Probability Letters, 31: 201–208, doi:10.1016/s0167-7152(96)00128-9 .
Claeskens, G.; Hjort, N. L. (2008), Model Selection and Model Averaging, Cambridge University Press .
deLeeuw, J. (1992), "Introduction to Akaike (1973) information theory and an extension of the maximum likelihood principle" (PDF), in Kotz, S.; Johnson, N.L., Breakthroughs in Statistics I, Springer, pp. 599–609 .
Fang, Yixin (2011), "Asymptotic equivalence between cross-validations and Akaike Information Criteria in mixed-effects models" (PDF), Journal of Data Science, 9: 15–21 .
Giraud, C. (2015), Introduction to High-Dimensional Statistics, CRC Press .
Hurvich, C. M.; Tsai, C.-L. (1989), "Regression and time series model selection in small samples", Biometrika, 76: 297–307, doi:10.1093/biomet/76.2.297 .
Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer .
McQuarrie, A. D. R.; Tsai, C.-L. (1998), Regression and Time Series Model Selection, World Scientific, ISBN 981-02-3242-X .
Stone, M. (1977), "An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion", Journal of the Royal Statistical Society: Series B (Methodological), 39 (1): 44–47, JSTOR 2984877 .
Sugiura, N. (1978), "Further analysis of the data by Akaike's information criterion and the finite corrections", Communications in Statistics - Theory and Methods, A7: 13–26 .
Takeuchi, K. (1976), " " [Distribution of informational statistics and a criterion of model fitting], Suri-Kagaku [Mathematical Sciences] (in Japanese), 153: 12–18 .
Vrieze, S. I. (2012), "Model selection and psychological theory: a discussion of the differences between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)", Psychological Methods, 17: 228–243, doi:10.1037/a0027127 .
Yang, Y. (2005), "Can the strengths of AIC and BIC be shared?", Biometrika, 92: 937–950, doi:10.1093/biomet/92.4.937 .

External links

Hirotogu Akaike comments on how he arrived at the AIC, in "This Week's Citation Classic", Current Contents Engineering, Technology, and Applied Sciences, 51: 22 (21 December 1981)
Akaike Information Criterion (North Carolina State University)
Example AIC use (Honda USA, Noesis Solutions, Belgium)
Model Selection (University of Iowa)

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 11/9/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.

Akaike information criterion

Definition

How to apply AIC in practice

AICc

History