Divergence (statistics)

In statistics and information geometry, divergence or a contrast function is a function which establishes the "distance" of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.

Definition

Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×S → R satisfying ^[1]

D(p || q) ≥ 0 for all p, q ∈ S,
D(p || q) = 0 if and only if p = q,

The dual divergence D* is defined as

D^{*}(p\parallel q)=D(q\parallel p).

Geometrical properties

Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution p ∈ S we can write p = p(θ).

For a pair of points p, q ∈ S with coordinates θ_p and θ_q, denote the partial derivatives of D(p || q) as

{\begin{aligned}D((\partial _{i})_{p}\parallel q)\ \ &{\stackrel {{\mathrm {def}}}{=}}\ \ {\tfrac {\partial }{\partial \theta _{p}^{i}}}D(p\parallel q),\\D((\partial _{i}\partial _{j})_{p}\parallel (\partial _{k})_{q})\ \ &{\stackrel {{\mathrm {def}}}{=}}\ \ {\tfrac {\partial }{\partial \theta _{p}^{i}}}{\tfrac {\partial }{\partial \theta _{p}^{j}}}{\tfrac {\partial }{\partial \theta _{q}^{k}}}D(p\parallel q),\ \ {\mathrm {etc.}}\end{aligned}}

Now we restrict these functions to a diagonal p = q, and denote ^[2]

{\begin{aligned}D[\partial _{i}\parallel \cdot ]\ &:\ p\mapsto D((\partial _{i})_{p}\parallel p),\\D[\partial _{i}\parallel \partial _{j}]\ &:\ p\mapsto D((\partial _{i})_{p}\parallel (\partial _{j})_{p}),\ \ {\mathrm {etc.}}\end{aligned}}

By definition, the function D(p || q) is minimized at p = q, and therefore

{\begin{aligned}&D[\partial _{i}\parallel \cdot ]=D[\cdot \parallel \partial _{i}]=0,\\&D[\partial _{i}\partial _{j}\parallel \cdot ]=D[\cdot \parallel \partial _{i}\partial _{j}]=-D[\partial _{i}\parallel \partial _{j}]\ \equiv \ g_{{ij}}^{{(D)}},\end{aligned}}

where matrix g^(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.

Divergence D(· || ·) also defines a unique torsion-free affine connection ∇^(D) with coefficients

\Gamma _{{ij,k}}^{{(D)}}=-D[\partial _{i}\partial _{j}\parallel \partial _{k}],

and the dual to this connection ∇* is generated by the dual divergence D*.

Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g^(D), ∇^(D), ∇^(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).^[3]

For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g^(D_f) = c·g and the connection ∇^(D_f) = ∇^(α), where g is the canonical Fisher information metric, ∇^(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).

Examples

The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.

f-divergences

Main article: f-divergence

This family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as

D_{f}(p\parallel q)=\int p(x)f{\bigg (}{\frac {q(x)}{p(x)}}{\bigg )}dx

Kullback–Leibler divergence:	$D_{{\mathrm {KL}}}(p\parallel q)=\int p(x)\ln \left({\frac {p(x)}{q(x)}}\right)dx$
squared Hellinger distance:	$H^{2}(p,\,q)=2\int {\Big (}{\sqrt {p(x)}}-{\sqrt {q(x)}}\,{\Big )}^{2}dx$
Jeffreys divergence:	$D_{J}(p\parallel q)=\int (p(x)-q(x)){\big (}\ln p(x)-\ln q(x){\big )}dx$
Chernoff's α-divergence:	$D^{{(\alpha )}}(p\parallel q)={\frac {4}{1-\alpha ^{2}}}{\bigg (}1-\int p(x)^{{\frac {1-\alpha }{2}}}q(x)^{{\frac {1+\alpha }{2}}}dx{\bigg )}$
exponential divergence:	$D_{e}(p\parallel q)=\int p(x){\big (}\ln p(x)-\ln q(x){\big )}^{2}dx$
Kagan's divergence:	$D_{{\chi ^{2}}}(p\parallel q)={\frac 12}\int {\frac {(p(x)-q(x))^{2}}{p(x)}}dx$
(α,β)-product divergence:	$D_{{\alpha ,\beta }}(p\parallel q)={\frac {2}{(1-\alpha )(1-\beta )}}\int {\Big (}1-{\Big (}{\tfrac {q(x)}{p(x)}}{\Big )}^{{\!\!{\frac {1-\alpha }{2}}}}{\Big )}{\Big (}1-{\Big (}{\tfrac {q(x)}{p(x)}}{\Big )}^{{\!\!{\frac {1-\beta }{2}}}}{\Big )}p(x)dx$

References

Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2.
Eguchi, Shinto (1985). "A differential geometric approach to statistical inference on the basis of contrast functionals". Hiroshima mathematical journal. 15 (2): 341–391.
Eguchi, Shinto (1992). "Geometry of minimum contrast". Hiroshima mathematical journal. 22 (3): 631–647.
Matumoto, Takao (1993). "Any statistical manifold has a contrast function — on the C³-functions taking the minimum at the diagonal of the product manifold". Hiroshima mathematical journal. 23 (2): 327–332.

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 8/22/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.