Rao–Blackwell theorem

In statistics, the Rao–Blackwell theorem, sometimes referred to as the Rao–Blackwell–Kolmogorov theorem, is a result which characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria.

The Rao–Blackwell theorem states that if g(X) is any kind of estimator of a parameter θ, then the conditional expectation of g(X) given T(X), where T is a sufficient statistic, is typically a better estimator of θ, and is never worse. Sometimes one can very easily construct a very crude estimator g(X), and then evaluate that conditional expected value to get an estimator that is in various senses optimal.

The theorem is named after Calyampudi Radhakrishna Rao and David Blackwell. The process of transforming an estimator using the Rao–Blackwell theorem is sometimes called Rao–Blackwellization. The transformed estimator is called the Rao–Blackwell estimator.^[1]^[2]^[3]

Definitions

An estimator δ(X) is an observable random variable (i.e. a statistic) used for estimating some unobservable quantity. For example, one may be unable to observe the average height of all male students at the University of X, but one may observe the heights of a random sample of 40 of them. The average height of those 40—the "sample average"—may be used as an estimator of the unobservable "population average".
A sufficient statistic T(X) is a statistic calculated from data X to estimate some parameter θ for which it is true that no other statistic which can be calculated from data X provides any additional information about θ. It is defined as an observable random variable such that the conditional probability distribution of all observable data X given T(X) does not depend on the unobservable parameter θ, such as the mean or standard deviation of the whole population from which the data X was taken. In the most frequently cited examples, the "unobservable" quantities are parameters that parametrize a known family of probability distributions according to which the data are distributed.

In other words, a sufficient statistic T(X) for a parameter θ is a statistic such that the conditional distribution of the data X, given T(X), does not depend on the parameter θ.

A Rao–Blackwell estimator δ₁(X) of an unobservable quantity θ is the conditional expected value E(δ(X) | T(X)) of some estimator δ(X) given a sufficient statistic T(X). Call δ(X) the "original estimator" and δ₁(X) the "improved estimator". It is important that the improved estimator be observable, i.e. that it not depend on θ. Generally, the conditional expected value of one function of these data given another function of these data does depend on θ, but the very definition of sufficiency given above entails that this one does not.
The mean squared error of an estimator is the expected value of the square of its deviation from the unobservable quantity being estimated.

The theorem

Mean-squared-error version

One case of Rao–Blackwell theorem states:

The mean squared error of the Rao–Blackwell estimator does not exceed that of the original estimator.

In other words

\operatorname{E}((\delta_1(X)-\theta)^2)\leq \operatorname{E}((\delta(X)-\theta)^2).\,\!

The essential tools of the proof besides the definition above are the law of total expectation and the fact that for any random variable Y, E(Y²) cannot be less than [E(Y)]². That inequality is a case of Jensen's inequality, although it may also be shown to follow instantly from the frequently mentioned fact that

0 \leq \operatorname{Var}(Y) = \operatorname{E}((Y-\operatorname{E}(Y))^2) = \operatorname{E}(Y^2)-(\operatorname{E}(Y))^2.\,\!

Convex loss generalization

The more general version of the Rao–Blackwell theorem speaks of the "expected loss" or risk function:

\operatorname{E}(L(\delta_1(X)))\leq \operatorname{E}(L(\delta(X)))\,\!

where the "loss function" L may be any convex function. For the proof of the more general version, Jensen's inequality cannot be dispensed with.

Properties

The improved estimator is unbiased if and only if the original estimator is unbiased, as may be seen at once by using the law of total expectation. The theorem holds regardless of whether biased or unbiased estimators are used.

The theorem seems very weak: it says only that the Rao–Blackwell estimator is no worse than the original estimator. In practice, however, the improvement is often enormous.

Example

Phone calls arrive at a switchboard according to a Poisson process at an average rate of λ per minute. This rate is not observable, but the numbers X₁, ..., X_n of phone calls that arrived during n successive one-minute periods are observed. It is desired to estimate the probability e^−λ that the next one-minute period passes with no phone calls.

An extremely crude estimator of the desired probability is

\delta_0=\left\{\begin{matrix}1 & \text{if}\ X_1=0, \\ 0 & \text{otherwise,}\end{matrix}\right.

i.e., it estimates this probability to be 1 if no phone calls arrived in the first minute and zero otherwise. Despite the apparent limitations of this estimator, the result given by its Rao–Blackwellization is a very good estimator.

The sum

S_n = \sum_{i=1}^n X_{i} = X_1+\cdots+X_n\,\!

can be readily shown to be a sufficient statistic for λ, i.e., the conditional distribution of the data X₁, ..., X_n, depends on λ only through this sum. Therefore, we find the Rao–Blackwell estimator

\delta_1=\operatorname{E}(\delta_0\mid S_n=s_n).

After doing some algebra we have

\begin{align} \delta_1 &= \operatorname{E} \left (\mathbf{1}_{\{X_1=0\}} \Bigg| \sum_{i=1}^n X_{i} = s_n \right ) \\ &= P \left (X_{1}=0 \Bigg| \sum_{i=1}^n X_{i} = s_n \right ) \\ &= P \left (X_{1}=0, \sum_{i=2}^n X_{i} = s_n \right ) \times P \left (\sum_{i=1}^n X_{i} = s_n \right )^{-1} \\ &= e^{-\lambda}\frac{\left((n-1)\lambda\right)^{s_n}e^{-(n-1)\lambda}}{s_n!} \times \left (\frac{(n\lambda)^{s_n}e^{-n\lambda}}{s_n!} \right )^{-1} \\ &= \frac{\left((n-1)\lambda\right)^{s_n}e^{-n\lambda}}{s_n!} \times \frac{s_n!}{(n\lambda)^{s_n}e^{-n\lambda}} \\ &= \left(1-\frac{1}{n}\right)^{s_n} \end{align}

Since the average number of calls arriving during the first n minutes is nλ, one might not be surprised if this estimator has a fairly high probability (if n is big) of being close to

\left(1-{1 \over n}\right)^{n\lambda}\approx e^{-\lambda}.

So δ₁ is clearly a very much improved estimator of that last quantity. In fact, since S_n is complete and δ₀ is unbiased, δ₁ is the unique minimum variance unbiased estimator by the Lehmann–Scheffé theorem.

Idempotence

Rao–Blackwellization is an idempotent operation. Using it to improve the already improved estimator does not obtain a further improvement, but merely returns as its output the same improved estimator.

Completeness and Lehmann–Scheffé minimum variance

If the conditioning statistic is both complete and sufficient, and the starting estimator is unbiased, then the Rao–Blackwell estimator is the unique "best unbiased estimator": see Lehmann–Scheffé theorem.

An example of an improvable Rao–Blackwell improvement, when using a minimal sufficient statistic that is not complete, was provided by Galili and Meilijson in 2016.^[4] Let $X_{1},X_{2},...,X_{n}$ be a random sample from a scale-uniform distribution $X\sim U\left((1-k)\theta ,(1+k)\theta \right)$ , with unknown mean $E[X]=\theta$ and known design parameter $k\in (0,1)$ . In the search for "best" possible unbiased estimators for $\theta$ , it is natural to consider $X_{1}$ as an initial (crude) unbiased estimator for $\theta$ and then try to improve it. Since $X_{1}$ is not a function of $T=\left(X_{(1)},X_{(n)}\right)$ , the minimal sufficient statistic for $\theta$ (where $X_{(1)}=\min(X_{i})$ and $X_{(n)}=\max(X_{i})$ ), it may be improved using the Rao–Blackwell theorem as follows: ${{\hat {\theta }}_{RB}}={{E}_{\theta }}[{{X}_{1}}|{{X}_{\left(1\right)}},{{X}_{\left(n\right)}}]={\frac {{{X}_{(1)}}+{{X}_{(n)}}}{2}}$ . However, the following unbiased estimator can be shown to have lower variance: ${{\hat {\theta }}_{LV}}{\text{ =}}{\frac {1}{2({{k}^{2}}{\frac {(n-1)}{(n+1)}}+1)}}\left[(1-k){{X}_{(1)}}+(1+k){{X}_{(n)}}\right]$ . And in fact, it could be even further improved when using the following estimator: ${{\hat {\theta }}_{BAYES}}={\frac {n+1}{n}}\left[1-{\frac {{\frac {\left({\frac {{X}_{(1)}}{1-k}}\right)}{\left({\frac {{X}_{(n)}}{1+k}}\right)}}-1}{{{\left[{\frac {\left({\frac {{X}_{(1)}}{1-k}}\right)}{\left({\frac {{X}_{(n)}}{1+k}}\right)}}\right]}^{n+1}}-1}}\right]{\frac {{X}_{(n)}}{1+k}}$

References

↑ Blackwell, D. (1947). "Conditional expectation and unbiased sequential estimation". Annals of Mathematical Statistics. 18 (1): 105–110. doi:10.1214/aoms/1177730497. MR 19903. Zbl 0033.07603.
↑ Kolmogorov, A. N. (1950). "Unbiased estimates". Izvestiya Akad. Nauk SSSR. Ser. Mat. 14: 303–326. MR 36479.
↑ Rao, C. Radhakrishna (1945). "Information and accuracy attainable in the estimation of statistical parameters". Bulletin of the Calcutta Mathematical Society. 37 (3): 81–91.
↑ Tal Galili & Isaac Meilijson (31 Mar 2016). "An Example of an Improvable Rao–Blackwell Improvement, Inefficient Maximum Likelihood Estimator, and Unbiased Generalized Bayes Estimator". The American Statistician. 70 (1): 108–113. doi:10.1080/00031305.2015.1100683.

External links

Nikulin, M.S. (2001), "Rao–Blackwell–Kolmogorov theorem", in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 8/22/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.