Empirical risk minimization

Machine learning and data mining

Problems Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction
Supervised learning (classification • regression) Decision trees Ensembles (Bagging, Boosting, Random forest) k-NN Linear regression Naive Bayes Neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH Hierarchical k-means Expectation-maximization (EM) DBSCAN OPTICS Mean-shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA t-SNE
Structured prediction Graphical models (Bayes net, CRF, HMM)
Anomaly detection k-NN Local outlier factor
Neural nets Autoencoder Deep learning Multilayer perceptron RNN Restricted Boltzmann machine SOM Convolutional neural network
Reinforcement Learning Q-Learning SARSA Temporal Difference (TD)
Theory Bias-variance dilemma Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine learning venues NIPS ICML IJMLC ML JMLR ArXiv:cs.LG
Machine learning portal

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms.

Background

Consider the following situation, which is a general setting of many supervised learning problems. We have two spaces of objects $X$ and $Y$ and would like to learn a function $\ h:X\to Y$ (often called hypothesis) which outputs an object $y\in Y$ , given $x\in X$ . To do so, we have at our disposal a training set of a few examples $\ (x_{1},y_{1}),\ldots ,(x_{m},y_{m})$ where $x_{i}\in X$ is an input and $y_{i}\in Y$ is the corresponding response that we wish to get from $\ h(x_{i})$ .

To put it more formally, we assume that there is a joint probability distribution $P(x,y)$ over $X$ and $Y$ , and that the training set consists of $m$ instances $\ (x_{1},y_{1}),\ldots ,(x_{m},y_{m})$ drawn i.i.d. from $P(x,y)$ . Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because $y$ is not a deterministic function of $x$ , but rather a random variable with conditional distribution $P(y|x)$ for a fixed $x$ .

We also assume that we are given a non-negative real-valued loss function $L({\hat {y}},y)$ which measures how different the prediction ${\hat {y}}$ of a hypothesis is from the true outcome $y$ . The risk associated with hypothesis $h(x)$ is then defined as the expectation of the loss function:

R(h)={\mathbf {E}}[L(h(x),y)]=\int L(h(x),y)\,dP(x,y).

A loss function commonly used in theory is the 0-1 loss function: $L({\hat {y}},y)=I({\hat {y}}\neq y)$ , where $I(\dots )$ is the indicator notation.

The ultimate goal of a learning algorithm is to find a hypothesis $h^{*}$ among a fixed class of functions ${\mathcal {H}}$ for which the risk $R(h)$ is minimal:

h^{*}=\arg \min _{{h\in {\mathcal {H}}}}R(h).

Empirical risk minimization

In general, the risk $R(h)$ cannot be computed because the distribution $P(x,y)$ is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:

\!R_{\text{emp}}(h)={\frac {1}{m}}\sum _{i=1}^{m}L(h(x_{i}),y_{i}).

Empirical risk minimization principle states that the learning algorithm should choose a hypothesis ${\hat {h}}$ which minimizes the empirical risk:

{\hat {h}}=\arg \min _{h\in {\mathcal {H}}}R_{\text{emp}}(h).

Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.

Properties

Computational complexity

Empirical risk minimization for a classification problem with 0-1 loss function is known to be an NP-hard problem even for such relatively simple class of functions as linear classifiers.^[1] Though, it can be solved efficiently when minimal empirical risk is zero, i.e. data is linearly separable.

In practice, machine learning algorithms cope with that either by employing a convex approximation to 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by posing assumptions on the distribution $P(x,y)$ (and thus stop being agnostic learning algorithms to which the above result applies).

References

↑ V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009). Agnostic Learning of Monomials by Halfspaces is Hard. (See the paper and references therein)

Literature

Vapnik, V. (2000). The Nature of Statistical Learning Theory. Information Science and Statistics. Springer-Verlag. ISBN 978-0-387-98780-4.

This article is issued from Wikipedia - version of the 10/3/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.