Home
Subjects
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Statistical Learning - Parameter estimation techniques
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (24)
Maximum likelihood
A parameter estimation heuristic that seeks parameter values that maximize the likelihood function for the parameter. This ignores the prior distribution and so is inconsistent with Bayesian probability theory, but it works reasonably well. See "Pathologies of Orthodox Statistics".
Maximum A Posteriori
A parameter estimation heuristic that seeks parameter values that maximize the posterior density of the parameter. This implies that you have a prior distribution over the parameter, i.e. that you are Bayesian. MAP estimation has the highest chance of getting the parameter exactly right. But for predicting future data, it can be worse than Maximum Likelihood; predictive estimation is a better Bayesian method for that purpose. MAP is also not invariant to reparameterization; see MacKay's Bayesian methods FAQ.
Unbiased estimation
A parameter estimation procedure based on finding an estimator function that minimizes average error. When the average error is zero then the estimator is "unbiased." The error of the function is averaged over possible data sets, including ones you never observed. The best function is then used to get parameter values. See "Pathologies of Orthodox Statistics".
Predictive estimation
Parameter estimation consistent with Bayesian probability theory. It seeks to minimize the expected "divergence" between the estimated distribution and the true distribution. The divergence is measured by Kullback and Leibler's formula. The distribution which achieves minimum divergence corresponds to integrating out the unknown parameter. Hence predictive estimation can be approximated by averaging over several different parameter choices. See "Inferring a Gaussian distribution", "A Comparison of Scientific and Engineering Criteria for Bayesian Model Selection", Geisser, and Bishop.
Minimum Message Length
A parameter estimation technique similar to predictive estimation but motivated by information theory. Consider compressing the data via a two-part code: the first part is a parameter setting, encoded with respect to the prior, and the second part is the data, encoded with respect to the model with that parameter. Parameters are continuous, and so cannot be encoded exactly---they must be quantized, which introduces error. So we can't choose the parameters which simply compress the data most; we have to choose parameters which compress the data well even if the parameters are slightly modified. The parameter setting which balances this tradeoff between accuracy and robustness is the MML estimate.
Bootstrapping
A technique for simulating new data sets, to assess the robustness of a model or to produce a set of likely models. The new data sets are created by re-sampling with replacement from the original training set, so each datum may occur more than once. See "What are cross-validation and bootstrapping?" and "The Bootstrap is Inconsistent with Probability Theory".
Bagging
Bootstrap averaging. Generate a bunch of models via bootstrapping and then average their predictions. See "Bagging Predictors", "Why does bagging work?", and "Bayesian model averaging is not model combination".
Monte Carlo integration
A technique for approximating integrals in Bayesian inference. To approximate the integral of a function over a domain D, generate samples from a uniform distribution over D and average the value of the function at those samples. More generally, we can use a non-uniform proposal distribution, as long as we weight samples accordingly. This is known as importance sampling (which is an integration method, not a sampling method). For Bayesian estimation, a popular approach is to sample from the posterior distribution, even though it is usually not the most efficient proposal distribution. Gibbs sampling is typically used to generate the samples. Gibbs sampling employs a succession of univariate samples (a Markov Chain) to generate an approximate sample from a multivariate density. See "Introduction to Monte Carlo methods", "Probabilistic Inference using Markov Chain Monte Carlo Methods", and the Markov Chain Monte Carlo home page. Software includes BUGS and FBM.
Regularization
Any estimation technique designed to impose a prior assumption of "smoothness" on the fitted function. See "Regularization Theory and Neural Networks Architectures".
Expectation-Maximization (EM)
An optimization algorithm based on iteratively maximizing a lower bound. Commonly used for maximum likelihood or maximum a posteriori estimation, especially fitting a mixture of Gaussians.
Variational bound optimization
A catch-all term for variations on the EM algorithm which use alternative lower bounds (usually simpler ones). The particular lower bound used by EM can lead to an intractable E-step. With a looser bound, the iterative update is more tractable, at the cost of increasing the number of iterations. Another approach, though less often used, is to use a tighter bound, for faster convergence but a more expensive update. See "An introduction to variational methods for graphical models", "Notes on variational learning", "Exploiting tractable substructures in intractable networks".
Variational bound integration
To approximate the integral of a function, lower bound the function and then integrate the lower bound. Not to be confused with Jensen bound integration. Variational Bayes applies this technique to the likelihood function for integrating out parameters. The EM bound can be used for this, or any of the simpler bounds used for variational bound optimization.
Jensen bound integration
To approximate the integral of a function, apply Jensen's inequality to turn the integral into a product which lower-bounds the integral. The bound has free parameters which are chosen to make it as tight as possible. Unlike variational bound optimization, the integrand itself does not need to be bounded, and very different answers can result from the two methods.
Expectation Propagation
To approximate the integral of a function, approximate each factor by sequential moment-matching. For dynamic systems, it generalizes Iterative Extended Kalman filtering. For Markov nets, it generalizes belief propagation. See A roadmap to research on EP.
Newton-Raphson
A method for function optimization which iteratively maximizes a local quadratic approximation to the objective function (not necessarily a lower bound as in Expectation-Maximization). If the local approximation is not quadratic, we have a generalized Newton method. See "Beyond Newton's method".
Iteratively Reweighted Least Squares
A method for maximum likelihood estimation of a generalized linear model. It is equivalent to Newton-Raphson optimization. See McCullagh&Nelder.
Back-propagation
A method for maximum likelihood estimation of a feed-forward neural network. It is equivalent to steepest-descent optimization. See Bishop.
Backfitting
A method for maximum likelihood estimation of a generalized additive regression. You iteratively optimize each f_i while holding the others fixed. It is equivalent to the Gauss-Seidel method in numerical linear algebra. See Hastie&Tibshirani and "Bayesian backfitting".
Kalman filtering
An algorithm for inferring the next state or next observation of a Linear Dynamical System. By making the state a constant, it can also be used for incrementally building up a maximum-likelihood estimate of a parameter. See "An Introduction to the Kalman Filter" (with links), "Dynamic Linear Models, Recursive Least Squares and Steepest Descent Learning", "From Hidden Markov Models to Linear Dynamical Systems", and Gelb (Ch.4).
Extended Kalman filtering
Kalman filtering applied to general dynamical systems with Gaussian noise. At each step, the dynamical system is approximated with a linear dynamical system, to which the Kalman filter is applied. The linear approximation can be iteratively refined to improve the accuracy of the Kalman filter output. Despite the name, extended Kalman filtering is not really different from Kalman filtering. See Gelb.
Relaxation labeling
An optimization algorithm for finding the most probable configuration of a Markov random field. It generalizes the Viterbi algorithm for Markov chains. Other approaches to this problem include Iterated Complete Modes, simulated annealing, network flow, and variational lower bounds. See "Foundations of Relaxation Labeling Processes" (Hummel and Zucker; appears in Readings in Computer Vision), "Self Annealing: Unifying deterministic annealing and relaxation labeling", "Probabilistic relaxation", and Li.
Deterministic annealing
An optimization technique where the true objective function is morphed into a convex function by a continuous convexity parameter. Start by solving the convex problem and gradually morph to the true objective while iteratively recomputing the optimum. It is called "graduated nonconvexity" in statistical physics, where the convexity parameter often corresponds to temperature.
Boosting
A technique for combining models based on adaptive resampling: different data is given to different models. The idea is to successively omit the "easy" data points, which are well modeled, so that the later models focus on the "hard" data. See Schapire's page, "Additive Logistic Regression: a Statistical View of Boosting", "Prediction Games and Arcing Algorithms", and "Half&Half Bagging and Hard Boundary Points".
Empirical Risk Minimization
A parameter estimation heuristic that seeks parameter values that minimize the "risk" or "loss" that the model incurs on the training data. In classification, a "loss" usually means an error, so it corresponds to choosing the model with lowest training error. In regression, "loss" usually means squared error, so ERM corresponds to choosing the curve with lowest squared error on the training data. It is thus the most basic (and naive) estimation heuristic. This method only uses a loss function appropriate for the problem and does not utilize a probabilistic model for the data. See "Empirical Risk Minimization is an incomplete inductive principle".
This set is often in folders with...
Statistical Learning - Statistical models
41 terms
Statistical Learning - Fundamental concepts
11 terms
Statistical Learning - Feature extraction techniqu…
9 terms
Statistical Learning - Model Selection and Non par…
9 terms
You might also like...
20220
53 terms
Test 1
79 terms
ML NEW
86 terms
FSTChapter2
33 terms
Other sets by this creator
LeetcodeTian
191 terms
Deep Learning Glossary
56 terms
c++ glossary
558 terms
database terminologies
136 terms
Other Quizlet sets
Clinical Systems UNIT 1 - Lung Cancer
13 terms
Exercise and Health exam #2
92 terms
Abs Psychology Ch. 3
40 terms
Style Periods of Concert Music
37 terms