Home
Subjects
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Statistical Learning - Statistical models
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (41)
Linear regression
A conditional statistical model of random vector y given measurement vector x, where y is a linear transformation of x followed by additive noise, typically Gaussian. That is, y = Ax + e. The matrix A is the parameter of the model. See "Bayesian linear regression".
Basis function regression
Linear regression applied to a new feature set, i.e. a new basis. If f_i(x) is basis function i, then the conditional model is y = sum_i a_i f_i(x) + e. The parameters of the model are the weights a_i. Polynomial regression and spline regression are special cases.
Gaussian process regression
A non-parametric regression model which is equivalent to basis function regression with a Gaussian prior density on the coefficients. The basis functions are implicitly determined by the covariance function of a Gaussian process. The advantage of this formulation is that the number of basis functions is potentially infinite, and they can be adapted, making it competitive with feed-forward neural network regression. See "Gaussian processes: A replacement for supervised neural networks?", Rasmussen's Web site, and Cressie.
Radial Basis Function regression
Basis function regression where each new feature is based on the distance to a prototype, hence the basis is "radial." The resulting curve is a superposition of "bumps," one at each prototype. Such models are motivated by regularization theory and Gaussian process theory. The basis functions are adapted by moving the prototypes or reshaping the bumps. See "Introduction to Radial Basis Function Networks", "Regularization Theory and Neural Networks Architectures", and "How do MLPs compare with RBFs?".
Generalized linear model
Linear regression except the conditional density of y is an arbitrary function of Ax, i.e. there is not simply additive noise. For example, a Poisson model for y, where the Poisson rate is equal to Ax, is a generalized linear model. Logistic regression is also a kind of generalized linear model. The attraction of such models is that the parameter A can be learned in a very similar way as for linear regression. See McCullagh&Nelder.
Logistic regression
A conditional statistical model of binary variable y given measurement vector x. The probability that y is 1 is given by the logistic function applied to a linear combination of x. That is, p(y=1) = 1/(1+exp(-a*x)). Logistic regression is a generalized linear model (and is really logistic linear regression). The row vector a is the parameter of the model. Logistic regression gives rise to a linear classifier. See a logistic regression overview and "Why the logistic function?".
Gaussian process classifier
A non-parametric logistic regression model analogous to Gaussian process regression. It is equivalent to using basis functions in logistic regression. See "Bayesian Classification with Gaussian Processes" and Rasmussen's Web site.
Feed-forward neural network regression
Basis function regression with adaptive basis functions. Given a measurement vector, each layer of the network makes a linear transformation and then applies a nonlinearity to each vector component. For example, a two-layer network has y = A2
f(A1
x) + e. The matrix A2 contains the basis coefficients while the matrix A1 adapts the basis. The nonlinearity f is fixed and is typically chosen to be a smoothed step function. The resulting basis contains oriented step functions with varying slope. See Bishop and "How many hidden layers should I use?".
Feed-forward neural network density model
Neural network regression except the conditional density of y is an arbitrary function of the network output, i.e. there is not simply additive noise. An example is logistic network regression. Follows the same principle as a generalized linear model.
Additive regression
A multivariate regression expressed as the sum of univariate regressions. That is, y is a sum of functions applied to the components of x, plus additive noise: y = sum_i f_i(x_i) + e. When the conditional density of y is an arbitrary function of sum_i f_i(x_i), such as a logistic additive regression, then you have a generalized additive regression. The parameters of the model are the functions f_i, which are learned recursively by another regression model. These models are very useful for visualization (also see projection pursuit regression). See Hastie&Tibshirani and Backfitting.
Projection pursuit regression
Additive regression on an adaptive feature space. That is, y = sum_i f_i(a_i*x) + e where the row vector a_i takes a linear combination of x. The parameters of the model are the axes a_i as well as the functions f_i, which are learned recursively. See "Projection Pursuit Regression" and "Generalized Projection Pursuit Regression".
Robust regression
A regression model which includes the possibility of measurement outliers. It also refers to parameter estimation that can handle outliers. (But changing your estimator without changing your model is inconsistent behavior.) See Venables&Ripley, Barnett&Lewis, code and papers at Antwerp, and an illustrative applet.
Independent Feature Model (`Naive Bayes')
Any statistical model which assumes that the components of the measurement vector, i.e. the features, are conditionally independent given the class. Like additive regression, this assumption allows each component to be modeled separately. Independent Component Analysis/Projection Pursuit can be used to extract better features for the Independent Feature Model. See Duda&Hart, "A Comparison of Event Models for Naive Bayes Text Classification", and "Beyond independence: Conditions for the optimality of the simple Bayesian classifier".
Linear classifier
Any classifier with straight-line decision boundaries in feature space. Such classifiers can arise from many different statistical models, especially Gaussian class-conditional densities. They can be implemented using a single-layer neural network or Perceptron. See Duda&Hart, Bishop.
Generalized linear classifier
Any classifier implemented by subjecting data to a nonlinear transformation before feeding it to a linear classifier. The transformation may increase or decrease the number of features. The new features are like basis functions in basis function regression, and the classifier is essentially thresholding a basis function regression. It can have arbitrarily shaped decision boundaries. A Radial Basis Function classifier is a special case where the transformation is based on distances to a set of prototypes (the basis functions are "bumps"). See Duda&Hart.
Support Vector Machine
A generalized linear classifier with a maximum-margin fitting criterion. This fitting criterion provides regularization which helps the classifier generalize better. The classifier tends to ignore many of the features. For example, a Support Vector Machine can be used as a regularized Radial Basis Function classifier.
Finite mixture model
Any probability density expressed as the weighted sum of simpler densities. The simpler densities are the components or states of the mixture. An equivalent definition is hierarchical sampling: to sample from the density, first draw a state at random (using a distribution over states) and then sample from that component. Using Gaussian components leads to a mixture of Gaussians. The distribution over states can itself be a mixture, leading to a deeper hierarchy, which is a hierarchical mixture model. Any sort of stochastic decision tree is a hierarchical mixture model. See David Dowe's website, "Cluster-Based Probability Model and Its Application to Image and Texture Processing", Bishop, and Expectation-Maximization.
Mixture of subspaces
A mixture of Gaussians where each Gaussian models only a subset of the features. The other features have a much smaller variance, producing a flat, pancake-like density. In other words, the covariance matrix of the Gaussian is nearly singular, reducing the number of parameters to estimate. Each Gaussian applies some feature extraction technique like PCA to determine the features to use. It is thus a combination of clustering and feature extraction. See "The EM Algorithm for Mixtures of Factor Analyzers" (with software), "Mixtures of Principal Component Analyzers", "Mixtures of local linear subspaces for face recognition", and "Modeling the Manifolds of Images of Handwritten Digits".
Coupled mixture model
A joint statistical model for a set of variables, where each variable has a mixture distribution. The mixtures are coupled by a joint distribution over states. For example, you could constrain two mixtures to never be in the same state. To sample the variables, you first sample a state for each variable (using the joint distribution over states) and then sample each variable from its chosen component. Coupled mixture models include Hidden Markov Models and Hidden Markov random fields.
Markov chain
Any multivariate probability density whose independence diagram is a chain. In other words, the variables are ordered, and each variable "depends" only on its neighbors in the sense of being conditionally independent of the others. An equivalent definition is that you sample the variables left-to-right, conditional only on the last outcome.
Autoregression
A Markov chain model for a sequence of variables, where the next variable is predicted from the previous variable via regression, typically linear.
Hidden Markov Model
A joint statistical model for an ordered sequence of variables. It is the result of stochastically perturbing the variables in a Markov chain (the original variables are thus "hidden"). The Markov chain has discrete variables which select the "state" of the HMM at each step. The perturbed values can be continuous and are the "outputs" of the HMM. A Hidden Markov Model is equivalently a coupled mixture model where the joint distribution over states is a Markov chain. If you made a scatterplot of the data, ignoring the ordering, it would look like a finite mixture. Only when you consider the ordering will you see that the progression of states has Markov structure: the next state depends on the previous state.
Autoregressive Hidden Markov Model
A Hidden Markov Model except that each output depends not only on the state but also the previous output, usually via a regression model. This extension can be implemented with surprisingly little extra work. (This differs from Rabiner's terminology. Rabiner's "autoregressive HMM" refers to a particular choice of emission density, and would now be called a crude Hidden Markov Segment Model.) See "Hidden Markov models using shared and global vector linear predictors", "ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition" (Digalakis, Rohlicek, & Ostendorf, IEEE Trans Speech and Audio 1(4), 1993), and "The Double Chain Markov Model".
Input/Output Hidden Markov Model
A Hidden Markov Model which is controlled by an input sequence. The input sequence can influence the choice of state at each step or it can influence the choice of output at each step. See "Markovian Models for Sequential Data" and "Input/Output HMMs for Sequence Processing".
Factorial Hidden Markov Model
The result of summing the outputs of several Hidden Markov Models. More generally, any arithmetic operation could be used, as long as it only applies to the outputs at a particular time. It is equivalent to one large HMM with a factored state distribution: the state is made up of independent parts. Factoring the state distribution is a general approach to reducing the number of parameters in a hidden variable model. Summing the outputs of ordinary mixture models gives a factorial mixture model. See "Factorial Learning and the EM Algorithm" and "Factorial Hidden Markov Models" (software).
Hidden Markov Decision Tree
A Hidden Markov Model where the distribution over states is hierarchical, as in a Hierarchical Mixture Model. The hierarchical path chosen for the next state depends on the hierarchical path for the previous state. In other words, the outcome at a particular node in the hierarchy is governed by a Markov chain. Thus it is both a Factorial Hidden Markov Model and a Coupled Hidden Markov Model. See "Hidden Markov Decision Trees".
Switching Hidden Markov Model
The result of switching among the outputs of several Hidden Markov Models. That is, only one of the outputs is measured at any particular time. The switching is controlled by a separate Markov chain. See "Variational Learning for Switching State-Space Models".
Hidden Markov Model with Duration
A Hidden Markov Model where the next state depends not just on the previous state, but on how long you have been in that state. Equivalently, it is an HMM with a nonstationary state transition matrix, that changes whenever you stay in the same state, and resets when you leave the state. See Rabiner&Juang (Ch.6).
Hidden Markov Segment Model
A Hidden Markov Model where the outputs are not simply vectors but sequences of arbitrary length ("segments"). The length of a segment and the variables in a segment can have an arbitrary probability distribution conditional on the state. In particular, the variables can be correlated. For example, the segment model can itself be an HMM, giving a Hierarchical Hidden Markov Model. The segment model can be a Gaussian process or (more specifically) a Linear Dynamical System. A Hidden Markov Model with Duration is a special case. A common refinement is that each segment influences the next segment, e.g. to provide continuity; this is the analogue of an autoregressive HMM.
Linear Dynamical System
A time-series model closely related to the Hidden Markov Model. It generalizes the HMM in some ways and is more restricted in other ways. The generalization is that there is a continuum of possible hidden states; it is an infinite mixture. The restrictions are (1) the state transitions are linear (plus Gaussian noise) and (2) the observations are linear in the state value (plus Gaussian noise). The Linear Dynamical System generalizes linear autoregression in the same way that Hidden Markov Models generalize Markov models. Hence it might be called hidden autoregression. All variables in the sequence are jointly Gaussian: it is a special kind of Gaussian process. See "From Hidden Markov Models to Linear Dynamical Systems", Gelb (Ch.4), and Kalman filtering.
General dynamical system
A general time-series model based on stochastically perturbing the variables in a Markov chain. The Markov chain may be continuous, the state transitions may be nonlinear with non-Gaussian noise, and the observations may be a nonlinear function of the state plus non-Gaussian noise. In such models, exact Bayesian inference is not feasible. Instead we use special approximation techniques which operate sequentially in time, such as extended Kalman filtering and Monte Carlo filtering. Monte Carlo filtering uses Monte Carlo integration to solve the necessary integrals at each step.
Mixture of Experts
A finite mixture model for random variable y where all the components and the distribution over components are conditional on measurement x. The mixture components are the "experts" and are typically linear regressions. A piecewise linear regression is a special case where x selects a unique expert. The distribution over experts can be hierarchical, as in a hierarchical mixture model, giving a Hierarchical Mixture of Experts. The measurement x influences the hierarchical path, as in a decision tree. See "Learning in modular and hierarchical systems", "A statistical approach to decision tree modeling", "Hierarchical mixtures of experts and the EM algorithm", and Steve Waterhouse's papers.
Decision tree
The deterministic ancestor of the hierarchical mixture of experts. Used widely in Machine Learning, where the most emphasis is on structuring the tree. See Murthy's survey, Ripley, and web site.
Markov random field
A set of random variables whose independence relationships are represented by a neighborhood structure, i.e. an undirected graph. The Markov property asserts conditional independence: given its immediate neighbors in the graph, a variable is independent of all other variables. This property is particularly useful for specifying the conditional distribution of a single variable, making the representation well suited to Gibbs sampling. Specifying all of the conditional distributions completely determines the joint distribution of the variables. This is a very general model, subsuming Markov chains. The color values in an image or the temperature across an object can be modeled as Markov random fields. When the variables are binary, this model is a Boltzmann machine. See Cressie, Li, Hertz (ch 7), Kindermann, and Relaxation labeling.
Simultaneous autoregression
A Markov random field where the each variable is predicted from its neighbors via regression, typically linear. It generalizes autoregression from Markov chains to Markov random fields. See Cressie.
Hidden Markov random field
The result of stochastically perturbing the variables in a Markov random field (the original variables are thus "hidden"). The result is technically still a Markov random field, but some variables are never observed. It is equivalently a coupled mixture model where the coupling distribution is a Markov random field. This generalizes a Hidden Markov Model by including more dependencies between the hidden states. A noisy image or depth map can be modeled as a Hidden Markov random field. See the references on Markov random fields.
Constrained mixture model
A finite mixture model with a restrictive prior distribution on the possible parameters of the components. Most importantly, the component parameters are statistically dependent. For example, a mixture of Gaussians where the means must form a circle or grid is a constrained mixture. Constrained mixtures are useful for approximating principal curves as well as for constrained clustering. Known historically as a self-organizing map.
Constrained Hidden Markov Model
A Hidden Markov Model with a restrictive prior distribution on the possible parameters of the output densities, which makes them statistically dependent. Equivalently, the Markov generalization of a constrained mixture. Has its origins in speaker adaptation for speech recognition.
Coupled Hidden Markov Model
A set of Hidden Markov Models whose Markov chains influence each other. There are two orthogonal ways to produce coupling. One is to couple the simultaneous states of the Markov chains, e.g. constraining them to always be in different states at any particular time. Another way is to allow the next state of a chain to depend on the previous state of another chain. Unlike a Factorial Hidden Markov Model, the outputs of the Markov chains do not interact. A Hidden Markov Decision Tree is both coupled and factorial.
Stochastic context-free grammar
A joint statistical model for an arbitrarily-long sequence of variables which generalizes a Markov chain. Each sequence is generated, starting from a single symbol, by stochastically applying the productions of a context-free grammar, until no more productions apply. Like a Markov chain, the most likely productions are easy to find for a given sequence, yet the sequence can have long-range correlations. See
Stochastic program
A joint statistical model expressed as an explicit sampling procedure. Stochastic grammars are a special case. See "Effective Bayesian Inference for Stochastic Programs", "Bellman Equations for Stochastic Programs", "Nondeterministic Lisp as a Substrate for Constraint Logic Programming", SCREAMER (software), and the modeling language of BUGS.
This set is often in folders with...
Statistical Learning - Parameter estimation techni…
24 terms
Statistical Learning - Fundamental concepts
11 terms
Statistical Learning - Feature extraction techniqu…
9 terms
Statistical Learning - Model Selection and Non par…
9 terms
You might also like...
QAT1 Chapter 16 - Simulation
16 terms
QAT1 - Chapter 16: Simulation
16 terms
ECON 421 Chapter 11
32 terms
chapter 11
32 terms
Other sets by this creator
LeetcodeTian
191 terms
Deep Learning Glossary
56 terms
c++ glossary
558 terms
database terminologies
136 terms
Other Quizlet sets
End Of Life 🔫🔪
18 terms
psych final
42 terms
maternity exam 4: violence against women CHP 5
11 terms
Section III: Policy Provisions, Options and Other…
69 terms