Home
Subjects
Expert solutions
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
Science
Computer Science
Artificial Intelligence
425 Final Exam
Flashcards
Learn
Test
Match
Flashcards
Learn
Test
Match
Terms in this set (201)
T/F Classification models must have categorical input (features) and categorical outputs (outputs); they cannot take continuous input (features).
F (inputs can be continous)
T/F A perceptron classifier is a discriminative model.
T
T/F Clustering is a supervised learning problem and dimensionality reduction is an unsupervised learning problem.
F (clustering and dimensionality reduction are unsupervised)
T/F Logistic regression refers to regression models that use the logistic sigmoid function to help make prediction.
F (Logistic regression is not a regression model)
T/F K-means guarantees to find the globally optimal solution.
F (Can only guarantee local optimal solution)
T/F SVM aims to minimize the margin.
F (Aims to maximize the margin)
T/F Convolutional neural networks can be used to model image data (e.g., performing image classification); they can also be used to model sequence data.
T (I think)
T/F Compared to normal/basic RNNs, LSTM can alleviate the vanishing gradient issue and hence model long-range dependencies/relations better.
T
Definition of Overfitting
When an algorithm corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
Definition of Unsupervised Learning
Given only inputs, automatically discover representations, features, structures, etc. It is learned by data that has not been learned, classified or categorized.
Definition of Parametric Models
A learning model that summarizes data with a set number of parameters (or features).
Definition of Regularization
A collection of techniques that can be used to prevent overfitting. Regularization adds information to a problem, often in the form of a penalty against complexity.
Definition of Cross Validation
A model validation technique used for assessing how the results of a statistical analysis will generalize to an independent data set. The main example used in class was n-fold cross validation, where 1/n of the training data is used as validation data, and the (n - 1)/n remaining data is still used as training. You then repeat this process for each fold, a total of n times, and take the average performance of all n validation folds.
Definition of Clustering
An unsupervised learning problem. Grouping similar training cases together and identifying a prototype example to represent each group.
Definition of Local Minima
Refers to the local minimum within some neighbourhood which may not be the global minimum.
Definition of Naive Bayes Classifier
Sort data into "bins" according to the output classes, C_k. Compute the marginal probabilities, p(C_k). For each class, estimate probabilities of each input, p(C_k | x_i).
Definition of Viterbi Algorithm
Dynamic programming algorithm for obtaining the maximum p(z|x) and the most likely sequence of hidden states that results in a sequence of observed events.
Definition of Backpropagation
Used to train feedforward neural networks. Computes gradient of loss function with respect to the weights.
Definition of models averaging/ensembling
When using the same training data to train different models and can either let the models vote or average their output probabilities.
Definition of dropout
Randomly drop hidden units or input units.
T/F The two major types of machine learning models are supervised models and reinforcement learning models. Unsupervised learning is a special form of supervised learning.
F (The two major types of machine learning models is supervised and unsupervised)
T/F In regression, both the input (i.e., features) and output (i.e., predictions) must take continuous values; they cannot be discrete.
F (Inputs can be discrete or continuous)
T/F In K Nearest Neighbour (KNN), if K is too large, it often causes overfitting.
F (It causes underfitting)
T/F The product of any two kernel functions is still a kernel function.
T
T/F In SVM, once a model is trained, only support vectors are kept; other non-support vectors are not needed in the test stage.
T
T/F In the M step of Mixture of Gaussian, the centre (say, centre-1) of a cluster (say cluster-1), is moved to the weighted mean of all data points that have been assigned to cluster-1 in the previous E step; data points assigned to other clusters will not be considered when M step computes centre-1
F (Mixture of Gaussian considers all points)
T/F The EM algorithm is an unsupervised clustering method.
F (It is an optimization algorithm)
T/F In CNN, if you use max pooling, the backpropagation takes derivative with regard to every input unit of the pooling layer, all input units in the pooling area get the same gradient value.
F (It takes the max value of the neighbourhood)
T/F Vanishing/exploding gradient is less serious in RNN than in a feed-forward network, because RNN share parameters at different time stamps.
F (It a serious problem for RNN)
T/F CNN is designed for a 2D data (e.g., images) and cannot be used for modelling sequence data. To process sequence data, one should use RNNs instead.
F
Definition of Underfitting
When K is too large (or too close to the amount of datapoints), KNN does not learn from the data. It is simply a majority classifier.
Definition of Development Dataset
A validation dataset. Used to select models during classification and other machine learning tasks.
Definition of Data Augmentation
Automatically augment existing training data to create larger training data. (Ex. Rotating, scaling or translating images).
Definition of Discriminative Classification Models
A function that takes an input vector x and assigns it to one of K classes, denotes Ck. Includes Perceptron, Logistic Regression and SVM.
Definition of max pooling
Pooling operation that calculates the maximum value in each sub-region of each feature map.
Definition of Rectified Linear Unit
An activation function used for neural networks. When x is less than 0, output 0, else, if x is greater than or equal to 0, output is 0.
Definition of autoencoder
Unsupervised neural network. Goal is to remember most salient information in order to reconstruct the data with the fewest number of errors. Can be used for dimensionality reduction and feature learning.
Definition of Divisive Clustering
Start with all data points in a single cluster and successively split clusters.
Definition of decision tree
Parametric way to represent data?
T/F Supervised models are trained on labeled data while unsupervised models learn from unlabeled data.
T
T/F Classification models have categorial output and regression models have continuous output.
T
T/F KNNs are parametric models.
F
T/F Perceptron classifiers are generative models.
F
T/F Discriminative models use Bayes' theorem to find the posterior class probability.
F
T/F SVM tries to minimize the margin between two classes.
F
T/F Both K-means and Mixture of Gaussian guarantee to find the globally optimal solution for clustering.
F
T/F Long Short-Term Memory (LSTM) is a type of convolutional neural networks, and is designed to capture long-distance dependency.
F
T/F When applying PCA dimensionality reduction, you can reconstruct the original data with no information loss even when you leave out some eigenvector.
F
T/F Autoencoders are supervised models.
F
Definition of Generative Models
Uses Bayes' Theorem to compute the probabilities p(C_k|x). Can generate new, unseen data points from joint distribution p(x, C_k).
Definition of Non-Parametric Models
The number of parameters (or features) is not fixed.
Definition of Support Vector Machine
Selects the data points that lie on the maximum margin hyperplanes separating data in the feature space (kernalized maximum-margin hyperplane classifier).
Definition of Agglomerative Clustering
Start with each data point in its own cluster and join data points until a single cluster remains. Requires finding distance between pairs of points and cluster centers.
Definition of the Vanishing Gradient
When successive differentiation (over 100s of time steps) causes a small gradients to vanish during RNN backpropagation.
Definition of Long Short-Term Memory
Type of RNN architecture, that uses gates to control the flow of information to solve the exploding/vanishing gradient problem found in RNN models.
Definition of Supervised Learning
Supervised learning is learned with the classification of datapoints.
T/F The assumption used in Naive Bayes can be extended to deal with features that have continuous values; an example the Gaussian Naive Bayes classifier.
T
T/F Nonparametric models refer to machine learning models that have no parameters.
F
T/F The difference between classification and regression is that classification uses categorical inputs/features but regression uses continuous input/features.
F
T/F In a decision tree, one major goal of training is to pick the order of features to split training data.
T
T/F Training is easy for KNN classifiers. KNN just remembers the training data.
T
T/F KNN is a parametric model.
F
T/F During testing, KNN may need to calculate the distance between a test data point and every training data point.
T
T/F KNN may consume a lot of memory and disk space to save all the training data.
T
T/F The perceptron algorithm can be kernalized; i.e., it can be rewritten to a form that depends only on dot-products of datapoints.
T
T/F For data that are linearly separable, perceptron always converge, i.e., the training process can find a decisions boundary in a finite number of steps.
T
T/F When overfitting happens, the training error is high and test error is low.
F
T/F In some situations you should use validation but not cross validation, e.g., when training takes too long
T
T/F A validation set can be used to prevent overfitting.
T
What model(s) is/are discriminative?
Perceptron, Logistic Regression, Decision Tree(?)
What is a feature space?
The encoding of everything we know about a task (all the inputs and outputs, as a single matrix)
Steps of Classification
1. Encode data into feature space (and class table)
2. Scale and normalize
3. Split data into training and testing set
4. Use training set to create a model
5. Use test set to grade model
What is the purpose of feature scaling?
Feature scaling is used to remove the bias of features with a broad range of values. It also improves the speed of gradient descent.
Typical ways to normalize
Min-max, meaning, standardization
N-fold cross validation
1. Split data into n-folds
2. Use n-1 folds to train, use nth fold to test
3. Repeat until each fold has been used to test
4. Average results
What are the steps for k-nearest neighbour?
1. Store all training data
2. For a test-point, find the sphere around it, enclosing k neighbours
3. Classify x according to the majority of k neighbours
In KNN, what happens if k is too small?
Overfitting
In KNN, what happens if k is too large?
Underfitting
Definition of non-parametric models
The number of parameters (or features) is not fixed.
Benefits of KNN
Learning is cheap. High performance for low dimensional feature space.
Drawbacks of KNN
Prediction of expensive. If data is far apart in high dimensions, performance can be reduced.
Goal of linear classifer
Find the line/hyperplane that separates 2 classes.
Linear discriminant function
y(x) = w(Transpose)* x + w_0
where given vector x is assigned to class 1 is y(x) > 0 and class 2 otherwise.
What is a discriminant function
A function that perceptron uses to classify points.
What is the perceptron algorithm
y(x) = f(w(Transpose)*x)
Aims to place a decision boundary to minimize an error function known as a perceptron criterion.
What is the perceptron converge theorem
If there exist an exact solution (i.e., the training set is linearly separable), the perceptron will find an exact solution in a finite number of steps.
Drawbacks of perceptron
1. Many solutions can exist
2. Does not converge if data is non-linearly separable
3. Can take a long time to converge
4. Does not generalize more than 2 classes
What is the Gaussian classifier
Find the mean vector and covariance matrix for the training data. Using maximum likelihood, determine classes for test data.
What is the Naive Bayes classifier
Sort data into "bins" according to the output classes, C_k. Compute the marginal probabilities p(C_k). For each class, estimate probabilities of each input, p(C_k | x_i).
What is logistic regression?
p(Ck|x)=sigma(w(Transpose)x)
Used in binary classification, the posterior probability is a logistic sigmoid on the linear feature vector phi.
T/F Linear Classifier is discriminative
T
T/F Decision Tree is parametric
T
T/F Decision Tree is discriminative
T
Many classifiers calculate the distance between two points using ___?
Euclidean Distance
What is often used as the method to find the optimal solution, which converges much faster with feature scaling than without?
Gradient Descent
List 3 examples of supervised learning
Classification, Regression, Time Series Prediction
Goal of classification
Select correct class for new inputs.
Goal of regression
Predict outputs accurately for new inputs.
T/F The Mixture of Gaussians assumes that each cluster has its own mean but all clusters share the same covariance matrix.
F
In clustering, the furthest-first strategy states that:
During initialization, have cluster centers as far as possible from one another.
T/F The EM algorithms is an optimization method used in situations when we have some unobserved variable.
T
T/F PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is minimized.
F
T/F The situation of overfitting in regression can be less severe if we have more training data
T
T/F In linear regression, the squared error function with a regularization term is shown below: When q equals to 1, the regularization term is called Lasso regularizer
T
T/F When training a constant regression model y = a, we need to use the training data to estimate a. Given a training set {(xn, tn)}, where xn is the feature vector of a data point, and tn is the output at that data point, if the error function that the model aims to minimize is the absolute error, e = ∑n1tn - a1, the model should use the mean of {tn} as the estimate of a.
False - it should be using the median of tn. (Euclidean -> mean, absolute -> median)
The ___ clustering algorithms start with one data point per cluster then group them to reach a single cluster that contains all data points, while the ___ clustering algorithms start with one cluster then split them to smaller clusters.
agglomerative, divisive
T/F. PCA can be used to create lossless transformation where data is transformed to a new space that has the same dimensionality as the original space. In such a situation, the original data can be fully reconstructed from the transformed data without any information loss.
T
T/F. In Mixtures of Gaussians, in the M step, all data points (which have been softly assigned to clusters in the E step) are used to update the means and covariances of all clusters.
T
In PCA, we get the following eigenvalues and eigenvectors for the covariance matrix of the training data:eigenvalue1 = 2.5 and eigenvector1 = (-0.5, 0.866)eigenvalue2 = 1 and eigenvector2 = (0.866, 0.5)The first principle component is:
-0.5, 0.866
T/F. K-means always converges
T
T/F. Clustering algorithms group similar training cases together and identifying a "prototype" example to represent each group.
T
T/F. K-means does not guarantee to find the globally optimal solution(s).
T
T/F. In regression, when we increase the order of polynomials, the training error will decrease (until the error is zero); the test error will decrease first and then when overfitting happens, the test error will start to increase.
T
T/F. A Lasso regularizer often give us small/sparse models, since it often sets weights to be zeros.
T
T/F. Non-linear basis functions can extend a linear model to do non-linear regression.
T
T/F. K-medoids updates the centre of a cluster to be the median of all data points assigned to that cluster.
F
Why do we use kernels/the kernel trick
Sometimes data is not linearly separable in its feature space. However, in a higher dimension, this data might be linearly separable.
What is a support vector machine
Selects the data points that lie on the maximum margin hyperplanes separating data in the feature space (kernalized maximum-margin hyperplane classifier)
What is regression
Supervised task with continuous output. Conditional density estimation, p(y|x). Finding a curve that can generate an output with some distribution (Gaussian) with the highest probability.
How can you reduce the effects of overfitting regression models
If there is not enough training data, a high order model will be highly susceptible to noise. Decreasing the model order or increase the amount of training samples will reduce overfitting.
What are the different regularization terms
q = 0.5
q = 1 (lasso regularizer)
q = 2 (quadratic regularizer)
q = 4
What is the goal of regularization
Regularization aims to balance minimizing unregularized error and finding "small weight"
Which regularization term provides sparse models
Lasso regularization often set some weights to 0, resulting in small/sparse models
What are the steps to perform k-means clustering
1. Generate k clusters
2. Define clusters by randomly determining cluster centers
3. Determining which cluster the data points belong to
4. Find cluster assignment to minimize the sum of the distance from each data point to its center
What are some variations on the k-means cluster
1. Distance can be calculated using a variety of different methods (Euclidean, Manhattan, etc)
2. K-means
3. K-medians
4. K-medoids
What are some tips to speed up clustering
1. Set initial cluster centers as far away as possible from each other
2. Add points to smaller clusters if there is a tie
3. Pick the correct number of cluster and correct error function using validation
4. Use data structures
5. To avoid local minima, randomly restart and split and merge clusters
What is hierarchical clustering
An algorithms that breaks the dataset into a series of nested clusters organized in a tree
What is the basic idea of "Mixture of Gaussian"
Use the general idea of k-means clustering. Rather than a true/false flag, use probabilities.
What do the E and M stand for in EM algorithm
Expectation and Maximization
What is the EM algorithm used for
Situations when we have unobserved variables.
What is the E step
Estimate the "expectation" of unobserved variables from observed data and current parameters.
What is the M step
Using the result of the E step as "completely" observed data, find the new maximum likelihood parameter estimates.
When should be EM algorithm be used
When there are some missing inputs, labels, or latent variables in a model
What is Principal Component Analysis
The orthogonal projects of data onto a lower dimension, while maximizing variance of the data
What is the principal subspace
A lower dimension linear space that data is projection onto in PCA
What does PCA maximize/minimizes
Maximizes the variance of the projected data. Minimizes the average projection cost.
What is the average projection cost in PCA
The mean squared distance between the projected data points and their projection
What is covariance
Determines how much variables vary from the mean with respect to each other.
T/F If covariance increases, both variables increase together.
T
How to you perform PCA
1. Adjust original data to be the difference from the mean of each dimension
2. Calculate covariance matrix
3. Calculate the eigenvectors/eigenvalues for the covariance matrix
4. Order eigenvectors by eigenvalues, highest to lowest
5. Construct the new space using principal components
6. Project the dataset onto the new space
How do you reconstruct the original data from PCA
DataAdjust(Transpose) = NewSpace x Transformed
DataDataOriginal(Transpose) = DataAdjust(Transpose) + OriginalMean
What is the first order Markov model
A chain structured process where future states depend only on the current state
What are some Markov model parameters
State/transition probability, start probability
What is a second order Markov model
Conditional distribution of a particular observation depends on the values of the previous two observation
What are some advantages of Markov models
Training can be easy, just need to count and calculate transition probabilities based on training data
What are some disadvantages of Markov models
Higher order Markov models may not contain "all" sequences in training data. Use "smoothing to handle unseen cases
T/F The EM algorithm is unsupervised clustering model
F
What are some HMM parameters
Start/Initial Probability, pi
State Transition Probability, A
Emission Probability, E
What is a state transition probability
The probability of switching from one hidden state to another
What is an emission probability
The probability of observing an emission/sequence that is of a certain state
How do you decode an HMM
Given HMM parameters and a sequence, find the most likely sequence of hidden states
How do you evaluate an HMM
Given HMM parameters and a sequence, find the probability of that sequence to occur
How do you train an HMM
Given a set of sequences, find the most likely HMM parameters
What is the "bugs on a grid" method of decoding HMM
Start a bug at each state at t = 1. When incrementing t, increment each bug by the emission and transition probability of the state with the highest probability. Repeat until t = n. Keep bug with largest probability at t = n.
What is a significant problem of Viterbi (how to solve it)
Underflows the probabilities, numbers become very small. Can be solved by using logs.
What are linear neurons
Compute the weighted sum of input features.
Simple but computationally limited.
Multilayer is still a linear model
y = b + sum(x * w)
What are binary threshold neurons
First compute the weighted sum of inputs.
Send a fixed size spike of activity if the weighted sum exceeds a threshold.
Each neuron truth values to compute the truth value of another proposition.
y = b + sum(x * w)
y = 1, if a > threshold
y = 0, otherwise
What are Rectified Linear Neurons
Compute the weighted sum of inputs.
Passed to Rectified Linear Unit.
a = b + sum(x * w)
y = a, if a > 0
y = 0, otherwise
or
y = max(0, a)
What are sigmoid neurons
Give a real-valued, smooth and bounded function of total input
Learning is easy since derivatives are nice.
a = b + sum(x * w)
y = 1 / (1 + e^-a)
How do you pick between different neurons/activation functions
Use a validation set to choose between types
What neuron is used for feed forward nets
ReLUs
What defines the "intelligence" of a neural network
Structure of network and number of neurons
What is the most common type of neural network
Feed-Forward
What is a deep neural network
A network with more than one hidden layer
Describe the layers in a FF neural network
First (bottom) layer is the input.
Last (top) layer is the output.
The activities of the neurons in each layer are non-linear function of activities of the layer below.
What is the Universal Approximator Theorem
Networks with one hidden layer are enough to represent an approximation of any function to an arbitrary degree of accuracy.
Why would you want a deep network
Able to generalize data better than shallow networks (which are better at memorizing)
What is the benefit of a deep network over a shallow network
Deep networks are less susceptible to overfitting. It achieves better performance with the same number of parameters.
How deep should a neural network be
No universal answer
What are the steps to train a FF neural network
1. Forward propagation - compute hidden units and output
2. Compute loss/error
3. Back propagation - Compute loss/errors
4. Use an error optimization algorithm to determine good parameters
What happens during FF propagation
Compute hidden units and output
What is the problem with exploding/vanishing gradient?
Small weights will make the gradients vanish.
Large weights will make the gradients explode.
Makes long range dependencies difficult to deal with.
What is a limitation of HMM
At a time instance, all the hidden states need to be selected from. When there are a lot of hidden states, HMM requires large data bandwidth.
What are the benefits of RNN
Distributed hidden states allow for more efficient storage of past information.
What is used to solve exploding/vanishing gradients
LSTM. Uses a memory cell with read and write flags/gates.
What are limitations of RNN
Training is time consuming and vanishing/exploding gradient
What are CNNs
Using smaller units as filters and subsampling to improve performance.
Convolve with filters to create smaller maps.
Subsample to reduce map size.
What is pooling
A subsampling method to reduce the size of the map. Reduces complexity.
What is max pooling
Break map into "pools". Recreate the map by taking the largest value from each pool. Introduces invariance in maps.
What is average pooling
Similar to max pool. Rather than a maximum of a pool, take the average of a pool for reconstruction.
What are some methods to improve model performance
Adding prior knowledge, data augmentation, dropout, averaging models, autoencoders
When is your neural network deep enough
When it starts overfitting. Can apply dropout to reduce overfitting after this point.
What are the two methods to classify non-linear data?
1. Kernel Trick - mapping the data to a higher dimension
2. Deep neural networks - Use hidden layer
Layers of an autoencoder
input layer -> hidden layer -> code layer -> hidden layer -> output layer
Why are autoencoders more powerful than PCA
We can design a FF network trying to make the output the same as the input with a hidden layer. PCA occurs when the hidden and output layers are linear. Autoencoders can better reconstruct the data with non-linear multiple layers. Autoencoders can be used for dimensionality reduction and feature learning.
Definition of Denoising Autoencoders
DAE corrupts input data (e.g., by adding background information). It is trained to predict the original, uncorrupted data point as its output.
Draw RNN and its equations
2020
Why are RNNs more powerful than HMM
1. RNN stores more information about history
2. The hidden units are updated with non-linear functions, instead of being controlled by a simple transition matrix.
3. The output contains non-linear neurons, instead of being controlled by a simple emission matrix.
What are the two views of PCA
1. Orthogonal projection of data onto a lower dimensional linear space, known as the principal space, s.t. the variance of the projected data is maximized.
2. The linear projection that minimizes the average projection cost, defined as the mean squared distance between the data point and their projections.
Equation for sigmoid logistic function
sigmoid (alpha) = 1 / (1 + exp(- alpha))
Time and Space Complexity for Viterbi
O(K^2N) and O(KN)
Draw LSTM
2019
What does the Forward Algorithm in HMM computes
The sum of previous probabilities and the current in each iteration.
How to change the Viterbi algorithm to make it a forward algorithm.
Instead of taking the max probability, take the sum of probabilties
What is backpropagation used for?
Used to train feed forward networks. Computes gradient loss function w.r.t the weights.
What is a convolution layer
Filter shifts on input to detect local features
What is subsampling layer
Subsample pixels to make image smaller, so there are fewer parameters used to characterize the image, and therefore computation is less expensive.
What is a feature map
Result of convolutional filter
What are fully connected layers
Used after convolution and subsampling layers. Dense layer to output layer.
Draw graph that represents training data and testing data as complexity increases
:(
Draw Lasso and Quadratic Regularization
:(
Sets with similar terms
Applied ML in python
25 terms
ISYE 6501 - Midterm 1
254 terms
Data Mining Test 1
48 terms
ITCS-3153 Exam 2
84 terms
Other sets by this creator
ELEC 472 Final Exam
151 terms
Technical Questions CIENA
32 terms
473 Final Exam PARTAYYYY
227 terms
Technical Questions AMD SOC
54 terms
Recommended textbook solutions
Information Technology Project Management: Providing Measurable Organizational Value
5th Edition
Jack T. Marchewka
346 solutions
Computer Organization and Design MIPS Edition: The Hardware/Software Interface
5th Edition
David A. Patterson, John L. Hennessy
220 solutions
Information Technology Project Management: Providing Measurable Organizational Value
5th Edition
Jack T. Marchewka
346 solutions
Service Management: Operations, Strategy, and Information Technology
7th Edition
James Fitzsimmons, Mona Fitzsimmons
103 solutions
Other Quizlet sets
PROD 315 Final Exam Ch. 12
27 terms
Business Law Study Guide: Part 2
31 terms
Ecosystems Chapter 4 review Quiz
25 terms
W11
14 terms