Home
Subjects
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Science
Computer Science
Artificial Intelligence
Deep Learning
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (54)
Forward propagation def.
A procedure that performs the computations mapping ni inputs u(1) to u(ni) to an output u(n). This defines a computational graph where each node computes numerical value u(i) by applying a function f (i) to the set of arguments A(i) that comprises the values of previous nodes u(j), j < i, with j ∈ Pa(u(i)). The input to the computational graph is the vector x, and is set into the first ni nodes u(1) tou(ni).Theoutputofthecomputationalgraphisreadoffthelast(output) node u(n).
Forward propagation algorithm
Tanh activation
An activation function with form from (-1 to 1). tanh is also sigmoidal (s - shaped).
Tanh function properties, pros and cons
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.
The function is differentiable.
The function is monotonic while its derivative is not monotonic.
The tanh function is mainly used classification between two classes.
RELU activation
Rectified Linear Unit. The most used activation function in the world right now.Since, it is used in almost all the convolutional neural networks or deep learning.As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.
RELU properties, pros and cons
Range: [ 0 to infinity)
The function and its derivative both are monotonic.
But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately.
RELU equation and derivative
Tanh equation and derivative
Logistic sigmoid equation σ
Logistic sigmoid definition, graph and properties
The sigmoid function saturates when its argument is very positive or very negative, meaning that the function becomes very flat and insensitive to small changes in its input. Range is from 0 to 1
Softmax equation
Softmax definition
Calculates the probabilities distribution of the event over 'n' different events. In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.
The main advantage of using Softmax is the output probabilities range. The range will 0 to 1, and the sum of all the probabilities will be equal to one. If the softmax function used for multi-classification model it returns the probabilities of each class and the target class will have the high probability.
The formula computes the exponential (e-power) of the given input value and the sum of exponential values of all the values in the inputs. Then the ratio of the exponential of the input value and the sum of exponential values is the output of the softmax function.
Softmax is used for ... classification while sigmoid is used for ... classification
Multiclass classification, binary classification
Softmax overflow problem
What we do is rescale so there's no overflow by subtracting the maximum of the x values from each x[i]. That converts overflow into underflow. Underflow is no problem, because that rounds off to zero, which is a well-behaved floating point number.
Simplifed Backpropagation algorithm (6.2)
Backpropagation algorithm (6.5)
Backpropagation algorithm (6.6)
The function that needs to be minimzed is called
Objective Function or Criterion
The objective function being minimized can also be called
Cost function, loss function or error function
Gradient descent def.
An optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost). Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.
On picture : a current position, gamma waiting factor, rest is a gradint term -> tells the direction of steepest descent.
Stochastic Gradient Descent
GD can be slow to run on very large datasets. Because one iteration of batch gd requires a prediction for each instance in the training dataset, it can take a long time when you hae many millions of instances.
Idea for SGD is to perform the update of coefficients for each training instance rather than at the end of the batch of instances.
First step of procedure randimizes training dataset. Because coefficients updated after every iteration - SGD is generally noisy.
Cost functions general definition
Describe how good does the model makes a prodiction for a given set of parameters.
Gradient descend formulas with SSE as an example
Gradient descend vs Stochastic Gradient Descent comparison
Maximum Likelihood estimator def.
Maximum Likelihood intuitiion
Attempts to find the parameter values that maximize the likelihood function, given the observations. The resulting estimate is called a maximum likelihood estimate, which is also abbreviated as MLE.
KL divergence formula
A measure of how one probability diverges from another.
Likelihood
The hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.
Maximum likelihood casual definition
A method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.
Log - likelihood
For many applications, the natural logarithm of the likelihood function, called the log-likelihood, is more convenient to work with. This is because we are generally interested in where the likelihood reaches its maximum value: the logarithm is a strictly increasing function, so the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the log-likelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function, because the probability of the conjunction of several independent variables is the product of probabilities of the variables and solving an additive equation is usually easier than a multiplicative one.
Mean square error loss derived by MLE from Gaussian prior
Learning rate decay
A process of slowly reducing the learning rate value.
1 epoch = 1 pass through the data.
Learning_rate = (1 / (1 + decay_rate
epoch_num) )
initial_learning_rate.
Other variations:
Learning_rate = 0.95^(epoch_num) * initial decay -> also called exponential decay
SGD with momentum algorithm (8.2)
Difference between Nesterov's momentum and Momentum
The difference between Nesterov momentum and standard momentum is where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after the current velocity is applied. Thus one can interpret Nesterov momentum
as attempting to add a correction factor to the standard method of momentum.
SGD with Nesterov's momentum
Adagrad 8.4 algorithm
RMSProp algorithm
Root Mean Square Prop
Adam algorithm
Adaptive Moment Estimation
Bias correction terms in Adam algorithm
Early Stopping
a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set.
Early stopping validation based algorithm strategy
1. Split the training data into a training set and a validation set, e.g. in a 2-to-1 proportion.
2. Train only on the training set and evaluate the per-example error on the validation set once in a while, e.g. after every fifth epoch.
3. Stop training as soon as the error on the validation set is higher than it was the last time it was checked.
4. Use the weights the network had in that previous step as the result of the training run.
L1 regularization formula
L1 regularization def.
Adds an L1 penalty equal to the absolute value of the magnitude of coefficients. In other words, it limits the size of the coefficients. L1 can yield sparse models (i.e. models with few coefficients); Some coefficients can become zero and eliminated. Lasso regression uses this method.
L2 regularization
L2 regularization def.
Adds an L2 penalty equal to the square of the magnitude of coefficients. L2 will not yield sparse models and all coefficients are shrunk by the same factor (none are eliminated). Ridge regression and SVMs use this method.
Difference between L1 and L2
The difference between the L1 and L2 is just that L2 is the sum of the square of the weights, while L1 is just the sum of the weights
What is Regularization?
Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients. In simple terms, it reduces parameters and shrinks (simplifies) the model. This more streamlined, more parsimonious model will likely perform better at predictions. Regularization adds penalties to more complex models and then sorts potential models from least overfit to greatest; The model with the lowest "overfitting" score is usually the best choice for predictive power.
What is Dropout
When applying dropout to a layer, we drop each neuron independently with a probability of p (usually called dropout rate). To the rest of the network, the dropped neurons have value of zero.
Dropout is performed only when training, during inference no nodes are dropped. However, in that case we need to scale the activations down by a factor of 1−p to account for more neurons than usual.
Alternatively, we might scale the activations up during training by a factor of 1/(1−p)1/(1−p).
How dropout works?
The dropout will randomly mute some neurons in the neural network and we therefore have a sparse network which hugely decreases the possibility of overfitting.More importantly, the dropout will make the weights spread over the input features instead of focusing on some features.
Batch normalization def.
a technique that normalizes the mean and variance of each of the features at every level of representation during training. The technique involves normalization of the features across the examples in each mini-batch. Empirically it appears to stabilize the gradient (less exploding or vanishing values) and batch-normalized models appear to overfit less.
Why batch norm works well?
1. It basically normalises the data and normalised data is almost always good before input to any layer.
2. It acts as a regularizer. This is because the input to the next layer for a training sample is calculated based on the combined operations of all training samples in this particular minibatch that is batch normed. So there is some regularization here.
3. It allows you to use higher learning rates and so networks with batch normalization are usually faster than those that do not have batch normalization.
Write down equations of how convolution of a given image is computed. Assume the input is an image I of size H ×W with C channels, the kernel K has size N ×M, the stride is T ×S, the operation performed is in fact cross-correlation (as usual in convolutional neural networks) and that O output channels are computed.
same padding scheme
we only use valid pixels, which causes the result to be smaller than the input
out_height = ceil(float(in_height) / float(strides[1]))
out_width = ceil(float(in_width) / float(strides[2]))
valid padding scheme
we pad original image with zero pixels so that the result is exactly the size of the input
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width = ceil(float(in_width - filter_width + 1) / float(strides[2]))
This set is often in folders with...
deep learning
75 terms
Machine Learning
338 terms
Deep Learning
35 terms
Deep Learning
26 terms
You might also like...
ML concept learning
65 terms
MGSC 290 Excel Chapter 6 in-class assignment
25 terms
math definitions
13 terms
Excel - Chapter 8
19 terms
Other sets by this creator
Python for Deson
20 terms
Python for Deson
2 terms
Python for Deson
2 terms
Complexity
47 terms
Other Quizlet sets
REST 216 quiz 3 Gas exchange Part 1
23 terms
MCAT Physics-Atomic and Nuclear Phenomena
18 terms
Lab Quiz 4 BB questions
22 terms
Intro to CJ chap. 4
18 terms