Home
Subjects
Textbook solutions
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
Science
Computer Science
Artificial Intelligence
Lecture 4 - Image classification with convolutional neural networks
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (30)
What are some good properties of initial weight values?
• If they are connected to a solution, it will lead to faster learning
• They should not be symmetrical or identical -> Redundency
• They should be large to break symmetry
• Not to large -> Numerical issues
List some common ways of initializing the weights
• Draw from Gaussian or Uniform distribution
• Normalized Xavier initialization
• Random orthogonal matrices
• Sparse initialization
Explain batch learning
Train on the whole training set
1) For each observation, calculate the error
2) Accumulate all the errors
3) Calculate an average error
4) Update the weights
5) Repeat step 1-4
Explain Stochastic gradient descent learning
Train on single observation at a time
1) Draw a random observation from training set
2) Train on the observation
3) Calculate the error
4) Update weights
5) Repeat step 1-4
Explain Minibatch learning
Same as Batch learning, but train on subsets of the training sets.
What are a reasonable size of a Minibatch?
Often power of 2:
32, ..., 256
Should fit in memory!
Why is batch normalization needed?
Because the simultaneous update of layers will lead to second, third, ..., order effects.
It minimizes the risk of the vanishing gradient problem
Explain batch normalization
z_{i}: Observation
--------
my_{b}: Expected value for (mini)batch
my_{b} = (1/m) * \sum_{i = 1}^{m} z_{i}
--------
sigma^2_{b}: Variance of (minibatch)
sigma^2_{b} = (1/m) * \sum_{i = 1}^{m} [z_{i} - my_{b}]^2
How should we design the learning rate when we use Stochastic learning methods?
It has to be decaying
What are the two conditions for covergence?
1) Sum of learning rates goes to infinity
2) Sum of squared learning rates is bounded
What is a common decaying learning rate?
A linear combination of initial and final learning rate:
α_{k} = (1 - λ)
α_{0} + λ
λ*α_{τ}
λ = k/τ
What are some good guidelines when choosing initial and final learning rates?
Initial rate should be larger than what initial results suggests
Final rate should be about 1% of initial rate
Explain Stochastic Gradient Descent with Momentum
Gradient is problematic due to curvature, a lot of unnecessary "jumping" back and forth.
A "velocity" factor is added. Instead of following the gradient fully, one condition the learning direction on the previous learning direction.
How the momentum is changing is determined by λ. A common λ is 0.9. Usually gives a speed-up of 10x.
What is the equation for Stochastic Gradient Descent with Momentum?
Δθ = λ*Δθ - (1 - λ)∇_{θ}J
θ = θ + αΔθ
List some popular loss functions
1) Maximum likelihood estimation
2) Conditional log-likelihood
3) Cross-entropy cost function
4) Surrogate loss function
5) Hinge loss
What is the equation for Maximum likelihood estimation?
θ_{ML} = argmax_{θ} PROD_{i=1}^{N} P(x^i; θ) = \sum_{i=1}^{N} log[P(x^i, θ)]
What is the equation for Conditional log-likelihood?
θ_{ML} = argmax_{θ} \sum_{i=1}^{N} log[P(y^i | x^i, θ)]
What is the intuitive explanation for how Cross-entropy cost function works?
Cross-entropy loss increases as the predicted probability diverges from the actual label. If there's a 0.012 probability of label 1, when the ground truth is equal to 1, the Cross-entropy loss will be large.
What is the equation for Cross-entropy cost function?
C = -(y_{i}
ln[h(x_{i})] + (1 - y_{i})
ln[1 - h(x_{i})]
Use sigmoid if binary classification:
h(x_{i}) = 1 / (1 + e^(-θ*x_{i}))
If multi-label classification, use soft-max:
h_{k}(x_{i}) = e^(θ_{k}
x_{i}) / [\sum_{l} e^{θ_{l}
}*x_{i})]
Why is regularization needed?
To keep the size of the parameters small.
List some regularization methods
Norm-based regularization adds a term to the error function:
1) L2-regularization
2) L1-regularization
Other methods:
3) Dataset augmentation
4) Early stopping
5) Bagging / ensemble methods
6) Dropout
Explain dropout
Randomly set a fraction of the units in the network to zero during training
Explain Bagging / ensemble methods
Train several different models separately, then have all of the models vote on the output.
Explain early stopping
Stop the training when the validation set error goes up
Explain Dataset augmentation
Generate new training data by systematically transforming the existing data (rotate, scale etc)
Explain L1-norm regularization
Prefer parameter vectors with small absolute-value norms
Explain L2-norm regularization
Prefer parameter vectors with small Euclidean norms
List some famous network architectures
1) LeNet 5
2) AlexNet
3) ResNet
4) Inception v4
What are some ways of making convolutions more efficient?
1) Parallel computation resources (GPU)
2) Clever convolution algorithms (Fourier transform)
What are some ways to avoid supervised training (as it's expensive)?
1) Random features
2) Hand-designed features
3) Unsupervised training of features
Recommended textbook explanations
Engineering Electromagnetics
8th Edition
John Buck, William Hayt
483 explanations
Introduction to Algorithms
3rd Edition
Charles E. Leiserson, Clifford Stein, Ronald L. Rivest, Thomas H. Cormen
709 explanations
Introduction to Programming Using Python
1st Edition
Y. Daniel Liang
773 explanations
Computer Organization and Design
5th Edition
David A. Patterson, John L. Hennessy
220 explanations
Sets with similar terms
ML concept learning
65 terms
GMS401 Chapter 6: Learning Curves
8 terms
Comparative Psychology Exam 3
77 terms
Machine Learning Google Course
95 terms
Sets found in the same folder
Lecture 1
14 terms
Lecture 2 - Feature Descriptors
43 terms
Lecture 3 - Convolutional Neural Networks
31 terms
Lecture 5 - Compound Descriptors and Metrics
18 terms
Other sets by this creator
Lecture 8
14 terms
Lecture 7
18 terms
Lecture 6
8 terms
Lecture 5
18 terms
Verified questions
COMPUTER SCIENCE
Suppose we use a hash function h to hash n distinct keys into an array T of length m. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of {{k, l}}: k ≠ l and h(k) = h(l)}?
COMPUTER SCIENCE
T F The strcpy function will overwrite the contents of its first string argument.
COMPUTER SCIENCE
__________ is a keyword that is used to get out from the iteration of a loop immediately. a. continue b. break c. goto d. default
COMPUTER SCIENCE
Modify the convert.py program with a loop so that it executes 5 times before quitting (i.e., it converts 5 temperatures in a row).
Other Quizlet sets
Pharmacokinetics
57 terms
biology Final Review
23 terms
Coach Taylor Final EXAM - Cade Hoesterey
89 terms
European Exploration
19 terms