Supervised Learning
Click the card to flip šŸ‘†
1 / 40
Terms in this set (40)
Supervised training where the targets have continuous values. It is used for predicting unseen inputs. An example is estimating the house price based on the number of rooms and land area. Another example is the petrol consumed per KM for a car based on the number of cylinders, weight and engine size.
Overfitting a model is when training reaches a state where it can classify the training data perfectly while poorly classifying any unseen data. It happens when we have a complex model with many parameters and a small number of data to train the model. Also, it could happen when the number of epochs becomes far more than what is necessary.
Why do you think overfitting has to be avoided while training a model?Overfitting introduces a lousy model that can fit very well with the training data but performs poorly with unseen novel data that have not existed while training.What do we mean by generalization?Generalization means the trained model performs well with training and testing unseen while training data.Feature extraction task is problem and domain-dependent and thus requires knowledge of the domain. True or false?True. It requires a sound understanding of all attributes of the training data. This is necessary to know what features carry more information about the underlying structure of the data.What is gradient descent?Gradient descent is an optimization algorithm for finding the minimum of a function. The algorithm is to find the best model parameters š‘„š‘› using the following equation. š‘„š‘›+1= š‘„š‘›āˆ’ šœ‚ (š‘‘š‘“(š‘„š‘›)/š‘‘š‘„) Where š‘“ (š‘„š‘›) is the cost function, mainly the error square between the predicted model output and the actual target value. šœ‚ is the training rate which is typically taken small value.Increasing the number of features may consistently increase the accuracy of the prediction as it will be more specific to the problem. True or False?False; more features may not carry important information about the targeted class. It also increases the model's complexity, which makes it hungry for more training data. Also, more features may carry noise information which harms the trained model.What is the difference between batch gradient descent and stochastic gradient descent training? Which one is faster in reaching the optimum model?Batch gradient descent adjusts the parameters of the model once per entire epoch. However, for the SGD the model parameters are adjusted with each example.Can we reach the exact local minimum of the cost function in the iterative approach training? If yes, why? If not, what do we do to stop the recursion?The exact local minimum may not happen, and the iteration may be going over and over nearby the minimum without stopping. In this case, we may set a small threshold value ε to stop the recursion when the cost function š½(šœƒ) < ε. In some cases, we may also set an upper limit for the number of iterations to prohibit the recursion from looping forever.What is the convex function? Why is it beneficial to have a convex function? Give an example of a convex function. How do we know if the function is convex?If the function is convex, we can assure that there is one optimum minimum and no local minimum. One example of a convex function is š‘“ (š‘„š‘„)= š‘„š‘„2. If the second derivative of a function is positive, then it should be convex. In this example š‘“ ′′(š‘„š‘„)=2, which is positive.In linear regression, why do we use the squared loss function instead of using the absolute of the loss function, which is computationally simpler?The squared loss function is quadratic, which means it is convex. Accordingly, the global minimum of the function can be reached and guaranteed. This is not the situation with the absolute loss function.What is the main difference between classification and regression problems? Give one example of each of them.Classification is when the model outputs are discrete values referring to distinct classes, while the model outputs are continuous values in regression. Recognition of the digits or discriminating animals from each other is a classification example. A regression example is finding the predicted house price based on the number of rooms and land areas.What is binary classification?Binary classification means their model is set to discriminate between two classes. An example is discriminating dogs from cats. Another example is discriminating spam from non-spams emails.What is logistic regression, and how is it different from linear regression?Logistic regression uses the logistic function upon the linear regression hypothesis function. The sigmoidal function š‘”š‘”(š‘§š‘§) of a variable z is formulated. Linear regression is where the output is linearly related to the input. That is, if the input is š‘§š‘§, then š‘”š‘”(š‘§š‘§)=š‘˜š‘˜ š‘§š‘§ where k is constant.Why is sigmoidal function commonly used in machine learning systems? Mention two advantages.Sigmoidal functions are favored as they are well consistent with the need for gradient descent modeling. Gradient descent is best used when the parameters cannot be calculated analytically and must be searched for by an optimization algorithm. The derivative is an essential operation in this algorithm. The derivative of the sigmoidal function is computationally easy to find. We can see that one advantage is the easy derivative formula. The second advantage is its probabilistic outcome of it. The output is between 0 and 1, so it can compress all the values to this probabilistic range.What are the advantages of the Softmax classifier over the logistic regression classifier?Softmax classifier is the multiclass generalization of the two-class binary Logistic Regression classifier. Softmax can suitably and directly be used for multiclass classification problems, while logistic regression is suitable for binary classification problems.In multiclass classification, why do we use Softmax when checking the highest score is more straightforward, giving us the same result?The answer is that if you have outputs expressed as probabilities, you can simulate confidence. You can also share scores between models and be confident about the model's performance. Besides, different models' outputs develop different numbers on different ranges. If your model outputs are 1, 2, or 5, the benchmark model will not develop the same outputs. The comparison now is more challenging to decide.Where is the acquired knowledge from data stored in ANN?The acquired knowledge is stored in the weight values of the connections between the input, hidden, and output layers.What do we mean by the generalization property in ANN? Why is it important?Generalization is the ability of the model to work well with the training and unseen data. If the model is not generalizing well, it will perform poorly with the novel unseen data where the model is needed to perform at its highest performance.When the adaptive favored property of ANN becomes problematic, and how to overcome such situations?Adaptivity does not always lead to robustness; it may do the opposite. For example, an adaptive system with short-time constants may change the well-trained model and deviates its parameters to respond to spurious disturbances as genuine inputs and cause severe degradation in system performance. To realize the full benefits of adaptivity, the principal time constants of the system should be long enough for the system to ignore spurious disturbances and yet short enough to respond to meaningfulWhat is the main limitation of using a single perceptron? How do we overcome this limitation?Single layer perceptron can only classify linearly separable data. To classify non-linearly separable data, we need more layers of perceptrons and maybe many perceptrons per layer, depending on the complexity of the data.What do we mean by a shallow neural network?Shallow neural networks are those with one or sometimes two hidden layers only.What is the ReLU activation function? Mention one reason why it is commonly adopted in training multilayer neural networks.This activation function is also known as a ramp function and is analogous to half-wave rectification in electrical engineering. It has low computational cost and its derivative is very straightforward to find.A training program trains a neural network using 1000 data examples. The training process uses 50 examples as a training batch to train the 1000 examples. Calculate the number of training cycles in one epoch of batch training. How many training cycles should be used if the training is done using SGD method? Repeat for mini-batch training.The number of training examples means how many times we adjust Ī”š‘¤ š‘–j while going over all the data examples i.e. in one epoch. Mini-batch = 1000/50 = 20 SGD = 1000 Batch = 1Which training scheme converges faster, batch or SGD?SGD is faster to converge than the batch training scheme.A neural network is used to train 1000 examples, and each example consists of 50 features. There are two hidden layers; each has 100 neurons in this neural network. The number of classes needed to classify is 20. How many parameters are required to be trained in this neural network? Which approach in training is better to follow? Is it SGD, or batch training, state why.There are 50 x 100 x 100 x 20 = 107 parameters to train in the neural network The SGD is faster than the batch training as the parameters are optimized after each example, while for the batch training, there will be one parameter adjustment per epoch, making it very slow.When is it expected that overfitting does not happen in the following scenarios? • A multilayer neural network with a few layers and a big dataset. • A multilayer neural network with a significant number of layers and a small-size dataset. • Training a neural network setting to a very large number of epochs.In the first scenario, we have a few layers of a multilayer neural network with a big dataset. The two other scenarios are very prone to overfitting.Depict two strategies that can be used to stop the training cycles of a neural network. What are the disadvantages of these strategies?One strategy is to set a certain number of epochs, and when it is reached, we stop the training process. The disadvantage here is we have no guarantee or evidence that the model is well trained at that number of epochs. The model may underfit or overfit without any indication. The second strategy is that we set a specific error threshold between the predicted model outputs and the actual targets. When this threshold is reached, we stop the training. The disadvantage is that we have no idea of the best error threshold. If it is too small and beyond the model's capability to reach, then the training will go forever without stopping. On the other hand, the error may be big enough and reached before the model is fully trained. The best strategy is when we set a validation set and compare the model's performance under training and that of the validation set. When the two performances start to split, we realize the overfitting starting point is there, and we have to stop the training. The next figure shows this scenario.What do we mean by vanishing gradients, and how does it contributes to the constraint of the maximum number of layers in a multilayer neural network?We can use the backpropagation algorithm to train a network with as many layers as possible. However, there is a risk of vanishing gradients with the fully connectionist MLP. The gradient becomes nearly zero at the layers close to the input layer. At this point, the value of the weight cannot be improved to improve the loss function as the gradient becomes zero or nearby zero. That is because the loss function depends on the derivative (gradient) of the cost function with respect to the connection weight.How do we initialize the weights in a multilayer neural network? Mention one method.The weights can be initialized by setting small positive and negative random values of all network weights. The other method is to set all of them at zero values. The more advanced way is to have the weights of a similar structure well-trained neural network trained on an extensively big dataset. Use the same weights and train the new neural network on the required database by the design engineer.What do we mean by data scaling, and why is it essential to train neural networks? Depict three ways to scale the data.Data scaling is a typical pre-processing process to be implemented on the input data before starting to train the models. In this process, all inputs should be constrained within certain boundaries. Such as, all values are normalized to have zero mean and unit variance; this is called normalization or standardization. The other approach is to constrain the input values to be within a specific range (mini-max scaling), such as 0 and 1. It may also be possible to scale all inputs with respect to the maximum input values. Scaling the input attributes (features) is essential to prevent large values from dominating small values. Variables that are measured at different scales do not contribute equally to the model fitting, and the trained model could be biased if the inputs are not correctly scaled. Suppose some attributes vary between 1000 - 10,000, while other attributes vary between 1 - 10. In that case, the classification results will be skewed toward the attributes with higher absolute values. Almost no effect will be noticed due to the small values, although they may carry more critical information about the class they belong to.