Home
Browse
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Machine Learning - From CourseraTeam
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (232)
Well-posed Learning Problem
A computer program is said to learn from (E)xperience E with respect to some (T)ask T and some (P)erformance measure P, if its performance on T, as measured by P, improves with experience E.
Machine Learning
Field of study that gives computers the ability to learn without being explicitly programmed.
Machine Learning broad definition
Field of study that gives computers the ability to learn without being explicitly programmed.
Abstract Essence of ML
Representation + Evaluation + Optimisation
Machine Learning
Learning from experience. It's also called supervised learning, were experience E is the supervision.
Pattern Recognition
Finding patterns without experience. It's also called unsupervised learning.
Unsupervised Learing
- We only have xi values, but no explicit target labels.
- You want to do 'something' with them.
Unsupervised Learning Tasks
- Outlier detection: Is this a 'normal' xi ?
- Data visualization: What does the high-dimensional X look like?
- Association rules: Which xij occur together?
- Latent-factors: What 'parts' are the xi made from?
- Ranking: Which are the most important xi ?
- Clustering: What types of xi are there?
Classification
ML task where T has a discrete set of outcomes. Often classification is binary.
Examples:
• face detection
• smile detection
• spam classification
• hot/cold
Regression
ML task where T has a real-valued outcome on some continuous sub-space
Examples:
• Age estimation
• Stock value prediction
• Temperature prediction
• Energy consumption prediction
Labels
Values that h aims to predict
Example:
• Facial expressions of pain
• Impact of diet on astronauts in space
• Predictions of house prices
Training algorithm
Given a model h with Solution Space S and a training set {X,Y}, a learning algorithm finds the solution that minimizes the cost function J(S)
Features/Attributes
Measurable values of variables that correlate with the label y
Examples:
• Sender domain in spam detection
• Mouth corner location in smile detection
• Temperature in forest fire prediction
• Pixel value in face detection
Cost Function
Squared error cost function. J(S)
Local minima
The smallest value of the function. But it might not be the only one.
General classification problem
If classes are disjoint, i.e. each pattern belongs to one and only one class then input space is divided into decision regions separated by decision boundaries or surfaces
Decision surfaces are
• Linear functions of x
• Defined by (D-1) dimensional hyperplanes in the D dimensional input space.
Linear Separability
Linearly separable data:
• Datasets whose classes can be separated by linear decision surfaces
• Implies no class-overlap
• Classes can be divided by e.g. lines for 2D data or planes in 3D data
Orthogonality
- Two vectors and are orthogonal if they're perpendicular
- If their inner product is 0:
a · b = 0
LDA
- Linear Discriminant Analysis
- Most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications.
- The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting
Training LDA objective:
Find (i.e. learn) that minimizes some error function on the training set.
Significant approaches:
• Least squares
• Fisher
• Perceptron
Artificial Neural Nets
• Feed-forward neural network/Multilayer Perceptron one of many ANNs
• We focus on the Multilayer Perceptron
• Really multiple layers of logistic regression models
The simplest ANNs consist of
• A layer of D input nodes
• A layer of hidden nodes
• A layer of output nodes
• Fully connected between layers
Curse of Dimensionality
The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data.
First, it's very easy to overfit the the training data, since we can have a lot of assumptions that describe the target label (in case of supervised learning). In other words we can easily express the target using the dimensions that we have.
Second,we may need to increase the number of training data exponentially, to overcome the curse of dimensionality and that may not be feasible.
Third, in ML learning algorithms that depends on the distance, like k-means for clustering or k nearest neighbors, everything can become far from each others and it's difficult to interpret the distance between the data points.
EXTREME Dimensionality Case
In an extreme, degenerate case, if D > n, each example can be uniquely described by a set of feature values.
Hidden layer(s) can
- Have arbitrary number of nodes/units
- Have arbitrary number of links from input nodes and to output nodes (or to next hidden layer)
- There can be multiple hidden layers
Hidden Unit Activation
Common functions for are unit step, sigmoid or logistic and tanh
RELU
Rectified Linear Unit
New trend, responsible for great deal of Deep Learning success. Advantages:
• No 'vanishing gradient' problem
• Can model any positive real value
• Can stimulate sparseness
Output layer can be
• Single node for binary classification
• Single node for regression
• n nodes for multi-class classification
Network Topology
Variations include:
• Arbitrary number of layers
• Fewer hidden units than input units (causes in effect dimensionality reduction, equivalent to PCA)
• Skip-layer connections
• Fully/sparsely interconnected networks
Training a network
Training a NN involves finding the parameters that minimize some error function
Choice of activation function depends on the output variables:
- Unity for regression
- Logistic sigmoid for (multiple independent) binary classification
- Softmax for exclusive (1-of-K) multiclass classification
NN Error Functions
Regression:
- Binary classification
- Multiple independent binary Classification:
- Multi-class classification
(mutually exclusive):
Gradient points in the direction of
Steepest Ascent
How to train ANN
An error function on the training set must be minimized. This is done by adjusting:
- Weights connecting nodes.
- Parameters of non-linear functions h(a).
Eigenvalue
Given an invertible matrix , an eigenvalue equation can be found in terms of a set of orthogonal vectors , and scalars such that:
M
Derivative
Measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function
Gradient
Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) . So given a function i.e. g(x,y) = -x+y^2 you know, that it is better to minimize the value of x, while strongly maximize the value of y. This is a base of gradient based methods, like steepest descent technique.
Batch Gradient descent
- Vanilla gradient descent, aka batch gradient descent
- Make small change in weights that most rapidly improves task performance
Gradient descent computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset
- Can be very slow
- Intractable for datasets that don't fit in memory
- Doesn't allow us to update our model online, i.e. with new examples on-the-fly.
- guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
On-line Gradient Descent
On-line (or Schotastic) gradient descent also known as incremental gradient descent updates parameter one data point at a time.
- Handles redundancy better. (Batch GD has redundancy)
- Usually much faster than Batch GD.
- SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily
- Can deal with new data better.
- Good chance of escaping local minima. However, when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent
Backpropagation
- Used to calculate derivatives of error function efficiently
- Errors propagate backwards layer by layer
Backprop is for:
Arbitrary feed-forward topology
Differentiable nonlinear activation functions
Broad class of error function
Error Backpropagation
1. Apply input vector to network and propagate forward
2. Evaluate d(k) for all output units
3. Backpropagate d's to obtain d(j) for all hidden units
4. Evaluate error derivatives as:
Regularization
Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.
Regularization
- Maximum likelihood generalization error (i.e. cross-validation)
- Regularized error (penalize large weights)
- Early stopping
Deep Learning
- Basically a Neural Network with Many hidden layers
- Can be used for unsupervised learning and dimensionality reduction
What is Deep Learning?
Definition:
• Hierarchical organization with more than one (non-linear)
hidden layer in-between the input and the output variables
• Output of one layer is the input of the next layer
Deep learning methods
(Deep) Neural Networks
• Convolutional Neural Networks
• Restricted Boltzmann Machines/Deep Belief Networks
• Recurrent Neural Networks
Convolutional Neural Networks
Type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field.
How is deep neural network optimized?
Optimized through gradient descent! (Forward-Backward algorithm)
- Penalize complex solutions to avoid overfitting
Cost function - ℓ2 norm
In order to avoid over-fitting, one common approach is to add a penalty term to the cost function. Common choices are the ℓ2-norm, given as:
Where C0 is the unregularized cost
Cost function - ℓ1 norm
Dropout
A very different approach to avoiding over-fitting is to use an approach called dropout.
Here, the output of a randomly chosen subset of the neurons are temporarily set to zero during the training of a given mini-batch. This makes it so that the neurons cannot
overly adapt to the output from prior layers as these are not always present. It has enjoyed wide-spread adoption and massive empirical evidence as to its usefulness.
Cost function - Euclidean distance
Distance measure between a pair of samples p and q in an n-dimensional feature space
Cost function - Manhattan or City block distance
Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance
Basic Decision Tree
Decision trees apply a series of linear decisions, that often depend on only a single variable at a time. Such trees partition the input space into cuboid regions, gradually refining the level of detail of a decision until a leaf node has been reached, which provides the final predicted label.
Tree components
Root node, branch, node, leaf node.
Branching Factor
Branching factor of node at level L is equal to the number of branches it has to nodes at level L + 1
CART
Classification And Regression Trees
Six general questions to decide on decision tree algorithm:
1. How many splits per node (properties binary or multi valued)?
2. Which property to test at each node?
3. When to declare a node to be leaf?
4. How to prune a tree that has become too large (and when is a tree too large)?
5. If a leaf node is impure, how to assign a category label?
6. How to deal with missing data?
Tree variaties
Trees are called monothetic if one property/variable is considered at each node, polythetic otherwise
What trees are preferable?
We prefer simple, compact trees, following Occam's Razor
How to make trees compact?
To do so, we will seek to minimise impurity of data reaching
descendent nodes
Occam's Razor
All things being equal - the simplest explanation is the best
The Principle of Plurality
Plurality should not be posited without necessity.
The Principle of Parsimony
It is pointless to do with more what is done with less.
Misclassification Impurity
Minimum probability that training example will be misclassified at node N
Bayes' Error
- The Bayes Error rate is the theoretical lowest possible error rate for a given classifier and a given problem (dataset).
- For real data, it is not possible to calculate the Bayes Error rate,
although upper bounds can be given when certain assumptions on
the data are made.
- The Bayes Error functions mostly as a theoretical device in Machine Learning and Pattern Recognition research.
Generalisation
Generalization is the desired property of a classifier to be able to predict the labels of unseen examples correctly. A hypothesis generalizes well if it can predict an example coming from the same distribution as the training examples well.
Overfitting
A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data. It presents itself in complicated decision boundaries that depend strongly on individual training examples.
Stopping Criteria
Reaching a node with a pure sample is always possible but usually not desirable as it usually causes over-fitting.
Three common ways to decide when to stop splitting decision tree
- Validation set
- Cross-validation
- Hypothesis testing (chi-squared statistic)
Evaluation Procedures
• For large datasets, a single split is usually sufficient.
• For smaller datasets, rely on cross validation
Validation set Criterian
Split training data in a training set and a validation set (e.g. 66% training data and 34% validation data). Keep splitting nodes, using only the training data to learn
decisions, until the error on the validation set stops going
down.
Cross-validation
In k-fold cross-validation, a dataset is split into k roughly equally sized partitions, such that each example is assigned to one and only one fold. At each iteration a hypothesis is learned using k-1 folds as the training set and predictions are made on the k'th fold. This is repeated until a prediction is made for all k folds, and an error rate for the entire dataset is obtained.
Cross-validation maximises the amount of data available to train and test on, at cost of increased time to perform the evaluation.
• Training Data segments between different folds should never overlap
• Training and test data in the same fold should never ovelap
Error estimation can either be done per fold separately, or delayed by collating all predictions per fold.
Cross-validation criterion
Split training data in a number of folds. For each fold, train on all other folds and make predictions on the held-out test fold. Combine all predictions and calculate error. If error has gone down, continue splitting nodes, otherwise, stop
Pruning
First fully train a tree, without stopping criterion
After training, prune tree by eliminating pairs of leaf nodes for which the impurity penalty is small
Multivariate Trees
Instead of monothetic decisions at each node, we can learn polythetic decisions. This can be done using many linear classifiers, but keep it simple!
Missing Attributes
It is common to have examples in your dataset with missing attributes/variables.
One way of training a tree in the presence of missing attributes is removing all data points with any missing attributes.
A better method is to only remove data points that miss a required attribute when considering the test for a given node for a given attribute.
This is a great benefit of trees (and in general of combined models,)
ID3
Interactive dichotomizer version 3
Used for nominal, unordered, input data only.
Every split has branching factor , where is the number of values a variable can take (e.g. bins of discretized variable) has as many levels as input variables
C4.5
- Successor of ID3.
- Multiway splits are used.
- Statistical significant split pruning.
Regression Trees
- Trained in a very similar way
- Leaf nodes are now continuous values - the value at a leaf node is that assigned to a test example if it reaches it
- Leaf node label assignment is e.g. mean value of its data sample
Problem: nodes make hard decisions, which is particularly undesired in a regression problem, where a smooth function is sought.
Model Combination View
• Decision Trees combine a set of models (the nodes)
• In any given point in space, only one model (node) is responsible for making predictions
• Process of selecting which model to apply can be described as a sequential decision making process corresponding to the traversal of a binary tree
Random forests
1. Very good performance (speed, accuracy) when abundant data is available.
2. Use bootstrapping/bagging to initialize each tree with different data.
3. Use only a subset of variables at each node.
4. Use a random optimization criterion at each node.
5. Project features on a random different manifold at each node.
Measures of classification accuracy
Classification Error Rate
Cross Validation
Recall, Precision, Confusion Matrix
Receiver Operator Curves, two-alternative forced choice
Estimating hypothesis accuracy
Sample Error vs. True Error
Confidence Intervals
Sampling Theory Basics
Binomial and Normal Distributions
Mean and Variance
Comparing Hypotheses
t-test
Analysis of Variance (ANOVA) test
Simple data splits
- Fixed train, development and test sets
- Bootstrapping
- Cross-validation
Fixed train, development and test sets
- Randomly split data into training, development, and test sets.
- Does not make use of all data to train or test
- Good for large datasets
Bootstrapping
Estimating the sampling distribution of an estimator by resampling with replacement from the original sample.
Cross-validation
- Randomly split data into n folds and iteratively use one as test set
- All data used to test, and almost all to train
- Good for small sets
Evaluation procedure for single split or cross validation
- For large datasets, a single split is usually sufficient.
- For smaller datasets, rely on cross validation
Overfitting can occur when
- Learning is performed for too long (e.g. in Neural Networks)
- The examples in the training set are not representative of all possible situations (is usually the case!)
- Model parameters are adjusted to uninformative features in the training set that have no causal relation to the true underlying target function.
Confusion Matrix
Easy to see if the system is commonly mislabelling one class as another
Classification measures - Error Rate
Common performance measure for classification problems
1. Success: Instance's class is predicted correctly (True Positives (TP) / Negatives (TN)).
2. Error: Instance's class is predicted incorrectly (False Positives (FP) / Negatives (FN)).
3. False positives - Type I error. False Negative - Type II error.
4. Classification error rate: Proportion of instances misclassified over the whole set of instances.
F-measure
Comparing different approaches is difficult when using multiple evaluation measures (e.g. Recall and Precision)
F-measure combines recall and precision into a single measure
accuracy
(TP + TN)/ (TP + TN + FP + FN)
accuracy may not be useful measure in cases where
1- There is a large class skew
2- There are differential misclassification costs - say, getting a positive wrong costs more than getting a negative wrong.
3- We are interested in a subset of high confidence predictions
TPR - True Positive Rate - Recall
TP/actual Positive = TP/TP + FN
False Positive Rate
FP/actual negative = FP/TN + FP
Specificity
...
Sensitivity
...
ROC Curve
Receiver Operator Characteristic (ROC) curves plot
TP vs FP rates
Hypothesis Quality
- We want to know how well a machine learner, which
learned the hypothesis as the approximation of the target
function , performs in terms of correctly classifying novel,
unseen examples
- We want to assess the confidence that we can have in this classification measure
The true error of hypothesis h
Probability that it will misclassify a randomly drawn example from distribution : D
However, we cannot measure the true error. We can only estimate it by observing the sample error eS
SAMPLE ERROR
In statistics, sampling error is incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population.
A Bernoulli trial
- It is a trial with a binary outcome, for which the probability that the outcome is 1 equals p (think of a coin toss of an old warped coin with the probability of throwing heads being p).
- A Bernoulli experiment is a number of Bernoulli trials performed after each other. These trials are i.i.d. by definition.
Binomial distribution
In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
Normal distribution
- The Normal distribution has many useful properties. It is fully described by it's mean and variance and is easy to use in calculations.
- The good thing: given enough experiments, a Binomial distribution converges to a Normal distribution.
t-test
Assesses whether the means of two distributions are statistically different from each other
Significance level
Significance level α%: α times out of 100 you would find a statistically significant difference between the distributions even if there was none. It essentially defines our tolerance level.
If the calculated t value is above the threshold chosen for statistical significance then the null hypothesis that the two groups do not differ is rejected in favor of the alternative hypothesis: the groups do differ.
Tests for comparing distributions
- t-test compares two distributions
- ANOVA compares multiple distributions
- If NULL-hypothesis is refuted, there are at least two distributions with significantly different means
Does NOT tell you which they are!
Latent variables
- Latent variables are variables that are 'hidden' in the data. They are not directly observed, but must be inferred.
- Clustering is one way of finding discrete latent variables in data.
Discrete latent variables
Hidden variables that can take only a limited number of discrete values (e.g. gender or basic emotion).
Clustering Applications
- Market segmentation
- Social Network Analysis
- Vector quantization
- Facial Point detection
K-Means Clustering
Informally, goal is to find groups of points that are close to each other but far from points in other groups
• Each cluster is defined entirely and only by its centre, or mean value µk
K-Means Algorithm
1. Assign each xi to its closest mean.
2. Update the means based on assignment
3. Repeat until convergence
K-Means Issues
Convergence is guaranteed but not necessarily optimal - local minima likely to occur
• Depends largely on initial values of uk.
• Hard to define optimal number K.
• K-means algorithm is expensive: requires Euclidean distance computations per iteration.
• Each instance is discretely assigned to one cluster.
• Euclidian distance is sensitive to outliers.
ROBBINS-MONRO
•Addresses the slow update speed of the M-step in K-means
•Uses linear regression (see lecture 1)
K-medoids clustering
Addresses issue with quadratic error function (L2-norm, Euclidean norm)
Replace L2 norm with any other dissimilarity measure (V...)
Random Initialization
- Randomly Initialize K-Means clusters using actual instances as cluster centers
- Run K-Means and store centers and final Cost function
- Pick clusters of iteration with lowest Cost function as optimal solution
- Most useful if K < 10
Choosing K
Elbow method
• Visual inspection
• 'Downstream' Analysis
Probability Theory Recap
p(x) = marginal distribution
p(x,y) = joint distribution
p(x|y) = conditional distribution
Mixture of Gaussians
Simple formulation: density model with richer representation than single Gaussian
i.i.d.
Independent and identically distributed random variables
EM Algorithm issues
• Takes a long time
• Often initialised using k-Means
Data Mining
- Quest to extract knowledge and/ or unknown interesting patterns from apparently unstructured data.
aka Knowledge Discovery from Data (KDD)
• Data mining bit of a misnomer - information/ knowledge is mined, not data.
Knowledge Discovery Process
1. Data cleaning - remove noise and inconsistencies
2. Data integration - combine data sources
3. Data selection - retrieve relevant data from db
4. Data transformation - aggregation etc. (cf. feature extraction)
5. Data mining - machine learning
6. Pattern Evaluation - identify truly interesting patterns
7. Knowledge representation - visualize and transfer new knowledge
ARFF
Attribute-Relation File Format
HDF5
• Much more complex file format designed for scientific data handling
• It can store heterogeneous and hierarchical organized data.
• It has been designed for efficiency.
DM Types of data
• Relational databases
• Data warehouses
• Transactional databases
• Object-relational databases
• Temporal/sequence/time-series databases
• Spatial and Spatio-temporal databases
• Text & Multimedia databases
• Heterogeneous & Legacy databases
• Data streams
DM Functionalities
Concept/Class description
• Characterization
• Discrimination
• Frequent patterns/ Associations/ Correlations
• Classification and Regression (Prediction)
• Cluster analysis
• Outlier analysis
• Evolution analysis
The goal of data mining is to
Find interesting patterns!. An interesting pattern is:
1. Easily understood.
2. Valid on new data with some degree of certainty.
3. Potentially useful.
4. Novel.
DM Objective Interest
Support: P(X U Y )
Percentage of transactions that a rule satisfies
Confidence: P(Y | X)
Degree of certainty of a detected association, i.e. the probability that a transaction containing X also contains Y
DM Subjective Interest
Subjective measures require a human with domain knowledge to provide measures:
• Unexpected results contradicting apriori beliefs
• Actionable
• Expected results confirming hypothesis
DM systems can be divided into types based on a number of variables
• Kinds of databases
• Kinds of knowledge
• Kinds of techniques
• Target applications
DM task primitives
DM task primitives forms the basis for DM queries.
DM primitives specify:
• Set of task-relevant data to be mined
• Kind of knowledge to be mined
• Background knowledge to be used
• Interestingness measures and thresholds for pattern evaluation
• Representation for visualizing discovered patterns.
DM query languages
• DM query language incorporates primitives
• Allows flexible interaction with DM systems
• Provides foundation for building user-friendly GUIs
• Example: DMQL
DM integration with DBS/
Data Warehouses
• No coupling - DMS will not utilize any DB/DW system functionality
• Loose coupling - Uses some DB/DW functionality, in particular data fetching/storing
• Semi-tight coupling - In addition to loose coupling use sorting, indexing, aggregation, histogram analysis, multiway join, and statistics primitives available in DB/DW systems
• Tight coupling
Dirty data
incomplete
noisy
inconsistent
Causes of incomplete data
• "Not applicable" data value when collected
• Different considerations between the time when the data was collected and when it is analyzed.
• Human/ hardware/ software problems
Causes of noisy data (incorrect values)
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
Causes of inconsistent data
• Different data sources
• Functional dependency violation (e.g., modify linked data)
Importance of cleaning data
If you have good data, the rest will follow
Data quality measures
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
• Broad categories:
• Intrinsic, contextual, representational, and accessibility
Major DM prep tasks
- Data cleaning : Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
- Data integration : Integration of multiple databases, data cubes, or files
- Data transformation : Normalisation and aggregation
- Data reduction : Obtains reduced representation in volume but produces the same or similar analytical results
- Data discretization : Part of data reduction but with particular importance, especially for numerical data
Noisy data
Noise is a random error or variance in a measured variable
Techniques for canceling out noise
1. Binning - First sort data, then distribute over local bins
2. Regression - Fit a parametric function to the data (e.g. linear or quadratic function)
3. Clustering
Noisy data - Binning
Cancelling noise by binning:
- Sort data
- Create local groups of data
- Replace original values by:
______ The bin mean
______ The closest min/max value of the bin
Noisy data - Regression
Canceling noise by regression:
1. Fit a parametric function to the data using minimization of e.g. least squares error
2. Replace original values by the parametric function value
Noisy data - Clustering
Canceling noise by clustering
- Cluster data into N groups
- Replace original values by means of clusters
OR:
- Use to detect outliers
Data Integration
• Entity identification problem
• Redundancy detection
• Correlation analysis
• Detection and resolution of data value conflicts
• e.g. weight units, in/exclusion of taxes
Data Transformation
Data transformation alters original data to make it more suitable for data mining.
• Smoothing (noise cancellation)
• Aggregation
• Generalisation
• Normalisation
• Attribute/feature construction
Itemsets
simply a set of items (cf set theory)
Mining frequent patterns
• One approach to data mining is to find sets of items that appear together frequently: frequent itemsets
• To be frequent some minimum threshold of occurrence must be exceeded
• Other frequent patterns of interest:
____ frequent sequential patterns
____ frequent structured patterns
Association Rules
Reflect items that are frequently found (purchased) together, i.e. they are frequent itemsets
• Information that customers who buy beer also buy crisps is e.g. encoded as:
beer ) crisps[support = 2%, confidence = 75%]
Support and Confidence
Are measures of pattern interestingness
Rule support
support(A -> B) = P(A u B)
Rule confidence
confidence(A -> B) = P (B|A)
Frequent Itemset
• Absolute support of an itemset is its frequency count
• Relative support is the frequency count of the itemset divided by the total size of the dataset
Min-max normalization
• Enables cost-function minimization techniques to function properly, taking all attributes into equal account
• Transforms all attributes to lie on the range [0, 1] or [-1, 1]
• Linear transformation of all data
z-score normalisation
Better terminology is zero-mean normalization
• min-max normalization cannot cope with outliers, z-score normalization can.
• Transforms all attributes to have zero mean and unit standard deviation.
• Outliers are in the heavy-tail of the Gaussian.
• Still a linear transformation of all data.
Data reduction
Should remove what's unnecessary, yet otherwise maintain the distribution and properties of the original data
• Data cube aggregation
• Attribute subset selection (feature selection)
• Dimensionality reduction (manifold projection)
• Numerosity reduction
• Discretization
Attribute subset selection
Feature selection
Feature selection is a form of dimensionality reduction in ML, hence the DM term 'dimensionality reduction' for manifold projection is problematic.
Approaches:
• Exact solution infeasible
• Greedy forward selection
• Backward elimination
• Forward-backward
• Decision tree induction
Numerosity Reduction
Reduces the number of instances rather than attributes.
Much more dangerous, as it risks changing the data distribution properties.
• Parametrization
• Discretization
• Sampling
Data Discretization
Grouping a possibly infinite space to a discrete set of possible values
For categorical data:
________ Super-categories
For real numbers:
________ Binning
________ Histogram analysis
________ Clustering
Instance Reduction
Reduces the number of instances rather than attributes.
Much more dangerous, as it risks changing the data distribution properties
• Duplicate removal
• Random sampling
• Cluster sample
• Stratified sampling
Complex Itemsets
The general rule procedure for finding frequent item sets would be:
1. Find all frequent itemsets
2. Generate strong association rules
However, this is terribly costly, with the total number of item sets to be checked for 100 items being
Simpler Method for complex itemset
Closed frequent itemset: X is closed if there exists no super-set Y such that Y has the same support count as X
Maximal frequent itemset: X is frequent, and there exist no supersets Y of X that are also frequent
Apriori algorithm
Apriori algorithm is a fast way of finding frequent itemsets
Rule based learning
Equivalent in expression power to traditional (mono-thetic) decision trees, but with more flexibility
• They produce rule sets as solutions, in the form of a set of IF... THEN rules
Predicates
A logic statement, generally as boolean logic
Rulesets
• Single rules are not the solution of the problem, they are members of rule sets
• Rules in a rule set cooperate to solve the problem. Together they should cover the whole search space
Evaluating Rules
• A good rule should not make mistakes and should cover as many examples as possible
Complexity: Favour rules with simple predicates (Occam's Razor)
Evaluating RuleSets
A complete rule set should be good at classifying all the training examples
Complexity: Favour rule sets with the minimal number of rules.
Learning Rulesets
Learning rules sequentially, one at a time
• Also known as separate-and-conquer
Learning all rules together
• Direct rule learning
• Deriving rules from decision trees
There are three reasons to reduce the dimensionality of a feature set
1. Remove features that have no correlation with the target distribution
2. Remove/combine features that have redundant correlation with target distribution
3. Extract new features with a more direct correlation with target distribution.
Degrees of freedom of variability
Number of ways data can change/ number of separate transformations possible
Intrinsic dimensionality
Subspace of data space that captures degrees of variability only, and is thus the most compact possible representation
Feature Selection
Feature Selection returns a subset of original feature set. It does not extract new features.
Benefits:
• Features retain original meaning
• After determining selected features, selection process is fast
Disadvantages:
• Cannot extract new features which have stronger correlation with target variable
Search Methods
• Exhaustive
• Greedy forward selection
• Greedy backward elimination
• Forward-backward approach
Filter Scores
• Correlation
• Mutual information
Entropy
• Classification rate
• Regression score
Mutual Information
Gives a measure of how 'close' two components of a joint distribution are to being independent
Forward Selection
1. Start with empty SF set and candidate set being all original features
2. Find feature with highest filter score
3. Remove feature from candidate set
4. Add feature to SF set
5. Repeat steps 2-4 until convergence
Backward Elimination
1. Start with complete SF set (contains all original features)
2. Find feature that, when removed, reduces the filter score least
3. Remove feature from SF set
4. Repeat steps 2-3 until convergence
Forward-backward algorithm
First applies Forward selection and then filters redundant elements using backward elimination
CFS
• Correlation based feature selection (CFS) selects features in a forward-selection manner.
• Looks at each step at both correlation with target variable and already selected features.
PCA
• Manifold projection
• Assumes Gaussian latent variables and Gaussian observed variable distribution
• Linear-Gaussian dependence of the observed variables on the latent variables
• Also known as Karhunen-Loève transform
PCA requires calculation of
• Mean of observed variables
• Covariance of observed variables
• Eigenvalue/eigenvector Computation of covariance matrix
ANN feature selection
• Artificial Neural Networks can implicitly perform feature selection
• A multi-layer neural network where the first hidden layer has fewer units (nodes) than the input layer
• Called 'Auto-associative' networks
Parametric methods
• Many methods learn parameters of prediction function (e.g. linear regression, ANNs)
• After training, training set is discarded.
• Prediction purely based on learned parameters and new data.
Memory-based methods
• Uses all training data in every prediction (e.g. kNN)
• Becomes a kernel method if using a non-linear example comparison/ metric
Kernel
Shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.
Kernel methods
Map a non-linearly separable input space to another space which hopefully is linearly separable
• This space is usually higher-dimensional, possibly infinitely
• The key element is that they do not actually map features to this space, instead they return the distance between elements in this space
• This implicit mapping is called the (Definition) Trick
Kernel methods
Kernel methods map a non-linearly separable input space to another space which hopefully is linearly separable
• This space is usually higher dimensional, possibly infinitely
• Even the 'non-linear' kernel methods essentially solve a linear optimization problem!!!!
Kernel trick
The key element of kernel methods is that they do not actually map features to this space, instead they return the distance between elements in this space This implicit mapping is called the (definition)
Sparse Kernel Methods
• Must be evaluated on all training examples during testing
• Must be evaluated on all pairs of patterns during training
- Training takes a long time
- Testing too
- Memory intensive (both disk/ RAM)
Solution: sparse methods
Three ways of constructing new kernels
Direct from feature space mappings
Proposing kernels directly
Combination of existing (valid) kernels
• multiplication by a constant
• exponential of a kernel
• sum of two kernels
• product of two kernels
• left/right multiplication by any function of x/x'
Commonly used kernels
• Linear kernel
• Polynomial kernel
• Gaussian kernel
(Gaussian kernel is probably the most frequently used kernel out there - Gaussian kernel maps to infinite feature space)
Sparse kernel methods
• Must be evaluated on all training examples during testing
• Must be evaluated on all pairs of patterns during training
• Training takes a long time
• Testing too
• Memory intensive (both disk/RAM)
Solution: sparse methods
Maximum margin classifier
Classifier which is able to give an associated distance from the decision boundary for each example.
Linearly-separable SVM
Satisfying solution (e.g. perceptron algorithm): finds a solution, not necessarily the 'best'
Best is that solution that promises maximum generalizability
Slack variable
Slack variables introduced to solve optimization problem by allowing some training data to be misclassified
Slack variables en >= 0 give a linear penalty to examples lying on the wrong side of the d.b.: point on correct side of db
|tn ! y(xn)|, otherwise
Relevance Vector Machines
Model the typical points of a data set, rather than atypical( a la density estimation) while remaining a (very) sparse (like heat map) representation
Returns a true posterior
Naturally extends to multi-classification
Fewer parameters to tune
SVM Margin is defined as
The minimum distance between decision boundary and any sample of a class
SVMs seek a decision boundary
That maximizes the margin
Maximum margin classifiers
- This turns out to be a solution where decision boundary is determined by nearest points only
- Minimal set of points spanning decision boundary sought
- These points are called Support Vectors
Small set of SVs means that
Our solution is now sparse
Non-linearly Separable Problem
Usually problems aren't linearly separable (not even in feature space)
'Perfect' separation of training data classes would cause poor generalization due to massive overfitting
Soft Margin
We have effectively replaced the hard margin with a soft margin
New optimization goal is maximizing the margin while penalizing points on the wrong side of d.b.
Multiclass SVM
SVM is an inherently binary classifier. Two strategies to use SVMs for multiclass classification:
- One-vs-all
- One-vs-one
Problems:
- Ambiguities (both strategies)
- Imbalanced training sets (one-vs-all)
One-class SVM
- Unsupervised learning problem
- Similar to probability density estimation
- Instead of a pdf, goal is to find smooth boundary enclosing a region of high density of a class
Prior Probability
- Probability of encountering a class without observing any evidence
- Can be generalized for the state of any random variable
- Easily obtained from training data (i.e. counting)
Joint Probability
- Joint probability is the probability of encountering a particular class while simultaneously observing the value of one or more other random variables
- Can be generalized for the state of any combination of random variables
Bayes' Theorem
posterior = (likelihood x prior)/evidence
Minimum Error Rate
Goal is to minimise error rate
Normal Density
By far the most (ab)used density function is the Normal or Gaussian density
All probability theory can be expressed in terms of two rules
- Product rule
- Sum rule
Directed PGN
Edges have direction (Bayesian Network)
Undirected PGN
No edge direction (Markov Random Field)
Directed Acyclic Graphs (DAGs)
are Bayesian Networks. Meaning there are no cyclic paths from any node back to itself
Some variables are observed, others are hidden/latent
Example observed: Labels of a training set
Example hidden: Learned weights of a model
PGNs are generative models
Allow us to sample from the probability distribution it defines
Ancestral sampling
is a simple sampling method well suited to PGNs
Conditional independence in PGN
is the PGN mechanism to show information in terms of interesting aspects of probability distributions
Blocking Paths
When a path is blocked, no information can flow through it
This means that observing C, if it blocks a path A-C-B, it means there is no added value in observing A, and B is fully determined by C
Sequence Data
Sequence data is data that comes in a particular order
Opposite of independent, identical distributed (i.i.d.)
Strong correlation between subsequent elements
- DNA
- Time series
- Facial Expressions
- Speech Recognition
- Weather Prediction
- Action planning
1st order Markov models
Restricted to encoding sequential correlation on previous element only
A Latice/Trellis diagram visualizes
state transitions over time
Also good tool to to visualize optimal path through states
(Viterbi Algorithm)
Emission Probabilities
Probabilities of observed variables
Acquiring emissions
Wide range of options to model ****** probabilities:
- Discrete tables
- Gaussians
- Mixture of Gaussians
- Neural Networks/RVMs etc to mode
Perceptron Algorithm
Perceptron is modeled after neurons in the brain. It has m input values (which correspond with the m features of the examples in the training set) and one output value. Each input value x_i is multiplied by a weight-factor w_i. If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and 'fires' a signal (+1). Otherwise it is not activated.
The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product <w, x>. To produce the behaviour of 'firing' a signal (+1) we can use the signum function sgn(); it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.
Thus, this Perceptron can mathematically be modeled by the function y = sgn(b+ <w, x>). Here b is the bias, i.e. the default value when all feature values are zero.
THIS SET IS OFTEN IN FOLDERS WITH...
Data Mining Mid Term
110 terms
Data Science
81 terms
YOU MIGHT ALSO LIKE...
DM2Midterm
56 terms
MA 322 MIDTERM
44 terms
BI Review, Chapter 4
32 terms
Data Mining Test 1
48 terms
OTHER SETS BY THIS CREATOR
Spring Framework Bharat
164 terms
Machine Learning 2 - CP'ed
232 terms
Microservices2 -CP'ed
53 terms
Microservices1 - CP'ed
38 terms