Home
Subjects
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Home
Science
Computer Science
Artificial Intelligence
Predictive Analytics Review
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (90)
What is data mining?
- Finding or extracting patterns in data
- Concerned with meaningful previously unknown patterns
- Combines statistics, machine learning and computing
- It is motivated by:
i. Large volumes of data
ii. Different type/dimensions of data
iii. Complex questions that require more than traditional statistical analyses
What is Descriptive Modelling?
-Focus on historical data only and summarises what has happened.
-Developing reports, dashboards, and scorecards are the main objectives/outcome of descriptive modelling. (e.g. What was the best-selling product of company X in the years 2016 and 2017? / what were the airlines with the highest passenger satisfaction rates in 2015-17?)
What are the descriptive modelling features?
-Extracts or presents the main descriptive features from data
-Summarises data w.r.t. specific data dimensions
(e.g. time, manufacturer, product, event, demographics, interests and the similar)
-Find co-occurrences of events or patterns
-Find associations in data elements
-Make no assumptions about data prior to modelling.
CRISP framework
-Stands for Cross Industry Standard Process for data mining.
-Most widely adopted framework for data mining.
What are some Descriptive Techniques?
-Correlation analysis
-Data clustering (or segmentation)
What are Predictive Analytics Features?
-Models existing historic data to be able to predict likely future outcomes or events given similar future unseen data, for this predictive analysis creates a model that represents how different variables in data are related to each other.
-Finds the most relevant part of data that can be used for prediction.
What is Predictive Analysis?
Focus on future outcomes for the business given their historic data. It will help understand what is likely to occur in the future (e.g. what is the likely student retention rate in course X in my university next year?)
What are some Predictive Analytics Features?
-Models existing historic data to be able to predict likely future outcomes or events given similar future unseen data,
for this predictive analysis creates a model that represents how different variables in data are related to each other.
-Finds the most relevant part of data that can be used for prediction
-Can predict a single outcome or a series of outcomes over time, the latter is referred to as forecasting.
-Needs exiting historic data to be labelled
(requires assumptions on data prior to analysis).
Data preparation
data quality:
-ACCURACY - dealing with data errors or extreme cases that deviate from expectation
-COMPLETENESS- dealing with the lack of attribute values, lack of certain attributes, or presence of aggregates values only
-CONSISTENCY- dealing with discrepancies in the data of processes that generate the data
-TIMELINESS- dealing with the timeframes within which data are prepared
-BELIEVABILITY- are data according to the standard process and can they be relied upon?
-INTERPRETABILITY- can the data be interpreted or understood?
-UNIFORMITY- dealing with the unit of measurement for the data and making sure all data points are in the same unit.
Data Terminology
Data set: collection of data with some defined structure
Data Point: a single instance in the data set
Attribute: it is a single property of each data instance/point
Label: it is the special attribute that needs to be predicted based on input attributes.
Identifier: it is a special attribute that is used for providing context to each data point. They are excluded from actual data mining steps
Training set: it is a portion of the data that is used for model building and tuning purposes only
Test set: it is the remaining portion of the data set that I used for model evaluation only.
*the training set and test set must not overlap or have common data points
Preparations steps
Data cleaning
Data integration
Data transformation
Data reduction
Data Cleaning
It is the process of removing any inaccurate or incomplete data from a data set.
Done via:
-replacing data errors or missing values
-modifying data errors
-deleting data errors or missing values.
Data cleaning (missing value treatment)
ignore or delete data points with missing values
- Ignore or delete attributes with missing values
- Replace missing values with a constant value (e.g. 0 for numeric attributes)
- Replace missing values with the central tendency of the attribute (Mean: numeric/ Mode: nominal)
What are some predictive analytics techniques?
-Classification
-Numeric estimation/prediction (i.e. regression)
-Time series analysis (i.e. forecasting)
-Anomaly detection
Main steps of CRISP Framework
-Business Understanding
-Data understanding
-Data preparation
-modelling
-evaluation
-deployment
What are the 2 types of Discriminant Analysis?
• DA/Classification
-These models categorize data points or instances into predefined categories based on some measurable characteristics of the data instances. (i.e. predictors)
-Predefined categories or classes are unique nominal values of a target attribute:
>Sales price: high, low, medium
>Exam outcome: pass, fail
>Customer review: good, neutral, bad
-Predictors can be any attribute or characteristics of the data:
>House size, lot frontage, # of bedrooms, location.
>Previous assessment marks, class participation.
># of positive words, of negative words, length of review.
•Workings of DA/Classification
-Finds a mapping between predictors and the target categories in the data.
-Requires (usually a large amount of) previously categorized data instances
(i.e., labelled data).
-Looks into labelled data and finds a mathematical path or algorithmic path from the predictors to the labels.
What are the applications of discriminant analysis?
-Predictors: mole scans, patient data (e.g. age, gender, etc.)
-Target attribute: skin cancer
-Classes: likely, not-likely
What is a Decision Tree?
•DT's represent a set of decision rules (i.e., if-then statements) to separate (i.e., classify) data into more pre-defined classes based on observation attributes
•They are among the most frequently used and interpretable data mining (machine learning) techniques.
•Works with both numeric and non-numeric data types.
•It separates data space into partitions.
•E.g.:
-With 2 predictors: A and B
-With 2 classes: 1 and 2
*Notice the misclassification: it is a normal outcome in many real-world classification problems.
What is entropy?
-it is a measure that gives an indication of how the data points in a data set are similar or different from each other.
-It is calculated using the formula:
>If the data are completely homogeneous, entropy = 0.
>If the data are equally divided, then the entropy = 1.
What is information gain?
-It is a measure of how much entropy has been decreased after splitting the data with a predictors values.
When do you stop splitting in Decision Trees?
-In real applications, it is unlikely to reach nodes that are 100% pure
-A better stopping criterion is required
e.g.
-No predictor satisfies a minimum information gain threshold on a split with it, STOP
-A maximal tree depth is reached, STOP
-There are less than certain number of data points in the current sub-tree, STOP
-These are called pre-pruning strategies, that prevent the tree to become too deep or un-interpretable.
what is pre-pruning useful for?
-The tree can become too deep
-It can become too complicated with many leaves
-It can become a memory of the data with no generalization ability
-The above conditions signal for model overfitting
-Solution: post-pruning strategies after the tree has been built
>Replacing some sub-trees with nodes, using majority voting
>Reducing tree depth and complexity
What are the types of K Nearest Neighbor Classification?
KNN Classifiers:
>Assume same-class instances congregate in a neighbourhood
-Look up proximal data points to classify new instances
-Find no mapping function between instances and class labels
-Go back to the training data to classify test data instances
Classifying 'not too easy' new instances:
-Finds the k nearest instances to new instance
-Determines the class of the new instance as the most frequent class in the nearest K neighbours.
-The key aspects of K-NN classification are:
>How many neighbours to look for/ what to consider
>How to determine the nearest instances/ the measure of proximity
Measure of Proximity:
-Distance;
the Euclidean distance between data points in an n-dimensional space can be used. This is sensitive to the scale and unit of data attributes, which can be alleviated using normalization techniques.
-Correlation;
the correlation between data points (not 2 variables) can be used as a proximity measure. One correlation measure may not cover all types of relationships.
(e.g. person is only for linear relationships)
-There are other measures, such as;
>Simple matching coefficient, for binary attributes
>Jaccard similarity, for text documents or attributes
>Cosine similarity, for text documents and for nominal attributes
how do you measure classification Performance?
-It is done by classifying test instances and comparing their predicted classes vs. actual classes.
-One approach is to use a confusion matrix that tabulates prediction results vs. actual classes.
>Correctly classified instances will be in the diagonal
>Accuracy will be the ration of the number of correctly classified test instances over the total number of test instances.
>Error or misclassification is 1-accuracy.
What is the Impact of imbalanced data sets?
-E.g.: In binary classification, one class has many instances, and the other class has few instances
-Accuracy may be misleading due to bias towards the larger class, where the smaller class may be of more importance (e.g. intrusion detection)
-The Kappa Statistic accounts for class imbalance:
>It is a measure of agreement between actual and predicted class labels
>Considers the expected random chance accuracy
>It indicates how better the classifier is performing over the performance of a classifier which randomly guesses the class of test instances according to the number of instances in each class (in training set)
What are the 2 types of learning?
-Eager Leaning:
>In machine learning, there are classification methods that use the training data to learn a mapping (in an algorithmic or mathematical way) between predictors of data and target class labels.
e.g.: decision trees and regression models
-Lazy Learning:
>In machine learning, there are classification methods that will not learn any mapping between predictors of data and target class labels.
>These methods investigate the training data every time to classify a test instance.
E.g.: K-NN and Case-based reasoning
What are the two model fitting problems?
-Overfitting:
Happens when a model is too specific and too complex after training with a training set. The model will not be able to do any generalization on new unseen data, it works like a memory of the training data.
>Scenarios:
i. Decision trees: a tree that is too deep and has too many leaves
(e.g., equal to the number of training instances)
ii. K-NNs: where k=1 which means the generalization using a possible group of neighbours.
-Underfitting:
Happens when a model is too general and too simple after training with a training set.
The model will only do too much generalization on new unseen data, it works as if it has not learnt much from the training set.
>Scenarios:
i. Decision trees: a tree that is very shallow and not many leaves.
(e.g. only a root node)
ii. K-NNs: where K is equal to the number of training instances resulting in too much generalization
What is Reduced Error Pruning?
-For post-pruning:
>Consider each node for pruning
>Removing the subtree at the node, make it a leaf and assign the most common class at that node (i.e., majority voting)
>Return to the previous tree only if the new tree has larger error
>Remove nodes iteratively choosing the node whose removal most decreases error
>Pruning continues until further pruning is harmful (error increases)
Where can data mining be applied?
Can be applied in:
i. Clinical and population health decision support.
ii. Targeted marketing
iii. Fraud detection
iv. Intrusion detection
v. Economy prediction
vi. Risk analysis and prediction
vii. Sports performance modelling and analysis
viii. Market basket analysis
ix. Student retention vs. attrition analysis
What are the two types of relationships?
i. Linear: the relationship can be explained with a straight line, constant change in one attribute because of change in the other.
ii. Non-linear: the relationship can be explained with a curve (or any non-straight line), non-constant change in one attribute as a result of change in the other.
What are the 3 strengths of relationships?
-High: changes in one attribute because of change in the other attribute are drastic/significant.
- Medium: changes in one attribute as a result of change in the other attribute are moderate.
- Low: changes in one attribute because of change in the other attribute are gradual/insignificant.
What is correlation and what are the 3 types of correlation?
-It is a way of finding attribute relationships and their strength
Types:
- Positive: both attributes change in the same direction (increase-increase/decrease-decrease).
- Negative: attributes change in the opposite direction (increase-decrease/decrease-increase).
- No-correlation: is similar to no relationship.
What is the Pearson correlation coefficient?
one of the several methods to find (linear) correlation:
-It is a value in [-1, +1], where;
-1 = perfect negative / 0 = no correlation / +1 = perfect positive
-Does not capture normality (or the type of the relationship)
-It is sensitive to outliers
What is the misconception of correlation?
-Correlation does not always mean causation.
-e.g.: number of people who drown and number of ice-creams sold on the beach, number of cars that have umbrellas in them and the number of road accidents.
-Correlation can be used to understand there may be an underlying cause
-The cause may not necessarily be one of the two attributes, there may exist a third factor.
What is a Correlation Matrix?
A square matrix:
-Diagonal equals = 1
-Rows: attributes
-Column: attributes
-Cells: pair-wise correlation of corresponding attributes.
What is Simple Regression?
-Relationship between two variables approximated using a linear equation.
-The line described using the equation is called the regression line with:
> an intercept
> a slope
> an error term (i.e. residual)
-Can be used to predict the dependent variable y for a new unseen x (i.e. estimation).
-The estimation error is the difference between the actual and estimated/predicted y values.
What is the Goodness of Fit Measure?
-a measure that indicates how well the regression line has fit the (training) data points.
-Summarises the discrepancy/differences between the actual and predicted (training) y values.
-R-squared (r^2): coefficient of determination.
>it is the proportion of the variance in y that is
predictable from the independent variable x
>indicates the total variation in y that can be
explained by the regression model.
How can the overall model significance be estimated?
-using F-value or F-ratio.
-F-value measures how significantly the regression line is different from a crude mean line.
Why should you check the single coefficient slope significance? How would you measure it?
-to see if the changes in y are significantly related or explained by the changes in x.
-Use T-test and p-value to measure
E.g.: predict sale price of a house from its 1st floor surface.
What is Multiple Regression?
-Relationship between a dependent variable and two or more variables approximated using a linear equation.
-The line described using the equation is called the regression line with:
> an intercept
> a number of slopes
> an error term (i.e., residual)
-Can be used to predict the dependent variable y for a set of new unseen x values.
What is the concept of multi-co-linearity?
-One predictor can be linearly predicted from other predictors.
-Will not make the model invalid.
-Can result in erratic changes in coefficient. estimates in response to small changes in data
-Can be removed for model simplicity.
-It is calculated using r2 of fit:
> current attribute as the dependent variable
> all other attributes as predictors.
> close to 0: no multi-collinearity exists for the
attribute.
> close to 1: multi-collinearity may exist for the
attribute.
What is the tolerance of multiple regression?
-For each coefficient is calculated as 1-r^2, where r^2 is model fit for the current attributes as the dependent variable and all the other attributes as predictors.
-It is a measure of a multi-co-linearity, preferred to be close to 1; i.e., r^2 of fit with other attributes to be close to 0.
-E.g.: predict sale price of a house from 1st floor surface and its lot frontage.
How do you Interpret estimation coefficients?
-Lot_Frontagecoefficient=424.815:
On average, the sale price of a house will increase by 424.815 as a result of increase in the lot frontage for 1 meter given the 1stfloor surface remains unchanged.
-1st_Flr_SF coefficient=112.436:
On average, the sale price of a house will increase by 112.436 as a result of increase in the 1stfloor surface for 1 square meter given the lot frontage remains unchanged.
-Intercept=19728.004:
On average, the sale price of a house is 19728.004 given the lot frontage and 1stfloor surface are both 0 (in this particular case, this is not meaningful as lot frontage and 1stfloor surface can never be 0).
What are the conditions of linear regression that should be met?
=> Need to make sure that:
-All variables are numeric.
-There are no missing values in any attribute/variable.
-There are no extreme cases (outliers detected and removed).
-Predictors are independent from each other (no multi-collinearity).
=> Estimation errors:
> have constant variance.
> they are normally distributed.
> are independent from
independent/dependent variable.
What are some Regression Validation Procedures?
-To assess the generalization ability of the regression model on previously unseen data
> Find the regression model with a partition of
the data set.
> Evaluate the regression model with a
separate partition of the data set.
-Can be done using:
> Hold-out validation
> Resampling (K-fold CV, LOOCV, Bootstrapping).
What are some Regression Validation Metrics?
-Predict/estimate y for each new unseen data point (from x).
-Calculate estimation error on each new unseen data point.
-Aggregate the estimation for all new unseen data points.
-Aggregation functions:
i. Mean absolute error (MAE).
ii. Root mean square error/deviation (RMSE/RMSD).
What are some the prediction types?
i. Ranking predictions: A model that uses input measurements to optimally rank input cases. (e.g. online document retrieval and ranking, and online recommendation systems)
ii.Estimate predictions: A model that uses input measurements to optimally estimate the target value. (e.g. sale price prediction and temp prediction)
iii.Class/Category predictions: A model that uses input measurements to make the best decision about the category of input cases.
The number of classes/categories can be:
- 2 i.e. binary classification:
(e.g. cancer vs. non-cancer conditions)
- More i.e., multi-class classification:
(e.g. sadness vs. happiness vs. surprise vs. anger in tweets)
data cleaning (k-nn imputation)
- Fill in the missing values with a value that is most suitable via learning from neighbouring data points (finding the closest neighbours to the given data point with a missing value/ averaging the nearby data points for learning and imputation)
- Needs all training data for learning and imputation
- Needs K to be defined
- Needs the 'closeness' criterion to be defined
- It is usually a robust approach (to different K values and closeness criteria)
- Data can be multidimensional with N attributes
data cleaning(data error treatment)
- Data errors can be ignored or deleted
- Data errors can be treated as missing values, same strategies can then be applied as with treating missing values in data (i.e. replacement with constant values, replacement with central tendency, and imputation through learning from other data attributes)
Data Intergration
It is concerned with combining data from various sources into a unique repository of data with a unique format and a unique view.
Two approaches:
-Data warehousing = using ETL(extract, transform,load)
-Mediated schemas, using specific wrappers for each data base
Data Transformation
It is concerned with converting data from one format/structure to another format/structure that is suitable for a specific data analysis process or technique.
Two ways:
1. Batch transformation - where specific transformation rules are developed to transform large volumes of data at one.
2. Interactive transformation - where specific interfaces are developed so users can directly interact with large volumes of data and make specific changes or corrections
Data Reduction
It is concerned with reducing the size of a data set either for performance considerations or for specific objectives
Several ways:
1. Reducing the number of data attributes
- to find the optimal subset of attributes that are best suitable for the specific analysis
- Techniques:
- cube aggregation = using the slice operation
- attribute/ feature selection = through finding which attribute should be included in a specific analysis and filtering out the non-relevant attributes
- dimensionality reduction = through merging original attributes and findin aggregate attributes that explain variance in data in a more consise and effective way.
2. reducing the number of attribute values
- To find optimal subset of attribute values that are best suitable for the specific analysis
Several Ways:
- cube aggregation = using the roll-up operation
- binning = to reduce the distinct number of attribute values by grouping them into specific interval or bins
- clustering = to reduce the distinct number of attribute values by way of data clustering and using clusters instead of original attribute values.
3. reducing the number of data points/instances
- to extract a sample of data points and analyze them instead of the entire data set
Techniques include:
- probability-based random sampling = using random numbers that correspond to data points which ensure no correlation between data points that are selected. (using Random Seed generates random numbers)
-probability-based stratified random sampling = where the data set is broken down into stata within each, data points have the same chance of being selected.
Data Transformation strategies
1. NORMALIZATION= to align probability distributions of different data attributes with different means and standard deviations
-Techniques:
-Min-max method: to bring all data values into specific interval between a minimum and a maximum value.
-Z-transformation: will centre data(subtract mean from all data points) and scale data(divide all data points by the standard deviation). It will convert every attribute value to its z-score
2. SMOOTHING = removes noise from any other fine-scale values from data attributes.
-lowers/removes the effect of unstable or infrequent data values.
captures important patterns in data
- Techniques include (average of all previous data/moving avg(simple,tailed,weighted)/exponential smoothing)
3. RESOLVE SKEWNESS = removes any significant left or right distributional skewness of data.
-If target or label is skewed, then bad predictive models will be built as a result of imbalanced data.
-If the predictor attribute is skewed, then some useful features of data may be lost or ignored by the predictive model.
-Techniques:
-Log transformation: will replace each attribute value with its log transformation
-Statistical approaches: Box and Cox method
4. RESOLVE OUTLIERS = to resolve data points that are far from the mainstream of the data.
-outliers are not to be mixed with the data errors
-outliers can be hard to define
-Outliers can be treated in two ways:
-identified and removed from data
-identified and transform all data points to lower/remove the effects of the outliers.
-Spatial Sign is one method that projects all attributes onto a multidimensional sphere
-this will make all the samples the same distance from the centre of the sphere (it is necessary to do the data centering and scaling before spatial sign transformation)
Classification Validation
motivations:
-to estimate the performance of the predictive model, especially for the new unseen data
-a simplistic approach: train/build the model in the entire data available.
-simplistic approach not the best solution:
- results in overfitting
- will not give a good estimate of performance when new unseen data arrives
- the estimation of the performance will be overly optimistic(sometimes 100% accuracy)
Hold-out validation
Random/stratified random split:
-it is the simplistic validation technique
-random selection of a sample for training(aprox. 70-80%), the rest for testing
-Classifier will be trained(tested) with a training(testing) set, once only.
-more effective: stratified random sampling
-take random samples from each class
-preserve the distribution of instances in the selected sample.
Resampling
motivations:
-hold-out validations shortcomings: each data instance guaranteed to be seem only once by the model (training/testing)
-the split(training-testing) may not represent the entire data set/population
-testing performance may vary a lot based on the random selection of training/testing partitions
- a lucky split may result in a good estimate and vice-versa.
Resampling Methods
General Method:
-a subset of the entire data(ie. sample) is selected for model building or training
-the remaining subset of data is used for model testing or evaluation
-the above steps are takes multiple times
-the overall performance of the model is the average of all performance in the above iterations
1. K-fold cross validation
-the entire data set is randomly partitioned into k-folds(each fold containing a sample)
-k-1 folds are used for model testing or evaluation
-the above procedure is done for k times
-the overall performance will be the average of the model performances in all the above iterations
2. Leave-one-out cross validation (LOOCV)
-the entire data set is partitioned into k-folds
-k = total number of instances in the entire data set
-k-1 folds are used for model building or training
-the remaining fold is used for model testing or evaluation
-the above procedures are done for k times
-the overall performance will be the average of the model performances in all the above iterations
3. Bootstrapping
-randomly take N data instances from the entire data set with N instances
-random selection is done with replacement
-some data instances may not be selected, some may be selected more than once.
-on average, only about 63.8% of instances are selected
-use the sample taken for model building or training
-use the remaining set (ie. the out-of-bag or out-of-bootstrap sample) for model testing or evaluation
-repeat the procedure several time, the overall performance will be the average of those for the above iterations.
What Resampling techniques to use when?
LOOCV: best suited for very small data sets (becomes less efficient with large data sets)
K-fold CV: best suited for small data sets, especially to get the best indication of performance
Bootstrapping: best suited for large data sets, especially for model comparisons.
*as the data sets gets larger, the choice between k-fold CV and Bootstrapping will not make a big difference.
Classification ensembles
motivations:
-Why predictive modelling in the first instance?
-to find a mapping between data points and labels
-to find the best mapping with the least error
-single models may:
-only capture specific features of the data set.
-overfit the data set (ie. have large variance)
-underfit the data set(ie. have a large bias)
Classification Ensembles Methods
General Methods:
-create single(base) models for the data set
-create a meta-model(meta-function) to choose among the models to classify a new single data instance
1. Bagging (bootstrap aggregation)
-create M different subsets of data instances using bootstrapping (ie. in a data set with N instances, extract N instances randomly with replacements)
-create M separate base models for the different subsets of data instances above
-create an aggregation model over the base models created.
-final prediction will be the prediction using the aggregate method.
2. Random forest(feature bagging)
-create M random subsets of data attributes or features(all data instances)
-create M separate base models for different subsets of data features above
create M aggregation method over the M base model created
-final prediction will be the prediction using the aggregate method.
3. Stacking
-create M separate base modes, each for the entire data set(ie. all data instances and all attributes)
-create an aggregation method over the M base models created.
-final prediction will be the prediction using the aggregate method.
4. Boosting
-step 1: extract a training subset of data instances using bootstrapping
-step 2: create a base(weak) model on the training data above and evaluate the weak model
-step 3: reweight data instances
-correctly classified: lower weight
-incorrectly classified: higher weight
-go through steps 1-3 for M times to create M weak learners
-final prediction will be the prediction using the aggregate method on the M weak learners above.
5. AdaBoost (Adaptive Boosting)
-one of the most popular implementations of boosting
Classification validation vs. Ensembles
CV = Only one model is eventuated (to reduce performance estimation error)
Ensembles = several models are created, one on top of the rest for aggregation(to reduce bias and variance)
Classification Baseline
Baseline: refers to the minimum performance requirement for a trained classifier.
-can be used as an indication of how good a trained classifier.
-zero rule(majority class)
sets the baseline performance at the portion of data instances in the largest class to the entire data set.
*the zero-rule baseline is not acceptable in many cases
-it ignores other (smaller) classes altogether even of they may be of higher importance.
What are association rules?
-a measure of how strongly two (or more) items co-occur.
-Find patterns in the data rather than predicting anything.
-They are rules extracted from large amounts of data.
-{items A} -> {items B}:
>If A is in the item set, then B will most likely be
there too.
-{Items A and items B} -> {items C and item D and item E}:
> If a shopper buys milk, then they will most
likely buy bread too.
> If a football team gets a penalty, then they
will most likely score a goal.
> If a customer buys one product per quarter,
then they will most likely not churn for a year.
What are containers?
-Frequent item sets reside in:
> Baskets of occurrence (e.g. one transaction,
one episode of care, one online session).
> Windows of item (e.g. one day, one quarter
[of a game]).
-Data may need to be pre-processed to:
> Create containers.
> Find co-occurrences in those containers.
What are the 3 association rule metrics?
=> Support:
-The relative frequency of occurrence of an item set in the container set.
-Filters out rules that are not worth considering further.
=> Confidence:
-Measures the likelihood of occurrence of the right-side of the rule (i.e. consequent) out of all the items in the container that contain the left-side of the rule (i.e. antecedent).
-This is the reliability of the rule.
=> Lift:
-It is like confidence However, it considers the support of the right-side of the rule too.
-Values closer to 1 indicate non-useful rules, larger lift values indicate more significant rules.
What are the 2 steps of the rule generation process?
1. Finding all frequent item sets:
-Look at all possible combination of items.
-There will be 2n-1 item sets in a set of n items.
-Filtering non-important item out (using support).
2. Generating/extracting rules from frequent item sets:
-Look at all possible rules.
-For a frequent item set with n items, there will be 2n-2 rules.
-Filter out rules that are not significant (using confidence or lift).
What is the Apriori algorithm?
-Used to find frequent item sets more efficiently
-Makes use of support of item sets:
> Item sets with support of larger than a
threshold is frequent.
> Rule 1: if an item set is frequent, then all its
subsets are frequent.
-An algorithm to find frequent item sets
-Makes use of support of item sets:
> Item sets with a support of larger than a
threshold is frequent.
> Rule 2: if an item set is infrequent, then all its
super sets are infrequent.
How do you generate and extract association rules?
-Generate all 2n-2 rules for each frequent item set with n items.
-Makes use of confidence or lift of rules to filter out non-significant rules.
Why do you need to have a sequence database, and how do you do it?
-To find sub-sequences that are frequent or significant.
-Makes use of a minimum support threshold (a minimum number of sequences in which a sub-sequence appears) to filter sub-sequences.
What is the difference between sequence patterns and association rules?
Similar in that they both look at frequency of items, but sequence patterns also look at which item should go before and after a particular item as well (and how often).
What are the applications of association rules and sequence patterns?
=>Several domains:
-Health
-Text analytics
-Web logs
-Education
-Stock market
-Sports: finding sequences of moves or events.
What is Optimization?
- To find the best solution among a number of
feasible solutions to a problem.
- Feasible solutions are correct and satisfy all
necessary conditions.
- The best solution to a problem:
> Can be based on minimisation, e.g.,
minimising the cost of goods transportation.
> Can be based on maximisation, e.g.,
maximising the revenue of a company
What are the 3 main components of Optimisation?
=> Objective function:
- Is the function that needs to be minimized or maximized.
- e.g., cost, revenue, time, ...
=> Constraints:
- Are certain conditions that need to be met while finding the optimal solution.
-e.g., certain number of sales, certain routes, a limited time, etc.
=> Search method:
-Is an algorithm that searches through the feasible solutions to find the optimal solution that will optimize the objective function.
-E.g., local search (Tabu search, hill climbing, ...) and global search (evolutionary search, ...)
What is Hyper-Parameter Optimisation?
=>Finding the optimal set of hyper-parameters that will maximise the performance of a model.
=> Performance Maximisation:
- E.g., minimising error for a regression model.
- E.g., maximising f-measure for a classifier.
- E.g., minimising avg. SSE for a clustering
method.
=> In machine learning:
-Often, a learner model can take several values for its hyper-parameters...
>E.g., the threshold value for the confidence of
a classifier.
>E.g., the number of clusters in a clustering task.
What is Random Search?
-to search for the optimal solution among all feasible solutions by randomly selecting a solution at a time.
-Is efficient in cases where the set (or the space) of feasible solutions is small.
-Can be efficiently done in a parallel procedure (since feasible solutions are independent from each other).
-The search stops when:
> A certain number of trials have been
completed.
> A certain level of performance has been
reached.
What is Grid Search?
-Is to exhaustively search through the set of feasible solutions to find the optimal solution.
-Discretization and setting boundaries will be necessary in case of real-valued hyper-parameters. Because:
> The search stops after all of the feasible
solutions have been tried.
> The optimal solution will be the one that
results in the best performance.
What is Gradient Search?
-Starts with a solution and gradually improves it by gradually changing hyper-parameters.
-Algorithm follows the following Steps:
1. Start with a random solution and calculate performance.
2. Calculate the gradient (i.e., the change in performance when hyper-parameters are changed by a very small value).
3. Adjust the solution (hyper-parameters) with respect to the gradient.
4. Re-calculate performance with the new solution.
5. Repeat Steps 2 to 4 until no significant performance changes are achieved.
How do you select the best search method?
=> Random search:
-Search space is small
-Optimality is relative to a limit or threshold.
Efficient
=>Grid search:
-Search space is small
-Optimality is absolute (only the absolutely best possible solution is sought)
Not too efficient
=>Gradient search:
-Search space is large
-Optimality is absolute (only the absolutely best possible solution is sought)
-Efficient
What are some Optimisation Classification Methods?
=> Decision trees:
- E.g., the depth of the tree
- E.g., the minimum number of data points in
each leaf
- E.g., the minimal information gain achieved at
each split
- E.g., the minimal size of a node for further split
- E.g., the confidence cut-off threshold
- E.g., the set of predictors
=> K-NNs:
- E.g., the number of neighbors or K
- E.g., the distance measure/function
- E.g., the confidence cut-off threshold
- E.g., the set of predictor
What are the Objective functions for Optimising Classification Methods?
Decision trees and K-NNs:
- E.g., classification accuracy
- E.g., overall classification f-measure
- E.g., classification f-measure for a particular
class
- E.g., classification Kappa
What are some Optimisation Clustering Methods?
Prototype-based clustering (e.g., k-means):
-classifying the unlabelled points in a data set into different groups or clusters, such that, members of the same cluster are as similar as possible, while members of different clusters are as dissimilar as possible.
- E.g., the number of clusters
- E.g., maximum number of runs (each run with a
different set of random centroids)
- E.g., maximum number of iterations per run
What are some Objective Functions for Optimising Clustering Methods?
=> using Prototype-based clustering (e.g., k-means):
- E.g., avg/SSE
- E.g., Davies Bouldin index
=> The elbow method:
- A heuristic search method
- Can be used when the values of the objective
function have a monotonically
decreasing/increasing trend
What is the concept of Optimal Model vs Optimal Outcome?
=> Optimal model (analytics)
Possible objective functions:
- Avg. SSE
- Davies Bouldin
- Accuracy, kappa, etc.
=> Optimal outcome (Business Objectives)
Possible objective functions:
- The gross profit (of a company)
- The time taken (for a specific process)
- The distance (to a particular destination)
How can we optimise product placement in retail?
=> Problem statement
- Find the best product placement options
(among several outlets) for each item to
maximise the sales of the item.
=> Approach
- Find what affects sales of items (looking at the
data that is available)
- Create an optimal predictive model to predict
sales of items
- Find optimal values for what affects sales (the
values that will maximize sales)
- Make relevant evidence-based
recommendations.
=> Stays as is:
> Item_MRP- the maximum retail price of an item.
=> Tractable features:
-Outlet_Type
- Item_Visibility
- Outlet_Size
- Outlet_Location_Type
=> Most significant and tractable:
> Outlet_Type
> Problem becomes- to search through different
Outlet_Types and find the maximal sales using
the predictive model for sales of items.
What are Time Series Models?
-Predictive models on cross-sectional data
-A set of predictors or independent variables
-To predict the class or dependent variable (different from predictors).
-Regardless of the time dimension.
-E.g. Use bus route and trip data to predict the type of bus delay
-What if: there is a time dimension (data are time-stamped).
-Aim: to predict the value/s of an attribute that is/are changing over time.
What is the application of Demand Forecasting?
-A manufacturing company makes wax tapes for gas and oil pipelines.
-Several types of tapes are manufactured
Considerations:
>warm weather seasons: spike in demand because of routine pipeline maintenance
>recent years: demand has grown because of growth in emerging economies.
>changes in pricing: result in stockpiling by customers.
>these all mean: there is a seasonality and trend in the demand for the tapes.
>Aim: the company would like...
-Forecast the demand for the different tape types (monthly, annually)
-Plan for required resources
-Prepare the production line
What are time periods?
Refers to any time interval that is of interest
(E.g. seconds, minutes, years, etc).
What is the horizon?
The time period for which forecasting is done
(E.g. next week, three weeks from now, or two years from now).
What is forecast error?
-It is the difference between the predicted value of the attribute and its actual value at any time
-Calculated as: et = yt - Ft.
What are the components of data?
-Seasonality:
> The data have regular and predictable changes that recur over time, with a (relatively) fixed period.
> This is called 'cyclicality' when the data have regular changes that recur over time, but with a 'non-fixed' period
-Trend: The data have a pattern of gradual change over time.
-Random Noise: The data have some normal fluctuations even after the seasonality and trend components have been taken out.
THIS SET IS OFTEN IN FOLDERS WITH...
Normal Distribution
12 terms
Descriptive statistics
54 terms
The Normal Distribution
20 terms
Predictive Analytics
211 terms
YOU MIGHT ALSO LIKE...
DM2Midterm
56 terms
Cis 4640: Exam # 1 Review
36 terms
MA 322 MIDTERM
44 terms
Data Warehousing
71 terms
OTHER SETS BY THIS CREATOR
Topic 2: Fundamentals of Information Security
35 terms
Topic 1: Introduction to IS Security and Risk Mana…
21 terms
Topic 5: FREEDOM OF EXPRESSION, CENSORSHIP AND WHI…
24 terms
Revision Questions
34 terms
OTHER QUIZLET SETS
Chp 15
40 terms
BIO 1510 EXAM III CH.13 - Book Review ?'s
16 terms
personal training chapter 6 overhead squat and sin…
10 terms
BIO EXAM 4
35 terms