Upgrade to remove ads
Terms in this set (79)
4 Technical Advances that enabled Data Science
1. Storage is more capable
2. Networking (data can be transferred easily)
3. Algorithms - machine learning, pattern recognition, and applied statistics
4. Computing power has increased.
Data-Driven Decision Making (DDD)
the practice of basing decisions on the analysis of data, rather than purely on intuition
Correlated to higher return on assets/equity, asset utilization, and market value
principles, processes, and techniques for understanding phenomena via the automated analysis of data for improved decision making.
Data Mining (or analytics)
the techniques used in data science.
What kind of problems should we focus on solving with data science1
1. Problems for which "discoveries" need to be made (our churn problem)
2. Decisions that repeat, especially on a massive scale (it is important to improve these)
Data sets that are too large for traditional data processing systems (usually do not fit into RAM memory), and therefore require new processing techniques.
3 V's that characterize Big Data
Big Data Technologies
The technologies used to process and handle big data, and include pre-processing prior to implementing data mining techniques.
How can you speed up big data mining?
A PROCESS for using information technology to extract useful (non-trivial, hopefully actionable) knowledge from large bodies of data.
CRISP Business Analysis Process
1. Business Understanding - identify business understanding
2. Data understanding - identifying strengths and limitations of the data, as well as costs and benefits of each data source
3. Data Preparation - data cleansing and conversion
4. Modeling - decide analytic techniques and build a model
5. Evaluation - test the model
6. Deployment - implementing the model and realize some return on investment
These can go back through each other. This cycle is iterative and outcomes are uncertain.
Classification or Class Probability Estimation
For each individual in a population, identify a small set of classes to which the individual belongs.
PREDICTS WHETHER SOMETHING WILL HAPPEN (binary)
What is the probability/score that the individual belongs to each class?
Regression or Value Estimation
a regression procedure produces a model that, given an individual, estimates the value of the particular variable specific to that individual.
PREDICT HOW MUCH SOMETHING WILL HAPPEN
identify similar individuals based on data known about them.
It can be used to find similar entities
Most popular methods for making product recommendations
group individuals in a population together by their similarity, but not driven by any specific purpose
useful in preliminary exploration
a method in which a specific target can be provided. There is data on the target.
Categorical data: classification
Numerical Data: regression
a method in which no specific purpose of target and target value. Your model just explores the data set, and there is no guarantee that the information will be useful.
Which business analytic task could be supervised or unsupervised?
categorical target, the probability that it will fall into a group.
knowable attributes that correlate with the target of interest (increased accuracy, alleviate computational problems)
a formula/model for estimating the unknown value of the target variable
segment the population into groups that differ from each other with respect to some quality of interest
Information Gain (IG)
splitting criterion based on a purity measure called entropy.
Measures the change in entropy due to any amount of new information being added.
IG = entropy(parent) - [p(c1)
entropy(c1) + p(c2)
measures the general disorder of the set
At it's max when there is a balance in probability of each
-p(1)log2(p1) - p2log2(p2)-.....
Reasons for selecting only a subset of attributes:
1. Better insights and business understanding
2. Better explanations and more tractable models
3. Reduced costs
4. Faster predictions
5. Better predictions!
Multivariate Supervised Segmentation
we select multiple attributes each giving some information gain and put them together.
Tree-structured model rules
1. No two parents share the same decedents
2. There are no cycles
3. The branches always point downward
4. Every example always ends up at a leaf node with some specific class determination.
How do we create a classification tree/decision tree from data [Tree Induction]
take each data subsest and recursively apply attribute selection to find the best attribute to partition it.
When do we stop when creating a classification/decision tree?
When a node is pure or
there are no more variables or
Why are decision trees popular data mining tools?
1. Easy to understand
2. Easy to implement
3. Easy to use
4. Computationally cheap
Advantages for model comprehensibility which is important for model evaluation and communicating it to others.
Probability estimation trees (frequency-based estimation)
basic assumption: each member of a segment corresponding to a tree leaf has the same probability to belong in the corresponding class
probability of positive instance: n/(n+m)
However, this is subject to overfitting because what if it is a small class.
Smoothed version of frequency-based estimate,
probability of belonging to class c:
p(c) = (n+1)/(n+m+2)
n = pos
m = neg
the target takes on discrete values that are not ordered (binary classification)
model predicts a score where a higher score indicates that the model thinks the example is more likely to be in one class
Probability estimation [classification]
model predicts a score between 0 and 1 that is meant to be the probability of being in that class
Key Analytic Techniques for fitting model to data
1. Linear Regression
2. Logistic Regression
3. Support-Vector Machines (SVM)
a simplified representation of reality created for a specific purpose
a formula for estimating the unknown value of interest (the target variable)
(can be mathematical, logical statement, rule ,etc)
estimate an unknown value (the target variable)
the creation of models from data
the input data for the induction algorithm (used to find optimal parameters)
the weights (w) of the linear function are the parameters.
The weights are loosly interpreted as importance indicators of the features, due to the impact on the target variable
We can express the model as the weighted sum of the attribute values
What does the "best line" depend on ?
The objective (loss) function
Objective function should represent
determines how much penalty should be assigned to an instance based on the error in the model's predicted value.
What is the key difference between linear regression, logistic regression, and support vector machines?
the each use a different object function
Logistic regression produces...
a numeric estimate (the probability of a class)
The odds of an event
the probability of the event occurring P+(x) to the probability of the event NOT occurring (1-P+(x))
We like odds in reference to logistic regression because it puts us in a range of 0 to infinity (only positive values)
Why do we take the log of odds for linear regression?
This allows us to solve for the probability, which is what we need for a logistic regression (see slide 8 in Week 4_1)
Objective function for logistic regression
Maximum Likelihood Model
Is logistic regression a regression model?
No, the value of the target variable is categorical, not numerical.
It is a class probability estimation model, and estimates the probability of class membership over a categorical class.
Linear Support Vector Machines (LSVM)
a non-probabilistic binary linear classifier (produces a binary estimate)
Ex: Iris Setosa vs. Iris Versicolor
Hard margin (linearly separable) for SVM function?
Maximum margin classifier
They choose the linear classifier as the center of the bar maximizes the margin between classes
Soft Margin (Not linearly seperable) for SVM function?
Hinge loss function
use maximum margin with hinge loss function
-Maximize margin and minimize mistakes
-Hinge loss incurs no penalty for an example that is NOT on the wrong side of the margin.
When does a Hinge loss become positive?
When an example is on the wrong side of the boundary and beyond the margin. It increases linearly with the examples distance from the margin [penalizes points more the farther they are from the boundary)
See picture in slide 14 of W4_1
Zero-one loss (loss function)
assigns a loss of zero for a correct decision and one for an incorrect decision
Squared error (loss function)
specifies a loss proportional to the square of the distance from the boundary.
Usually used for numeric value prediction (regression) rather than classification
Greatly penalizes gross outliers .
the property of a model or modeling process, whereby the model is applied to data that were not used to build the model
(applying test data to the training data)
finding chance occurrences in the data that look like interesting patterns, but which do not generalize
the most extreme overfitting procedure (only looks at training data and does not generalize).
There is a trade off between over-fitting an
Overfitting vs. Generalization
the tendency of data-mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.
The higher the complexity...
the more overfitting occurs
Hold out Validaiton
"lab test" of generalization performance. you take out some of the data for which you know the value of the target variable, but which will not be used to build the model
accuracy on training data
Accuracy on test data
shows the generalization performance as well as the performance on training data, plotted against model complexity (# of nodes)
can get some statistics on estimated performance such as mean and variance.
computes it's estimates over all the data. Splits the data into k-folds and builds different model, each time holding a different fold out as test set. Tests the performance of the model.
accuracy vs. training instances (sample size)
Eventually, the impact of increased training set will
For smaller training-set size, which technique yields better generalization accuracy?
Logistic regression (over tree induction)
Why are tree inductions better for larger data sets?
They are more flexible and can better represent non-linear data.
takes a fully-grown decision tree and discards unreliable parts
stops growing a branch when information becomes unreliable.
complexity control of linear models. Optimizes some combination of fit and simplicity.
Places penalty's for complexity
the sum of squares of the weight
L2-Norm + standard of least-squares linear regression
the sum of the absolute values of the weights
L-1 Norm + standard least-squares linear regression
YOU MIGHT ALSO LIKE...
Ch. 9 Big Data Analytics for Managing Risk
Data mining quiz 1
CIS 375 Quiz 1
Cis 4640: Exam # 1 Review