Upgrade to remove ads
Data Mining Mid Term
Terms in this set (110)
Data Mining Process Outline
Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment
Which technique would you answer the the following question with: "Who are the most profitable customers?"
Which technique would you answer the the following question with: "Is there really a difference between profitable customers and the
average customer? "
Statistical Hypothesis Testing
Which technique would you answer the the following question with: "But who really are these customers? Can I characterize them?"
OLAP (manual search), Data mining (automated pattern finding)
Which technique would you answer the the following question with: "Will some particular new customer be profitable? How much revenue should I expect this customer to generate?"
Data mining (predictive modeling)
What is a model?
A simplified representation of reality created to serve a purpose
What is a predictive model?
A formula for estimating the unknown value of interest: the target. The formula can be mathematical, logical statement (e.g. rule), etc.
What is prediction?
Estimate an unknown value (i.e. the target)
What is an instance or example?
Represents a fact or a data point. Described by a set of attributes (fields, columns, variables, or features)
What is model induction?
The creation of models from data.
What is Training Data?
The input data for the induction algorithm.
How do you calculate the dimensionality of a dataset?
You take the sum of the number of numeric features and the number of values of categorical features
What is an example of classification and class probability estimation?
How likely is this consumer to respond to our campaign?
What is an example of regression analysis?
How much will she use the service?
What is an example of similarity matching?
Can we find consumers similar to my best customers?
What is an example of clustering?
Do my customers form natural groups?
What is an example of Co-occurrence Grouping?
What items are commonly purchased together?
What data mining task describes the following tasks: Frequent itemset mining, association rule discovery, and market-basket analysis.
What does profiling or behavioral description ask?
What does "normal behavior" look like? (for example, as baseline to detect fraud)
What does data reduction ask?
Which latent dimensions describe the consumer taste preferences?
Describe how link prediction could be used?
Since John and Jane share 2 friends, should John become Jane's friend?
What is an essential question causal modeling is used to answer?
Why are my customers leaving?
Classification:Supervised or Unsupervised?
Regression::Supervised or Unsupervised?
Causal Modeling::Supervised or Unsupervised?
Similarity Matching:Supervised or Unsupervised?
Link Prediction:Supervised or Unsupervised?
Data Reduction:Supervised or Unsupervised?
Clustering:Supervised or Unsupervised?
Co-occurrence Grouping:Supervised or Unsupervised?
Profiling:Supervised or Unsupervised?
What is the minimum data threshold for supervised data mining?
a min ~500 of each type of classification
What is leakage in data mining?
leakage in data mining (henceforth, leakage) is essentially the introduction of information about the target of a data mining problem, which should not be legitimately available to mine from.
What is a common example of leakage in data mining?
a model that uses the target itself as an input, thus concluding for example that it rains on rainy days
What kind of a target does classification have?
Categorical and often binary
What kind of target does regression mining have?
What is an example of a target proxy sample?
I do not see if a person after seeing an ad bought the book, so lets
model clicks ...
What is an example of a sample proxy?
I want to run a campaign in Spain but only have data on US customers
Survivorship bias is the logical error of concentrating on the people or things that "survived" some process and inadvertently overlooking those that did not because of their lack of visibility.
What is supervised segmentation?
When you segment the population into groups that differ from each with respect to some quantity of interest?
How do you decide how to partition data set when performing supervised segmentation?
partition the customers into subgroups that are less impure - with respect to the class (i.e., such that in each group as many instances as possible belong to the
What is the most common measurement of purity in supervised segmentation?
most common splitting criterion is called information gain
What is the formula for entropy?
𝑒ntrop𝑦 = −p1 log2 p1 − p2 log2 p2 − ...
What does entropy measure?
Measures the general disorder of a set.
How do you calculate information gain?
Entropy Parent - ( probability of child1 x entropy of child1 + probability of child2 x entropy of child2 etc...)
if a leaf contains n positive instances and m negative instances, what is the estimation equation for calculating any new instance being positive?
What is the formula for laplace correction?
p(c) = n+1/n+m+2
N is the number of examples in the leaf belonging to class 𝑐, and m is the
number of examples not belonging to class 𝑐.
What are the 3 different types of classification evaluation procedure?
1. Classifier model: Model predicts the same set of discrete value as the data had
2. Ranking: Model predicts a score where a higher score indicates that the model think the example to be more likely to be in one class
3. Probability estimation: Model predicts a score between 0 and 1 that is meant to be the probability of being in that class
Can a cost/benefit of a ranking classification have a variable solution?
No. cost/benefit is constant, unknown, or difficult to calculate
Can a cost/benefit of a probability classification have a variable solution?
Yes! cost/benefit is not constant across examples and known relatively precisely
Which strategy simplifies a decision tree to prevent over-fitting to noise in the data?
What are pre and post decision tree pruning and which is preferred?
Post-pruning: takes a fully-grown decision tree and discards unreliable parts
Pre-pruning: stops growing a branch when information becomes unreliable
Post-pruning preferred in practice
What type of graph is the following link called? http://imgur.com/PuylZyG
What type of graph is the following link called?http://imgur.com/Ip23GIN
What type of graph is the following link called?
What does it mean to have a parameterized model in a classification function?
the weights of the linear function are the parameters
What is the importance indicator for a classification function?
The weights are often loosely interpreted as importance indicators of the features
What is the distinction between classification and regression classification models?
whether the value for the target variable is categorical or numeric logistic regression model produces a numeric estimate, the values of the target variable in the data are categorical.
How does logistic regression estimate class membership?
Logistic regression is estimating the probability of class membership (a numeric quantity) over a categorical class
Is logistic regression a regression model?
No. Logistic regression is a class probability estimation model and not a regression model
What type of graph is the following link called?
Logistic regression ("sigmoid") curve
How does SVM hinge loss penalization work?
Hinge loss incurs no penalty for an example that is not on the wrong side of the margin
How does a Zero-one loss function work?
Zero-one loss assigns a loss of zero for a correct decision and one for an incorrect decision.
How does a Squared error loss function work?
specifies a loss proportional to the square of the distance from the boundary
What is a Squared error loss function usually used for?
numeric value prediction (regression), rather than classification
How does a square error loss function treat predictions that are wrong?
The squaring of the error has the effect of greatly penalizing predictions that are grossly wrong
What does it mean for a SVM to have a "polynomial kernel"?
Non-linear SVM consider "higher-order" combinations of the original features such as squared features, products of features, etc
What is a neural network?
Think of a neural network as a "stack" of models. On the bottom of the stack are the original features each layer in the stack applies a simple model to the outputs of the previous layer.
What do you worry about when deciding between Linear Models versus Tree Induction?
How "smooth" is the underlying phenomenon being modeled?
trees need a lot of data to approximate curved boundaries
How "non-linear" is the underlying phenomenon being modeled?
if very, much "data engineering" needed to apply linear models
How much data do you have?!
there is a key tradeoff between the complexity that can be modeled and the amount of training data available
What are the characteristics of the data: missing values, types of
variables, relationships between them, how many are irrelevant, etc. trees fairly robust to these complications
What is the holdout set for final evaluation called?
What is accuracy on training data called?
called "in-sample" accuracy, vs. "out-of-sample" accuracy on test data
For smaller training-set sizes, what is the best modeling tool?
For smaller training-set sizes, logistic regression yields better generalization accuracy than tree induction for smaller data, tree induction will tend to over-fit more
Which is more flexible, linear regression or decision trees?
Classification trees are a more flexible model representation than linear logistic regression
What is a learning curve?
A learning curve shows the generalization performance plotted
against the amount of training data used
What is a fitting graph?
A fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model
What graph would you use for a fixed amount of training data?
What is the L2 Norm?
the sum of the squares of the weights
What is ridge regression?
L2-norm + standard least-squares linear regression = ridge regression
What is the L1-norm?
the sum of the absolute values of the weights
What is lasso?
L1-norm + standard least-squares linear regression = lasso
How does nested cross validation work?
What is the formula for precision?
What is the formula for recall?
How do you calculate f-measure?
F-measure = 2 × (precision*recall/precision+recall)
What does the expected value framework decompose data-analytic thinking into?
the structure of the problem, the elements of the analysis that can be extracted from the data, and
the elements of the analysis that need to be acquired from other sources
How do you calculate the expected value framework formula?
Expected benefit of targeting = pr(x)
vr + [1 - pr(x)]
Vnr > 0
vr = benefit of responding, vnr = benefit of not responding
You subtract if there is a negative affect for no response
vr - [1 - pr(x)]
Vnr > 0
vr > [1 - pr(x)]
What is the rule of basic probability?
P(x,y) = p(y) * p(x|y)
What is the expected profit formula?
b(y,n) = expected profit
What are class priors?
the proportion of positive and negative instances in the target population, commonly used in profit curves
What value can the area under the ROC curve have?
The area under a classifier's curve expressed as a fraction of the unit square Its value ranges from zero to one
What does the AUC tell us?
Both are equivalent to the probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance
What are other names for the AUC?
ROC Area = Mann-Whitney-Wilcoxon = Gini Coefficient
What does the cumulative response curve measure?
Positives targeted vs percent of test instances
Does a ROC graph predict proportions or class
costs and benefits?
No. ROC graphs are independent of the class proportions as well as the costs and benefits
Will this customer purchase service 𝑆1 if given incentive 𝐼1? What type of problem is this? What target?
Classification problem. Binary target (the customer either purchases or does not)
"Which service package (𝑆1, 𝑆2, or none) will a customer likely purchase if given incentive 𝐼1?"
What type of problem is this? What target?
Classification problem Three-valued target
Classify the type of problem the following is? How much will this customer use the service?
Regression problem Numeric target Target variable: amount of usage per customer
What are the reasons for only selecting a subset of attributes to analyze?
Better insights and business understanding
Better explanations and more tractable models
argOver-fitting (to be continued..)
Which curve tells us if we should invest in data or not?
What is the definition of lift?
The definition of lift is how much better a classifier performs than the random performance
What is a better measurement of overfitting, a fitting graph or a lift curve?
fitting graph is a much better measurement of overfitting than the lift curve
What is plotted, x, y for the following curve: profit curve?
classifiers vs random, profit, percentage of instances
What is plotted, x, y for the following curve: ROC Graph?
True Positive Rate, False Positive Rate
What is plotted, x, y for the following curve: cumulative response curve?
Classifiers, Percentage of positives targeted, percentage of test instances
What is plotted, x, y for the following curve: lift curve?
Classifiers, Percentage of test instances, Lift (how much better a classifier performs than the random performance)
how do you measure impurity for numeric values?
What is a classifier that always selects the majority class?
Base rate classifier
How do you calculate true positive rate?
True Positive Rate = Recall = TP/(TP+FP)
How do you calculate sensitivity?
Sensitivity = True Negative Rate = TN/(TN+FP)
How do you calculate specificity?
THIS SET IS OFTEN IN FOLDERS WITH...
Machine Learning - From CourseraTeam
YOU MIGHT ALSO LIKE...
CIS 375 Quiz 1
Data mining quiz 1