Upgrade to remove ads
CIS 375 Quiz 1
Terms in this set (96)
Choose customers who are most likely to respond to an on-line ad
Discover rules that indicate when an account has been defrauded
Find patterns indicating what customer behavior is more likely to respond to an online ad
Estimate probability of default for a credit application (specific customer)
Set of principles, concepts, and techniques that structure thinking and analysis of data
_________________ extracts useful information and knowledge from large volumes of data by following a process with reasonably well defined steps
Volume, Variety, Powerful Computers, More efficient algorithms
Structured numeric data
Data in an excel spreadsheet table
Un-structured Textual Data
Un-structured visual data
Data mining process:
business understanding, data understanding, data preparation, modeling, evaluation, deployment
Understand or create the business problem
Understand the strength and limitation of data
Convert the data into the right format
Apply data mining techniques to the data
Assess the performance of the data mining results
Put into real use to realize ROI
Example of clear, pre-packaged business project
Example: Customer churn prediction (apple/verizon)
A clear, pre-packaged business project can utilize a:
Ambiguous data mining problem
eBook reading patterns
A simplified representation of reality created to serve a purpose
Estimate an unknown value (i.e. the target)
A formula for estimating the unknown value of interest: the target
The formula for a predictive model can be _____________
mathematical or logical statement (a rule)
An _________________ can represent a fact or a point
An instance/Example can be described by:
A set of Attributes
fields, columns, variables, features
An instance/example can contain a ___________ feature collection, otherwise known as ____________
Feature vector, collection of feature values
Creation of models from data
the input data for the induction algorithm
The data for model evaluation
Numeric or Categorical
Anything that has some order
Stuff that does not have an order
Numeric Feature examples
Numbers, dates, dimensions of 1
Categorical Feature Examples
Binary, text, dimensions (# of possible values (-1)
How do you find the dimensionality of the dataset?
The sum of the number of numeric features and the number of values of categorical
features less one
(True/False): When conducting supervised data mining the value of the attributes except for the target variable are known when the model is used
True or false: Choosing which customers are most likely to leave is an example of the use of DM results
True or false: Finding the characteristics that differentiate my most profitable customers from my less profitable customers is an example of an unsupervised learning task
Predictive model of customer churn steps
1. Define churn
2. Test set/ Training set
5. Model performance
6. Unknown data
8. Retention or no campaign
What is the result of supervised data mining
A model that predicts some quantity
Subclasses of supervised data mining:
9 Common data mining tasks (1-4)
1. Classification and class probability estimation
3. Similarity Matching
9 Common data mining tasks (5-9)
5. Co-occurrence grouping
7. Data Reduction
8. Link Prediction
9 Casual Modeling
3 UNSUPERVISED common data mining tasks
What are the 3 supervised common data mining tasks
What are the 3 unsupervised and supervised common data mining tasks
When should we stop the classification tree from growing
1. Nodes = pure
2. No more variables
3. Information gain from adding an extra variable/node < a set threshold level
Tree Structured Model Rules
1. No 2 parents share descendants
2. There are no cycles
3. The branches always point downwards
4 Every example ends up at leaf node with some specific class determination
How do we create a classification tree from data? (2)
1. Divide and conquer approach
2. Take each data subset and RECURSIVELY apply attribute selection to find the best attribute to partition it
Recursive partitioning purpose
Smaller and smaller rect. regions. Divide entire x space to be "homogenous" or "pure" as possible
Containing points that belong to just one class
Splitting is done to reduce ______________
Information Gain measures:
The change in entropy due to any amount of new information being added
How can we segment the population into groups that differ from each other with respect to some quantity of interest?
Informative/knowable attributes that correlate with the target variable
Whats the most important splitting criterion?
What is information gain based on?
Purity measure called Entropy
Entropy is 0 at __________
Maximal disorder. Equally mixed properties.
The higher the entropy, the ___________
more information content
What is the entropy of a group in which all examples belong to the same class?
-entropy = -1log1 = 0. Not good for learning training set
.99 value is pure or impure
Supervised vs. Unsupervised
Supervised: Arranging a fruit basket with known variables and differences (Size, color, shape, fruit name). Unsupervised: Arranging a fruit basket into groups with no knowledge on the fruits, its attributes, or its potential grouping categories
Predictive or directed
Descriptive or undirected
"How likely is the consumer to (or will he or she) respond to our campaign?" -> What kind of DM task is this?
"How much will she use this service?" -> What kind of DM task is this?
"Can we find consumers similar to my best customers?" -> What kind of DM task is this?
"Do my customers form natural groups?" -> What kind of DM task is this?
"What items are commonly purchased together?" -> What kind of DM task is this?
"What does "normal behavior" look like? (for example, as
baseline to detect fraud)" -> What kind of DM task is this?
"Which latent dimensions describe the consumer taste preferences?" -> What kind of DM task is this?
"Since John and Jane share 2 friends, should John become Jane's friend"? -> What kind of DM task is this?
"Why are my customers leaving?"-> What kind of DM task is this?
To avoid bad habits:
The training sample should be as similar as possible to the USE data
True/False: In a classification tree induction, the next attribute added is the one with the largest increase in entropy
measures the change in entropy due to any amount of new
information being added
The most common splitting criterion is called:
Entropy at minimal disorder means:
Zero - The set has the same members with the same, single property
Entropy at maximal disorder:
1, The properties are equally mixed
Reasons for selecting only a subset of attributes:
Better insights, more traceable models, Faster/better predictions, avoid over fitting
"How mized up the classes are
Horizontal and vertical decision boundaries that we use to partition instance space into similar regions.
Purpose of creating Homogenous regions:
To predict the target variable of a new, unseen instance by determining which segment/region it belongs to
A new decision boundary is a straight line, but is not perpendicular to the access. It is a weighted sum of the values for the various attributes
Classification Function vs. decision trees:
The difference from DTs is that the method for taking multiple
attributes into account is to create a mathematical function of them.
Entropy progression from classification tree:
High -> Low
How do we create a classification tree from data?
"divide and conquer", Take each data subset and RECURSIVELY apply attribute selection to find the best attribute to partition with
Classification Function in context of the parameterized model:
The data mining is going to "fit" the parameterized model to a
particular dataset, to find a good set of weights on the features.
A classification tree is equivalent to this
Entropy progression from Root -> Branches -> Leaf node
High -> low
Probability estimation trees, Basic assumption
Each member of a segment corresponding to a tree leaf has the SAME probability to belong in the corresponding class
How do we resolve small samples for tree-based probability estimation?
Smoothing; laplace correction
3 Different Solutions to Classification
Classifier model, ranking, probability estimation
YOU MIGHT ALSO LIKE...
Data mining quiz 1
BI Review, Chapter 4
Cis 4640: Exam # 1 Review
OTHER SETS BY THIS CREATOR
CIS 375 Quiz 1
History Unit 3 Test