Home
Subjects
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Home
Science
Computer Science
Artificial Intelligence
CIS 375 Quiz 1
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (96)
Choose customers who are most likely to respond to an on-line ad
Use
Discover rules that indicate when an account has been defrauded
DM
Find patterns indicating what customer behavior is more likely to respond to an online ad
DM
Estimate probability of default for a credit application (specific customer)
Use
Data Mining
Set of principles, concepts, and techniques that structure thinking and analysis of data
_________________ extracts useful information and knowledge from large volumes of data by following a process with reasonably well defined steps
Data Mining
Data Opportunities
Volume, Variety, Powerful Computers, More efficient algorithms
Structured numeric data
Data in an excel spreadsheet table
Un-structured Textual Data
Yelp contents
Un-structured visual data
Flickr contents
Data mining process:
business understanding, data understanding, data preparation, modeling, evaluation, deployment
Business Understanding
Understand or create the business problem
Data Understanding
Understand the strength and limitation of data
Data preperation
Convert the data into the right format
Modeling
Apply data mining techniques to the data
Evaluation
Assess the performance of the data mining results
Deployment
Put into real use to realize ROI
Example of clear, pre-packaged business project
Example: Customer churn prediction (apple/verizon)
A clear, pre-packaged business project can utilize a:
Classification tree
Ambiguous data mining problem
eBook reading patterns
Model
A simplified representation of reality created to serve a purpose
Prediction
Estimate an unknown value (i.e. the target)
Predictive Model
A formula for estimating the unknown value of interest: the target
The formula for a predictive model can be _____________
mathematical or logical statement (a rule)
An _________________ can represent a fact or a point
Instance/Example
An instance/Example can be described by:
A set of Attributes
Attributes
fields, columns, variables, features
An instance/example can contain a ___________ feature collection, otherwise known as ____________
Feature vector, collection of feature values
Model induction
Creation of models from data
Training data
the input data for the induction algorithm
Validation data
The data for model evaluation
Feature types
Numeric or Categorical
Numeric
Anything that has some order
Categorical
Stuff that does not have an order
Numeric Feature examples
Numbers, dates, dimensions of 1
Categorical Feature Examples
Binary, text, dimensions (# of possible values (-1)
How do you find the dimensionality of the dataset?
The sum of the number of numeric features and the number of values of categorical
features less one
(True/False): When conducting supervised data mining the value of the attributes except for the target variable are known when the model is used
True
True or false: Choosing which customers are most likely to leave is an example of the use of DM results
True
True or false: Finding the characteristics that differentiate my most profitable customers from my less profitable customers is an example of an unsupervised learning task
False
Predictive model of customer churn steps
1. Define churn
2. Test set/ Training set
3/4. Model
5. Model performance
6. Unknown data
7. Predictions
8. Retention or no campaign
What is the result of supervised data mining
A model that predicts some quantity
Subclasses of supervised data mining:
Classification- Categorical
Regression- Numerical
9 Common data mining tasks (1-4)
1. Classification and class probability estimation
2. Regression
3. Similarity Matching
4. Clustering
9 Common data mining tasks (5-9)
5. Co-occurrence grouping
6. Profiling
7. Data Reduction
8. Link Prediction
9 Casual Modeling
3 UNSUPERVISED common data mining tasks
Clustering
Co-occurrence grouping
Profiling
What are the 3 supervised common data mining tasks
Classification
Regression
Casual modeling
What are the 3 unsupervised and supervised common data mining tasks
Similarity matching
Link prediction
Data Reduction
When should we stop the classification tree from growing
1. Nodes = pure
2. No more variables
3. Information gain from adding an extra variable/node < a set threshold level
Tree Structured Model Rules
1. No 2 parents share descendants
2. There are no cycles
3. The branches always point downwards
4 Every example ends up at leaf node with some specific class determination
How do we create a classification tree from data? (2)
1. Divide and conquer approach
2. Take each data subset and RECURSIVELY apply attribute selection to find the best attribute to partition it
Recursive partitioning purpose
Smaller and smaller rect. regions. Divide entire x space to be "homogenous" or "pure" as possible
"Pure"
Containing points that belong to just one class
Splitting is done to reduce ______________
Impurities
Information Gain measures:
The change in entropy due to any amount of new information being added
How can we segment the population into groups that differ from each other with respect to some quantity of interest?
Informative/knowable attributes that correlate with the target variable
Whats the most important splitting criterion?
Information Gain
What is information gain based on?
Purity measure called Entropy
Entropy is 0 at __________
Maximal disorder. Equally mixed properties.
The higher the entropy, the ___________
more information content
What is the entropy of a group in which all examples belong to the same class?
-entropy = -1log1 = 0. Not good for learning training set
.99 value is pure or impure
very impure
Supervised vs. Unsupervised
Supervised: Arranging a fruit basket with known variables and differences (Size, color, shape, fruit name). Unsupervised: Arranging a fruit basket into groups with no knowledge on the fruits, its attributes, or its potential grouping categories
Supervised
Predictive or directed
Unsupervised
Descriptive or undirected
"How likely is the consumer to (or will he or she) respond to our campaign?" -> What kind of DM task is this?
Classification
"How much will she use this service?" -> What kind of DM task is this?
Regression
"Can we find consumers similar to my best customers?" -> What kind of DM task is this?
Similarity Matching
"Do my customers form natural groups?" -> What kind of DM task is this?
...
"What items are commonly purchased together?" -> What kind of DM task is this?
Co-Occurrence grouping
"What does "normal behavior" look like? (for example, as
baseline to detect fraud)" -> What kind of DM task is this?
Profiling
"Which latent dimensions describe the consumer taste preferences?" -> What kind of DM task is this?
Data Reduction
"Since John and Jane share 2 friends, should John become Jane's friend"? -> What kind of DM task is this?
Link Prediction
"Why are my customers leaving?"-> What kind of DM task is this?
Causal Modeling
To avoid bad habits:
The training sample should be as similar as possible to the USE data
True/False: In a classification tree induction, the next attribute added is the one with the largest increase in entropy
False
Information gain:
measures the change in entropy due to any amount of new
information being added
The most common splitting criterion is called:
Information Gain
Entropy at minimal disorder means:
Zero - The set has the same members with the same, single property
Entropy at maximal disorder:
1, The properties are equally mixed
Reasons for selecting only a subset of attributes:
Better insights, more traceable models, Faster/better predictions, avoid over fitting
Entropy:
"How mized up the classes are
"Logistic"
Log odds
"Regression"
Numeric target
"Decision Boundaries"
Horizontal and vertical decision boundaries that we use to partition instance space into similar regions.
Purpose of creating Homogenous regions:
To predict the target variable of a new, unseen instance by determining which segment/region it belongs to
Linear Classifier
A new decision boundary is a straight line, but is not perpendicular to the access. It is a weighted sum of the values for the various attributes
Classification Function vs. decision trees:
The difference from DTs is that the method for taking multiple
attributes into account is to create a mathematical function of them.
Entropy progression from classification tree:
High -> Low
How do we create a classification tree from data?
"divide and conquer", Take each data subset and RECURSIVELY apply attribute selection to find the best attribute to partition with
Classification Function in context of the parameterized model:
The data mining is going to "fit" the parameterized model to a
particular dataset, to find a good set of weights on the features.
A classification tree is equivalent to this
rule set
Entropy progression from Root -> Branches -> Leaf node
High -> low
Probability estimation trees, Basic assumption
Each member of a segment corresponding to a tree leaf has the SAME probability to belong in the corresponding class
How do we resolve small samples for tree-based probability estimation?
Smoothing; laplace correction
3 Different Solutions to Classification
Classifier model, ranking, probability estimation
YOU MIGHT ALSO LIKE...
Data mining quiz 1
53 terms
BI Review, Chapter 4
32 terms
Cis 4640: Exam # 1 Review
36 terms
Test 1
79 terms
OTHER SETS BY THIS CREATOR
CIS 375 Quiz 1
25 terms
History Unit 3 Test
11 terms
Ch. 13
43 terms