Upgrade to remove ads
Terms in this set (20)
A critical skill in data science is the ability to decompose a dataanalytics problem into pieces such that each piece matches a known task for which tools are available.
Data Science Principle.
Recognizing familiar problems and their solutions avoids wasting time and resources reinventing the
wheel. It also allows people to focus attention on more interesting parts of the process that require human involvement—parts that have not been automated, so human creativity and intelligence must come into
Classification and class probability estimation(scoring)
Attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclusive. An example classification question would be:
"Among all the customers of MegaTelCo, which are likely to respond to a given offer?"
In this example the two classes could be called will respond and will not respond. For a classification task, a data mining procedure produces a model that, given a new individual, determines which class that individual belongs to.
A closely related task is scoring or class probability estimation.
A scoring model applied to an individual produces, instead of a class prediction, a score representing the probability (or some other quantification of likelihood) that that individual belongs to each class.
Data Mining Tasks
1) Classification and Class probability estimation(scoring)
2) Regression or ("value estimation")
3) Similarity matching
5) Co-occurrence grouping
7) Link prediction
8) Data Reduction
9) Causal modeling
Regression or ("value estimation")
attempts to estimate or predict, for each individual,
the numerical value of some variable for that individual. An example regression question would be: "How much will a given customer use the service?" The property (variable) to be predicted here is service usage, and a model could be generated by looking at other, similar individuals in the population and their historical usage
Classification and Regression
classification predicts whether something will happen, whereas regression predicts how much something will happen.
attempts to identify similar individuals based on data known about them. Similarity matching can be used directly to find similar entities. Similarity matching is the basis for one of the most popular methods
for making product recommendations
attempts to group individuals in a population together by their similarity, but not driven by any specific purpose. Clustering is useful in preliminary
domain exploration to see which natural groups exist because these groups in turn may suggest other data mining tasks or approaches.
What products should we offer or develop? How should our customer care teams (or sales teams) be structured?
Co-occurrence grouping or frequent itemset mining, association rule discovery, and market-basket analysis
attempts to find associations between entities
based on transactions involving them. An example co-occurrence question would be: What items are commonly purchased together?
While clustering looks at similarity between objects based on the objects' attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.
Co-occurrence of products in purchases is a common type of grouping known as market-basket analysis
Profiling or behavior description
attempts to characterize the typical behavior of an individual, group, or population. An example profiling question would be: "What is the typical cell phone usage of this customer segment?" Profiling is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems
attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Link prediction is common in social networking systems: "Since you and Karen share 10 friends, maybe you'd like to be Karen's friend?"
Link prediction can also estimate the strength of a link.
For example, for recommending movies to customers one can think of a graph between customers and the movies they've watched or rated. Within the graph, we search for links that do not exist between customers and movies, but that we predict should exist and should be strong. These links form the basis for recommendations.
attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. The smaller dataset may be easier to deal with or to process.
Moreover, the smaller dataset may better reveal the information.
Data reduction usually involves loss of information.
What is important is the trade-off for improved insight.
attempts to help us understand what events or actions actually influence others. For example, consider that we use predictive modeling to target advertisements to consumers, and we observe that indeed the targeted consumers purchase at a higher rate subsequent to having been targeted. Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased
Techniques for causal modeling include those involving a substantial investment in data, such as randomized controlled experiments (e.g., so-called "A/B tests") as well as sophisticated methods for drawing causal conclusions from observational
data. Both experimental and observational methods for causal modeling generally can be viewed as "counterfactual" analysis: they attempt to understand
what would be the difference between the situations—which cannot both happen
—where the "treatment" event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen.
When undertaking causal modeling, a business needs to weigh the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.
Specific target or answer is sought from data. Usually answering a question.
also there must be data on the target.
No target or answer is sought. Only observations are made about the data.
Classification, Regression and Causal modeling
Both Supervised and Unsupervised.
Similarity Matching, Link Prediction, Data reduction.
Clustering, Co-Occurrence grouping and profiling are unsupervised.
Types of Targets for supervised data mining
Regression - Numeric Target
Classification - Categorical target (Often Binary)
Steps involved in building a model for a problem.
Business Understanding (analyst) - Defining the problem clearly.
Data understanding - In data understanding we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply. It is not unusual for a business problem to contain several data mining tasks, often of different types, and combining their solutions will be necessary
Data preparation - data understanding, in which the data are manipulated and converted into forms that yield better results. Typical examples of data preparation are converting data to tabular format, removing or inferring missing values, and converting data to different types
Modeling - the output of modeling is some sort of model or pattern capturing regularities in the data.
Evaluation - The purpose of the evaluation stage is to assess the data mining results rigorously and
to gain confidence that they are valid and reliable before moving on. The evaluation stage also serves to help ensure that the model satisfies
the original business goals.
Leading to creation of a MODEL and deployment.
YOU MIGHT ALSO LIKE...
Data mining quiz 1
CIS 375 Quiz 1
BI Review, Chapter 4
OTHER QUIZLET SETS
Japanese Internment Camps (handout)
Endo 2 Spring 2016
earth science exam 1