Home
Subjects
Expert solutions
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
Science
Computer Science
Artificial Intelligence
Big Data Analytics & Data Mining
Flashcards
Learn
Test
Match
Flashcards
Learn
Test
Match
Terms in this set (33)
Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment?
Model B
Assuming that data mining techniques are to be used in the following case, identify whether the task required is supervised or unsupervised learning.
Estimating the repair time required for an aircraft based on a trouble ticket.
Supervised
Assuming that data mining techniques are to be used in the following case, identify whether the task required is supervised or unsupervised learning.
Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
Unsupervised
For prediction models, a good rule of thumb is to have ______ records for every predictor variable.
10
Assuming that data mining techniques are to be used in the following case, identify whether the task required is supervised or unsupervised learning.
Automated sorting of mail by zip code scanning.
Supervised
Assuming that data mining techniques are to be used in the following case, identify whether the task required is supervised or unsupervised learning.
Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known.
Supervised
Assuming that data mining techniques are to be used in the following case, identify whether the task required is supervised or unsupervised learning.
Identifying segments of similar customers.
Unsupervised
A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records with missing values. About how many records would you expect to be removed?
92.31% of records
Find matches for the data mining procedures.
Linear regression- supervised learning.
Collaborative filtering-
unsupervised learning.
Neural nets-
supervised learning.
Association rules-
unsupervised learning.
Regression trees-
supervised learning.
Logistic regression-
supervised learning.
Principal components-
unsupervised learning.
Cluster analysis-
unsupervised learning.
Classification trees-
supervised learning.
k-Nearest-neighbors-
supervised learning.
Find matches for the following terms.
Unsupervised Learning-
An analysis in which one attempts to learn patterns in the data other than predicting an output value of interest.
Supervised Learning-
The process of providing an algorithm (logistic regression, regression tree, etc.) with records in which an output variable of interest is known and the algorithm "learns" how to predict this value with new records where the output is unknown.
Validation set-
The portion of the data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried.
test set-
The portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on new data.
training set-
The portion of the data used to fit a model.
Algorithm-
A specific procedure used to implement a particular data mining technique: classification tree, discriminant analysis, and the like.
The second principal component represents any linear combination of the variables that accounts for the most variability in the data, once the first principal component has been extracted.
False
What plots do you use to study relation of numerical outcome to categorical predictors?
Bar charts, multiple panels, side by side boxplots
What plots do you use to determine the needs for transformations of the numerical outcome variable or numerical predictors?
boxplots, histograms
Positive correlation indicates that, as one variable increases, the other variable increases as well.
True
The principal components are always mutually exclusive and exhaustive of the variables.
True
When should we normalize data?
When units of measurement are different so it is unclear how to compare the variability of different variables.
Yes
When units of measurement are common.
No
Scale reflects importance.
No
When units of measurement are different so scale does not reflect importance.
Yes
In multidimensional visualization, to represent additional numerical information, you can use____
size, color intensity
What plots do you use to study relation of categorical outcome to numerical predictors?
Side by side boxplots
What plots do you use to study relation of numerical outcome to numerical predictors?
scatter plots
Chemical Features of Wine. The table below shows the PCA output on data (non-normalized) in which the variables represent chemical characteristics of wine, and each case is a different wine. The data are in the file Wine.xls, which you can refer to when answering the question.
a) Consider the row near the bottom labeled "Variance." Explain why column 1's variance is so much greater than that of any other column.
b) Discuss the use of normalization (standardization) in relating to the output in part a) and the dataset.
JENN:
a. Column 1's variance is so much greater than that of any other column because that is the first principal component. The variance of the values are the maximum. The largest variances are the most important, or most "principal"
b. Normalization/ standardization relates to the output in part a and the dataset through its need to be normalized. Normalization is helpful when the variance for one column is so off from the rest- it allows you to replace the original variables with a more standardized version, thereby reducing the skew. You divide the variable by its standard deviation, and this allows all variables equal importance in terms of variability.
JENNY:
a) Column 1 is the first principle component and therefore has the maximum variation of the data compared to the other principal components.
b) If you change one variable from mg to g (for example. magnesium) it may go from having very little impact on the first principle component to having a large impact. In order to protect the PCA from being affected by rescaling you need to standardize the variables. If the scale matters then there is no need to standardize, but if the scale does not matter then you should standardize.
A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so) using 0.5 as the cutoff. Note: fraudulent is the class of interest here.
Construct the confusion matrix in the following table by filling the blanks.
(a b
c d)
(30 32
58 920)
A large number of insurance records are to be examined to develop a model for predicting fraudulent claims. Of the claims in the historical database, 1% were judged to be fraudulent. A sample is taken to develop a model, and oversampling is used to provide a balanced sample in light of the very low response rate. When applied to this sample (n = 800), the model ends up correctly classifying 310 frauds, and 270 nonfrauds. It missed 90 frauds and classified 130 records incorrectly as frauds when they were not. Note: fraudulent is the class of interest here.
Construct the confusion matrix in the following table and calculate the error rate by filling the blanks.
( a b
c d)
(310 90
130 270)
error rate 0.275
Consider the example of predicting prices of used Toyota Corolla automobiles (dataset ToyotoCorolla.xlsx). The best-pruned tree is shown below.
What is the predicted price for a used Toyota Corolla with the following specifications?
Age_08-04
Mileage
Fuel_Type
HP
Automatic
CC
Doors
Quarterly_Tax
Weight
77
11700
Petrol
110
No
1510
5
9168.75
Assuming that data mining techniques are to be used in the following case, identify whether the task required is supervised or unsupervised learning.
Estimating the repair time required for an aircraft based on the trouble ticket.
Supervised
Find matches for the following terms.
Attribute
Response
Predictor
Model
Score
Success Class
- Attribute: Any measurement on the records, including both the input (X) variables and the output (Y) variable.
- Response: A variable, usually denoted by Y, which is the variable being predicted in supervised learning; also called dependent variable, output variable, target variable, or outcome variable.
- Predictor: A variable, usually denoted by X, used as an input into a predictive model. Also called a feature, input variable, independent variable, or from a database perspective, a field.
- Model:
An algorithm as applied to a dataset, complete with its settings (many of the algorithms have parameters that the user can adjust).
- Score:
A predicted value or class. Scoring new data means using a model developed with training data to predict output values in new data.
- Success Class:
The class of interest in a binary outcome (e.g., purchasers in the outcome purchase/no purchase).
In multidimensional visualization, to represent additional categorical information, you can use____
Textbook:
hue, shape, multiple panels
(quiz 2 did not have "hue" and took point off for "color intensity")
Assume that a confusion matrix has been produced for a model based on a cutoff value of 0.5. Explain how increasing the cutoff would affect the sensitivity and specificity measures of the model.
JENNY:
By increasing the cutoff of the model, sensitivity of would decrease and specificity would increase.
A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so) using 0.5 as the cutoff. Note: fraudulent is the class of interest here.
The overall error rate is:
0.0865
Suppose that we have the following data (one variable). Use complete linkage to identify the clusters. Data: 0, 0, 1, 3, 3, 6, 7, 9, 10, 10
Describe all the steps and for each step, discuss the distance between the two merged clusters for each merge, and the resulting clusters, as we did in class.
For example, step 1, the clusters are ...
step 2, merge ..., the distance between the two merged clusters {...} and {...} is ...; ..., at the end of step 2, the resulting clusters are ...
step 3, merge ..., the distance between the two merged clusters {...} and {...} is ...; ..., at the end of step 3, the resulting clusters are ...
...
JENNY (0.9/1)
Your Answer:
Step 1) The clusters are {0}, {0}, {1}, {3}, {3}, {6}, {7}, {9}, {10}, {10}
Step 2) merge {0} and {0}; {3} and {3}; {10} and {10}
the distance between the two merged clusters {0} and {0} is 0;
resulting clusters: {0,0}, {1}, {3}, {3}, {6}, {7}, {9}, {10}, {10}
Step 3) merge {3} and {3};
the distance between the two merged clusters {3} and {3} is 0;
resulting clusters: {0,0}, {1}, {3,3}, {6}, {7}, {9}, {10}, {10}
Step 4) merge {10} and {10};
the distance between the two merged clusters {10} and {10} is 0;
resulting clusters: {0,0}, {1}, {3,3}, {6}, {7}, {9}, {10,10}
Step 5) merge {0,0} and {1};
the distance between the two merged clusters {0,0} and {1} is 1;
resulting clusters: {0,0,1}, {3,3}, {6}, {7}, {9}, {10,10}
Step 6) merge {10,10} and {9};
the distance between the two merged clusters {10,10} and {9} is 1;
resulting clusters: {0,0,1}, {3,3}, {6}, {7}, {9,10,10}
Step 7) merge {6} and {7};
the distance between the two merged clusters {6} and {7} is 1;
resulting clusters: {0,0,1}, {3,3}, {6,7}, {9,10,10}
Step 8) merge {0,0,1} and {3,3};
the distance between the two merged clusters {0,0,1} and {3,3} is 2;
resulting clusters: {0,0,1,3,3}, {6,7}, {9,10,10}
Step 9) merge {9,10,10} and {6,7};
the distance between the two merged clusters {9,10,10} and {6,7} is 2;
resulting clusters: {0,0,1,3,3}, {6,7,9,10,10}
Step 10) merge {0,0,1,3,3} and {6,7,9,10,10}:
the distance between the two merged clusters {0,0,1,3,3} and {6,7,9,10,10} is 3;
resulting clusters: {0,0,1,3,3, ,6,7,9,10,10}
Consider the example of predicting prices of used Toyota Corolla automobiles (dataset ToyotoCorolla.xlsx). The best-pruned tree is shown below. For the first decision node on the right side, what is the meaning of the number 7823.53?
JENNY (0.5/0.5):
If Age >= 32.50
And Age > 56.5
And Mileage > 128400
THEN the price of a used Toyota Corolla is $7823.53
Consider the loan acceptance example (dataset universal bank.xlsx). The best-pruned tree is shown below. Summarize all the Classification Rules from this tree in the simplified form, i.e., remove any redundancies for full credit.
JENNY (1.75/2)
1) IF Income < 114.5 AND CCAvg < 2.95 THEN Class = 0
2) IF Income < 114.5 AND CCAvg > 2.95 AND CD Account > 0.5 THEN Class = 1
3) IF Income < 114.5 AND CCAvg > 2.95 AND CD Account > 0.5 AND Income < 100 THEN Class = 0
4) IF Income < 114.5 AND CCAvg > 2.95 AND CD Account > 0.5 AND Income > 100 AND Education < 1.5 THEN Class = 0
5) IF Income < 114.5 AND CCAvg > 2.95 AND CD Account > 0.5 AND Income > 100 AND Education > 1.5 THEN Class = 1
6) IF Income > 114.5 AND Education > 1.5 THEN Class = 1
7) IF Income > 114.5 AND Education < 1.5 AND Family > 2.5 THEN Class = 1
8) IF Income > 114.5 AND Education < 1.5 AND Family < 2.5 THEN Class = 0
A neural net typically starts out with random coefficients; hence, it produces essentially random predictions when presented with its first case. Discuss the key ingredient by which the net evolves to produce a more accurate prediction. *Partial, 0.5/1.5
The key ingredient by which the net evolves to produce more accurate predictions is their high predictive performance (HPP). They have a high tolerance for noisy data and they also have the ability to capture and display highly complicated relationships between predictors and an outcome variable. This often isn't possible with other predictive models.
Consider the used cars (Toyota Corolla) example with 1436 records and details on 38 attributes, including Price, Age, KM, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. Suppose that you use Analytic Solver Data Mining (or R or other tool's) neural network routine to fit a model, and further suppose that you develop 3 models in same procedure and setting, except with 300, 3000, and 10000 as the number of epochs, respectively. For each model, you record the RMS error for the training data and the validation data.
a) What would you expect for the RMS error of the training data as the number of epochs increases?
b) What would you expect for the RMS error of the validation data?
c) Comment on how you would select the appropriate number of epochs for the model, given the 3 models with different number of epochs.
*Partial, 1/1.5
a. An epoch (or sweep, or iteration) is when the model has completed running all of the records through. After one epoch is completed, then return to the first record and repeat the process over again. From the root mean squared error, I would expect the error to drop the more epochs were completed; the more times the data is run through, the better the model learns our data and the lower our error rate gets. We do need to beware of overfitting, however.
b. For the RMS error of the validation data, I would expect the error rate to also drop. If we overfitted the data, and ran too many epochs, the model would learn our training data too well and wouldn't perform as well with validation or new data. As long as we avoided the overfitting of our data, I would expect the RMS error rate of the validation data to decrease with each epoch.
c. The appropriate number of epochs for each model varies. For example, in XLMiner, you can simply select the number of epochs to be performed when setting up the new algorithm; with our sample data in the example video, the error rate dropped to 0 after 13 epochs, but there was also much less data present there than there is in the Toyota Corolla sample. I would run each model through and find where the error rate started to drop, but before overfitting occurred, and choose that amount for each algorithm. Avoiding overfitting is the most important aspect, because we don't want our model to perform poorly with our validation data or new data.
Sets found in the same folder
Data Analytics Ch. 1: Data Analytics in Accounting…
24 terms
C743 Data Mining & Analytics
46 terms
Data Mining For Business Intelligence: Chapter-1 I…
23 terms
Data Mining For Business Analytics Ch. 5
16 terms
Verified questions
COMPUTER SCIENCE
Write a program that uses a structure to store the following data on a company division: Division Name (such as East, West, North, or South) Quarter (1, 2, 3, or 4) Quarterly Sales The user should be asked for the four quarters’ sales figures for the East, West, North, and South divisions. The data for each quarter for each division should be written to a file. Input Validation: Do not accept negative numbers for any sales figures.
COMPUTER SCIENCE
For the following C statement, what is the corresponding MIPS assembly code? Assume that the variables f, g, h, and i are given and could be considered 32-bit integers as declared in a C program. Use a minimal number of MIPS assembly instructions. f = g + (h − 5);
COMPUTER SCIENCE
True/False: A function definition is a sequence of statements that defines a new command.
COMPUTER SCIENCE
Compare and contrast the following pairs of concepts: (a) Hardware vs. Software (b) Algorithm vs. Program (c) Programming Language vs. Natural Language (d) High-Level Language vs. Machine Language (e) Interpreter vs. Compiler (f) Syntax vs. Semantics
Other Quizlet sets
Mediation - bearbeitet zur Zusammenführung
11 terms
Science Glaciers & Polar Ice Sheets
25 terms
Ch. 4, 6, and 8 Des Jardins
25 terms
11.3 Economic Policy
40 terms
Related questions
QUESTION
What are the properties of a AN
QUESTION
you can not train with or participate in the battles
QUESTION
How did Spence, Homzie, and Rutledge (1964) demonstrate that a primitive associative system exists in humans in addition to a dominant cognitive system>
QUESTION
Countries outside the United States such as European countries follow standards developed by the International Organization for standardization (ISO).