Exploratory Data Analysis GOALS
Click the card to flip 👆
1 / 34
Terms in this set (34)
Create predictors that will have strong pattern with prediction target
Especially great scope for feature engineering when combining data from multiple sources
Creativity and understanding the problem domain pay off bug here

Replacing variables to improve generalization
Feature engineering with dates
Hierarchical Clusteringcreate table with all distances between ppl or cases starting with the shortest distances, we clusterData RobotHands-on supervised machine learningAssociation Rules: SUPPORTProbability that 2 items co-occur # transactions with both A and D/all transactionsAssociation Rules: ConfidenceConditional prob that transactions containing A also contain d transactions with both A and D/transactions with AMarket Basket Analysismodeling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. look for surprising combosTypes of TargetsSupervised event/no event class label Regression continuous outcomeData SplittingTraining-->Validation Training-->Validation Training-->Validation Training-->Validation --> HoldoutMissing Datawhole rown could get kicked outSplittingdone with stats for categorical targets, some version of chi-squaredAutomated Machine Learning CriteriaAccuracy Productivity Ease of use Understanding and learning Resource avail Process transparency generalizability across contexts recommended actionsPrioritize Model CriteriaPredictive accuracy familiarity with model prediction speed speed to build model insightsSuccess criteriahow much value can the model drive?Decide whether to continueest. resouces required consider tech risks understand alt to creating model est. model valueFind appropriate dataInternal External -purch -pub1st steps in exploratory data analysisview raw table contents descriptive statistics verify variable column types match expectationsAccuracyTP+TN/TN+TP+FN+FPSensitivity True Pos RateTP/TP+FN Proportion of P's that were correctly ID'dSpecificity True Neg RateTN/TN+FP Proportion of N that were correctly ID'dPositive Predictive ValueTP/TP+FNNegative Predictive ValueTN/TN+FNStatistical Accuracy (F1)2TP/2TP+FN+FP5-fold validationeach tested as own chunks of data start @1, then 2, learning every time to increase sample sizeAUCArea under curve TPR vs. FPRFalse Positive Rate (Fallout)FP/FP+TN