Study sets, textbooks, questions
Upgrade to remove ads
OMIS 472 Exam 1
Terms in this set (26)
Merge, append, data partition, Input data, Filter, file import
§ Combines data sets created by different paths of a process flow in a project
§ It stacks data instead of side-by-side merging variables
· For example, combining Region A and Region B data that has same variables
§ Partitions data into training, validation & test data
Used to develop the model
Used to evaluate different models created w/ training data and select the best one
Used to independently assess the selected model
§ Enables you to create a data source directly from an external file such as Excel
§ Can connect to any node
§ Can be used to eliminate observations with extreme values (outliers) in the variables
§ Not good to routinely use this-want to find out reason for outliers first
§ This can follow or precede data partition node
§ First Node in any diagram
§ Specifies the data set you want to use in the diagram
§ Used to combine different data sets within a project
§ Can combine output data sets from different nodes
§ Combines data as side-by-side merging
Tools for initial data exploration
Stat explore, stat explore for continuous targets, multiplot node, graph explore node
§ Chi-square stat shows the strength of the relationship between the target & each categorical input variable.
§ Can create chi-square stats for continuous variables, but you have to create categorical variables from them first. You have to set interval variables to yes and specify number of bins.
§ Shows a chi-square plot and variable worth plot when run.
§ Variable worth-calculated from the p-values of the chi-square stats
§ Variables with the highest chi-square are the most important
Stat explore for continuous targets
§ Interval Variables property set to No and set correlations, Pearson correlations and spearman correlations to yes.
§ Shows plots of all inputs and input levels (if categorical) against the target (frequency plots)
Graph explore node
§ Select variables you want to plot and their roles
Variable clustering node
o Divides the inputs in a predictive modeling data set into disjoint clusters or groups
o Disjoint- If in one cluster, cannot appear in any other cluster
o Inputs in a cluster are strongly inter-correlated and the inputs included in any other cluster are NOT strongly correlated with the inputs in any other cluster
o You can then estimate a predictive model by including only 1 variable from each cluster or a linear combination of all variables in that cluster. This reduces the severity of collinearity and results in having fewer variables to deal with in the model.
o Starts with all variables in one cluster and divides it into smaller and smaller clusters using an algorithm
o Automatically excludes variables with the role set to target
Variable selection node
o Used for variable selection
o Can look at the strength of the relationship of input variables with target variable by using chi-square or r-squared values.
o Interval targets- R-square
o Binary targets-Both r-square and chi-square
R-square for variable selection
§ First, node rejects any variables that have a value less than the minimum r-squared
§ R-square tells us the proportion of variation in the target variable explained by a single input variable, ignoring the effect of other input variables.
§ To detect non-linear relationships, the VS node creates binned variables from each interval variable. These are called AOV16 variables. These are treated as class variables.
§ VS node then performs a forward stepwise regression to evaluate the variables chosen in the first step.
Chi Square for variable selection
§ VS node creates a tree based on chi-square maximization
§ VS node first bins interval variables and uses the binned variable rather than the original inputs in building the tree.
§ Node rejects any split with chi-square below the specified threshold
Tools for data modification
Impute node, drop node, replacement node, interactive binning node, and principle components node
§ Used to drop variables from the data set or metadata
§ Can be used to filter out extreme values in a variable without losing any observations
§ Can also be used to change the distribution of any variable in the sample
§ It's like filtering without losing the observation
§ Used for imputing missing values of inputs
interactive binning node
§ Binning helps uncover complex non-linear relationships between the inputs and the target
§ Binning is a method for converting an interval-scaled variable into a categorical variable
§ Forms bins; once bins are formed, the Gini statistic is computed for each input.
· If Gini Stat has a value below the minimum cutoff property, it is rejected.
Principle components node
§ Principle components are new variables constructed from a set of variables. They are linear combinations of the original variables.
§ In general, a small # of principal components can capture most of the information contained in the original inputs.
§ Using principal components results in no collinearity
§ Target variables won't be included in constructing the principal components
§ Principal components are calculated as weighted sums of the original variables.
§ Eigenvalues are equal to the statistical variance of the new components.
Transform variables node
o Best- node selects the transformation that yields the best chi-square value for the target
o Multiple- makes several transformations for each input & passes them on to the next node and then regression node will eliminate any that aren't necessary
o Simple Transformations: Log, Log10, Inverse (), , , , range, centering (), and standardize ()
o Score Code Export
o Start Groups
Sets with similar terms
454 Exam 1
STA 125 Chapter 1
Stats Unit 1 Lessons 5-7 Review
Other sets by this creator
MGMT 468 Chp 10 Assignment
MGMT 468 Chp 9
MGMT 468 Chp 7
OMIS 472 Midterm
Other Quizlet sets
BYU World History Chapter 4
ריפוי בעיסוק בתחום המוגבלויות הפיזיות א