Only $2.99/month

Terms in this set (90)

It is concerned with reducing the size of a data set either for performance considerations or for specific objectives

Several ways:
1. Reducing the number of data attributes
- to find the optimal subset of attributes that are best suitable for the specific analysis
- Techniques:
- cube aggregation = using the slice operation
- attribute/ feature selection = through finding which attribute should be included in a specific analysis and filtering out the non-relevant attributes
- dimensionality reduction = through merging original attributes and findin aggregate attributes that explain variance in data in a more consise and effective way.

2. reducing the number of attribute values
- To find optimal subset of attribute values that are best suitable for the specific analysis

Several Ways:
- cube aggregation = using the roll-up operation
- binning = to reduce the distinct number of attribute values by grouping them into specific interval or bins
- clustering = to reduce the distinct number of attribute values by way of data clustering and using clusters instead of original attribute values.

3. reducing the number of data points/instances
- to extract a sample of data points and analyze them instead of the entire data set

Techniques include:
- probability-based random sampling = using random numbers that correspond to data points which ensure no correlation between data points that are selected. (using Random Seed generates random numbers)
-probability-based stratified random sampling = where the data set is broken down into stata within each, data points have the same chance of being selected.
1. NORMALIZATION= to align probability distributions of different data attributes with different means and standard deviations
-Techniques:
-Min-max method: to bring all data values into specific interval between a minimum and a maximum value.
-Z-transformation: will centre data(subtract mean from all data points) and scale data(divide all data points by the standard deviation). It will convert every attribute value to its z-score

2. SMOOTHING = removes noise from any other fine-scale values from data attributes.
-lowers/removes the effect of unstable or infrequent data values.
captures important patterns in data
- Techniques include (average of all previous data/moving avg(simple,tailed,weighted)/exponential smoothing)

3. RESOLVE SKEWNESS = removes any significant left or right distributional skewness of data.
-If target or label is skewed, then bad predictive models will be built as a result of imbalanced data.
-If the predictor attribute is skewed, then some useful features of data may be lost or ignored by the predictive model.
-Techniques:
-Log transformation: will replace each attribute value with its log transformation

-Statistical approaches: Box and Cox method

4. RESOLVE OUTLIERS = to resolve data points that are far from the mainstream of the data.
-outliers are not to be mixed with the data errors
-outliers can be hard to define

-Outliers can be treated in two ways:
-identified and removed from data
-identified and transform all data points to lower/remove the effects of the outliers.
-Spatial Sign is one method that projects all attributes onto a multidimensional sphere
-this will make all the samples the same distance from the centre of the sphere (it is necessary to do the data centering and scaling before spatial sign transformation)
General Method:
-a subset of the entire data(ie. sample) is selected for model building or training
-the remaining subset of data is used for model testing or evaluation
-the above steps are takes multiple times
-the overall performance of the model is the average of all performance in the above iterations

1. K-fold cross validation
-the entire data set is randomly partitioned into k-folds(each fold containing a sample)
-k-1 folds are used for model testing or evaluation
-the above procedure is done for k times
-the overall performance will be the average of the model performances in all the above iterations

2. Leave-one-out cross validation (LOOCV)
-the entire data set is partitioned into k-folds
-k = total number of instances in the entire data set
-k-1 folds are used for model building or training
-the remaining fold is used for model testing or evaluation
-the above procedures are done for k times
-the overall performance will be the average of the model performances in all the above iterations

3. Bootstrapping
-randomly take N data instances from the entire data set with N instances
-random selection is done with replacement
-some data instances may not be selected, some may be selected more than once.
-on average, only about 63.8% of instances are selected

-use the sample taken for model building or training
-use the remaining set (ie. the out-of-bag or out-of-bootstrap sample) for model testing or evaluation
-repeat the procedure several time, the overall performance will be the average of those for the above iterations.
General Methods:
-create single(base) models for the data set
-create a meta-model(meta-function) to choose among the models to classify a new single data instance

1. Bagging (bootstrap aggregation)
-create M different subsets of data instances using bootstrapping (ie. in a data set with N instances, extract N instances randomly with replacements)
-create M separate base models for the different subsets of data instances above
-create an aggregation model over the base models created.
-final prediction will be the prediction using the aggregate method.

2. Random forest(feature bagging)
-create M random subsets of data attributes or features(all data instances)
-create M separate base models for different subsets of data features above
create M aggregation method over the M base model created
-final prediction will be the prediction using the aggregate method.

3. Stacking
-create M separate base modes, each for the entire data set(ie. all data instances and all attributes)
-create an aggregation method over the M base models created.
-final prediction will be the prediction using the aggregate method.

4. Boosting
-step 1: extract a training subset of data instances using bootstrapping
-step 2: create a base(weak) model on the training data above and evaluate the weak model
-step 3: reweight data instances
-correctly classified: lower weight
-incorrectly classified: higher weight
-go through steps 1-3 for M times to create M weak learners
-final prediction will be the prediction using the aggregate method on the M weak learners above.

5. AdaBoost (Adaptive Boosting)
-one of the most popular implementations of boosting