Upgrade to remove ads
Cis 4640: Exam # 1 Review
Terms in this set (36)
Business Intelligence (BI)
is an umbrella term that refers to a variety of software applications used to analyze an organization's raw data. BI as a discipline is made up of several related activities, including data mining, online analytical processing, querying and reporting.
is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques.
is comprised of solutions used to build analysis models and simulations to create scenarios, understand realities and predict future states. Business analytics includes data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user
Data science (data driven science)
, is an interdisciplinary field about scientific processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics
Input Variable (independent variable, predictor, attribute, dimension and feature)
is one whose value determines the value of other variables. It is usually denoted as X.
Output Variable (dependent variable, target variable, or response variable, and label variable)
is one whose value is determined by one or more independent variables. It is usually denoted as Y.
special type of variables that are neither input nor output, but are used to identify individual records (such as SSN, WIN, SKU, ProductID, etc)
Supervised learning models
predicts the value of the response variable based on a set of predictor variables
-The keyword refers to the fact that a correct answer (i.e. response value) is known for each predictor value.
-E.G. Predict sales (response variable) based on Spending on TV ads, new competitors, and new products from competitors (predictor variables)
Unsupervised learning models
uncovers hidden patterns in the data. There are no response variables to predict.
-E.G. Market segmentations, customer classification, grouping of patients for ER re-admittance
Predict if a data point belongs to one of the predefined classes. The prediction will be based on learning from a known data set.
Predict the numeric target label of a data point. The prediction will be based on learning from a known data set
predict if a data point is an outlier compared to other data points in the data set.
predict the value of the target variable for a future time frame based on historical values
identify natural clusters within the data set based on inherit properties within the data set
identify relationships within an item set based on transaction data
1. understand project objectives and requirements from a business perspective and convert it into data mining or analytics problem definition
2. collect data, identify data quality problems and discover initial insights
3. construct the final data set
4. Select modeling techniques and calibrate parameters
5. thoroughly evaluate the model(s) and review the steps to identify if there are business issues not covered by the data mining solution
6. activities to apply the results from DM. This ranges from generating DM reports to implementation of repeatable DM process.
Data Quality: Missing Values
records that maybe have values missing in one or more columns
Data Quality: Data types and conversion
the type of data that a variable holds needs to be consistent with the selected analytical techniques
E.G. Categorical data can't be used for correlation or as the dependent variable of linear models.
Data Quality: Transformation
converting an existing variable into a different data type (e.g., categorical to numerical) or value (e.g., scaling).
Data Quality: Outliers
anomalies in the data set. These anomalies may occur legitimately or erroneously
Data Quality: Feature Selection
a technique to reduce the number of variables or attributes into a smaller set without sacrificing model accuracy much.
Data Quality: Data Sampling
sampling reduces the amount of data that needs to be processed for analysis and modeling, but
-May introduce sampling error, making the sample not representative of the population.
-Generally the benefits of sampling outweighs analyzing the population...WHEN obtaining the population is costly, infeasible or impossible
Training versus testing (or test) data sets
-Training data set: the data set used to build your model
-Test data set: the data set used to validate your model
-Both come from the same original data set.
-Rule of thumb is two thirds of data used for training and one third used for test data set.
-80/20 and 70/30 are also popular
Models built need to be evaluated for their performance.
-We do not want a model to memorize and output only the same values that are in the training data set.
-Why? If a model memorizes the training data set very, very well, there is a good chance that it won't generalize well into the unseen data.
-We want our model to be strong on both training data (the data the model has seen) and test data (the data the model has not seen) to ensure that the model is robust.
an example or observation
set of records
Missing completely at random (MCAR)
-The probability that a value is missing is unrelated to the value of any other variables.
-More than 80% of females did not respond to a certain question. A type of MCAR?
-Missing slaes data versus machine malfunctions, bad weather, and traffic.
Missing not at random (MNAR)
Missing values in "household income" from low-income respondents
Fixing Missing Values: Imputation
replace missing values with a particular value
-Mean substitution, replaces missing values with mean of the variable.
-Mode substitution, replaces missing values with statistical mode or most frequent data value
Fixing Missing Values: Modeling
use DM or statistical technique to predict the missing values.
use of predictors (i.e., independent variables) to sort or classify records into two or more distinct classes (buckets or categories).
This set is often in folders with...
Ch. 9 Big Data Analytics for Managing Risk
Data Management - Foundations
You might also like...
MA 322 MIDTERM
Chapter 1 Intro ISDS 574
454 Exam 1
Business Intelligence CH 1-4*, 13
Other sets by this creator
Chapter 6: Looping
FIN 331 Ch. 8
CIS 4640 - 01
CIS 4640 - Decision Trees