Home
Subjects
Explanations
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
Science
Computer Science
Computer Graphics
Data Mining Exam 1
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (96)
Business Intelligence
is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies
2 Major Objectives of Business Intelligence:
1) Enable easy access to data to provide business managers with the ability to connect analysis
2) Transform data to information (knowledge) and to decisions that finally lead to action
Data mining is the extraction of interesting ___________ or knowledge from ___________ volumes of data.
patterns; large
Data mining is an attempt to remove some of the ____________ associated with __________-____________ in the business environment.
uncertainty; decision-making
Data Mining Vertical Use Cases
Healthcare, Logistics, Telecommunications
Data Mining Horizontal Use Cases
Customer Analysis, Revenue Generation, Human Resources
What are the 3 data mining approaches?
descriptive, predictive, prescriptive
What is descriptive data mining?
characterizes the properties of the data
What type of data mining approach is this: What has happened?
descriptive
What is predictive data mining?
makes inferences from the data
What type of data mining is this: What may happen?
predictive
What is prescriptive data mining?
attempts to direct decision making
What type of data mining is this: What should be done?
prescriptive
What is the Data Mining Process?
data -> target data -> clean data -> patterns -> insight -> action
What is Big Data?
big data refers to situations where datasets are so large they cannot be stored or analyzed using traditional methods
Data is almost always organized in a _______________ format.
tabular/matrix
Rows
tuples and observations
Columns
variables, dimensions, features, and inputs/targets
Common Data Types: Categorical
the value is a name that identifies a specific category
__________ is a commonly occurring nominal variable in business data.
Region
Common Data Types: Ordinal
the value identifies a category but is also associated with a rank i.e. 1st, 2nd, 3rd
Types of nominal data
ordinal and categorical
Common Data Types: Numeric
data has meaningful intervals between measurements
i.e. temperature
Common Data Types: Ratio
data has meaningful intervals between measurements
Types of interval data
numeric and ratio
What are identifiers?
identifiers are often the primary key in the database from which the dataset was drawn
identifiers has no predictive value in models...they are only there to help you navigate the data
Common Data Types: Text
can often be distinguished from nominal data by the number of levels present
The independent variable is the _______.
input
The dependent variable is the __________.
target
Accuracy
dot he data accurately represent what they are intended to represent
Completeness
do we have all of the necessary data
timeliness
was the data collected recently enough to still be useful
Believability
can the data be trusted
Interpretability
do we really understand what the data shows
When processing data quality, you are looking for ___________ quality rather than optimal quality
sufficient
Data Cleaning
dealing with missing values and smoothing noisy data
Data Integration
ensure that the incorporation of data from multiple sources has not introduced inconsistencies into the data
Data Reduction
identifying a smaller subset of the data which can produce the same (or similar) analytical results
Data Cleaning: Missing data approaches
ignore the tuple, manual correction, global constant, central tendency, most probably value
Data Cleaning: Noisy data approaches
binning, regression, outlier analysis
Data Integration: Entity Identification
how do we match records in one data source with those in another
Data Integration: Redundancy
can a given field be derived from others within the data set
Data Integration: Duplication
can result from data redundancy in underlying data sources
Data Integration: Data Value Conflict
what data source A identifies price is 24.99 and data source B indicates price is 19.99, which is correct?
Data Reduction: Reducing Rows
Aggregation: increasing the granularity of the data cube will reduce the number of observations present in the resulting data
Sampling: selecting representative observations for analysis while discarding the bulk of the data
Data Reduction: Reducing Columns
Manual Feature Selection: the data analyst can examine the available dimensions and exclude those they feel are not useful for modeling purposes
Feature Selection based on Objective Function: a modeling approach is used to identify the features that appear to have the most influence not he dependent variable
Feature Extraction: maps high dimensional data onto a lower dimensional subspace
Exploratory Data Analysis (EDA)
EDA is an approach that attempts to develop an understanding of the data to facilitate the selection of the best possible models
Linear Regression
linear regression used ordinary least squares to allow us to predict a target (dependent) variable based on one or more inout (independent) variables
3 requirements of Linear Regression
1) dependent variable must be an interval variable
2) independent variables may be interval or nominal
3) predicted values represent the mean of the target variable at the given values of the independent variables
What is dummy coding?
dummy coding is the practice of converting a nominal variable into several binary variables (1's and 0's) that can be used in a regression model
Regression Assumptions: Model Parameters
1) Linearity: the relationship between each independent variable and the dependent variable is a line, holding the other variables fixed
2) No Multicollinearity: if there are multiple predictors in the model, they should not be highly correlated with one another
Regression Assumptions: Residuals
1) Homoscedasticity (constant variable): the variance of the residuals should show a attention with increasing values of the dependent variable
2) No Autocorrelation (statistical independence of errors) there should be no correlation between consecutive error terms
3) Normal Distribution: the residuals should be normally distributed
Regression assumptions are generally evaluated visually using a plot of residuals (y-y^) on the y axis and predicted values (y^) not he x axis.
...
Normal distributions:
are desirable but not always present in a given data set
Transformation:
involves mathematically manipulating the data to make modeling more feasible
Significance of the Model (F test and associated p-value)
tests overall significance of the model
Significance of the predictors (t tests and associated p-values) tests the likelihood that he b of the given predictor is zero
...
Proportion of variance explained (r2 and adjusted r2)
indicates how much of the variation in the dependent variable is explained by the model
Dimensionality reduction
is a set of techniques used to reduce the amount of data necessary for prediction while still providing accurate models
Dimensionality Reduction: Feature selection
selectively exclude dimensions from consideration
Dimensionality Reduction: Feature Extraction
mathematically combine dimensions to produce intrinsic/latent dimensions
maps high dimensional data onto a lower dimensional subspace
2 Types of Feature Selection:
1) Manual feature selection: the data analyst can examine the available dimensions and exclude those they feel are not useful for modeling purposes
2) feature selection based on objective function: a modeling approach used to identify the features that appear to have the most influence not he dependent variable
The Curse of Dimensionality
is based on the fact that statistical methods count observations that occur in a given space
as dimensions increase the data needed to amok accurate inferences grows exponentially and observations become sparser
model performance often suffers as dimensionality increases
Principal component analysis PCA
is a feature extraction method which takes a classical linear approach to dimension reduction
projects high-dimensional data into a lower dimensional sub-space using linear transformation
What is the goal of PCA?
to reduce dimensionality while retaining as much information as possible in the dataset
2 PCA Requirements
1) PCA inly involves predictors, not target variables
2) PCA can only be performed on dimensions which are numeric in nature
Matrix
a rectangular array of rows and columns
Matrix: Principal
diagonal from upper-left to lower right of a matrix
Matrix: principal elements
elements of the diagonal
Matrix: trace
sum of the principal elements
Diagonal matrix
matrix in wich all non-diagonal elements are zero
Identity/unity matrix
scalar matrix in which all diagonal elements equal 1
Scalar matrix
diagonal matrix in which all diagonal elements are equal
Transpose matrix At of matrix A
obtained by converting rows to columns and columns to rows
Determinant
a function that associates a scalar to a square matrix
Determinant: Inverse
B is the inverse of A if A*B = I
Determinant: Singularity
A matrix with a non-zero determinant is called non-singular (it has an inverse)
Conversely, a matrix with a zero determinant is called singular (it has no inverse)
What is an eigenvalue?
a measure of the variation within the data along a particular path
The largest eigenvalue is referred to as the ___________.
principal eigenvalue
What is an eigenvector?
the magnitude and direction of a path through the data
2 common approaches for choosing how many principal components to keep:
1) Threshold: determine how much of the information you want to retain, keep enough components to satisfy that threshold
2) Scree plot: oder eigenvalues in descending order and plot the nth eigenvalue against k
What are some issues with PCA?
Covariance is sensitive to large values
PCA assumes the underlying subspace is linear (i.e., that variables are numeric) and thus transformations may be required
What is a cluster?
a collection of data objects
large similarity among objects in the same cluster
dissimilarity among objects in different clusters
Cluster analysis is also known as ____________.
segmentation
Good clustering will produce high quality clusters with:
high intra-class similarity and low inter-class similarity
Quality of the clustering depends on:
the similarity measure used and the implementation
quality can also be measured by the ability to discover hidden patterns
Minkowski difference
means of calculating the distance between points in dimensional space
What is standardization?
an important consideration when performing cluster analysis
What are 2 approaches to standardization?
1) z score
2) scaling to [0, 1]
Standardization: z score
subtract the mean from each observation and divide by the sample deviation
resulting data will have a mean of zero and a standard deviation of one
Standardization: scaling
subtract the minimum value from each observation and divide by range
Clustering Approaches: Partitional Clustering
goal is to partition a dataset containing n objects into k clusters
given k, find a partition of k clusters that optimize the chosen partitioning criterion
k-means
each cluster is represented by a calculated centroid
k-medoids
each cluster is represented by one of the objects in the cluster
Clustering Approaches: Hierarchical Clustering
goal is to identify the hierarchies between objects in the dataset such that they can be represented in a nested tree structure
...
...
Other sets by this creator
Final exam
10 terms
UJ
61 terms
Cost Accounting Exam 1
12 terms
Tax Kahoot Questions
15 terms