hello quizlet
Home
Subjects
Expert solutions
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
Science
Computer Science
Computer Graphics
Data Analytics Journey (part 2)
Flashcards
Learn
Test
Match
Flashcards
Learn
Test
Match
Study Guide Version
Terms in this set (95)
Business Understanding/Discovery phase
DALC
Tools:
1. Scope Statement
2. Stakeholder Register
3. Gannt Chart
4. Network Diagram
Technique:
1. Critical Path Method
2. KPI
3. Budget estimation techniques
4. schedule estimation techniques
5. SWOT analysis
Business Understanding/Discovery phase
DALC
An analyst defines the major questions of interest that needs of the stakeholders, and assesses the resource constraints of the project. Define project outcomes.
Business Understanding/Discovery phase
DALC
Potential Problems-
Lack of clear focus on:
-Stakeholders
-Timeline
-Limitations
-Budget
could potentially derail an analysis
Data Acquisition
DALC
Collecting data phase. Data is collected and stored, for easy retrieval from a database, perhaps a component of a data warehouse, by using languages like SQL. Web scraping and surveys to acquire data.
Data Acquisition
DALC
Potential Problems:
Quality: Uniqueness, relevance, reliability, validity, and accuracy
Type of data: structured, unstructured, semi-structured, quantitative, qualitative
may make access more difficult.
Data Acquisition
DALC
Tools:
1. SQL
2. Web Scraping software
3. Survey
4. Input Data: Self-generated Data
5. NoSQL-Use to collect unstructured data
Techniques:
1. ETL
2. API
3. Web Scraping
Data Cleaning
DALC
Also known as data cleansing, data wrangling, data munging, and feature engineering. Analyst will use SQL, Python, R or excel to perform data modifications and transformations
Data Cleaning
DALC
Tools:
1. Python
2. R
3. SQL
4. Excel
Techniques:
1. Data Reduction: optimize storage capacity
2. Modification
3. Transformation
4. Anomaly detection
Data Cleaning
DALC
Potential Problems:
1. Some cleaning techniques could dramatically change data/outcomes
2. Outliers not dealt with can cause problems with statistical models due to excessive variability.
Data Exploration
DALC
Analyst begins to understand the basic nature of data, the relationships within it (btw data variables), the structure of the dataset, the presence of outliers, and the distribution of data values. This phase uses data visualization tools and numerical summaries such as measures of central tendency and variability.
Data Exploration
DALC
Tools:
1. Distributions: Normal or Skewed Curve
2. Visualization Tools: tableau, R, python, RStudio, and histogram
3. Statistical tools such as mean, median, and mode
Techniques:
1. Correlation Discovery
2. Pattern Discovery
3. Visualization techniques: histogram, charts, tables, boxplot, etc.
4. Variability: STD, Quartile
Data Exploration
DALC
Potential Problems:
Skipping this step could:
1. enable faulty perceptions of the data which hurt advanced analytics.
2. The analyst will lack insight into the structure of the data set.
Predictive Modeling
DALC
Allows the analyst to move beyond describing the data to creating models that enable predictions of outcomes of interest. Python and R are used in automating the training and use of models.
Predictive Modeling
DALC
Tools:
1. Python
2. R
Techniques:
1. Data modeling
2. correlation modeling
3. regression modeling
4. Time series modeling
5. Cross Validation
6. Regression Models
7. Classification Models
8. Clustering
9. Training
Predictive Modeling
DALC
Potential Problems:
1. Too many input variables (predictors) can cause problems.
2. Correlation does not imply causation.
3. Time series models often need sufficient time data to offer precise trending.
4. Predictive model accuracy should be assessed using cross-validation.
Data Mining
DALC
Looks for patterns in large sets of data. Tools are Python and R. Also called machine learning. A specialized segment of data mining techniques that continually update to improve modeling over time.
Both Exploration and mining uncover patterns. the difference is that data exploration is an initial step to uncover initial patterns and using both manual and automated methods. While data mining is an in-depth step to discover patterns using automated methods like machine learning.
The Difference between Data Exploration and Data Mining:
Data Mining
DALC
Tools:
1. Python
2. R
Techniques:
1. Training dataset to build models.
2. Testing dataset for model evaluation
3. Classification
4. Clustering
5. AI
6. Machine Learning
7. Deep Learning
Data Mining
DALC
Potential Problems:
1. Running on entire data is problematic
2. Needs to subset data into training and testing datasets to build models.
3. Training data: machine learns on training data to improve models.
4. Testing data: evaluate the model itself.
5. Too little sample could cause limited insight.
Data Reporting
DALC
Analyst tells the story of the data and uses graphs or interactive dashboards to inform others of the findings from the analyses. Tools such as Tableau is used to spot trends and patterns. Goal is to give actionable insight to stakeholders.
Data Reporting
DALC
Tools:
1. Dashboards - Tableau
2. Story telling: a feature of tableau
3. Graphs, charts, images, histogram, etc.
Techniques:
1. Visualization
2. Stakeholder communication
Data Reporting
DALC
Potential Problems:
1. Due to potential large audience consumption, mistakes can cause bad business decisions and loss of revenue.
2. Improper scales used to graphs could push for interpretations of the story that is inaccurate.
1. Business Understanding
2. Data Acquisition
3. Data Cleaning
4. Data Exploration
5. Predictive Modeling
6. Data Mining
7. Data Reporting
Data Analytics Life Cycle phases:
1. Planning
2. Wrangling
3. Modeling
4. Applying / Reporting and Visualization
Data Pathway Phases:
1. Find a question
2. Collect the data
3. Prepare the data
4. Create the Model
5. Evaluate the Model
6. Deploy the Model
Data Science Phases:
Descriptive Analytics
What Happened?
Observation/Describe event. It is the interpretation of historical data to better explain market developments.
Diagnostic Analytics
Why did it happen?
Explains the reason for the event. It enables the extraction of value from data by posing the right questions and conducting in-depth investigations into the problems.
Predictive Analytics
What will happen?
Correlation. Predicts what will happen in the future. It uses data, statistical algorithms, and machine learning techniques to determine the JS of potential outcomes. The aim is to have the best assessment of what will happen in the future, rather than simply understanding what has happened.
Prescriptive Analytics
How can we make it happen?
Change/Action/Solution/Causality/Manipulation/Decision Making. It helps organizations make decisions.
Predictive and prescriptive analytics are two forward-looking tools used by business leaders. Predictive analytics uses collected data to come up with future outcomes, and prescriptive analytics takes that data and makes decisions that cause future outcomes.
What is the relationship between predictive and prescriptive analysis?
Structured
Data Types:
numbered and labeled stored in an organized framework with columns and rows, e.g. SQL, databases, Excel, etc
Semi-Structured
Data Types:
Loosely organized in categories using tags. e.g. Emails, CSV, XML, JSON, etc.
Unstructured
Data Types:
text heavy, information not organized in clearly define framework, e.g. text, videos, audio, etc.
Quantitative
Data Types:
Known as numerical, parametric, or interval data
Qualitative
Data Types:
known as nominal or ordinal. Describes the basic features of the data in a study.
Relational Database
Collection of data items with predefined relationships between them, e.g. collection of tables
In house, open data, web server, data lake, data warehouse, self-generated
List various data sources:
Python
Open-source general-purpose programming language. Provides more general approach and has several libraries that are useful to data science. Used by engineers and programmers.
R
open-source programming language with new libraries or tools added continuously. Mainly used for statistical analysis. Used by statisticians, educational researchers, etc.
Tableau
Visual analytics engine that makes it easier to create interactive visual analytics in the form of dashboards.
API (Application Programming Interface)
A software intermediary that allows two applications to talk to each other. An ______ is the messenger that delivers your request to the provider that you are requesting it from and then delivers the response back to you, e.g. with PayPal.
XML (Extensible Markup Language)
A markup language is a set of codes, or tags, that describes the text in a digital document.
SQL
A domain specific language used in programming and designed for managing data in relational database management systems. Helps pull data from databases.
D3.js - Data driven document
a JavaScript library for manipulating documents based on data. D3 helps bring data to life using HTML, SVG, and CSS.
Search Engine
program that searches for and identifies items in a database that correspond to keywords or character specified by the user.
JSON
a lightweight format for storing and transporting data on networks. Also an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and array data types.
Boxplot
provides a concise summary of the quartiles of numerical data (i.e., cut points that divide the data into 25% percentile segments). This graph is also convenient for detecting outliers and skewness.
MLaas - Machine Learning as a Service
an array of services that provide machine learning tools as part of cloud computing services. Clients benefit from machine learning without the cognate cost time and risk of establishing in house internal machine learning team.
ETL (extraction, transformation, and load)
type of data integration that is used to blend data from several sources. It's often used to build a data warehouse.
ETLTL - Extract, transform, load, transform, and load
Tends to load anything and everything into a warehouse or data lake from where it can be analyzed at a later point of time.
Training
_______ Data set is implemented to build up a model. Data points in the ______ set are excluded from the test (validation).
Test (or validation)
_______ Data set used to validate the model built.
Normal Distribution (Bell-Shaped)
a symmetrical curve centered around the mean. Its data falls to the empirical rule that indicates the percentage of the data set that falls within (plus or minus) 1, 2 and 3 standard deviations of the mean.
Bell curve with long tail end
The portion of the distribution having many occurrences far from the central part of the distribution. In sales, it may mean more people buying individualize niche products.
Histogram
A simple and commonly used plot to quickly check the distribution of a sample of data. The data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained. It helps show outliers in data and skewness.
Qlik
end-to-end platform which includes data integration.
Heatmap
a colorful graph that can visually show frequency or interaction using a range of colors. Red is used mostly for frequency while blue is used for least frequency.
Scatterplot
a two-dimensional graph which is great to visualize correlation or relationships. Each dot on the scatterplot represents an outcome for two numerical variables of interest.
Scripting languages are interpreted, and programming languages are compiled. Programming uses a compiler to convert the language to machine language, while scripting uses an interpreter (like PowerShell) to convert the language to machine language.
What is the difference between scripting and programming used in data analytics?
Regression
a technique that allows us to predict an outcome based on a set of predictor variables. It is like providing output given a set of inputs.
Trend Analysis
Regression analysis and a function of time in a value. Understanding how and why things have changed over time. Ex. Stock Prices. Involves figuring out the path your data is on. It starts by plotting a line, making a graph of changes over time, then connecting the points. Trying to find a function for a particular line like the number of people that visit a site, movement, etc.
Time Series
a statistical tool that deals with sequence of data in chronological order. A technique that looks for trends in data over time. It also involves separating data into an overall trend. Can be phrased as supervised learning.
Decomposition
Breaking trend overtime into components. Its procedures are used in time series to describe the reasons for variations in trend.
Spectral Density
a graph that shows frequencies related to the auto-covariance time domain.
Machine Learning
involves using algorithms and statistical models to analyze and draw inferences from patterns in data. Focuses on the development of computer programs that can access data and use it to learn for themselves.
artificial intelligence (AI)
the development of smart machines capable of performing tasks that typically require human intelligence. Ex. Visual perceptions, speech recognition, online cheque processing, decision-making, natural language processing (nlp).
Classification
Machine Learning Techniques:
A technique in which the analyst wants to assign an item to a specific category based on various conditions.
The general approach that the model uses it to find the location of the item needing classification among measurements of interest, compare this item to items close by, then assign them to a group. Also used for object detection, spam detection, cancer detection, etc.
Discuss the general approach to Classification:
Clustering
Machine Learning Techniques:
groupings are unknown, and the analyst wishes to determine if the objects belong to any group. An example is when data on search queries are analyzed to determine if they group in a particular way and how many groups exist. e.g. genome patterns, google news, pointcloud processing
Baye's Theorem
Machine Learning Techniques:
The probability of observing various data, given the hypotheses, and the observed data. It gives you the after-the-data probability of a hypothesis as a function of the likelihood of the data, the probability of getting the data you found.
Naive Bayes
Machine Learning Techniques:
founded by Thomas Bayes is an algorithm that applies Baye's theorem to estimate the conditional probability of an outcome. It is a machine learning model used to classify the object based on different features.
Dimensionality Reduction
Machine Learning Techniques:
Reduces the number of variables and the amount of data. You will deal with a single score and not multiple scores or a lot of data. It uses techniques such as Principal Component Analysis (PCA), factor Analysis, & Feature Selection.
Data Reduction
Simply reducing the amount or volume of data in each storage or database. One of the goals is to optimize storage capacity.
Hierarchal Clustering
Algorithm that groups similar objects into groups that are called clusters.
Anomaly Detection
Machine Learning Techniques:
The identification of rare items, events or observations in a dataset which differ from the norm or raise suspicions. It can be used to detect fraud, intrusion, outliers, technical glitch, etc. in a dataset. Tools include R, RStudio, Tableau, MS Excel, Editor, etc. Techniques include local outlier factor (LOF), alfa function, etc.
Neural networks
Algorithm that mimic the operations of human brain to recognize relationships between vast amounts of data. It is modeled roughly after the neurons that are inside a biological brain. There are on and off switches that relate to each other. Taking very basic pieces of info and connecting it with many other nodes and it is very high-level cognitive decisions and classifications.
Deep Learning
Machine Learning Techniques:
a type of neural network capable of performing text classifications. Also, a type of recurrent neural network (RNN) that works best on sequential data.
Decision trees
Machine Learning Techniques:
tree like model of alternative decisions and their consequences. It is a whole series, a sequence of binary decisions based on your data, that can combine to predict an outcome. It branches out from one decision to the next.
Optimization Analysis
Machine Learning Techniques:
Finding the best value for one or more target variables given certain constraints. It shows what value a variable should have, given certain conditions or restraints.
Supervised Model
Machine Learning Techniques:
Machine learning algorithm that learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data. E.g. Classification and regression.
Unsupervised Model
Machine Learning Techniques:
Provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own. Example, clustering, anomaly detection, neural network
Knowing the goals of an organization, resource availability, stakeholders, and the outcome(s) of the project.
What decisions are necessary to initiate a data analytics project?
Project will not be aligned with organizational needs.
What are the implications of undefined outcomes of potential data analytics projects?
Formulate questions that align with the organizational needs.
How does one define research questions within an organization?
Data privacy laws covering the collection and sharing of personally identifiable information (PII). Example: GDPR in the EU, IRAC, HIPPA.
Summarize the legal frameworks for data governance
Stands for Issue, Rule, Application, and Conclusion. It functions as a methodology for legal analysis. The IRAC the format is mostly used in hypothetical questions in law school.
Define IRAC
Conflict of interest in the context of data framework refers to not being ethical or compromising analysis to allow it to lean towards favorable results.
Define Conflict of Interest in the context of data frameworks.
the ability for information in digital format to be accessible to the average end-user. One of the goals is to allow non-specialists to be able to access data without technical requirement. It means that everyone should have access to the data and there isn't a gatekeeper that can create a bottleneck to the data.
Define Democratization
- Project sponsor provides funds
- Program managers provide direction
- Project manager coordinates and manages the triple constraints, and gets the data/reports out of the organization
- Researcher pushes the team to ask interesting questions and identifies key problems
- Data analyst obtains and cleans data, displays data in reports, and searches for trends and outliers
- The unicorn is the ninja that knows everything.
Define various roles in the workplace in a data analytic project:
- Stakeholders are people who have interest/power in any decision or activity of the project/organization.
- Partners are the organizations responsible for carrying out specific project activities in the manner and scope indicated in an application form.
- Third parties may include regulatory agencies/customers
Define the various roles of potential partners and stakeholders in data analytics projects:
Critical path is the longest path of activities on a project or the minimum of time necessary to complete all project works. Delay on the CP activities could delay the project.
Explain and define the critical path and its relationship to project timeline:
Iron meaning it's not negotiable. The iron triangle shows in graphical form the project constraints of Time, Cost, Scope. Other Names include Golden Triangle, Triple Constraints, and Trinity. Quality is a central theme which is at the midpoint. If you break Iron Triangle by making a change to one constraint, the other two need to be adjusted accordingly otherwise quality will suffer. Some variations use quality interchangeably with scope.
Explain the IRON triangle and the challenge of balancing resources in data analytics projects:
include persuasion, verbal communication, non-verbal communication, active listening, problem-solving, and decision-making.
What are the effective interpersonal communication skills?
It involves being able to listen to others with understanding and empathy.
What is active listening?
Means creating meaningful dialog together that focus on the problem, opportunity, and solution. They can use diagrams, charts, and visuals. Its strategy aims at bringing together different groups of people and third parties to assist with a project or product development. Examples of tools teams use to co-create include Google docs, Slack, Microsoft Teams, etc.
Describe co-creation approaches and tools (co-creation is collaboration)
Other sets by this creator
The Data Analytics Journey
71 terms
Security+ Acronymns
104 terms
Intel
7 terms
Saga Test 2
19 terms
Recommended textbook solutions
Fundamentals of Database Systems
7th Edition
•
ISBN: 9780133970777
(1 more)
Ramez Elmasri, Shamkant B. Navathe
687 solutions
Introduction to Algorithms
3rd Edition
•
ISBN: 9780262033848
(2 more)
Charles E. Leiserson, Clifford Stein, Ronald L. Rivest, Thomas H. Cormen
726 solutions
Computer Organization and Design MIPS Edition: The Hardware/Software Interface
5th Edition
•
ISBN: 9780124077263
(5 more)
David A. Patterson, John L. Hennessy
220 solutions
Information Technology Project Management: Providing Measurable Organizational Value
5th Edition
•
ISBN: 9781118898208
Jack T. Marchewka
346 solutions
Other Quizlet sets
exam 319-1
50 terms
Microbiology Ch. 4
20 terms
BTE320 Chapter 8
35 terms
6011 Exam Review
87 terms