Home
Subjects
Create
Search
Log in
Sign up
Upgrade to remove ads
Only $2.99/month
Science
Computer Science
Artificial Intelligence
Management 665
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (25)
What is the main idea of the article "Telling a Story with Data"?
The main idea behind telling a story with data is that there are a variety of approaches to describe data analytically other than a basic report. If the data can produce some sort of value, then it provides individuals with something more than just a report
Workbook
A place where you can create visualizations with you data which can be shown using a dashboard
Dashboard
A dashboard is a collection of data snapshots from multiple worksheets
Story
dashboard where you can provide a narration of visualizations and remember concepts
Numerical columns
Refers to integer numbers, such as age, and real numbers, such as price
Categorical columns
Correspond to categorical variables such as gender, social class, blood type and nationality. These variables can only take on a limited number of possible values
Binary columns
Only two values can be present, like pass/fail, yes/no, or true/false.
Handle missing values and inconsistent data in analytics projects
drop missing values and replace with certain values
What is business intelligence?
A variety of software applications to analyze an organizations raw data. This is made up of several activities, such as data mining, analytical processing, querying and reporting
Python packages used for data visualization
matplotlib and seaborn
what types of charts or graphs are possible with data visualization python packages?
heat map, scatter plots, linear model plots, box plots, histograms
Explain the difference between correlation analysis and causality analysis
A correlation between two variables does not mean that a change in one variable will cause a change in the other. A causality means that one event is the occurrence of another event
Explain the procedure to do correlation analysis when your dataset contains categorical columns or variables
Replace the categorical columns/variables with numerical identifiers. Use the numerical identifiers that were generated to create dummy variables with each different response having its own column
Explain when to act on correlation and when not to, according to the article
Two factors: the first is the level of confidence that the correlation will reliably reoccur in the future. The second is the determining the tradeoff between the risk and reward of taking action.
Variance
Measures how far a dataset is spread out. The average of the squared differences from the mean
What is hypothesis testing in general?
A statistical test that is used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for an entire population
When do yo do a t-test?
Test if the mean of two groups of data samples is significantly different or not
When do you do an ANOVA test?
Test differences when there are three or more samples or groups
What is p-value and how do you interpret it?
P-value is the probability that the null hypothesis is true. A low p-value means we can reject the null hypothesis, meaning a significant relationship between two variables being tested. If p-value is high, accept null hypothesis, meaning no significant relationship between variables
General assumptions about dataset when conducting t-tests
assume data is normally distributed and equal variances between the samples
What are the five types of machine learning techniques?
classification and multi-class classification, anomaly detection, regression, clustering, and reinforcement learning
Supervised machine learning
Output datasets provided which are used to train the machine to get the desired outputs. Has the process of algorithm learning from the training dataset, we know the correct answers, and the algorithm makes predictions on training data
Unsupervised Machine learning
Contains no datasets, data is clustered into different classes or used in association, only has input data and no corresponding output variables. Unsupervised because there are no correct answers
Statistical analysis
Theory driven, assumptions made about data, often relies on sampling, focusing on causality, requires strong statistical skills
Machine Learning
Less interested in the mechanics of the technique, use if makes sense, does not require assumptions to be made about data, focus on deploying the results
You might also like...
Data Mining Test 1
48 terms
Cis 4640: Exam # 1 Review
36 terms
OCR A2 Skills
90 terms
Acrobatiq Unit 11
65 terms
Other sets by this creator
Social Studies Final Exam
20 terms
Social Studies Methods Mid Term
31 terms
3-6 Literacy Final Exam
110 terms
Conceptual Questions Exam 3
38 terms