68 terms

Descriptive analytics

uses data to understand past and present

Predictive analytics

analyzes past performance to forecast future outcomes

Prescriptive analytics

uses optimization techniques to design processes

Business analytics

s the use of data, statistical analysis, quantitative methods, and mathematical or computer-based models to help managers make better, fact-based decisions.

Inferential Statistics

Drawing conclusions about a large group of individuals based on information about a subset thereof

Hypothesis Testing

Examining the veracity of a claim about the population

e.g., New CFL bulbs last no longer than standard incandescents

e.g., New CFL bulbs last no longer than standard incandescents

Population

All the items or individuals about which you want to draw a conclusion (the "large group")

Parameter

A numerical measure that describes a characteristic of a population

Operational Definitions

Universally accepted meanings that are clear to all associated with an analysis

Discrete Variables

variables that must be a whole number or finite, number of people or outcome on a coin flip, can't have half

continuous variables

can be half or any number in between, time is actually continuous

probability mass function

the cumulative height of each bar represents the % of the total that bin represents. tells us the probability of getting a certain value. this is when discrete.

probability density function

same as above but is the area under the curve and is for continuous variables and not discrete.

both probability functions must

equal 1

properties of a normal distribution

bell shaped

symmetric

mean, median and mode all equal

symmetric

mean, median and mode all equal

normal distributions spread is determined by the

standard deviation o

the range of a normal distribution is theoretically

-infinity to infinity

why do we use samples?

less time consuming and less costly

more practical than analyzing the whole population

can be more accurate than relating everything in a population

more practical than analyzing the whole population

can be more accurate than relating everything in a population

As sample size increases,

sigma descreases and we become more confident in our decisions.

Type I Error

Rejecting a true null hypothesis

Probability of a Type I Error is:

The level of significance of the test

Set in advance by the researcher

Probability of a Type I Error is:

The level of significance of the test

Set in advance by the researcher

Type 2 Error

Failing to reject a false null hypothesis

Probability of a Type II Error is:

Directly related to the power of the test (1-Prob. Type II Error = Power)

Generally not computable (requires that population mean be known)

Probability of a Type II Error is:

Directly related to the power of the test (1-Prob. Type II Error = Power)

Generally not computable (requires that population mean be known)

A sampling distribution is

the distribution of all of the possible values of a sample statistic for a given size sample selected from a population

Hypothesis testing is

analyzing the difference between the observed results and what you would expect if the null hypothesis were true

the p value is the probability of a

type 1 error which is rejecting a true null hypothesis.

As Sample Size Increases

Probability of Type II Error Decreases

What is a scatterplot?

Graphical representation of the relationship between two variables

relationships needed to understand from scatterplot

Magnitude

Correlation

Linearity vs. Nonlinearity

Outliers

Correlation

Linearity vs. Nonlinearity

Outliers

limitations of scatterplot

Cannot infer causality

Limited to three covariates (at most), can be very difficult to analyze

Sub-population effects can be masked

Limited to three covariates (at most), can be very difficult to analyze

Sub-population effects can be masked

Magnitude (Direction)

Magnitude is the overall trend in the data points

Positive?

Negative?

Zero (horizontal)?

Can be measured easily in Excel by adding a trend line to the plot

Positive?

Negative?

Zero (horizontal)?

Can be measured easily in Excel by adding a trend line to the plot

Correlation (strength)

Correlation is how closely related the values of X and Y are

Closely related values will produce points that are closer to a line

Less closely related values will produce a more dispersed cloud of points

Measured by the correlation coefficient

Closely related values will produce points that are closer to a line

Less closely related values will produce a more dispersed cloud of points

Measured by the correlation coefficient

Linearity vs. Nonlinearity

Linear relationships produce points clustered around a straight line

Nonlinear relationships produce points that follow a curved line

We can't use linear methods to quantify nonlinear relationships

Incorrect conclusions

Invalid predictions

Costly consequent actions

Nonlinear relationships produce points that follow a curved line

We can't use linear methods to quantify nonlinear relationships

Incorrect conclusions

Invalid predictions

Costly consequent actions

you should delete an outlier if..

the data point is not relevant to the topic of study

the underlying data is flawed

You would want to keep it if the data is actually reflective of the real world

the underlying data is flawed

You would want to keep it if the data is actually reflective of the real world

influential points

have a effect on or impact on the resulting analysis and conclusions.

residuals

difference between observed and predicted values

LOBF is determined by

the minimum of the sum of the squared residuals.

R squared

is the percent of variation explained by the dependent variable(s), relative to the overall variation in the data.

, R2 is how much more accurately we can estimate the outcome variable with the dependent variable(s) as opposed to simply using the average of the outcome variables

, R2 is how much more accurately we can estimate the outcome variable with the dependent variable(s) as opposed to simply using the average of the outcome variables

standard error only helps you when calculating the p value

...

statistical significance does not mean practical significance

...

adjusted r squared penalizes for each additional variable added to the data set.. It also doesn't carry the % of the variation explained aspect that r squared carries.

...

A model nests another when

it is a generalized version of it

Includes additional parameters

Does not exclude any parameters

Includes additional parameters

Does not exclude any parameters

Interaction Terms

Allows the effect of one variable on _𝑌_ to depend on the value of another variable

L.I.N.E Assumptions, what are these assumptions on

Have mean zero (Linearity) ,are probabilistically independent (Independence), Normally distributed, Equal Variance.

for all values of x the population errors

for all values of x the population errors

Linearity test

sample residuals must be in a linear line, residual plots

independent population errors

autocorrelation test, knowing the value of the errors for any set of x values provides no information on others

normally distributed test

plot the residuals on a histogram or qq plot , should be roughly bell shaped

Can also use a normal probability plot but it will be harder to analyze compared to the histogram of the residuals.

Can also use a normal probability plot but it will be harder to analyze compared to the histogram of the residuals.

equal variance test

can be seen by plotting y on x OR RESIDUAL PLOTS

fan shape = heterskedastic

consistent = homoskedastic

fan shape = heterskedastic

consistent = homoskedastic

Multicollinearity

Occurs when one variable is a linear combination of one or more other variables

No new information available on which to estimate parameters

No new information available on which to estimate parameters

Transformations

Transformations are simply the original variables altered by some mathematical function

two earlier transformations were scaling and interaction terms

WHEN YOU USE THESE TRANSFORMATIONS YOU CANT USE R SQUARED ANYMORE OR ADJUSTED OR STANDARD ERRORS

two earlier transformations were scaling and interaction terms

WHEN YOU USE THESE TRANSFORMATIONS YOU CANT USE R SQUARED ANYMORE OR ADJUSTED OR STANDARD ERRORS

elasticity interpretation

when you log both the variable and the y hat you get the percentage change in estimate due to percentage change in variable which is determining elasticity

Stepwise Regression

Any process that automates the include/exclude decision rules based on some measure of fit, e.g.,

Adjusted R2

Parameter p-values

types = forward, backward, stepwise

Adjusted R2

Parameter p-values

types = forward, backward, stepwise

tradeoffs in step wise regression

positives: may help identify important variables

can help sift through a large amount of variables

negatives:

can capitalize on sample randomness

won't account for transforms or interactions

very bad for understanding relationships but good for predicting

you could find a model that works perfectly with your sample but will be useless to use in the future.

can help sift through a large amount of variables

negatives:

can capitalize on sample randomness

won't account for transforms or interactions

very bad for understanding relationships but good for predicting

you could find a model that works perfectly with your sample but will be useless to use in the future.

three types of forecasting approaches

Judgmental - Qualitative

Extrapolation - Time Series

Econometric - Regression

Extrapolation - Time Series

Econometric - Regression

Goal of forecasting is to make predictions

Accurately

Out of sample

Early

Out of sample

Early

statistical and practical significance

does not imply causality

could be a third part variable causing this or

simultaneity: order events does not guarantee causality

could be a third part variable causing this or

simultaneity: order events does not guarantee causality

combined forecasts

forecast errors might cancel out when combined across multiple forecasts: example= political polls

combining forecasts should increase confidence if...

the errors are random and independent across the polls

trend

the overall direction of the data

seasonality

short term repetitive patterns

cycles

long term shifts and patterns not as predictive

time between can change dramatically and may only be one or two within a data set

time between can change dramatically and may only be one or two within a data set

noise

random deviations from the pattern due to given observations

3 common examples of trend

linear

exponential

s-shaped

exponential

s-shaped

how to account for cycles in regression?

you must find a leading indicator and implement it within your data

autocorrelation can be helpful in identifying

seasonality

Positive: Large and small values tend to be followed by the same

Negative: Large and small values tend to be followed by the opposite

Positive: Large and small values tend to be followed by the same

Negative: Large and small values tend to be followed by the opposite

a larger span on a moving average will result in a

less impactful extreme observations

the average predicted college GPA, the average observed GPA and the predicted gpa of the average in the variables will equal each other.

...

core correlation

a variable is correlated with past values of other variables

autocorrelation

a variable is correlated with past values of itself.

when graph says something vs something else where is each variable placed

y vs x, so something on y axis