analyzes past performance to forecast future outcomes
uses optimization techniques to design processes
s the use of data, statistical analysis, quantitative methods, and mathematical or computer-based models to help managers make better, fact-based decisions.
Drawing conclusions about a large group of individuals based on information about a subset thereof
Examining the veracity of a claim about the population
e.g., New CFL bulbs last no longer than standard incandescents
All the items or individuals about which you want to draw a conclusion (the "large group")
A numerical measure that describes a characteristic of a population
Universally accepted meanings that are clear to all associated with an analysis
variables that must be a whole number or finite, number of people or outcome on a coin flip, can't have half
can be half or any number in between, time is actually continuous
probability mass function
the cumulative height of each bar represents the % of the total that bin represents. tells us the probability of getting a certain value. this is when discrete.
probability density function
same as above but is the area under the curve and is for continuous variables and not discrete.
both probability functions must
properties of a normal distribution
mean, median and mode all equal
normal distributions spread is determined by the
standard deviation o
the range of a normal distribution is theoretically
-infinity to infinity
why do we use samples?
less time consuming and less costly
more practical than analyzing the whole population
can be more accurate than relating everything in a population
As sample size increases,
sigma descreases and we become more confident in our decisions.
Type I Error
Rejecting a true null hypothesis
Probability of a Type I Error is: The level of significance of the test Set in advance by the researcher
Type 2 Error
Failing to reject a false null hypothesis
Probability of a Type II Error is: Directly related to the power of the test (1-Prob. Type II Error = Power) Generally not computable (requires that population mean be known)
A sampling distribution is
the distribution of all of the possible values of a sample statistic for a given size sample selected from a population
Hypothesis testing is
analyzing the difference between the observed results and what you would expect if the null hypothesis were true
the p value is the probability of a
type 1 error which is rejecting a true null hypothesis.
As Sample Size Increases
Probability of Type II Error Decreases
What is a scatterplot?
Graphical representation of the relationship between two variables
relationships needed to understand from scatterplot
Magnitude Correlation Linearity vs. Nonlinearity Outliers
limitations of scatterplot
Cannot infer causality
Limited to three covariates (at most), can be very difficult to analyze
Sub-population effects can be masked
Magnitude is the overall trend in the data points
Positive? Negative? Zero (horizontal)?
Can be measured easily in Excel by adding a trend line to the plot
Correlation is how closely related the values of X and Y are Closely related values will produce points that are closer to a line Less closely related values will produce a more dispersed cloud of points
Measured by the correlation coefficient
Linearity vs. Nonlinearity
Linear relationships produce points clustered around a straight line
Nonlinear relationships produce points that follow a curved line
We can't use linear methods to quantify nonlinear relationships Incorrect conclusions Invalid predictions Costly consequent actions
you should delete an outlier if..
the data point is not relevant to the topic of study
the underlying data is flawed
You would want to keep it if the data is actually reflective of the real world
have a effect on or impact on the resulting analysis and conclusions.
difference between observed and predicted values
LOBF is determined by
the minimum of the sum of the squared residuals.
is the percent of variation explained by the dependent variable(s), relative to the overall variation in the data.
, R2 is how much more accurately we can estimate the outcome variable with the dependent variable(s) as opposed to simply using the average of the outcome variables
standard error only helps you when calculating the p value
statistical significance does not mean practical significance
adjusted r squared penalizes for each additional variable added to the data set.. It also doesn't carry the % of the variation explained aspect that r squared carries.
A model nests another when
it is a generalized version of it
Includes additional parameters Does not exclude any parameters
Allows the effect of one variable on _𝑌_ to depend on the value of another variable
L.I.N.E Assumptions, what are these assumptions on
Have mean zero (Linearity) ,are probabilistically independent (Independence), Normally distributed, Equal Variance.
for all values of x the population errors
sample residuals must be in a linear line, residual plots
independent population errors
autocorrelation test, knowing the value of the errors for any set of x values provides no information on others
normally distributed test
plot the residuals on a histogram or qq plot , should be roughly bell shaped
Can also use a normal probability plot but it will be harder to analyze compared to the histogram of the residuals.
equal variance test
can be seen by plotting y on x OR RESIDUAL PLOTS
fan shape = heterskedastic
consistent = homoskedastic
Occurs when one variable is a linear combination of one or more other variables
No new information available on which to estimate parameters
Transformations are simply the original variables altered by some mathematical function
two earlier transformations were scaling and interaction terms WHEN YOU USE THESE TRANSFORMATIONS YOU CANT USE R SQUARED ANYMORE OR ADJUSTED OR STANDARD ERRORS
when you log both the variable and the y hat you get the percentage change in estimate due to percentage change in variable which is determining elasticity
Any process that automates the include/exclude decision rules based on some measure of fit, e.g., Adjusted R2 Parameter p-values
types = forward, backward, stepwise
tradeoffs in step wise regression
positives: may help identify important variables can help sift through a large amount of variables
negatives: can capitalize on sample randomness won't account for transforms or interactions very bad for understanding relationships but good for predicting you could find a model that works perfectly with your sample but will be useless to use in the future.
three types of forecasting approaches
Judgmental - Qualitative Extrapolation - Time Series Econometric - Regression
Goal of forecasting is to make predictions
Accurately Out of sample Early
statistical and practical significance
does not imply causality
could be a third part variable causing this or
simultaneity: order events does not guarantee causality
forecast errors might cancel out when combined across multiple forecasts: example= political polls
combining forecasts should increase confidence if...
the errors are random and independent across the polls
the overall direction of the data
short term repetitive patterns
long term shifts and patterns not as predictive
time between can change dramatically and may only be one or two within a data set
random deviations from the pattern due to given observations
3 common examples of trend
linear exponential s-shaped
how to account for cycles in regression?
you must find a leading indicator and implement it within your data
autocorrelation can be helpful in identifying
Positive: Large and small values tend to be followed by the same
Negative: Large and small values tend to be followed by the opposite
a larger span on a moving average will result in a
less impactful extreme observations
the average predicted college GPA, the average observed GPA and the predicted gpa of the average in the variables will equal each other.
a variable is correlated with past values of other variables
a variable is correlated with past values of itself.
when graph says something vs something else where is each variable placed