any value that falls more than 1.5 IQR above Q3 or below Q1
y - y hat, the difference between the actual y-value in a scatterplot and the predicted y-value in the LSRL
P(at least 1)
= 1 - P(none)
Mean and StanDev of a Binomial Random Variable
mean: mu of x = np
stan.dev.: sigma of x = sq.root (np(1-p))
Significance Test 4-step process
STATE: hypotheses, significance level, parameters
PLAN: choose inference method, check conditions
DO: perform calculations, compute test stat, find P-value
CONCLUDE: interpret the result of your test in the context of your problem
Chi-Square Test df and expected counts
Goodness of Fit: df = #categories - 1, exp counts = sample size x hypo. prop. in each category
Homogeneity/Independance: df = (#columns - 1)(#rows - 1), exp counts: (row total)(column total)/table total
z = (value - mean) / stan.dev.
interpreting a residual plot
curved pattern = linear may not be appropriate
small residuals = predictions will be pretty precise
inc/dec spread = predictions for larger/smaller values of x will be more variable
two mutually exclusive events whose union is the sample space (rain/not rain etc)
binomial distribution: calc usage
exactly x: binompdf(n, p, x)
at most x: binomcdf(n, p, x)
less than x: binomcdf(n, p, x-1)
at least x: 1 - binomcdf(n, p, x-1)
more than x: 1 - binomcdf(n, p, x)
CI 4 step process
STATE parameter, confidence level
PLAN inference method, check conditions
CONCLUDE interpret interval in the context of the problem
types of chi-square tests
Goodness of fit: test the distribution of one group of sample as compared to a hypothesized distribution
Homogeneity: when you have a sample from 2 or more indep. pop.s or 2 or more groups in an experiment
Association: when you have a single sample from a single population with 2 variables
calc tips: normalcdf and invnorm
normalcdf: min, max, mean, standev
invnorm: area to the left as a decimal, mean, standev
using an LSRL to predict outside the domain of the explanatory variable
use of a control group
gives the researchers a comparison group to be used to evaluate the effectiveness of the treatments
conditions for geometric distributions
Binomial - can be classified as success/failure
Trials - goal must be to count #trials until the first success
Success - prob of success must be the same for each trial
2 sided test from a confidence interval
we do/don't have enough evidence to reject Ho: mu = __ in favor of Ha: mu (does not equal) __ at the a = 0.05 level because __ falls outside/inside the 95% CI.
a = 1 - CI
Conditions: inference for proportions
Random, Normal (at least 10 succ/failures in both groups if it's 2sample), Independent (10% condition)
Shape, Outliers, Center, Spread
the estimated or predicted y-value (context) for a given x-value (context)
a sample taken in such a way that every set of n individuals has an equal chance to be the sample selected
Conditions: binomial distribution
B: binomial (success/failure)
N: number (of trials must be fixed)
S: success (prob. of success must stay constant)
finding sample size given ME
one mean: m = z*(sigma/n)
proportion: m = z*sq.rt.(p(1-p)/n)
solve for n
Conditions: inference for means
SRS, Normal (more than 30), Indep. (10%)
SOCS, 'is greater/less than' (compare)
the equation for the line that creates the least residuals
Does x CAUSE y?
association is not causation!
mean/standev of a sum of 2 random variables
mean: mu of x+y = mu x + mu y
standev: sigma of x+y = sq.rt. (sigma x^2 + sigma y^2)
can we generalize the results to the population of interest?
Yes, if a large random sample was taken from the same population we hope to draw conclusions about.
factors that affect power
sample size (inc power = inc sample size), alpha (a 5% sig.test will have a greater chance of rejecting Ho than a 1% test), consider an alternative farther from mu 0 (values of mu that are in Ha but like close to the hyp. value are harder to detect than farther away ones)
adding a to every member of a data set adds a to the measures of position but does not change the measures of spread and shape, multiplying every member by b multiplies the measures of position by b and multiplies most measures of spread by b but does not change the shape
LSRL "SE of b"
measures the standev of the estimated slope for predicting the y-variable from the x-variable/how far the estimated slope will be from the true slope on average
experiment vs obs. study
a study is an experiment ONLY if researchers impose a treatment on the exp. units
mean and standev of a difference of 2 random variables
mean: mu of x-y = mu x - mu y
standev: sigma x-y = sq.rt. (sigma x^2 - sigma y^2)
assuming that the null is true, the _______ measures the chance of observing a statistic (or difference) as large or larger than the one observed
type 1 error
rejecting Ho when Ho is actually true
type 2 error
failing to reject Ho when Ho should be rejected
rejecting Ho when Ho should be rejected
correlation measures the strength and direction of the linear relationship between x and y
stratified random sample vs SRS
stratified guarantees that each of the strata will be represented
mean/standev of a discrete random variable
mean: mu x = sum (xp)
standev: sigma x = sq.rt. (sum (x-mu)p)
systematic favoring of certain outcomes due to flawed sample selection, poor wording, undercoverage, nonresponse, etc
2-sample t-test phrasing hints and Ho/Ha
key phrase: difference in the means!
Ho: mu1 - mu2 = 0 OR mu1 = mu2
Ha: mu1 - mu2 < 0, > 0, not equal to 0
2-sample t-test conclusion
we do/don't have enough evidence at the a = ? level to conclude that the difference between the mean ? for all ? and the mean ? for all ? is ?
measures spread by giving the typical or average distance that the observations are away from their mean
% or the variation in y that is explained by the LSRL of y on x
goal/benefit of blocking
goal: to create groups of homogeneous exp units
benefit: the reduction of the effect of variation within the exp units
the long-run average outcome of a random phenomenon carried out a very large number of times
data is collected in such a way that there is no systematic tendency to over- or under-estimate the true value of the pop. parameter
paired t-test: phrasing hints and Ho/Ha
key phrase: mean difference
Ho: mu diff = 0
Ha: mu diff < 0, > 0, not equal to 0
paired t-test: conclusion
we do/don't have enough evidence at the ? a level to conclude that the mean difference in ? for all ? is ?
LSRL y-intercept 'a'
when the x variable is 0, the y variable is estimated to be ______.
Completely Randomized Design, Randomized Block Design (homogeneous blocks), Matched Pairs
the proportion of times the outcome would occur in a long series of repetitions
Central Limit Theorem
if the pop. dist. is normal, the samp. dist will also be normal. also, as n increases, the samp. dist.'s standev will increase.
if the pop.dist. is not normal, the samp. dist. will become more and more normal as n increases.
Intervals prodced with this method will capture the true population/mean of ______ in about ___% of all possible samples of this same size from this same population.
Conditions: inference for regression
independent (10%), normal (regression), equal variance (around regression), random
LSRL slope 'b'
for every one unit change in x the y variable is predicted to increase/decrease by ____ units.
SRS, stratified, cluster, census, convenience, voluntary
two events are independent if...
P(B) = P(BIA) or P(B) = P(BIAc)
why large samples give more trustworthy results
large samples yield more precise results than small samples because in a large sample the calues of the sample statistic tend to be cloaser to the true population parameter
confidence interval conclusion
I am ___% confident that the interval from ___ to ___ captures the true _____.
Conditions: chi-squared tests
random, all exp. counts at least 5, independent (10%)