63 terms

outlier

any value that falls more than 1.5 IQR above Q3 or below Q1

residual

y - y hat, the difference between the actual y-value in a scatterplot and the predicted y-value in the LSRL

P(at least 1)

= 1 - P(none)

Mean and StanDev of a Binomial Random Variable

mean: mu of x = np

stan.dev.: sigma of x = sq.root (np(1-p))

stan.dev.: sigma of x = sq.root (np(1-p))

Significance Test 4-step process

STATE: hypotheses, significance level, parameters

PLAN: choose inference method, check conditions

DO: perform calculations, compute test stat, find P-value

CONCLUDE: interpret the result of your test in the context of your problem

PLAN: choose inference method, check conditions

DO: perform calculations, compute test stat, find P-value

CONCLUDE: interpret the result of your test in the context of your problem

Chi-Square Test df and expected counts

Goodness of Fit: df = #categories - 1, exp counts = sample size x hypo. prop. in each category

Homogeneity/Independance: df = (#columns - 1)(#rows - 1), exp counts: (row total)(column total)/table total

Homogeneity/Independance: df = (#columns - 1)(#rows - 1), exp counts: (row total)(column total)/table total

z-score equation

z = (value - mean) / stan.dev.

interpreting a residual plot

curved pattern = linear may not be appropriate

small residuals = predictions will be pretty precise

inc/dec spread = predictions for larger/smaller values of x will be more variable

small residuals = predictions will be pretty precise

inc/dec spread = predictions for larger/smaller values of x will be more variable

complementary events

two mutually exclusive events whose union is the sample space (rain/not rain etc)

binomial distribution: calc usage

exactly x: binompdf(n, p, x)

at most x: binomcdf(n, p, x)

less than x: binomcdf(n, p, x-1)

at least x: 1 - binomcdf(n, p, x-1)

more than x: 1 - binomcdf(n, p, x)

at most x: binomcdf(n, p, x)

less than x: binomcdf(n, p, x-1)

at least x: 1 - binomcdf(n, p, x-1)

more than x: 1 - binomcdf(n, p, x)

CI 4 step process

STATE parameter, confidence level

PLAN inference method, check conditions

DO calculations

CONCLUDE interpret interval in the context of the problem

PLAN inference method, check conditions

DO calculations

CONCLUDE interpret interval in the context of the problem

types of chi-square tests

Goodness of fit: test the distribution of one group of sample as compared to a hypothesized distribution

Homogeneity: when you have a sample from 2 or more indep. pop.s or 2 or more groups in an experiment

Association: when you have a single sample from a single population with 2 variables

Homogeneity: when you have a sample from 2 or more indep. pop.s or 2 or more groups in an experiment

Association: when you have a single sample from a single population with 2 variables

calc tips: normalcdf and invnorm

normalcdf: min, max, mean, standev

invnorm: area to the left as a decimal, mean, standev

invnorm: area to the left as a decimal, mean, standev

extrapolation

using an LSRL to predict outside the domain of the explanatory variable

use of a control group

gives the researchers a comparison group to be used to evaluate the effectiveness of the treatments

conditions for geometric distributions

Binomial - can be classified as success/failure

Independent

Trials - goal must be to count #trials until the first success

Success - prob of success must be the same for each trial

Independent

Trials - goal must be to count #trials until the first success

Success - prob of success must be the same for each trial

2 sided test from a confidence interval

we do/don't have enough evidence to reject Ho: mu = __ in favor of Ha: mu (does not equal) __ at the a = 0.05 level because __ falls outside/inside the 95% CI.

a = 1 - CI

a = 1 - CI

Conditions: inference for proportions

Random, Normal (at least 10 succ/failures in both groups if it's 2sample), Independent (10% condition)

SOCS

Shape, Outliers, Center, Spread

LSRL y-hat

the estimated or predicted y-value (context) for a given x-value (context)

SRS

a sample taken in such a way that every set of n individuals has an equal chance to be the sample selected

Conditions: binomial distribution

B: binomial (success/failure)

I: independent

N: number (of trials must be fixed)

S: success (prob. of success must stay constant)

I: independent

N: number (of trials must be fixed)

S: success (prob. of success must stay constant)

finding sample size given ME

one mean: m = z*(sigma/n)

proportion: m = z*sq.rt.(p(1-p)/n)

solve for n

proportion: m = z*sq.rt.(p(1-p)/n)

solve for n

Conditions: inference for means

SRS, Normal (more than 30), Indep. (10%)

Describe/Compare Distributions

SOCS, 'is greater/less than' (compare)

LSRL

the equation for the line that creates the least residuals

Does x CAUSE y?

association is not causation!

mean/standev of a sum of 2 random variables

mean: mu of x+y = mu x + mu y

standev: sigma of x+y = sq.rt. (sigma x^2 + sigma y^2)

standev: sigma of x+y = sq.rt. (sigma x^2 + sigma y^2)

can we generalize the results to the population of interest?

Yes, if a large random sample was taken from the same population we hope to draw conclusions about.

factors that affect power

sample size (inc power = inc sample size), alpha (a 5% sig.test will have a greater chance of rejecting Ho than a 1% test), consider an alternative farther from mu 0 (values of mu that are in Ha but like close to the hyp. value are harder to detect than farther away ones)

linear transformations

adding a to every member of a data set adds a to the measures of position but does not change the measures of spread and shape, multiplying every member by b multiplies the measures of position by b and multiplies most measures of spread by b but does not change the shape

LSRL "SE of b"

measures the standev of the estimated slope for predicting the y-variable from the x-variable/how far the estimated slope will be from the true slope on average

experiment vs obs. study

a study is an experiment ONLY if researchers impose a treatment on the exp. units

mean and standev of a difference of 2 random variables

mean: mu of x-y = mu x - mu y

standev: sigma x-y = sq.rt. (sigma x^2 - sigma y^2)

standev: sigma x-y = sq.rt. (sigma x^2 - sigma y^2)

P-value

assuming that the null is true, the _______ measures the chance of observing a statistic (or difference) as large or larger than the one observed

type 1 error

rejecting Ho when Ho is actually true

type 2 error

failing to reject Ho when Ho should be rejected

Power

rejecting Ho when Ho should be rejected

r

correlation measures the strength and direction of the linear relationship between x and y

stratified random sample vs SRS

stratified guarantees that each of the strata will be represented

mean/standev of a discrete random variable

mean: mu x = sum (xp)

standev: sigma x = sq.rt. (sum (x-mu)p)

standev: sigma x = sq.rt. (sum (x-mu)p)

bias

systematic favoring of certain outcomes due to flawed sample selection, poor wording, undercoverage, nonresponse, etc

2-sample t-test phrasing hints and Ho/Ha

key phrase: difference in the means!

Ho: mu1 - mu2 = 0 OR mu1 = mu2

Ha: mu1 - mu2 < 0, > 0, not equal to 0

Ho: mu1 - mu2 = 0 OR mu1 = mu2

Ha: mu1 - mu2 < 0, > 0, not equal to 0

2-sample t-test conclusion

we do/don't have enough evidence at the a = ? level to conclude that the difference between the mean ? for all ? and the mean ? for all ? is ?

standev

measures spread by giving the typical or average distance that the observations are away from their mean

r-squared

% or the variation in y that is explained by the LSRL of y on x

goal/benefit of blocking

goal: to create groups of homogeneous exp units

benefit: the reduction of the effect of variation within the exp units

benefit: the reduction of the effect of variation within the exp units

expected value/mean

the long-run average outcome of a random phenomenon carried out a very large number of times

unbiased estimator

data is collected in such a way that there is no systematic tendency to over- or under-estimate the true value of the pop. parameter

paired t-test: phrasing hints and Ho/Ha

key phrase: mean difference

Ho: mu diff = 0

Ha: mu diff < 0, > 0, not equal to 0

Ho: mu diff = 0

Ha: mu diff < 0, > 0, not equal to 0

paired t-test: conclusion

we do/don't have enough evidence at the ? a level to conclude that the mean difference in ? for all ? is ?

LSRL y-intercept 'a'

when the x variable is 0, the y variable is estimated to be ______.

experimental designs

Completely Randomized Design, Randomized Block Design (homogeneous blocks), Matched Pairs

probability

the proportion of times the outcome would occur in a long series of repetitions

Central Limit Theorem

if the pop. dist. is normal, the samp. dist will also be normal. also, as n increases, the samp. dist.'s standev will increase.

if the pop.dist. is not normal, the samp. dist. will become more and more normal as n increases.

if the pop.dist. is not normal, the samp. dist. will become more and more normal as n increases.

Confidence Interval

Intervals prodced with this method will capture the true population/mean of ______ in about ___% of all possible samples of this same size from this same population.

Conditions: inference for regression

independent (10%), normal (regression), equal variance (around regression), random

LSRL slope 'b'

for every one unit change in x the y variable is predicted to increase/decrease by ____ units.

sampling techniques

SRS, stratified, cluster, census, convenience, voluntary

two events are independent if...

P(B) = P(BIA) or P(B) = P(BIAc)

why large samples give more trustworthy results

large samples yield more precise results than small samples because in a large sample the calues of the sample statistic tend to be cloaser to the true population parameter

confidence interval conclusion

I am ___% confident that the interval from ___ to ___ captures the true _____.

Conditions: chi-squared tests

random, all exp. counts at least 5, independent (10%)