190 terms

AP Statistics: Key Vocabulary Module 1

STUDY
PLAY
individuals
-the objects described by a set of data
-they may be people, animals, or things
variable
-any characteristic of an individual
-can take different values for different individuals
-example: a person's height, gender, or salary
categorical variable
-places an individual into one of several groups or categories
-example: male or female, different colors
quantitative variable
-takes numerical values for which it makes sense to find an average
-not every variable that takes number values is quantitative
-example: height in centimeters or salary in dollars
distribution
-tells us what values the variable takes and how often it takes these values
-pattern of variation of the variable
bar chart (bar graph)
-displays the distribution of a categorical variable more vividly
-represents each category as a bar
-bar heights show the category counts or percents
-easier to make and read than pie charts
-more flexible than pie charts
-also can compare any set of quantities that are measured in the same units
pie chart
-displays the distribution of a categorical variable more vividly
-shows the distribution of a categorical variable as a "pie," whose slices are sized by the counts or percents for the categories
-must include all the categories that make up the whole
-use only when you want to emphasize each category's relation to the whole
-they are awkward to make by hand, so technology will do the job for you
- must be from 0 to a 100%
frequency
-the count of individuals that fall within each category
relative frequency
-the percent of individuals that fall within each category
two-way table
-organizes data about two categorical variables measured for the same set of individuals
-often used to summarize large amounts of information by grouping outcomes into categories
marginal distribution
-the marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table
-each marginal distribution from a two-way table is a distribution for a single categorical variable (we can use a bar graph or a pie chart to display such distribution)
-tell us nothing about the relationship between two variables
conditional distribution
-a conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable
-there is a separate conditional distribution for each value of the other variable
-there are two sets of conditional distributions for any two-way table: the distributions of the row variable for each value of the column variable, and the distributions of the column variable for each value of the row variable
-example: we can study the regional preference of China alone by only looking at the "China" column in the two-way table
segmented bar graph
-use to display conditional distributions
-can use to compare the distributions of categorical variables
-side-by-side bar graph makes comparison easier because the "middle" segments are in different places on the vertical axis
side-by-side bar graph
-use to display conditional distributions
-makes comparison easier than a segmented bar graph
association
-we say that there is an association between two variables if knowing the value of one variable helps predict the value of the other
-if knowing the value of one variable does not help you predict the value of the other, then there is no association between the variables
-to see whether there is an association between two categorical variables, compare an appropriate set of conditional distributions
-caution: even a strong association between two categorical variables can be influenced by other variables lurking in the background
dotplot
-can use to show the distribution of a quantitative variable
-each data value is shown as a dot above its location on a number line (displays individual values on a number line)
how to examine the distribution of a quantitative variable
-in any graph, look for the overall pattern and for striking departures from that pattern
-you can describe the overall pattern of a distribution by its shape, center, and spread
-an important kind of departure is an outlier
shape
-concentrate on the main features
-look for major peaks, not for minor ups and downs in the graph
-look for clusters of values and obvious gaps
-look for potential outliers, not just for the smallest and largest observations
-look for rough symmetry or clear skewness
center
-the "midpoint"
spread
-spread is a way to measure the variability of the observations around the center
outlier
-an individual value that falls outside the overall pattern
-call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or 1.5 x IQR below the first quartile
mode
-most common value
symmetric
-a distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other
skewed to the right; right-skewed; positively skewed; skewed towards positive values
-a distribution is skewed to the right if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side
skewed to the left; left-skewed; negatively skewed; skewed towards negative values
-if the left side of the graph is much longer than the right side
unimodal
-distribution is single-peaked
bimodal
-distribution is double-peaked
multimodal
-distribution has more than two peaks
stemplot
-separate each observation into a stem and a one-digit leaf
split stems
-use when values have many digits or when you have very few stems
-split stems to determine the shape of the distribution
-if you do split stems, be sure that each stem is assigned an equal number 0f possible leaf digits (two stems, each with five possible leaves; or five stems, each with two possible leaves)
back-to-back stemplot
-gives you the ability to take two separate data sets and put them on the same plot
-allows you to compare the different data sets on one plot
histogram
-plots the counts (frequencies) or percents (relative frequencies) of values in equal-width classes
histograms vs. bar graphs
histogram:
-displays the distribution of a quantitative variable
-the horizontal axis is marked in units of measurement for the variable
-draw with no space to show equal width classes

bar graph:
-used to display the distribution of a categorical variable or to compare the sizes of different quantities
-the horizontal axis identifies the categories or quantities being compared
-draw with black space between the bars to separate the items being compared
time plot
-used for variables that are measured over time, such as the height of a growing child, seasonal variation, or the price of a stock
-time on horizontal axis
range
-the difference between the largest and smallest data values in a distribution
-measure of spread
mean
-the most common measurement of center
-to find the mean of a set of observations, add their values and divide by the number of observations
-tells us how large each data value would be if the total were split equally among all the observations
-"x-bar" is the mean of the sample
-"mu" is the mean of the population
-the mean is sensitive to the influence of extreme observations
-nonresistant measure of center
Σ (capital sigma)
-capital Greek letter sigma
-"add them all up"
median
-a measurement of center
-the midpoint of a distribution; the number such that about half the observations are smaller and about half are larger
-resistant measure of center

to find the median of a distribution:
1. arrange all observations in order of size, from smallest to largest
2. if the number of observations n is odd, the median is the center observation in the ordered list
3. if the number of observations n is even, the median is the average of the two center observations in the ordered list
resistant measure
-statistic that is not affected very much by extreme observations
nonresistant measure
-statistic that is affected by extreme observations
quartiles
-mark out the middle half of the distribution
-Q1 lies one-quarter of the way up the list; median of the observations that are to the left of the median in the ordered list
-Q2 is the median, which is halfway up the list
-Q3 lies three-quarters of the way up the list; median of the observations that are to the right of the median in the ordered list
-IQR = Q3 - Q1
five number summary
-consists of the smallest observation (minimum), the first quartile (Q1), the median, the third quartile (Q3), and the largest observation (maximum), written in order from smallest to largest
-MINIMUM, Q1, MEDIAN, Q3, MAXIMUM
-these five numbers divide each distribution roughly into quarters
-about 25% of the data values fall between the minimum and Q1
-about 25% of the data values fall between Q1 and the median
-about 25% of the data values fall between the median and Q3
-about 25% of the data values fall between Q3 and the maximum
-the five number summary of a distribution leads to a new graph, the boxplot (box-and-whisker plot)
boxplot (box-and-whisker plot)
-are based on the five number summary
-are useful for comparing distributions

how to make a boxplot:
1. a central box is drawn from the first quartile (Q1) to the third quartile (Q3)
2. a line in the box marks the median
3. lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers
4. outliers are marked with a special symbol such as an asterisk (*)
modified boxplot
-a boxplot that will not plot outliers as part of the boxplot
standard deviation/variance
-measures spread by looking at how far the observations are from their mean
-standard deviation is calculated by finding an average of the squared deviations and then taking the square root
-the average squared deviation is called the variance

how to find the standard deviation of n observations:
1. find the distance of each observation from the mean and square each of these distances
2. average the distances by dividing their sum by n-1
3. the standard deviations sx is the square root of this average squared distance
percentile
-the pth percentile of a distribution is the value with p percent of the observations less than it
cumulative relative frequency graph (ogive)
-can be used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution
-begin by grouping the observations into equal-width classes (much like the process of making a histogram)
-the completed graph shows the accumulating percent of observations as you move through the classes in increasing order
standardized value
-tells us how many standard deviations the original observation falls away from the mean, and in which direction
-observations larger than the mean are positive
-observations smaller than the mean are negative
-often standardize to express the observations on a common scale
z-score
-z = (x - mean)/standard deviation
-a standardized score is often called a z-score
density curve
-a curve that is always on or above the horizontal axis and has area exactly 1 underneath it
-describes the overall pattern of a distribution
-the area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval
-come in many shapes
-outliers are not described by the curve
-no set of real data is exactly described by a density curve; the curve is an approximation that is easy to use and is accurate enough for practical use
-the median of a density curve is the equal-areas point, the point that divides the area under the curve in half
-the mean μ of a density curve is the balance point, at which the curve would balance if made of solid material
-the median and the mean are the same for a symmetric density curve; they both lie at the center of the curve
-the mean of a skewed curve is pulled away from the median in the direction of the long tail
normal curve
-all normal curves have the same overall shape: symmetric, single-peaked, and bell-shaped
-any specific normal curve is completely described by giving its mean μ and its standard deviation σ
-the mean is located at the center of the symmetric curve and is the same as the median; changing μ without changing σ moves the normal curve along the horizontal axis without changing its spread
-the standard deviation σ controls the spread of a normal curve; curves with larger standard deviations are more spread out
-the standard deviation is the distance from the center to the change-of-curvature points on either side
-special properties of normal curves: μ and σ completely specify the shape of the distribution
normal distribution
-the distributions that normal curves describe
-we abbreviate the normal distribution with mean μ and standard deviation σ as N(μ, σ)
inflection point
-located at + or - σ
-these are the points where the curve changes concavity
68-95-99.7 rule (empirical rule)
in a normal distribution with mean μ and standard deviation σ:
-approximately 68% of the observations fall within σ of the mean μ
-approximately 95% of the observations fall within 2σ of the mean μ
-approximately 99.7% of the observations fall within 3σ of the mean μ
standard normal distribution
-the normal distribution with mean 0 and standard deviation 1
-if a variable x has any normal distribution N(μ, σ) with mean μ and standard deviation σ, the the standardized variable z = (x - μ)/σ has the standard normal distribution N(0, 1)
standard normal table
-table A, the standard normal table, gives the areas under the standard normal curve
-table A is a table of areas under the standard normal curve
-the table entry for each value z is the area under the curve to the left of z
normal probability plot
-good method for assessing normality
-if the points on a normal probability plot lie close to a straight line, the data are approximately normally distributed
-systematic deviations from a straight line indicate a non-normal distribution
-outliers appear as points that are far away from the overall pattern of the plot
-in a right-skewed distribution, the largest observations fall distinctly to the right of a line drawn through the main body of points
-similarly, left skewness is evident when the smallest observations fall to the left of the line
response variable
-measures an outcome of a study
-dependent variable (we don't use this terminology in statistics)
-example: accident death rate and life expectancy
explanatory variable
-may help explain or predict changes in a response variable
-independent variable (we don't use this terminology in statistics)
-example: car weight and number of cigarettes smoked
scatterplot
-shows the relationship between two quantitative variables measured on the same individuals
-the values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis
-each individual in the data appears as a point in the graph

how to make a scatterplot:
1. decide which variable should go on each axis
2. label and scale your axes
3. plot individual data values

how to examine a scatterplot:
as in any graph of data, look for the overall pattern and for striking departures from that pattern
-you can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship
-an important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship
positive association
-describes the overall trend, not the relationship between each pair of points
-when above-average values of one tend to accompany the above-average values of the other and when below-average values also tend to occur together
negative association
-describes the overall trend, not the relationship between each pair of points
-when above-average values of one tend to accompany below-average values of the other
correlation coefficient r
-measures the strength and direction of the linear association between two quantitative variables x and y
-although you can calculate a correlation for any scatterplot, r measures strength for only straight-line relationships
-indicates the direction of a linear relationship by its sign: r > 0 for a positive association and r < 0 for a negative association
-always satisfies -1 ≤ r ≤ 1 and indicates the strength of a linear relationship by how close it is to -1 or 1
-perfect correlation, + or - 1, occurs only when the points on a scatterplot line exactly on a straight line
-values near 0 indicate a very weak linear relationship
-values near 1 indicate a strong positive relationship
-values near -1 indicate a strong negative relationship
facts about correlation
1. correlation makes no distinction between explanatory and response variables
2. because r uses the standardized values of the observations, r does not change when we change the units of measurement of x, y, or both
3. the correlation r itself has no unit of measurement
cautions about correlation
1. correlation does not imply causation
2. correlation requires that both variables be quantitative
3. correlation does not describe curved relationships between variables, no matter how strong the relationship is
4. a value of r close to 1 or -1 does not guarantee a linear relationship between two variables
5. like the mean and the standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations
6. correlation is not a complete summary of two-variable data, even when the relationship between the variables is linear
direction
-if the relationship has a clear direction, we speak of either positive association or negative association
form
-linear relationships, where the points show a straight-line pattern, are an important form of relationship between two variables
-curved relationships and clusters are other forms to watch for
strength
-the strength of a relationship is determined by how close the points in the scatterplot lie to a simple form such as a line
regression line
-is a line that describes how a response variable y changes as an explanatory variable x changes
-we often use a regression line to predict the value of y for a given value of x
-summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other
-regression, unlike correlation, requires that we have an explanatory variable and a response variable
-is a model for data
-has the form y-hat = a + bx
-y-hat is the predicted value of the response variable y for a given value of the explanatory variable x
-b is the slope, the amount by which y is predicted to change when x increases by one unit
-a is the y-intercept, the predicted value of y when x = 0
extrapolation
-the use of the regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line; such predictions are often not accurate
-often, using the regression line to make a prediction for x = 0 is an extrapolation; that's why the y intercept isn't always statistically meaningful
-don't make predictions using values of x that are much larger or much smaller than those that actually appear in your data
residual
-the difference between an observed value of the response variable and the value predicted by the regression line
-residual = observed y - predicted y = y - y hat
least-squares regression line
-the LSRL of y o x is the line that makes the sum of the squared residuals as small as possible
-makes errors in predicting y as small as possible by minimizing the sum of the squares of the residuals
-residuals of the LSRL have a special property: the mean of the least-square residuals is always zero
residual plot
-a scatterplot of the residuals against the explanatory variable
-help us assess whether a linear model is appropriate or not
-when an obvious curved pattern exists in a residual plot, the model we are using is not appropriate
-graphical tool for determining if a LSRL is an appropriate model for a relationship between two variables
standard deviation of the residuals
-if we use a LSRL to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals (s) is given by

s = square root of (sum of the residuals squared/ (n -2 ))

-this value gives the approximate size of a "typical" prediction error (residual)
coefficient of determination r squared
-is the fraction of the variation in the values of y that is accounted for by the LSRL of y on x
-usually expressed as a percentage from 0% to 100%
-we can calculate r squared using the following formula:

r squared = 1 - sum of the residuals squared / sum of (yi - y bar) squared

-if all the points fall directly on the LSRL, the sum of the squared residuals is 0 and r squared is 1; then all of the variation is accounted for by the linear relationship with x
-because the LSRL yields the smallest possible sum of squared prediction errors, the sum of the squared residuals can never be more than the sum of the squared deviations from the mean of y
-in the worst case scenario, the LSRL does no better at predicting y than y = y bar does; then the two sums of squares are the same and r squared = 0
-interpretation: __% of the variation in [response variable] is accounted for by the linear model relating [response variable] and [explanatory variable].
outlier/influential observation
outlier:
-an outlier is an observation that lies outside the overall pattern of the other observations
-points that are outliers in the y direction but not the x direction of a scatterplot have large residuals
-other outliers may not have large residuals

influential observation:
-an observation is influential for a statistical calculation if removing it would markedly change the result of the calculation
-points that are outliers in the x direction of a scatterplot are often influential for the LSRL
-in the regression setting, not all outliers are influential
-the LSRL is most likely to be heavily influenced by observations that are outliers in the x direction
-influential points often have small residuals, because they pull the regression lines toward themselves
-the best way to verify that a point is influential is to find the regression line both with and without the unusual point
-if the line moves more than a small amount when the point is deleted, the point is influential
correlation and regression wisdom
1. the distinction between explanatory and response variables is important in regression
-LSRL makes the distances of the data points from the line small only in the y direction
-if we reverse the roles of the two variables, we get a different LSRL
-this isn't true for correlation: switching x and y does not affect the value of r
2. correlation and regression lines describe only linear relationships
-you can calculate the correlation and LSRL for any relationship between two quantitative variables, but the results are useful only if the scatterplot shows a linear pattern
-always plot your data!
3. correlation and LSRL are not resistant
-one unusual point in a scatterplot can greatly change the value of r
-LSRL are also not resistant
4. association does not imply causation
-when we study the relationship between two variables, we often hope to show that changes in the explanatory variable cause changes in the response variable
-a strong association between two variables is not enough to draw conclusions about cause and effect
-sometimes an observed association really does not reflect cause and effect
-in other cases, an association is explained by other variables, and the conclusion that x causes y is not valid
-an association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y
-remember: it only makes sense to talk about the correlation between two quantitative variables. if one or both variables are categorical, you should refer to the association between two variables. to be safe, you can use the more general term "association" when describing the relationship between any two variables
population
-the population in a statistical study is the entire group of individuals we want information about
census
-collects data from every individual in the population
-this is deal, not always realistic
sample
-is a subset of individuals in the population from which we actually collect data
sample survey
-the first step in planning a sample survey is to say exactly what population we want to describe
-the second step is to say exactly what we want to measure, that is, to give exact definition to our variables
-the final step in planning a sample survey is to decide how to choose a sample from the population
-we reserve the term "sample survey" for studies that use an organized plan to choose a sample that represents some specific population
-some people use the terms "survey" or "sample survey" to refer only to studies in which people are asked one or more questions (we'll avoid this restrictive terminology)
convenience sampling
-choosing individuals from the population who are easy to reach results in a convenience sample
-convenience sampling often produces unrepresentative data
-produces bias: using a method that favors some outcomes over others
-BAD SAMPLING
bias
-the design of a statistical study shows bias if it would consistently underestimate or consistently overestimate the value you want to know
-bias: using a method that favors some outcomes over others
-bias is not just bad luck in one sample; it's the result of a bad study design that will consistently miss the truth about the population in the same way
-convenience and voluntary response samples are almost guaranteed to show bias
voluntary response sample
-consists of people who choose themselves by responding to a general invitation
-also known as "self-selected samples"
-call-in, text-in, write-in, and many internet polls rely on voluntary response samples
-people who choose to participate in such surveys are usually not representative of some larger population of interest
-they attract people who feel strongly about an issue, and who often share the same opinion; that leads to bias
-BAD SAMPLING
random sampling
-involves using a chance process to determine which members of a population are included in the sample
-the easiest way to choose a random sample of n people is to write their names on identical slips of paper, put the slips in a hat, mix them well, and pull out slips one at a time until you have n of them
-an alternative would be to give each member of the population a distinct number and to use the "hat method" with these numbers instead of people's names
-the resulting sample is called a simple random sample, or SRS for short
simple random sample (SRS)
-a simple random sample (SRS) of size n is chosen in such a way that every group of n individuals in the population has an equal chance to be selected as the sample
-an SRS gives every possible sample of the desired size an equal chance to be chosen
-it also gives each member of the population an equal chance to be included in the sample
-ways to choose an SRS: hat method, table of random digits, random number generator
-the use of impersonal chance avoids bias

how to choose a simple random sample:
1. the "hat method"
-write names of people or what you are trying to sample on a piece of paper and draw it out of a hat
-assign numbers to people, write those numbers on a piece of paper and draw it out of a hat
-the hat method won't work well if the population is large
2. with a random number generator on a graphing calculator
-math -> PROB -> randInt
3. choosing an SRS with technology
-label. give each individual in the population a distinct numerical label from 1 to N
-randomize. use a random number generator to obtain n different integers from 1 to N
(it is standard practice to use n for sample size and N for the population size)
4. with table D
-label. give each member of the population a numerical label with the same number of digits. use as few digits as possible
-randomize. read consecutive groups of digits of the appropriate length from left to right across a line in table D. ignore any group of digits that wasn't used as a label or that duplicates a label already in the sample. stop when you have chosen n different labels
(your sample contains the individuals whose labels you find)
design
-the design of a sample refers to the method used to choose the sample from the population
stratified random sample and strata
-to get a stratified random sample, start by classifying the population into groups of similar individuals, called strata
-then choose a separate SRS in each stratum and combine these SRSs to form the sample
-choose the strata based on facts known before the sample is taken
-for example, in a study of sleep habits on school nights, the population of students might be divided into freshman, sophomore, junior, and senior strata
-stratified random sampling works best when the individuals within each stratum are similar with respect to what is being measured and when there are large difference between strata
-when we can choose strata that are "similar within but different between," stratified random samples give more precise estimates than simple random samples of the same size
-however, both simple random sampling and stratified random sampling are hard to use when populations are large and spread out over a wide area
(stratum is singular and strata is plural)
cluster sample and clusters
-to get a cluster sample, start by classifying the population into groups of individuals that are located near each other, called clusters
-then choose an SRS of the clusters
-all the individuals in the chosen clusters are included in the sample
-in a cluster sample, some people take an SRS from each cluster rather than including all members of the cluster
-cluster samples are often used for practical reasons, like saving time and money
-cluster sampling works best when the clusters looked just like the population but on a smaller scale
-cluster samples don't offer the statistical advantage of better information about the population that stratified random samples do; that's because clusters are often chosen for ease so they may have as much variability as the population itself

for example: imagine a large high school that assigns its students to homerooms alphabetically by last name. the school administration is considering a new schedule and would like student input. administrators decide to survey 200 randomly selected students. it would be difficult to track down an SRS of 200 students, so the administration opts for a cluster sample of homerooms. the principal (who knows some statistics) takes a simple random sample of 8 homerooms and gives the survey to all 25 students in each classroom
the difference between strata and clusters
-be sure to know the difference between strata and clusters
-we want each stratum to contain similar individuals and for there to be large differences between strata
-for a cluster sample, we'd like each cluster to look just like the population, but on a smaller scale
-remember: strata are ideally "similar within but different between" while clusters are ideally "different within but similar between"
inference
-the purpose of a sample is to give us information about a larger population
-the process of drawing conclusions about a population on the basis of sample data is called inference because we infer information about the population from what we know in the sample
-inference from convenience or voluntary response samples would be misleading because these methods of choosing a sample are biased; we are almost certain that the sample does not fairly represent the population
-THE FIRST REASON TO RELY ON RANDOM SAMPLING IS TO AVOID BIAS IN CHOOSING A SAMPLE
-still, it is unlikely that results from a random sample are exactly the same as for the entire population
-the sample results will differ somewhat just by chance
undercoverage
-occurs when some members of the population cannot be chosen in a sample
-sampling is often done using a list of individuals in the population; such lists are seldom accurate or complete; the result is undercoverage
-most samples suffer from some degree of undercoverage
-the results of national sample surveys therefore have some bias due to undercoverage if the people not covered differ from the rest of the population
sampling frame
-the list of individuals from which a sample will be drawn
nonresponse
-occurs when an individual chosen for the sample can't be contacted or refuses to participate
-the real problems start after the sample is chosen
-nonresponse to surveys often exceeds 50%, even with careful planning, and several follow-up calls
-if the people who respond differ from those who don't, in a way that is related to the response, bias results
-nonresponse can occur only after a sample has been selected
response bias
-another type of nonsampling problem occurs when people give inaccurate answers to survey questions
-people may lie about their age, income, or drug use
-they may misremember how many hours they spent on the internet last week
-or they might make up an answer to a question that they don't understand
-the gender, race, age, ethnicity, or behavior of the interviewer can also effect people's responses
-a systemic pattern of inaccurate answers in a survey leads to response bias
wording of questions
-the wording of questions is the most important influence on the answers given to a sample survey
-confusing or leading questions can introduce strong bias
-changes in wording can greatly affect a survey's outcome
-even the order in which the questions are asked matters
-don't trust the results of a sample survey until you have read the exact questions asked
observational study
-observes individuals and measures variables of interest but does not attempt to influence the responses
-goal can be to describe some group or situation, to compare groups, or to examine relationships between variables
-an observational study, even one based on a random sample, is a poor way to gauge the effect that changes in one variable have on another variable
experiment
-deliberately imposes some treatment on individuals to measure their responses
-the purpose of an experiment is to determine whether the treatment causes a change in the response
-when our goal is to understand cause and effect, experiments are the only source of fully convincing data
confounding
-occurs when two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other
-observational studies of the effect of an explanatory variable on a response variable often fail because of confounding between the explanatory variable and one or more other variables
-well-designed experiments take steps to prevent confounding
-beware of the influence of other variables!
-AP EXAM TIP:
if you are asked to identify a possible confounding variable in a given setting, you are expected to explain how the variable you choose (1) is associated with the explanatory variable and (2) affects the response variable
confounding variable
-variable that results in confounding
treatment
-a specific condition applied to the individuals in an experiment
-if an experiment has several explanatory variables, a treatment is a combination of specific values of these variables
experimental units
-are the smallest collection of individuals to which treatments are applied
subjects
-when the experimental units are human beings, they are often called subjects
factors
-sometimes, the explanatory variable in an experiment are called factors
-many experiments study the joint effects of several factors
-in such an experiment, each treatment is formed by combining a specific value (often called a level) of each of the factors
level
-a specific value
random assignment
-in an experiment, random assignment means that experimental units are assigned to treatments using a chance process
-this is the solution to the problem of bias in sampling
-creates roughly equal groups at the beginning of the experiment
-helps ensure that the effects of other variables are spread evenly among the two groups
control
-keep other variables that might affect the response the same for all groups
control group
-the group of patients who receive the dummy treatment or placebo is the control group
-use of a control group enables us to control the effects of outside variables on the outcome
-control is the first basic principle of statistical design of experiments
-THE MAIN PURPOSE OF A CONTROL GROUP IS TO PROVIDE A BASELINE FOR COMPARING THE EFFECTS OF THE OTHER TREATMENTS
replication
-we would not trust an experiment with just one student in each group
-the results would depend too much on which group got lucky and received the stronger student
-if we assign many subjects to each group, however, the effects of chance will balance out, and there will be little difference in the average responses in the two groups unless the treatments themselves cause a difference
-this is the idea of replication: use enough experimental units to distinguish a difference in the effects of the treatments from chance variation due to random assignment
completely randomized design
-in a completely randomized design, the experimental units are assigned to the treatments completely by chance
-notice that the definition of a completely randomized design does not require that each treatment be assigned to an equal number of experimental units
-it does specify that the assignment of treatments must occur completely at random
placebo
-a dummy treatment
placebo effect
-many patients responded favorably to any treatment, even a placebo; this may be due to the trust in the doctor and expectations of a cure, or simply due to the fact that medical conditions often improve without treatment
-favorable response to a dummy treatment is called the placebo effect
double-blind
-in a double-blind experiment, neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received
-the idea behind a double-blind design is simple; until the experiment ends and the results are in, only the study's statistician knows for sure which treatment a subject is receiving
single-blind
-however, some experiments cannot be carried out in a double-blind manner
-if researchers are comparing the effects of exercise and dieting on weight loss, then subjects will know which treatment they are receiving
-such an experiment can still be single-blind if the individuals who are interacting with the subjects and measuring the response variable don't know who is dieting and who is exercising
-in other single-blind experiments, the subjects are unaware of which treatment they are receiving, but the people interacting with them and measuring the response variable do know
statistically significant
-in an experiment, researchers usually hope to see a difference in the responses so large that it is unlikely to happen just because of chance variation
-we can use the laws of probability, which describe chance behavior, to learn whether the treatment effects are larger than we would expect to see only if chance were operating
-if they are, we call them statistically significant
-an observed effect so large that it would rarely occur by chance is called statistically significant
-if we observe statistically significant differences among the groups in randomized comparative experiment, we have good evidence that the treatments caused these differences
-the great advantage of randomized comparative experiments is that they can produce data that give good evidence for a cause-and-effect relationship between the explanatory and response variables
-we know that in general a strong association between two variables does not imply causation, however, a statistically significant association in data from a well-designed experiment does imply causation
block
-is a group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments
randomized block design
-in a randomized block design, the random assignment of experimental units to treatments is carried out separately within each block
-helps account for the variation of the response due to other variables; therefore, the variation of the distributions would be reduced
-when blocks are formed wisely, it is easier to find convincing evidence that one treatment is more effective than the other
-CONTROL WHAT YOU CAN, BLOCK ON WHAT YOU CAN'T CONTROL, AND RANDOMIZE TO CREATE COMPARABLE GROUPS
matched-pairs design
-a common type of randomized block design for comparing two treatments is a matched pairs design
-the idea is to create blocks by matching pairs of similar experimental units (i.e. twins)
-then we can use chance to decide which member of a pair gets the first treatment
-the other subject in that pair receives the other treatment; that is, the random assignment of subjects to treatments is done within each matched pair
-just as with other forms of blocking, matching helps account for the variation among the experimental units

-sometimes each "pair" in a matched pairs design consists of just one experimental unit that gets both treatments one after the other
-in that case, each experimental unit serves as its own control
-the order of the treatments can influence the response, so we randomize the order for each experimental unit
law of large numbers
-if we observe more and more repetitions of any chance process, the proportion of times that a specific outcome occurs approaches a single value (we call this value probability)
random
-we call a phenomenon random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions
-"random" in statistics is not a synonym for "haphazard" but a description of a kind of order that emerges only in the long run
probability
-the probability of any outcome of a chance process is a number between 0 and 1 that describes the proportion of times the outcome would occur in a very long series of repetitions
-outcomes that never occur have a probability of 0
-an outcome that happens on every repetition has a probability of 1
-an outcome that happens half the time in a very long series has a probability of 0.5
-probabilities only describe what happens in the long run. short runs of random phenomena like tossing coins or shooting a basketball often don't look random to us because they do not show the regularity that emerges in very many repetitions
probability theory
-probability theory is the branch of mathematics that describes random behavior
simulation
-is the imitation of chance behavior, based on a model that accurately reflects the experiment under consideration
-most often carried out with random numbers (Table D or randInt on the calculator)

STEPS FOR CONDUCTING A SIMULATION:
1. state. ask a question of interest about some chance process
2. plan. describe how to use a chance device to imitate one repetition
3. do. perform many repetitions of the simulation
4. conclude. use the results of your simulation to answer the question of interest
sample space
-the sample space S of a chance process is the set of all possible outcomes
-can be very simple or very complex
-ex: tossing a coin S = {H, T}
-ex: when Gallup draws a random sample of 1523 adults and asks a survey question, the sample space contains all possible sets of responses from 1523 of the 235 million adults in the country (this S is extremely large); each member of S lists the answers from one possible sample, which explains the term sample space
probability model
-a description of some chance process that consists of two parts: a sample space S and a probability for each outcome
-does more than just assign a probability to each outcome; it allows us to find the probability of any collection of outcomes, which we call an event
event
-an event is any collection of outcomes from some chance process
-that is, an event is a subset of the sample space
-events are usually designated by capital letters, like A, B, C and so on
-if A is any event, we write its probability P(A)
-the probability of any event is a number between 0 and 1
-the probability that an event does not occur is 1 minus the probability that the event does occur
complement
-the complement of any event A is the event that A does not occur, written as A^c
-the complement rule states that P(A^c) = 1 - P(A)
mutually exclusive or disjoint
-two events A and B are mutually exclusive (disjoint) if they have no outcomes in common and so can never occur together - that is, if P(A and B) = 0
-Addition Rule for Mutually Exclusive Events: P(A or B) = P(A) + P(B)
addition rule for mutually exclusive (disjoint) events
-P(A or B) = P(A) + P(B)
general addition rule
-P(A or B) = P(A) + P(B) - P(A and B)
intersection
-A n B
-A and B
-the intersection of two or more events is the event that they all occur
union
-A u B
-A or B
-the intersection of two or more events is the event that any of them occur
conditional probability
-the probability that one event happens given that another event is already known to have happened is called a conditional probability
-suppose we know that event A has happened; then the probability that event B happens is denoted by P(B|A)
-P(A|B) = P(A n B)/P(B)
-P(B|A) = P(B n A)/P(A)
joint event
-the simultaneous occurrence of two events
joint probability
-the probability of a joint event
tree diagram
-when chance behavior involves a sequence of outcomes, a tree diagram can be used to describe the sample space
-tree diagrams can also help in finding the probability that two or more events occur together
-we simply multiply along the branches that correspond to the outcomes of interest
independent
-two events A and B are independent if the occurrence of one event does not change the probability that the other event will happen
-in other words, events A and B are independent if P(A|B) = P(A) and P(B|A) = P(B)
general multiplication rule
-P(A and B) = P(A n B) = P(A)P(B|A)
multiplication rule for independent events
-P(A and B) = P(A n B) = P(A)P(B)
independent trial
-trials are independent if the outcome of one trial does not influence the outcome of any other
replacement
-f0r example, if you are selecting random digits by drawing numbered slips of paper from a hat, you want all ten digits to be equally likely to be selected each draw, then after you draw a digit and record it, you must put it back into the hat; then the second draw will be exactly like the first (sampling with replacement)
-if you do not replace the slips you draw, however, there are only nine choices for the second slip picked, and eight for the third (sampling without replacement)
random variable
-takes numerical values that describe the outcomes of some chance process
-a variable whose value is a numerical outcome of a random phenomena
-there are two main types of random variables, corresponding to two types of probability distributions: discrete and continuous random variables
probability distribution
-the probability distribution of a random variable gives its possible values and their probabilities
discrete random variable
-a discrete random variable X takes a fixed set of possible values with gaps in between

probability distribution of a discrete random variable X:
----------------------------------
value: X1 , X2, X3, + ...
probability: P1, P2, P3, + ...
----------------------------------
probability histogram
-for displaying a probability distribution of a discrete random variable X
-probability vs. value
-the height of each bar represents the probability of the outcome of the base
-the sum of the heights of each bar is equal to 1 because the heights are probabilities
expected value of X
-the expected value of a discrete random variable X is an average of all possible values of X taking into account the fact that all values do not need to be equally likely
-this expected value need not be a possible value for X
law of large numbers (with means of discrete random variables)
-the law of large numbers says that the average of the values of X observed in many trials must approach μ
continuous random variable
-a continuous random variable X takes all values in an interval of numbers
-the probability distribution of X is described by a density curve
-the probability of any event is the area under the density curve and above the values of X that make up the event
-exact calculation of the mean and standard deviation for most continuous random variables requires advanced mathematics
probability distribution of a continuous random variable X
-described by a density curve
-the probability of any event is the area under the density curve and above the values of X that make up the event
-the probability distribution for a continuous random variable assigns probabilities to intervals of outcomes rather than to individual outcomes
-in fact, all continuous probability models assign probability zero to every individual outcome
-only intervals of values have positive probability
-we can use any density curve to assign probabilities
-the density curves that are most familiar to us are Normal curves
-the mean of the distribution is the point at which the area under the density curve would balance if it were made out of solid material
-the mean lies at the center of symmetric density curves such as the Normal curves
-we can locate the standard deviation of a Normal distribution from its inflection points
independent random variables
-if knowing whether any event involving X alone has occurred tells us nothing about the occurrence of any event involving Y alone, and vice versa, then X an Y are independent random variables
uniform distribution
-a uniform distribution, also known as a rectangular distribution, is a distribution that has constant probability
binomial setting
-a binomial setting arises when we perform several independent trials of the same chance process and record the number of times that a particular outcome occurs
THE FOUR CONDITIONS ARE:
1. the possible outcomes of each trial can be classified as a "success" or a "failure"
2. trials must be independent; that is, knowing the result of one trial must not tell us anything about the result of any other trial
3. the number of trials n must be fixed in advance
4. there is the same probability p of success on each trial
binomial random variable
-the count of X successes in a binomial setting is a binomial random variable
-a kind of discrete random variable
binomial distribution
-the probability distribution of X is a binomial distribution with parameters n and p
-n is the number of trials of the chance process
-p is the probability of a success on any one trial
-the possible values of X are the whole numbers from 0 to n
probability distribution function
-given a discrete random variable X, the probability distribution function assigns a probability to each value of X
cumulative distribution function
-the cumulative distribution function of X calculates the sum of the probabilities for 0, 1, 2 ... up to the value of X
-that is, it calculates the probability of obtaining at most X successes in n trials
binomial coefficient
-"n choose k"
-it is the number of ways of arranging k successes among n observations/trials

BINOMIAL COEFFICIENTS ON THE CALCULATOR:
math -> PRB -> nCr
"n choose r"
n!
n x (n - 1) x ... x 2 x 1
binomial probability
-if X has the binomial distribution with n trials and probability p of success on each trial, the possible values of X are 0, 1, 2...n.
-if x is any one of these values, then the binomial probability is given by the formula in the picture on this card

BINOMIAL PROBABILITIES ON THE CALCULATOR:
binompdf (n, p, k) computes P(X = k)
binomcdf (n, p, k) computes P(X ≤ k)
geometric setting
-a geometric setting arises when we perform independent trials of the same chance process and record the number of trials it takes to get one success

examples:
-roll a pair of dice until you get doubles
-in basketball, attempt a three-point shot until you make one
-keep placing a $1 bet on the number 15 in roulette until you win

conditions for a geometric setting:
1. the possible outcomes of each trial can be classified as a "success" or a "failure"
2. trials must be independent; that is, knowing the result of one trial must not tell us anything about the result of any other trial
3. the variable of interest is the number of trials required to obtain the first success
4. there is the same probability p of success on each trial
geometric random variable
-the number of trials Y it takes to get a success in a geometric setting is a geometric random variable
geometric distribution
-the probability distribution of Y is a geometric distribution with parameter p, the probability of success on any trial
-the possible values of Y are 1, 2, 3...
geometric probability
-if Y has the geometric distribution with the probability p of success on each trial, the possible values of Y are 1, 2, 3, ... . if k is any one of these values, the geometric probability is given by the formula in the picture on this card

GEOMETRIC PROBABILITIES ON THE CALCULATOR:
geometpdf (p, k) computes P(Y = k)
geometcdf (p, k) computes P(Y ≤ k)
parameter
-is a number that describes some characteristic of the population
-the value of the parameter is usually not known because we cannot examine the entire population
statistic
-is a number that describes some characteristic of a sample
-the value of the statistic can be computed directly from sample data
-we often use a statistic to estimate an unknown parameter
sampling variability
-the value of a statistic varies in repeated random sampling
sampling distribution
-the sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population
unbiased estimator
-a statistic used to estimate a parameter is an unbiassed estimator if the mean of its sampling distribution is equal to the value of the parameter being estimated
-the sample range is a biased estimator
sampling distribution of p-hat
-as n increases the sampling distribution of p-hat become approximately Normal
-before you perform Normal calculations, check that the Large Counts Condition is satisfied: np and nq are both greater than or equal to 10
-the mean of the sampling distribution of p-hat is the true value of the population proportion p
-the standard deviation of the sampling distribution of p-hat is square root of pq/n as long as the 10% condition is satisfied (the population is at least ten times the sample size)
sampling distribution of x-bar
-the mean of the sampling distribution of x-bar is the population mean, mu
-the standard deviation of the sampling distribution of x-bar is theta/square root of n
-to cut the standard deviation in half, you must use a sample that is 4 times as large
Central Limit Theorem
-draw an SRS of size n from any population with mean mu and finite standard deviation theta
-the central limit theorem (CLT) says that when n is large, the sampling distribution of the sample mean x-bar is approximately Normal
-how large a sample size n is needed for the sampling distribution of x-bar to be close to Normal depends on the population distribution. more observations are required if the shape of the population distribution is far from normal. in that case, the sampling distribution of x-bar will also be very non-Normal if the sample size is small
Normal/Large Sample Condition for Sample Means
-if the population distribution is Normal, then so is the sampling distribution of x-bar. this is true no matter what the sample size n is
-if the population distribution is not Normal, the CLT tells us that the sampling distribution of x-bar will be approximately Normal in both cases if n is greater than or equal to 30
point estimator
-is a statistic that provides an estimate of a population parameter
point estimate
-the value of that statistic from a sample is called a point estimate
C% confidence interval
- C% confidence interval uses sample data to estimate an unknown population parameter with an indication of how precise the estimate is and of how confident we are that the result is correct
-point estimate + - margin of error
confidence level C
-the confidence level C gives the overall success rate of the method for calculating the confidence interval
-that is, in C% of all possible samples, the method would yield an interval that captures the true parameter value
upper p critical value
-the number z* with probability p lying to its right under the standard Normal curve
margin of error
-(critical value)(standard deviation of the statistic)
-other things being equal, the margin of error of a confidence interval gets smaller as the confidence level C decreases and the sample size n increases
-as the standard deviation decreases, the margin of error gets smaller
-the margin of error covers only chance variation due to random sampling or random assignment
null hypothesis
-the claim we weight evidence against in a statistical test is called the null hypothesis
-often the null hypothesis is a statement of "no difference"
-Ho: parameter = value
alternative hypothesis
-the claim about the population that we are trying to find evidence for is the alternative hypothesis
-Ha: parameter < value
-Ha: parameter > value
-Ha: parameter does not equal value
one-sided alternative hypothesis and two-sided alternate hypothesis
-the alternate hypothesis is one-sided if it states that a parameter is larger than the null hypothesis value or if it states that the parameter is smaller than the null value
-it is two-sided if it states that the parameter is different from the null hypothesis value (it could be either larger or smaller)
p-value
-the probability, computed assuming Ho is true, that the statistic would take a value as extreme as or more extreme than the one actually observed, in the direction specified by the alternative hypothesis, is called the p-value of the test
-small values of p are evidence against the null hypothesis because they say that the observed result is unlikely to occur when the null hypothesis is true
-large values of p fail to give convincing evidence against the null hypothesis and in favor of the alternative hypothesis because they say that the observed result is likely to occur by chance alone when the null hypothesis is true
statistically significant
-if the p-value is smaller than alpha, we say that the results of a study are statistically significant at the alpha level
-in that case, we rejected the null hypothesis and conclude that there is convincing evidence in favor of the alternate hypothesis
significance level
-the significance level is a fixed value of p prior to studying the data
-by selecting a significance level, we decide in advance how much evidence against the null hypothesis we require in order to reject the null hypothesis
-most common significance level is 0.05
Type I Error
-if we reject Ho when Ho is true, we have committed a Type I Error
-the data give convincing evidence for Ha when in reality Ho is correct
-the probability of a Type I Error is the significance level alpha
Type II Error
-if we fail to reject Ho when Ha is true, we have committed a Type II Error
-the data don't give convincing evidence for Ha, when in reality Ho is correct
test statistic
-a test statistic is an estimate of the population parameter that we calculate from our sample data
power of a test
-the power of a significance test against a specific alternative is the probability that the test will reject Ho when the alternative is true
-measures the ability of a test to detect an alternate value of the parameter
-power = 1 - P(Type II Error)
--------------------------
How to Increase Power (Minimize the P(Type II Error)
1. increase the alpha level
2. increase the sample size
3. decrease the population standard deviation
4. use a more extreme alternative value (farther away from the null value)
standard error
-when the standard deviation of a statistic is estimated from data, the result is called the standard error of the statistic
t distribution
-the density curves of the t-distributions are similar in shape to the standard Normal curve
-they are symmetric about 0, single-peaked and bell shaped
---------------------
There is a different t-distribution for each sample size. We specify a particular t-distribution by giving its degrees of freedom. The spread of the t-distributions is a bit greater than that of a standard normal distribution. The t distributions have more probability in the tails and less in the center than does the standard normal. Therefore, the t-distribution is "shorter" than the normal curve.
----------------------
-if the SRS has size n, the t-statistic has the t-distribution with n-1 degrees of freedom
----------------------
As the degrees of freedom increase, the t density curve approaches the standard normal density curve ever more closely. This happens because Sx estimates theta more accurately as the sample size increases. So using Sx in place of theta causes little extra variation when the sample size is large.

Flickr Creative Commons Images

Some images used in this set are licensed under the Creative Commons through Flickr.com.
Click to see the original works with their full license.