73 terms

Individuals

objects described by a set of data. May be people, but they also may be animals or things.

Variable

Any characteristic of an individual. Can take different values for different individuals.

categorical variable

places an individual into one of several groups or categories

quantitative variable

takes numerical values for which arithmetic operations such as adding and averaging make sense

distribution

tells us what values the variable takes and how often it takes these variables

spread

give the lowest and highest value in data set

outliers

are there any values that stand out as unusual

center

what is the approximate value of the data (only an estimation)

shape

does the graph show symmetry, or is it skewed in one direction

time plot

plots each variable observation against the time at which it was measured

Mean

add values of observations and divide by the number of observation

Median

midpoint of the distribution

five number summary

consists of smallest observation, first quartile, the median, third quartile, and largest observation, written in order from smallest to largest

Quartiles

To calculate, arrange observation in increasing order and locate the median in ordered list of observations. Q1 is middle value less than Median, Q3 is middle number of values greater than median

IQR

distance between first and third quartiles (Q3-Q1)

Outliers equation

Q1 - 1.5 x IQR

Q3 + 1.5 x IQR

Q3 + 1.5 x IQR

boxplot

graph of five number summary, with outliers plotted individually

Standard Deviation

average of the squares of the deviations of the observations from their mean.

Density Curve

curve that is always above the horizontal axis and has area exactly 1 underneath it. Describes the overall pattern of a distribution.

normal distribution

mound-shaped and symmetric, based on continuous variable, adheres to 68-95-99.7 rule

z score

standardized value. observed - predicted / standard deviation.

standard normal distribution

the normal distribution N(0,1) with a mean 0 and standard deviation 1

Inverse normal calculations

working backwards from area, we find z, then x. Value of z is found in table A in reverse.

response variable

measures an outcome of a study. Sometimes referred to as the dependent variable

explanatory variable

attempts to explain the observed outcomes. Sometimes referred to as the independent variable

scatterplot

shows the relationship between two quantitative variables measured on the same individuals.

form

linear or curved

direction

positive, negative neither

strength

weak, moderate, strong

positive association

when one variable increases, the other increases

negative association

when on variable increases, the other decreases

Correlation

measures strength and direction of the relationship between two quantitative variables. Usually represented by 'r'

regression line

straight line that describes how a response variable y changes as an explanatory variable x changes. Often to predict values of y for given values of x

least squares regression line

line that makes the sum of squares of the vertical distances from the data points to the line as small as possible

r-squared

square of the correlation coefficient, represents the percentage of the change in y-variable that can be attributed to the x-variable.

residual

the difference between an observed value of y and the value predicted by the regression line

residual plot

scatterplot of each x-value and its residual value. Used to determine whether a linear equation is a good model for a set of data. If it exhibits randomness, then a line is a GOOD model for data. If exhibits pattern, then a line is NOT a good model for data

influential point

when you remove this point and it has a large effect on the correlation and/or regression

population

statistical study in the entire group of individuals we want info about

census

collects data from every individual in the population

sample

subset of individuals in the population from which we actually collect data

bias

if the design of a study consistently underestimates or overestimates the value you want to know

convenience sample

chooses individuals who are easiest to reach.

voluntary response sample

consists of people who choose themselves by responding to a general invitation. Show bias because people with strong opinions are most likely to respond

simple random sample

of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected

stratified random sample

classifying the population into groups of similar individuals called a strata, then choose a separate SRS in each stratum and combine SRSs to form smaple

cluster sample

start by classifying the population into groups of individuals that are located near each other, called clusters. Then choose SRS of the clusters. All individuals in clusters are included in the sample

undercoverage

occurs when some members of the population cannot be chosen in a sample

nonresponse

occurs when an individual chosen from the sample can't be contacted ot refuses to participate

wording of questions

most important influence on the answers given to a survey

response bias

systematic pattern of incorrect responses in a sample survey

observational study

observes individuals and measures variable of interest but does not attempt to influence the responses

experiment

deliberately imposes some treatment on individuals to measure their responses. only source of fully convincing data to understand cause and effect

confounding

occurs when two variables associated in such a way that their effects on a response variable cannot be distinguished from each other.

treatment

specific condition applied to individuals in an experiment

experimental units

smallest collection of individuals to which treatments are applied

comparison

use a design that compares two or more treatments

random assignment

use chance to assign experimental units to treatments. Doing so helps create roughly equivalent groups of experimental units by balancing the effects of other variable among the treatment groups

control

keep other variables that might affect the response the same for all groups.

replication

use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups

statistically significant

observed effect so large that is would rarely occur by chance

completely randomized design

treatments are assigned to all experimental units completely by chance

control group

receives an inactive treatment or existing baseline treatment

placebo effect

response to dummy treatment

double blind experiment

neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received

block

group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to treatments

randomized block design

the random assignment of experimental units to treatments is carried out separately in each block

matched pairs design

randomized blocked experiment in which each block consists of a matching pair of similar experimental units. Chance is used to determine which unit in each pair gets treatment

parameter

is a number that describes some characteristic of the population

statistic

a number that describes some characteristic of a sample

sampling distribution

distribution of all values taken by a statistic in all possible samples of the same size from the same population

unbiased estimator

if the mean of its sampling distribution is equal to the parameter being estimated

point estimator

a statistic that provides and estimate of the population parameter