Search
Create
Log in
Sign up
Log in
Sign up
STATS 121 FINAL
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (241)
alternative hypothesis
A statement about the value of a parameter that is either "less than," "greater than," or "not equal to" a hypothesized number or another parameter; the hypothesis that the researcher usually wants to prove or verify
ANOVA (Analysis of Variance)
A procedure used to test equality of three or more means.
ANOVA (Analysis of Variance) variables
Explanatory categorical, Response quantitative
association vs. causation
We can only argue causation from association if the results having significant association are from an experiment
bar graph
a graphical representation of categorical data. Names of each category are listed on the x axis and a bar that has the frequency (or percentage) in that category is placed over each category name
bias (sampling)
A condition that occurs when the design of a a study systematically favors certain outcomes
bivariate data
two measurements are made on each unit
block
A group of experimental units sharing some common characteristic. In a randomized block design, random allocation of treatments is carried out separately within each group.
boxplot
A plot of data that incorporates the maximum observation, the minimum observation, the first quartile, the second quartile (median), and the third quartile
categorical (or qualitative) variable
A variable that can be classified into groups or categories such as gender and religion
causation
Changes in the explanatory variable directly affect the response variable. Experiments are needed to verify causation.
census
The enumeration of every unit in a population
center
A summary number about which observations tend to cluster.
Measures of center
mean and median
center line
The middle line on a control chart. Its value is the target value of the mean when the process is in control.
Central Limit Theorem (CLT)
The name of the theorem stating that the sampling distribution of a statistic (e.g. x̅) is approximately normal whenever the sample is large and random
Chi-distribution
The theoretical distribution that models the test statistic for doing Chi-square tests
Chi-square test statistic
A test statistic computed from data that has an approximate Chi-square distribution
claimed parameter value
the value of the parameter as given in the null hypothesis
completely randomized design (CRD)
an experimental design where all experimental units are assigned at random to treatments
comparison study
A study that compares only active treatments to determine which works best
conditions
The basic premisses that must be checked using a statistical procedure
conditional distribution
The distribution of one variable restricted to a single row (or column) of another variable in a two way table. A conditional distribution is found by dividing the values in the row (or column) by the row (or column) total
conditional percentage
In a contingency table, the percentage of a category in a row (or column) found by dividing the appropriate cell count by the row (or column) total.
confidence interval
An estimate of the value of a parameter in interval form with an associated level of confidence; it gives a list of plausible values for the parameter based on the value of the statistic
confidence level
The percentage of all possible samples for which the confidence intervals will contain the parameter being estimated; selected subjectively by the researcher
confounding
A situation where the effect of one variable on the response variable cannot be separated from the effect of another variable on the response variable
control treatment
A treatment where no experimental condition is applied to the units in order to determine whether the active treatments affect the response. This enables the researcher to "control" for lurking variables.
control chart
A chart plotting the means (xbars) of regular samples of size n against time. It has a center line and upper and lower control limits to determine whether a process is in control or out of control.
control limits
lines on either side of the center line computed using μ-3σ/√n and μ+3σ/√n. A sample mean outside of these bounds signals that the process is out of control.
convenience sample
A sample type where the researcher contacts those subjects who are readily available and does not use any random selection. The results are almost always biased.
correlation coefficient
A measure of the strength of the linear relationship between two quantitative variables, symbolized with the letter r
data
information collected on individuals
degrees of freedom
a characteristic of the t-distribution (and other distributions like F and x^2) indicating the amount of information available in the data.
density curve
a mathematical model used to describe the overall pattern of the distribution of a random variable
deviation
The difference (distance) between an observation and the mean of all the observations in a data set, or the difference between an observation and the corresponding regression model estimate
direction of relationship
A characteristic of data in a scatterplot that is identified as either a positive or negative association
distribution
A list of all possible values of a variable together with the frequency (or probability) of each value
dotplot
A one dimensional plot of a quantitative data set where each value in the data set is represented by a dot above its corresponding location on the x axis.
double blind study
An experiment where neither the subjects nor the diagnosticians (e.g. doctor or nurse) know which treatment is administered to whom
equal variance (or equal standard deviation)
Variances (or standard deviations) for each of the treatment groups (or samples) in ANOVA are all equal. In regression, the variances of the y's at each x are all assumed to be equal
estimate of a parameter
A single value or a range of values used to estimate a parameter
expected count
An estimate of how many observations should be in a cell of a two way table if there were no association between the row and column variables
experiment
A study where treatments are deliberately imposed on the individuals in the study before data is gathered in order to observe their response to the treatment.
explained variation
The amount of total variation in the y's that is accounted for by a regression model; it is equal to ∑(ŷ-y)^2
extrapolation
Using a model to predict a y value for an x value that is outside the range of observed x's
extrapolation problem
Dangerous and strongly discouraged because the relationship between x and y may be different outside the range of observed x's
factor
a term synonymous with explanatory variabl
fail to reject Ho
The appropriate statistical conclusion in hypothesis testing when the P-value is greater than α; equivalently, conclude that "There is not enough evidence to believe Ha"
failure
Any category that is not of primary interest in a categorical data set
F distribution
The distribution that models the ratio of two variance estimates; used in ANOVA for obtaining the P-value for testing equality of three or more means
five number summary
minimum, Q1, median, Q3, maximum; preferred numerical summary when data are very skewed or outliers are present
follow-up analysis
The analysis performed on data after an overall test on the equality of multiple means or which proportions differ from which
form of relationship
A description of data in a scatterplot indicating whether the data have a linear relationship, a curved relationship or no relationship
F test statistic
A test statistic that has an F distribution
histogram
A graphical display of a quantitative data set; data are grouped into intervals (usually of equal width) and a bar is drawn over each interval having height proportional to the frequency (or percentage) of values in the interval. Values of the variable are given on the x axis and frequencies (or percentages) are given on the y axis. Examined to determine shape, center, and spread.
in control
A process functioning within acceptable limits
independent samples
SRS's collected separately from each of two (or more) disjoint populations; matched pairs data are considered to be dependent samples
individual
Each object or unit described or examined in a data set
inference
Using results from a sample statistic value to draw conclusions about the population parameter.
influential point
An observation that substantially alters the fitted regression equation
interquartile range (IQR)
The difference between Q3 and Q1 (Q3-Q1); the length of the box in a boxplot; contains 50% of the data.
interviewer bias
bias introduced into survey results by body language, voice intonation, gender, race, etc. of an interviewer
lack of realism
A weakness in experiments where the setting of the experiment does not realistically duplicate the conditions we really want to study.
law of large numbers
The fact that the average of observed values in a sample (xbar) will tend to get closer and closer to μ as the sample size increases.
least squares regression line
The line that minimizes the sum of squared residuals
left skewed
A density curve where the left side of the distribution extends in a long tail (mean<median)
left-tailed alternative hypothesis
An alternative hypothesis that states the parameter value is less than some number or the parameter from another treatment or population (e.g. Ha: μ<85)
lower tailed alternative hypothesis
Another name for a left-tailed alternative hypothesis
lurking variable
A variable that the researcher is not necessarily interested in studying but which affects the relationship between the explanatory variable and the response variable
mall-intercept sample
A sample where respondents are contacted in a shopping mall or similar location. Often the method of selection is haphazard although occasionally systematic
margin of error (for 95% confidence)
The maximum amount that a statistic value will differ from the parameter value for the middle 95% of statistics (Note: changing the level of confidence changes the percentage of interest, e.g. 95%)
marginal distribution
The distribution of only one variable in a two way table using counts found by summing over the categories of the other variable
marginal percentage
The percentage for a row (or column) total in a two table found by dividing the row (or column) total by the table total
matched pairs
A design of experiment that combines matching of subject or measurements with randomization. Either two measurements taken on each unit (such as pre and post) OR measurements taken on two individuals matched by some characteristics different from the explanatory variable and the response variable.
matched pairs t (procedure for mean)
The hypothesis testing method for matched pairs data. The standard null hypothesis is H0: μd = 0 where μd is the mean difference between treatments.
maximum
The largest value in a data set
mean
A measure of center of the data; a value that "balances" the data; found by summing all the data and dividing by the number of data points.
measurement
A recorded fact about an individual; may be either numerical (quantitative) or qualitative (categorical)
measurement bias
Bias introduced into survey results because of poorly worded questions, interviewer effects, measuring instrument difficulties, etc.
measurement variation
Differences in repeated measurements on the same object.
median
a measure of center of data; a value that splits the data in half; the "middle" number after the data have been sorted.
minimum
The smallest value in a data set
multiple analyses (i.e. multiple comparisons)
Performing two or more test of significance on the same data set. This inflates the overall α (probability of a type 1 error) for the tests (i.e., the more analyses performed, the greater the chance of falsely rejecting at least one true null hypothesis)
multi-stage sample
A type of sample from a population that has groups and sub-groups. First, some groups are randomly selected, and then some sub-groups from within the selected groups are randomly sampled. Finally, individuals are randomly selected from within the sampled sub-groups. This can be extended to sub-sub-groups, etc.
natural variation
Variation from object to object within a population.
non-probability sample
A sample selected without randomization; hence, the probability of obtaining
a particular sample cannot be computed.
non-response bias
Bias introduced into survey results because individuals refuse to participate.
normal distribution
A bell-shaped, symmetric density curve that is often used as a model for data or other random variables; specified by μ and σ
normality of Y at each X
The distribution of all the Y values at each possible value of X is normal.
null hypothesis
The hypothesis that the researcher assumes to be true until sample results indicate otherwise; usually the hypothesis that the researcher wants to disprove.
observational study
A study that merely observes conditions of individuals in a population and records information; the population is disturbed as little as possible. (Note: treatments are not imposed on units.)
observed count
The count of individuals actually observed in a given cell of a two-way table.
observed effect
The difference between the observed value of the statistic and the hypothesized value of the
corresponding parameter; (e.g. x- μ0).
observed statistic
The value of the statistic computed from the data
one sample t (procedure for mean)
An inferential procedure using the mean from one sample to test or estimate the population mean; the test statistic follows a t distribution; used when σ is unknown.
one sample z (procedure for proportion)
An inferential procedure using the proportion from one sample to test or estimate the population proportion; the approximate distribution of the test statistic is z or standard normal.
one-sided (or one-tailed) test
An alternative hypothesis where the researcher is interested in deviations in only one direction (< or > is in Ha)
out of control
A process no longer functioning within accepted limits
out of control signal
One sample mean outside the control limits (three standard deviations from x )
or nine sample means in a row above (or below) the center line in a control chart.
(overall) type 1 error rate
The probability of falsely rejecting at least one true null hypothesis when multiple tests are being performed
outlier
an observation that falls outside the pattern of the data set. Outliers inflate the mean and often prevent us from using statistical procedures like the one sample t
parameter
A characteristic of a population that is usually unknown; this characteristic could be the mean (μ), median, proportion, standard deviation (σ) computed on all the data from the population. Does not have variability
placebo
A fake imitation treatment that resembles the real treatment in all respects except for the active ingredient
placebo effect
The response of patients to a treatment even though it has no active ingredient
"playing the game"
simulating a game or process to estimate probabilities of possible outcomes
pooled sample proportion
The value for p̂ when computing the two sample proportion z test statistic. To compute, add the number of successes in both samples and divide by the sum of the two samples.
population
The entire group of individuals of interest in a study
population distribution
The distribution of all the observations in a population
population mean (μ)
Mean of all the observations in a population
population proportion (p)
Proportion (or percentage) of all the observations in the population having a certain characteristic.
population standard deviation (σ)
The standard deviation of all observations in a population; a measure of the variability of all the population values about their mean.
positive association
Large values of one variable tend to occur with large values of another variable and small values of one variable tend to occur with small values of the other.
power (1-B)
The probability of making a correct decision by rejecting a false null hypothesis; increases when α increases, or when n increases
practical significance
A difference between the observed statistic and the claimed parameter value that is large enough to be worth reporting. To assess, look at the numerator of the test statistic and ask 'Is the difference important?' If yes, then results are also this. Note: Do not assess this unless results are statistically significant.
predicted y (symbolized by ŷ)
Value for y at a specified x as predicted by the regression equation; computed by plugging the value for x into the equation and solving for y.
prediction
using a regression equation to estimate a value of the response variable for a given value of the explanatory variable
prediction interval
an interval estimate of plasuible values for a single observation of Y at a specified value of X
probability
A measure of the proportion of times an outcome occurs in a very long series of repetitions, indicating the likelihood of the outcome
probability sample
A sample chosen using some type of random device. The probability of any specific sample can be computed and is greater than zero.
proportion
the fraction of successes in either a sample (p hat) or a population (p)
P-value
the probability of getting a test statistic as extreme as or more extreme than the value actually observed assuming Ho is true
Q1 (first quartile)
A location measure of the data that has approximately one-fourth or 25% of the data below it
Q3 (third quartile)
A location measure of the data that has approximately three fourths or 75% of the data below it
quantitative variable
A variable with numerical values such as height and weight. This type of data required for both variables in regression analysis
quantitative bivariate data
The type of data required for regression analysis where two quantitative variables are measured on each individual.
quartile
one of the three values that divide the ordered data set into quarters
question wording bias
sample results that differ from the truth because of the wording of the question used to obtain the information
quota sample
A sample selected to fill quotas for different population characteristics like gender, race, age, etc.
r^2
The percentage of total variation in y, the response variable, that is accounted for by the regression of y on x (or is explained by the explanatory variable)
r x c table
A two-way table with r rows and c columns.
randomization
A method of assigning experimental units to treatment groups that eliminates bias and gives each unit the same probability of being assigned to any treatment group
randomized block design (RBD)
An experimental design where treatments are randomly allocated within each block.
random number table
A table consisting of the digits 0 through 9 in equal proportions such that the digit in any position in the table is independent of the digits in neighboring positions (i.e., there is no pattern in the order of the digits.)
random outcome
an individual outcome from a random phenomenon
range
the maximum observation minus the minimum observation
regression
the mathematical modeling of relationships between numerical variables
regression equation
A mathematical formula for a straight line that models a linear relationship between two quantitative variables. (yhat=a+bx)
reject H0
The appropriate statistical conclusion when the P-value is less than or equal to α; conclude that "There is enough evidence to believe Ha."
replication
having more than one individual per treatment in an experiment (Note: replication is NOT same as reproducibility of results or repetition of an experiment)
residual (y-y hat)
The difference between the actual y and the predicted y
residual plot
A diagnostic plot of the residuals versus the explanatory variable used to assess how well the regression line fits the data; complete scatter with a shoe box shape is good; curvature indicates that a non-linear model would better fit the data, and a megaphone pattern indicates the standard deviation of y is not the same for all values of x.
resistant measure
A summary number that is not affected by outliers. The median is a resistant measure of center.
respondent bias
Bias resulting from respondents lying when asked about illegal or unpopular behavior, forgetting or confusing past behavior, having no knowledge about the question content and not wanting to appear stupid, etc.
response variable
the variable that will be measured in an experiment; also called the dependent variable
right skewed distribution
A density curve where the right side of the distribution extends in a long tail; (mean > median)
right tailed alternative hypothesis
An alternative hypothesis that states the parameter value of a treatment or population is greater than some number or the parameter from another treatment or population
sample mean (x̅)
average of data in a sample
sample proportion (p hat)
Proportion (or percentage) of successes in a sample; the number of individuals in a sample with a certain characteristic, divided by the sample size.
sample size (n)
# of observations in a sample
sample standard deviation (s)
a measure of the variability in a sample about x̅
sample variance (s^2)
The average of the squared deviations of the observations in a sample about the mean, x̅
sampling variability
the variability of a statistic from one sample to the next; a measure of sampling variability
sampling distribution of p̂
A distribution of the sample proportion; a list of all the possible values for p̂ together with the frequency (or probability) of each value
sampling distribution of x̅
A distribution of the sample mean; a list of all the possible values for x̅ together with the frequency (or probability) of each value
scatterplot
A two dimensional plot used to examine direction, form and strength of the relationship between two quantitative variables
selection bias
bias introduced into sample results due to how the units were selected for sampling
shape
description of the overall pattern of a histogram using terms including symmetric, right skewed, left skewed, flat (uniform), bell-shaped, etc.
significance level (α)
probability of a type I error, i.e. probability of rejecting a true null hypothesis, the largest risk of rejecting a true null hypothesis that a researcher is willing to take
significant result
A test of significance that yields a P-value less than α; an observed effect that is larger than could reasonably be expected due to chance alone
simple random sample
A sample size of size n selected from the population in such a way that each possible sample of size n has an equal chance of being selected.
simulation
using random numbers to imitate chance behavior
slope (β=parameter symbol, b=statistic symbol)
A measure of the average rate of change in the response variable for every one unit increase in the explanatory or independent variable
spread
A summary number representing the variability of the observations. Measures include range, interquartile range, and standard deviation.
standard deviation
a measure of the "average" or typical deviation of the observations about the mean; measures variability of data about the mean.
standard deviation of p̂ (or standard deviation of the sampling distribution of p̂)
a measure of the variability of the sampling distribution of p̂; equals √[p(1-p) ÷ n]
standard deviation of x̅ (or standard deviation of the sampling distribution of x̅)
a measure of the variability of the sampling distribution of x̅; the "average" amount that the statistic, x̅, deviates from its associated parameter, μ; equals σ/√n
standard error
An estimate of the standard deviation of the sampling distribution of a statistic. It is a measure of the "average" amount that a statistic deviates from its associated parameter. Note: The denominator of many test statistics is the standard error of the statistic corresponding to the parameter being tested.
standard error of p̂ (or standard deviation of the sampling distribution of p̂)
an estimate of the standard deviation of the sampling distribution of p̂; equals √[p̂(1-p̂) ÷ n]
standard error of x̅ (or standard error of the mean)
an estimate of the standard deviation (variability) of the sampling distribution of x̅; estimates variability of all the x̅ 's about μ; equals s/√n
standardize
convert an x-value to its corresponding z-score by subtracting the mean and dividing by the standard deviation.
standardized value
the z-score obtained from standardizing an x-value.
standard normal distribution
a normal distribution with a mean of zero and standard deviation of one. probabilities are given in a table for values of the standard normal variable.
statistic
a number computed from sample data (without any knowledge of the value of a parameter) used to estimate the value of the parameter)
statistics
the study of data analysis-collecting data, organizing and summarizing data, and drawing conclusions from sample data to answer research questions in the presence of variation.
statistic inference
drawing conclusions about a known parameter from a statistic
statistical process control
a procedure used to check a process at regular intervals to detect problems and correct them before they become serious.
statistical significance
the difference between the observed statistic and the claimed parameter value as given in Ho is too large to be due to chance alone. To assess, ask "is P-value<α?" If yes, then results are statistically significant
stem plot (also called stem and leaf plot)
a graphical representation of a quantitative data set. leading values of each data point are presented as stems and second digits are given as leaves.
stratified sample
a sampling scheme where the population is divided into strata according to some characteristic and a simple random sample is selected from each strata
strength of linear relationship
an assessment of how closely clustered points are about a straight-line in a scatterplot. Very little scatter signifies a strong relationship, lots of scatter signifies a weak relationship. Measure using r, the correlation coefficient.
subject
an individual or unit in a study, usually a person.
success
category of interest in a qualitative data set.
sum of squared errors (or sum of squared residuals)
The total (sum) of the squared residuals for a regression
symmetric distribution
a distribution with a density curve where the right half is a mirror image of the left half. (mean=median)
target value
the desired mean of a process that is in control
t distribution
A distribution specified by degrees of freedom used to model test statistics for the sample mean, differences between sample means, etc. where σ (' s) is (are) unknown.
test of independence
a chi-square test on data collected from a single SRS with two categorical measurements on each individual. The null hypothesis states that there is no relationship between the two categorical variables.
test of significance (also test of hypothesis)
A statistical procedure for making decisions about parameter values based on probabilities of the associated statistic(s).
test statistic
a numerical value calculated from the sample information assuming Ho is true; used to obtain P-value.
theoretical regression model
the regression equation for the population estimated by the least squares regression line whose slope and y-intercept are found using sample data. symbolized by μy=α+βx
total variation of y
the sum of squared deviations of the y's about y bar computed as ∑ (y-y bar)^2
treatment
The condition or conditions applied to a subject or individual in an experiment; a placebo or no treatment is often considered a treatment. The collection of treatments is the explanatory variable.
t-test
Any test of significance where the test statistic can be modeled with the t-distribution; used when σ is unknown
two sample t procedure for means
a procedure for comparing the means of two populations using the means from two independent samples, one from each population
two sample z procedure for proportions
a procedure for comparing the proportions of two populations using the proportions from two independent samples, one from each population
two side alternative
an alternative hypothesis where the researcher is interested in deviations in both directions ("≠" is in Ha).
Remember
r* to always double the table probability when computing P-value for a two-sided alternative
two-way table
a table of counts summarizing information for bivariate categorical data where the rows give categories of one variable (usually the explanatory variable) and the columns give categories of the other variable
type I error
the error made when rejecting a true null hypothesis; believing Ha when Ho is true
type II error
failing to reject a false null hypothesis; the error of believing Ho when Ha is true
unbiased
A condition where the mean of all possible statistics equals the parameter that the statistic estimates
under-coverage bias
Bias that occurs in sample results because a segment of the population with a certain characteristic is not sampled.
unexplained variation
the variation of the y's about the regression equation; equals the sum of squared residuals
univariate data
one measurement on each unit
upper-tailed alternative hypothesis
another name for right-tailed alternative hypothesis
variable
any characteristic of an individual or object; it may take on any number of values either categorical or numerical.
variance
a measure of the average squared deviation of the data about the mean
voluntary response
a method of sample selection that consists of people choosing themselves by responding to a general appeal
x̅-chart (x bar chart)
A chart used to monitor a process to determine whether it is in control or out of control
y intercept
the y value where the regression line intercepts (crosses) the y axis
z-score
The number of std. deviations a value or observation is from the mean; a standardized x-value.
α
level of significance or probability of a type I error
α
true population y-intercept in a regression equation
β
Probability of a type II error (probability of failing to reject a false null hypothesis)
β
true population slope in a regression equation
a
estimated (sample) y-intercept in a regression equation
b
estimated (sample) slope in a regression equation
C
Level of confidence
μ
mean of a population distribution
μ
mean of the sampling distribution of x̅
μx1-μx2
the difference between the means of two populations
n
sample size
N (μ, σ)
Normal distribution with mean, μ, and standard deviation, σ
p
proportion (or percentage) of a population
p
mean of the sampling distribution of p̂
p̂
proportion (or percentage) of a sample
p1-p2
difference between the proportions of two populations
p̂1-p̂2
difference between two sample proportions
r
sample correlation coefficient
s^2
sample variance
s
sample standard deviation
s/√n
standard error of x̅; estimates standard deviation of the sampling distribution of x̅
∑
summation symbol
σ
standard deviation of a population or distribution
σ^2
variance of a population or distribution
σ/√n
standard deviation of the sampling distribution of x̅
x̅
sample mean
x̅1-x̅2
the difference between two sample means
X
explanatory variable in regression analysis
Y
response variable in regression analysis
y hat
predicted y
;