STATS 121 FINAL
Terms in this set (241)
A statement about the value of a parameter that is either "less than," "greater than," or "not equal to" a hypothesized number or another parameter; the hypothesis that the researcher usually wants to prove or verify
ANOVA (Analysis of Variance)
A procedure used to test equality of three or more means.
ANOVA (Analysis of Variance) variables
Explanatory categorical, Response quantitative
association vs. causation
We can only argue causation from association if the results having significant association are from an experiment
a graphical representation of categorical data. Names of each category are listed on the x axis and a bar that has the frequency (or percentage) in that category is placed over each category name
A condition that occurs when the design of a a study systematically favors certain outcomes
two measurements are made on each unit
A group of experimental units sharing some common characteristic. In a randomized block design, random allocation of treatments is carried out separately within each group.
A plot of data that incorporates the maximum observation, the minimum observation, the first quartile, the second quartile (median), and the third quartile
categorical (or qualitative) variable
A variable that can be classified into groups or categories such as gender and religion
Changes in the explanatory variable directly affect the response variable. Experiments are needed to verify causation.
The enumeration of every unit in a population
A summary number about which observations tend to cluster.
Measures of center
mean and median
The middle line on a control chart. Its value is the target value of the mean when the process is in control.
Central Limit Theorem (CLT)
The name of the theorem stating that the sampling distribution of a statistic (e.g. x̅) is approximately normal whenever the sample is large and random
The theoretical distribution that models the test statistic for doing Chi-square tests
Chi-square test statistic
A test statistic computed from data that has an approximate Chi-square distribution
claimed parameter value
the value of the parameter as given in the null hypothesis
completely randomized design (CRD)
an experimental design where all experimental units are assigned at random to treatments
A study that compares only active treatments to determine which works best
The basic premisses that must be checked using a statistical procedure
The distribution of one variable restricted to a single row (or column) of another variable in a two way table. A conditional distribution is found by dividing the values in the row (or column) by the row (or column) total
In a contingency table, the percentage of a category in a row (or column) found by dividing the appropriate cell count by the row (or column) total.
An estimate of the value of a parameter in interval form with an associated level of confidence; it gives a list of plausible values for the parameter based on the value of the statistic
The percentage of all possible samples for which the confidence intervals will contain the parameter being estimated; selected subjectively by the researcher
A situation where the effect of one variable on the response variable cannot be separated from the effect of another variable on the response variable
A treatment where no experimental condition is applied to the units in order to determine whether the active treatments affect the response. This enables the researcher to "control" for lurking variables.
A chart plotting the means (xbars) of regular samples of size n against time. It has a center line and upper and lower control limits to determine whether a process is in control or out of control.
lines on either side of the center line computed using μ-3σ/√n and μ+3σ/√n. A sample mean outside of these bounds signals that the process is out of control.
A sample type where the researcher contacts those subjects who are readily available and does not use any random selection. The results are almost always biased.
A measure of the strength of the linear relationship between two quantitative variables, symbolized with the letter r
information collected on individuals
degrees of freedom
a characteristic of the t-distribution (and other distributions like F and x^2) indicating the amount of information available in the data.
a mathematical model used to describe the overall pattern of the distribution of a random variable
The difference (distance) between an observation and the mean of all the observations in a data set, or the difference between an observation and the corresponding regression model estimate
direction of relationship
A characteristic of data in a scatterplot that is identified as either a positive or negative association
A list of all possible values of a variable together with the frequency (or probability) of each value
A one dimensional plot of a quantitative data set where each value in the data set is represented by a dot above its corresponding location on the x axis.
double blind study
An experiment where neither the subjects nor the diagnosticians (e.g. doctor or nurse) know which treatment is administered to whom
equal variance (or equal standard deviation)
Variances (or standard deviations) for each of the treatment groups (or samples) in ANOVA are all equal. In regression, the variances of the y's at each x are all assumed to be equal
estimate of a parameter
A single value or a range of values used to estimate a parameter
An estimate of how many observations should be in a cell of a two way table if there were no association between the row and column variables
A study where treatments are deliberately imposed on the individuals in the study before data is gathered in order to observe their response to the treatment.
The amount of total variation in the y's that is accounted for by a regression model; it is equal to ∑(ŷ-y)^2
Using a model to predict a y value for an x value that is outside the range of observed x's
Dangerous and strongly discouraged because the relationship between x and y may be different outside the range of observed x's
a term synonymous with explanatory variabl
fail to reject Ho
The appropriate statistical conclusion in hypothesis testing when the P-value is greater than α; equivalently, conclude that "There is not enough evidence to believe Ha"
Any category that is not of primary interest in a categorical data set
The distribution that models the ratio of two variance estimates; used in ANOVA for obtaining the P-value for testing equality of three or more means
five number summary
minimum, Q1, median, Q3, maximum; preferred numerical summary when data are very skewed or outliers are present
The analysis performed on data after an overall test on the equality of multiple means or which proportions differ from which
form of relationship
A description of data in a scatterplot indicating whether the data have a linear relationship, a curved relationship or no relationship
F test statistic
A test statistic that has an F distribution
A graphical display of a quantitative data set; data are grouped into intervals (usually of equal width) and a bar is drawn over each interval having height proportional to the frequency (or percentage) of values in the interval. Values of the variable are given on the x axis and frequencies (or percentages) are given on the y axis. Examined to determine shape, center, and spread.
A process functioning within acceptable limits
SRS's collected separately from each of two (or more) disjoint populations; matched pairs data are considered to be dependent samples
Each object or unit described or examined in a data set
Using results from a sample statistic value to draw conclusions about the population parameter.
An observation that substantially alters the fitted regression equation
interquartile range (IQR)
The difference between Q3 and Q1 (Q3-Q1); the length of the box in a boxplot; contains 50% of the data.
bias introduced into survey results by body language, voice intonation, gender, race, etc. of an interviewer
lack of realism
A weakness in experiments where the setting of the experiment does not realistically duplicate the conditions we really want to study.
law of large numbers
The fact that the average of observed values in a sample (xbar) will tend to get closer and closer to μ as the sample size increases.
least squares regression line
The line that minimizes the sum of squared residuals
A density curve where the left side of the distribution extends in a long tail (mean<median)
left-tailed alternative hypothesis
An alternative hypothesis that states the parameter value is less than some number or the parameter from another treatment or population (e.g. Ha: μ<85)
lower tailed alternative hypothesis
Another name for a left-tailed alternative hypothesis
A variable that the researcher is not necessarily interested in studying but which affects the relationship between the explanatory variable and the response variable
A sample where respondents are contacted in a shopping mall or similar location. Often the method of selection is haphazard although occasionally systematic
margin of error (for 95% confidence)
The maximum amount that a statistic value will differ from the parameter value for the middle 95% of statistics (Note: changing the level of confidence changes the percentage of interest, e.g. 95%)
The distribution of only one variable in a two way table using counts found by summing over the categories of the other variable
The percentage for a row (or column) total in a two table found by dividing the row (or column) total by the table total
A design of experiment that combines matching of subject or measurements with randomization. Either two measurements taken on each unit (such as pre and post) OR measurements taken on two individuals matched by some characteristics different from the explanatory variable and the response variable.
matched pairs t (procedure for mean)
The hypothesis testing method for matched pairs data. The standard null hypothesis is H0: μd = 0 where μd is the mean difference between treatments.
The largest value in a data set
A measure of center of the data; a value that "balances" the data; found by summing all the data and dividing by the number of data points.
A recorded fact about an individual; may be either numerical (quantitative) or qualitative (categorical)
Bias introduced into survey results because of poorly worded questions, interviewer effects, measuring instrument difficulties, etc.
Differences in repeated measurements on the same object.
a measure of center of data; a value that splits the data in half; the "middle" number after the data have been sorted.
The smallest value in a data set
multiple analyses (i.e. multiple comparisons)
Performing two or more test of significance on the same data set. This inflates the overall α (probability of a type 1 error) for the tests (i.e., the more analyses performed, the greater the chance of falsely rejecting at least one true null hypothesis)
A type of sample from a population that has groups and sub-groups. First, some groups are randomly selected, and then some sub-groups from within the selected groups are randomly sampled. Finally, individuals are randomly selected from within the sampled sub-groups. This can be extended to sub-sub-groups, etc.
Variation from object to object within a population.
A sample selected without randomization; hence, the probability of obtaining
a particular sample cannot be computed.
Bias introduced into survey results because individuals refuse to participate.
A bell-shaped, symmetric density curve that is often used as a model for data or other random variables; specified by μ and σ
normality of Y at each X
The distribution of all the Y values at each possible value of X is normal.
The hypothesis that the researcher assumes to be true until sample results indicate otherwise; usually the hypothesis that the researcher wants to disprove.
A study that merely observes conditions of individuals in a population and records information; the population is disturbed as little as possible. (Note: treatments are not imposed on units.)
The count of individuals actually observed in a given cell of a two-way table.
The difference between the observed value of the statistic and the hypothesized value of the
corresponding parameter; (e.g. x- μ0).
The value of the statistic computed from the data
one sample t (procedure for mean)
An inferential procedure using the mean from one sample to test or estimate the population mean; the test statistic follows a t distribution; used when σ is unknown.
one sample z (procedure for proportion)
An inferential procedure using the proportion from one sample to test or estimate the population proportion; the approximate distribution of the test statistic is z or standard normal.
one-sided (or one-tailed) test
An alternative hypothesis where the researcher is interested in deviations in only one direction (< or > is in Ha)
out of control
A process no longer functioning within accepted limits
out of control signal
One sample mean outside the control limits (three standard deviations from x )
or nine sample means in a row above (or below) the center line in a control chart.
(overall) type 1 error rate
The probability of falsely rejecting at least one true null hypothesis when multiple tests are being performed
an observation that falls outside the pattern of the data set. Outliers inflate the mean and often prevent us from using statistical procedures like the one sample t
A characteristic of a population that is usually unknown; this characteristic could be the mean (μ), median, proportion, standard deviation (σ) computed on all the data from the population. Does not have variability
A fake imitation treatment that resembles the real treatment in all respects except for the active ingredient
The response of patients to a treatment even though it has no active ingredient
"playing the game"
simulating a game or process to estimate probabilities of possible outcomes
pooled sample proportion
The value for p̂ when computing the two sample proportion z test statistic. To compute, add the number of successes in both samples and divide by the sum of the two samples.
The entire group of individuals of interest in a study
The distribution of all the observations in a population
population mean (μ)
Mean of all the observations in a population
population proportion (p)
Proportion (or percentage) of all the observations in the population having a certain characteristic.
population standard deviation (σ)
The standard deviation of all observations in a population; a measure of the variability of all the population values about their mean.
Large values of one variable tend to occur with large values of another variable and small values of one variable tend to occur with small values of the other.
The probability of making a correct decision by rejecting a false null hypothesis; increases when α increases, or when n increases
A difference between the observed statistic and the claimed parameter value that is large enough to be worth reporting. To assess, look at the numerator of the test statistic and ask 'Is the difference important?' If yes, then results are also this. Note: Do not assess this unless results are statistically significant.
predicted y (symbolized by ŷ)
Value for y at a specified x as predicted by the regression equation; computed by plugging the value for x into the equation and solving for y.
using a regression equation to estimate a value of the response variable for a given value of the explanatory variable
an interval estimate of plasuible values for a single observation of Y at a specified value of X
A measure of the proportion of times an outcome occurs in a very long series of repetitions, indicating the likelihood of the outcome
A sample chosen using some type of random device. The probability of any specific sample can be computed and is greater than zero.
the fraction of successes in either a sample (p hat) or a population (p)
the probability of getting a test statistic as extreme as or more extreme than the value actually observed assuming Ho is true
Q1 (first quartile)
A location measure of the data that has approximately one-fourth or 25% of the data below it
Q3 (third quartile)
A location measure of the data that has approximately three fourths or 75% of the data below it
A variable with numerical values such as height and weight. This type of data required for both variables in regression analysis
quantitative bivariate data
The type of data required for regression analysis where two quantitative variables are measured on each individual.
one of the three values that divide the ordered data set into quarters
question wording bias
sample results that differ from the truth because of the wording of the question used to obtain the information
A sample selected to fill quotas for different population characteristics like gender, race, age, etc.
The percentage of total variation in y, the response variable, that is accounted for by the regression of y on x (or is explained by the explanatory variable)
r x c table
A two-way table with r rows and c columns.
A method of assigning experimental units to treatment groups that eliminates bias and gives each unit the same probability of being assigned to any treatment group
randomized block design (RBD)
An experimental design where treatments are randomly allocated within each block.
random number table
A table consisting of the digits 0 through 9 in equal proportions such that the digit in any position in the table is independent of the digits in neighboring positions (i.e., there is no pattern in the order of the digits.)
an individual outcome from a random phenomenon
the maximum observation minus the minimum observation
the mathematical modeling of relationships between numerical variables
A mathematical formula for a straight line that models a linear relationship between two quantitative variables. (yhat=a+bx)
The appropriate statistical conclusion when the P-value is less than or equal to α; conclude that "There is enough evidence to believe Ha."
having more than one individual per treatment in an experiment (Note: replication is NOT same as reproducibility of results or repetition of an experiment)
residual (y-y hat)
The difference between the actual y and the predicted y
A diagnostic plot of the residuals versus the explanatory variable used to assess how well the regression line fits the data; complete scatter with a shoe box shape is good; curvature indicates that a non-linear model would better fit the data, and a megaphone pattern indicates the standard deviation of y is not the same for all values of x.
A summary number that is not affected by outliers. The median is a resistant measure of center.
Bias resulting from respondents lying when asked about illegal or unpopular behavior, forgetting or confusing past behavior, having no knowledge about the question content and not wanting to appear stupid, etc.
the variable that will be measured in an experiment; also called the dependent variable
right skewed distribution
A density curve where the right side of the distribution extends in a long tail; (mean > median)
right tailed alternative hypothesis
An alternative hypothesis that states the parameter value of a treatment or population is greater than some number or the parameter from another treatment or population
sample mean (x̅)
average of data in a sample
sample proportion (p hat)
Proportion (or percentage) of successes in a sample; the number of individuals in a sample with a certain characteristic, divided by the sample size.
sample size (n)
# of observations in a sample
sample standard deviation (s)
a measure of the variability in a sample about x̅
sample variance (s^2)
The average of the squared deviations of the observations in a sample about the mean, x̅
the variability of a statistic from one sample to the next; a measure of sampling variability
sampling distribution of p̂
A distribution of the sample proportion; a list of all the possible values for p̂ together with the frequency (or probability) of each value
sampling distribution of x̅
A distribution of the sample mean; a list of all the possible values for x̅ together with the frequency (or probability) of each value
A two dimensional plot used to examine direction, form and strength of the relationship between two quantitative variables
bias introduced into sample results due to how the units were selected for sampling
description of the overall pattern of a histogram using terms including symmetric, right skewed, left skewed, flat (uniform), bell-shaped, etc.
significance level (α)
probability of a type I error, i.e. probability of rejecting a true null hypothesis, the largest risk of rejecting a true null hypothesis that a researcher is willing to take
A test of significance that yields a P-value less than α; an observed effect that is larger than could reasonably be expected due to chance alone
simple random sample
A sample size of size n selected from the population in such a way that each possible sample of size n has an equal chance of being selected.
using random numbers to imitate chance behavior
slope (β=parameter symbol, b=statistic symbol)
A measure of the average rate of change in the response variable for every one unit increase in the explanatory or independent variable
A summary number representing the variability of the observations. Measures include range, interquartile range, and standard deviation.
a measure of the "average" or typical deviation of the observations about the mean; measures variability of data about the mean.
standard deviation of p̂ (or standard deviation of the sampling distribution of p̂)
a measure of the variability of the sampling distribution of p̂; equals √[p(1-p) ÷ n]
standard deviation of x̅ (or standard deviation of the sampling distribution of x̅)
a measure of the variability of the sampling distribution of x̅; the "average" amount that the statistic, x̅, deviates from its associated parameter, μ; equals σ/√n
An estimate of the standard deviation of the sampling distribution of a statistic. It is a measure of the "average" amount that a statistic deviates from its associated parameter. Note: The denominator of many test statistics is the standard error of the statistic corresponding to the parameter being tested.
standard error of p̂ (or standard deviation of the sampling distribution of p̂)
an estimate of the standard deviation of the sampling distribution of p̂; equals √[p̂(1-p̂) ÷ n]
standard error of x̅ (or standard error of the mean)
an estimate of the standard deviation (variability) of the sampling distribution of x̅; estimates variability of all the x̅ 's about μ; equals s/√n
convert an x-value to its corresponding z-score by subtracting the mean and dividing by the standard deviation.
the z-score obtained from standardizing an x-value.
standard normal distribution
a normal distribution with a mean of zero and standard deviation of one. probabilities are given in a table for values of the standard normal variable.
a number computed from sample data (without any knowledge of the value of a parameter) used to estimate the value of the parameter)
the study of data analysis-collecting data, organizing and summarizing data, and drawing conclusions from sample data to answer research questions in the presence of variation.
drawing conclusions about a known parameter from a statistic
statistical process control
a procedure used to check a process at regular intervals to detect problems and correct them before they become serious.
the difference between the observed statistic and the claimed parameter value as given in Ho is too large to be due to chance alone. To assess, ask "is P-value<α?" If yes, then results are statistically significant
stem plot (also called stem and leaf plot)
a graphical representation of a quantitative data set. leading values of each data point are presented as stems and second digits are given as leaves.
a sampling scheme where the population is divided into strata according to some characteristic and a simple random sample is selected from each strata
strength of linear relationship
an assessment of how closely clustered points are about a straight-line in a scatterplot. Very little scatter signifies a strong relationship, lots of scatter signifies a weak relationship. Measure using r, the correlation coefficient.
an individual or unit in a study, usually a person.
category of interest in a qualitative data set.
sum of squared errors (or sum of squared residuals)
The total (sum) of the squared residuals for a regression
a distribution with a density curve where the right half is a mirror image of the left half. (mean=median)
the desired mean of a process that is in control
A distribution specified by degrees of freedom used to model test statistics for the sample mean, differences between sample means, etc. where σ (' s) is (are) unknown.
test of independence
a chi-square test on data collected from a single SRS with two categorical measurements on each individual. The null hypothesis states that there is no relationship between the two categorical variables.
test of significance (also test of hypothesis)
A statistical procedure for making decisions about parameter values based on probabilities of the associated statistic(s).
a numerical value calculated from the sample information assuming Ho is true; used to obtain P-value.
theoretical regression model
the regression equation for the population estimated by the least squares regression line whose slope and y-intercept are found using sample data. symbolized by μy=α+βx
total variation of y
the sum of squared deviations of the y's about y bar computed as ∑ (y-y bar)^2
The condition or conditions applied to a subject or individual in an experiment; a placebo or no treatment is often considered a treatment. The collection of treatments is the explanatory variable.
Any test of significance where the test statistic can be modeled with the t-distribution; used when σ is unknown
two sample t procedure for means
a procedure for comparing the means of two populations using the means from two independent samples, one from each population
two sample z procedure for proportions
a procedure for comparing the proportions of two populations using the proportions from two independent samples, one from each population
two side alternative
an alternative hypothesis where the researcher is interested in deviations in both directions ("≠" is in Ha).
r* to always double the table probability when computing P-value for a two-sided alternative
a table of counts summarizing information for bivariate categorical data where the rows give categories of one variable (usually the explanatory variable) and the columns give categories of the other variable
type I error
the error made when rejecting a true null hypothesis; believing Ha when Ho is true
type II error
failing to reject a false null hypothesis; the error of believing Ho when Ha is true
A condition where the mean of all possible statistics equals the parameter that the statistic estimates
Bias that occurs in sample results because a segment of the population with a certain characteristic is not sampled.
the variation of the y's about the regression equation; equals the sum of squared residuals
one measurement on each unit
upper-tailed alternative hypothesis
another name for right-tailed alternative hypothesis
any characteristic of an individual or object; it may take on any number of values either categorical or numerical.
a measure of the average squared deviation of the data about the mean
a method of sample selection that consists of people choosing themselves by responding to a general appeal
x̅-chart (x bar chart)
A chart used to monitor a process to determine whether it is in control or out of control
the y value where the regression line intercepts (crosses) the y axis
The number of std. deviations a value or observation is from the mean; a standardized x-value.
level of significance or probability of a type I error
true population y-intercept in a regression equation
Probability of a type II error (probability of failing to reject a false null hypothesis)
true population slope in a regression equation
estimated (sample) y-intercept in a regression equation
estimated (sample) slope in a regression equation
Level of confidence
mean of a population distribution
mean of the sampling distribution of x̅
the difference between the means of two populations
N (μ, σ)
Normal distribution with mean, μ, and standard deviation, σ
proportion (or percentage) of a population
mean of the sampling distribution of p̂
proportion (or percentage) of a sample
difference between the proportions of two populations
difference between two sample proportions
sample correlation coefficient
sample standard deviation
standard error of x̅; estimates standard deviation of the sampling distribution of x̅
standard deviation of a population or distribution
variance of a population or distribution
standard deviation of the sampling distribution of x̅
the difference between two sample means
explanatory variable in regression analysis
response variable in regression analysis