AP Statistics Essential Terminology


Terms in this set (...)

Simple Random Sample
is a subset of a statistical population in which each member of the subset has an equal probability of being chosen. An example of a simple random sample would be the names of 25 employees being chosen out of a hat from a company of 250 employees.
Systematic Random Sample
is a random sampling technique which is frequently chosen by researchers for its simplicity and its periodic quality.
Stratified Random Sample
is a method of sampling that involves the division of a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members' shared attributes or characteristics. ... These subsets of the strata are then pooled to form a random sample.
Cluster Sample
refers to a sampling method that has the following properties. The population is divided into N groups, called clusters. The researcher randomly selects n clusters to include in the sample.
prejudice in favor of or against one thing, person, or group compared with another, usually in a way considered to be unfair.
Response bias
(also called survey bias) is the tendency of a person to answer questions on a survey untruthfully or misleadingly. For example, they may feel pressure to give answers that are socially acceptable
Nonresponse bias
is the bias that results when respondents differ in meaningful ways from nonrespondents. Nonresponse is often problem with mail surveys, where the response rate can be very low.
Voluntary bias
occurs when sample members are self-selected volunteers, as in voluntary samples . An example would be call-in radio shows that solicit audience participation in surveys on controversial topics (abortion, affirmative action, gun control, etc.).
In survey sampling, undercoverage is a type of selection bias . It occurs when some members of the population are inadequately represented in the sample.
convenience sample
is one of the main types of non-probability sampling methods. A convenience sample is made up of people who are easy to reach. Consider the following example.
experimental unit
is the physical entity which can be assigned, at random, to a treatment. Commonly it is an individual animal. The experimental unit is also the unit of statistical analysis. However, any two experimental units must be capable of receiving different treatments.
a circumstance, fact, or influence that contributes to a result or outcome.
In an experiment, the factor (also called an independent variable) is an explanatory variable manipulated by the experimenter. Each factor has two or more levels, i.e., different values of the factor. Combinations of factor levels are called treatments.
response variable
is the variable about which a researcher is asking a specific question.
extraneous variable
are any variables that you are not intentionally studying in your experiment or test. When you run an experiment, you're looking to see if one variable (the independent variable) has an effect on another variable (the dependent variable). ... These undesirable variables are called extraneous variables.
confounding variable
is a variable in a quantitative research study that explains some or all of the correlation between the dependent variable and an independent variable.
refers to the practice of using chance methods (random number tables, flipping a coin, etc.) to assign subjects to treatments. In this way, the potential effects of lurking variables are distributed at chance levels (hopefully roughly evenly) across treatment conditions.
the action of copying or reproducing something.
a group or individual used as a standard of comparison for checking the results of a survey or experiment.
refers to the practice of keeping patients in the dark as to whether they are receiving a placebo or not. It can also refer to allocation concealment, which is used to avoid selection bias.
is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.
The difference between the lowest and highest values. In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9, so the range is 9 − 3 = 6. Range can also mean all the output values of a function.
interquartile range
also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.
skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. The qualitative interpretation of the skew is complicated and unintuitive.
made up of exactly similar parts facing each other or around an axis; showing symmetry.
add up the values in the data set and then divide by the number of values that you added.
we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.
standard deviation
a quantity calculated to indicate the extent of deviation for a group as a whole.
is the expectation of the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean.
indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula. z = (X - μ) / σ where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.
each of the 100 equal groups into which a population can be divided according to the distribution of values of a particular variable.
The complement of an event is the event not occurring. Thus, the complement of Event A is Event A not occurring. The probability that Event A will not occur is denoted by P(A').
When two events are said to be independent of each other, what this means is that the probability that one event occurs in no way affects the probability of the other event occurring. An example of two independent events is as follows; say you rolled a die and flipped a coin.
Mutually exclusive
In logic and probability theory, two propositions (or events) are mutually exclusive or disjoint if they cannot both be true (occur). A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails, but not both.
expected value
a predicted value of a variable, calculated as the sum of all possible values each multiplied by the probability of its occurrence.
binomial distribution
The binomial distribution gives the discrete probability distribution P_p(n|N) of obtaining exactly n successes out of N Bernoulli trials (where the result of each Bernoulli trial is true with probability p and false with probability q=1-p). The binomial distribution is therefore given by
geometric distribution
The geometric distribution is a discrete distribution for n=0, 1, 2, ... having probability density function
normal distribution
a function that represents the distribution of many random variables as a symmetrical bell-shaped graph.
discrete data
is information that can be categorized into a classification. Discrete data is based on counts. Only a finite number of values is possible, and the values cannot be subdivided meaningfully. For example, the number of parts damaged in shipment. ... It is typically things counted in whole numbers.
continuous data
is information that can be measured on a continuum or scale
a fact or piece of data from a study of a large quantity of numerical data.
a numerical characteristic of a population, as distinct from a statistic of a sample.
sampling distribution
is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
standard error
a measure of the statistical accuracy of an estimate, equal to the standard deviation of the theoretical distribution of a large population of such estimates.
margin of error
an amount (usually small) that is allowed for in case of miscalculation or change of circumstances.
confidence level
the probability that the value of a parameter falls within a specified range of values.
critical value (z*)
A critical value of z is sometimes written as za, where the alpha level, a, is the area in the tail. For example, z.10=1.28. A critical value of z (Z-score) is used when the sampling distribution is normal, or close to normal.
null hypothesis
(in a statistical test) the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
alternate hypothesis
The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed).
test statistic
A test statistic is a standardized value that is calculated from sample data during a hypothesis test. You can use test statistics to determine whether to reject the null hypothesis. The test statistic compares your data with what is expected under the null hypothesis.
When you perform a hypothesis test in statistics, a p-value helps you determine the significance of your results. ... The p-value is a number between 0 and 1 and interpreted in the following way: A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
level of significance
The null hypothesis is rejected if the p-value is less than a predetermined level, α. α is called the significance level, and is the probability of rejecting the null hypothesis given that it is true (a type I error). It is usually set at or below 5%.
type 1 error
In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a "false positive"),
type 2 error
while a type II error is incorrectly retaining a false null hypothesis (a "false negative").
power (error)
The power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. Statistical power is inversely related to beta or the probability of making a Type II error. In short, power = 1 - β.
central limit theorem
The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
Critical value (t*)
The critical t statistic (t) is the t statistic having degrees of freedom equal to df and a cumulative probability equal to the critical probability (p).
degrees of freedom
In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it, is called number of degrees of freedom.
Dependent event definition. When two events are dependent, the occurrence of one event influences the probability of another event
chi square distribution
The Chi Square distribution is the distribution of the sum of squared standard normal deviates. The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed.
chi square test statistic
The chi-square test statistic is an overall measure of how close the observed frequencies are to the expected frequencies.
expected counts
Expected counts are the projected frequencies in each cell if the null hypothesis is true (aka, no association between the variables.
correlation coefficient
a number between −1 and +1 calculated so as to represent the linear dependence of two variables or sets of data.
Slope is often denoted by the letter m; there is no clear answer to the question why the letter m is used for slope, but it might be from the "m for multiple" in the equation of a straight line "y = mx + b" or "y = mx + c". The direction of a line is either increasing, decreasing, horizontal or vertical.
The equation of any straight line, called a linear equation, can be written as: y = mx + b, where m is the slope of the line and b is the y-intercept. The y-intercept of this line is the value of y at the point where the line crosses the y axis.
the extension of a graph, curve, or range of values by inferring unknown values from trends in the known data.
Residuals. The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Both the sum and the mean of the residuals are equal to zero.
residual plot
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
coefficient of determination
The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. ... An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable.