bell-shaped

the colloquial description given to a particular type of smooth, symmetric distribution, so called because it resembles a bell. The normal distribution is one example.

Many variables are normal or approximately normal, which is useful because of the bell curve's nice mathematical properties, such as the Empirical Rule.

Many variables are normal or approximately normal, which is useful because of the bell curve's nice mathematical properties, such as the Empirical Rule.

bimodal

if it has more than one mode.

A bimodal variable has two prominent peaks in its distribution when displayed in a histogram, dotplot, or stemplot. Number of modes is one of the descriptors of shape.

A bimodal variable has two prominent peaks in its distribution when displayed in a histogram, dotplot, or stemplot. Number of modes is one of the descriptors of shape.

boxplot

visual representation of the five-number summary. The box extends from the lower quartile to the upper quartile and shows the middle 50% of the data, including the median. The whiskers of the boxplot may extend to the minimum and maximum.

More commonly, a boxplot identifies outliers using "fences" based on the IQR. A point is a univariate outlier if it is smaller in value than the lower fence or larger in value than the upper fence. It is displayed with a symbol like an asterisk or circle. The whiskers then extend to the largest and smallest non-outliers.

Lower fence = Q1 - 1.5(IQR)

Upper fence = Q3 + 1.5(IQR)

Sometimes a data point is called an extreme outlier if it is more than 3(IQR) below Q1 or above Q3.

Strengths: Since a boxplot summarizes the five-number summary, it directly shows center and spread. It identifies outliers using a fixed rule based on the IQR. Boxplots are very good for comparing two or more distributions.

Weaknesses: A boxplot can be used to judge symmetry versus skew, but not all aspects of shape, such as whether the distribution is bell-shaped or bimodal.

More commonly, a boxplot identifies outliers using "fences" based on the IQR. A point is a univariate outlier if it is smaller in value than the lower fence or larger in value than the upper fence. It is displayed with a symbol like an asterisk or circle. The whiskers then extend to the largest and smallest non-outliers.

Lower fence = Q1 - 1.5(IQR)

Upper fence = Q3 + 1.5(IQR)

Sometimes a data point is called an extreme outlier if it is more than 3(IQR) below Q1 or above Q3.

Strengths: Since a boxplot summarizes the five-number summary, it directly shows center and spread. It identifies outliers using a fixed rule based on the IQR. Boxplots are very good for comparing two or more distributions.

Weaknesses: A boxplot can be used to judge symmetry versus skew, but not all aspects of shape, such as whether the distribution is bell-shaped or bimodal.

distribution

of a variable describes all of its possible categories or numerical values, along with how often they occur.

For a quantitative variable, the three characteristics that are typically of most interest are the shape, center, and spread of the distribution, as well as whether or not there are any outliers.

A categorical variable is typically summarized in a frequency table.

For a quantitative variable, the three characteristics that are typically of most interest are the shape, center, and spread of the distribution, as well as whether or not there are any outliers.

A categorical variable is typically summarized in a frequency table.

distribution, normal

is a smooth, continuous, bell-shaped distribution with many useful properties (e.g., the Empirical Rule).

Most real-life data do not exactly fit the mathematical formula for the normal distribution. For example, the data may be counted values (discrete) rather than a continuous variable. However, many real-world phenomena are approximately normally distributed.

The normal distribution plays a large role in inferential statistics.

Most real-life data do not exactly fit the mathematical formula for the normal distribution. For example, the data may be counted values (discrete) rather than a continuous variable. However, many real-world phenomena are approximately normally distributed.

The normal distribution plays a large role in inferential statistics.

dotplot

displays each observation for a given numerical variable as a dot on a number line.

If two or more observations share the same value, which is very common, the dots are stacked vertically.

Strengths: A dotplots is easy to create and displays all the individual data values. Dotplots can be useful for comparing two or more distributions when each sample size is small or moderate.

Weaknesses: A dotplot is too cluttered for a large sample size. In that case, a histogram or stem-and-leaf plot is better.

If two or more observations share the same value, which is very common, the dots are stacked vertically.

Strengths: A dotplots is easy to create and displays all the individual data values. Dotplots can be useful for comparing two or more distributions when each sample size is small or moderate.

Weaknesses: A dotplot is too cluttered for a large sample size. In that case, a histogram or stem-and-leaf plot is better.

Empirical Rule

states that for any bell-shaped curve:

approximately 68% of the values fall within one standard deviation of the mean (µ ± 1σ)

approximately 95% of the values fall within two standard deviations of the mean (µ ± 2σ)

approximately 99.7% of the values fall within three standard deviations of the mean (µ ± 3σ)

only about 0.3% of values are farther than three standard deviations from the mean

By extension, the range of a sample from a bell-shaped distribution equals about 4 to 6 standard deviations. For larger samples (n > 200 or so), SD ~ range / six.

approximately 68% of the values fall within one standard deviation of the mean (µ ± 1σ)

approximately 95% of the values fall within two standard deviations of the mean (µ ± 2σ)

approximately 99.7% of the values fall within three standard deviations of the mean (µ ± 3σ)

only about 0.3% of values are farther than three standard deviations from the mean

By extension, the range of a sample from a bell-shaped distribution equals about 4 to 6 standard deviations. For larger samples (n > 200 or so), SD ~ range / six.

five-number summary

is a group of summary values that includes the median, the quartiles (lower quartile and upper quartile) and the extremes (minimum and maximum).

For a given variable, the five-number summary divides the observational units into four groups, where each group contains 25% of the units.

Related measures are the range and interquartile range, which is the foundation for a boxplot of the variable.

For a given variable, the five-number summary divides the observational units into four groups, where each group contains 25% of the units.

Related measures are the range and interquartile range, which is the foundation for a boxplot of the variable.

frequency table

is a method for summarizing the distribution of a categorical variable. Each row of the table lists one category along with its corresponding frequency and/or relative frequency.

The table rows may be ordered in different ways, including alphabetically or by increasing/decreasing frequency.

The table rows may be ordered in different ways, including alphabetically or by increasing/decreasing frequency.

histogram

special type of bar graph for displaying the distribution of a numerical variable. The categories are continuous, equally-spaced intervals of the number line and the frequencies (or relative frequencies) are the counts (or percentages) of data points in each interval.

One rule of thumb for making histograms is that the number of intervals (sometimes called bins) should be approximately equal to the square root of the sample size, adjusting upward or downward as needed to accommodate outliers and create a smooth shape.

Strengths: A histograms is good for judging distribution shape when the sample size is moderate to large, and there is flexibility in the number and width of the bins.

Weaknesses: For a small sample size, a histogram is usually too sparse to show the true shape of a variable's distribution. Individual observation values are also not visible in a histogram

One rule of thumb for making histograms is that the number of intervals (sometimes called bins) should be approximately equal to the square root of the sample size, adjusting upward or downward as needed to accommodate outliers and create a smooth shape.

Strengths: A histograms is good for judging distribution shape when the sample size is moderate to large, and there is flexibility in the number and width of the bins.

Weaknesses: For a small sample size, a histogram is usually too sparse to show the true shape of a variable's distribution. Individual observation values are also not visible in a histogram

location

of a distribution is its center or "average", and is most often quantified by the mean or median. Sometimes mode is used as a measure of center.

Location can be thought of as the "typical" value of a variable.

Location can be thought of as the "typical" value of a variable.

maximum

s the largest value for a given numerical variable; it is the upper extreme.

mean

is the arithmetic average of values for a given numerical variable, computed by taking the sum of the individual observations and dividing by the total number of observations.

The mean is the most commonly used measure of location. Though most people equate average with mean, there are many different kinds of averages. For example, a trimmed mean can be computed by deleting a fixed percentage of points on the extremes of the data set before taking the mean, which makes it more resistant to the effects of outliers.

A population mean is customarily symbolized by the Greek letter μ while a sample mean is usually represented by ("x-bar").

The mean is the most commonly used measure of location. Though most people equate average with mean, there are many different kinds of averages. For example, a trimmed mean can be computed by deleting a fixed percentage of points on the extremes of the data set before taking the mean, which makes it more resistant to the effects of outliers.

A population mean is customarily symbolized by the Greek letter μ while a sample mean is usually represented by ("x-bar").

median

is the middle value for a given numerical variable, such that 50% of the values are above the median and 50% are below the median.

The median is a commonly used measure of location and often preferable to the mean when the distribution is left/right skewed or has outliers, because it is a resistant statistic.

The median is a commonly used measure of location and often preferable to the mean when the distribution is left/right skewed or has outliers, because it is a resistant statistic.

minimum

is the smallest value for a given numerical variable; it is the lower extreme.

mode

is the most frequently occurring value for a given variable. It can be used with either categorical or numerical variables.

outlier, univariate

is a data point that is "far away" from the rest of the observed values for a given numerical variable.

The shape, center, and spread of the distribution all influence whether or not a data point is an outlier, though there is no single definition.

A boxplot can help identify outliers using a rule based on the IQR, while a standardized score defines outliers in terms of standard deviations. There are several possible reasons for outliers, which determine whether they may be deleted or ignored.

The shape, center, and spread of the distribution all influence whether or not a data point is an outlier, though there is no single definition.

A boxplot can help identify outliers using a rule based on the IQR, while a standardized score defines outliers in terms of standard deviations. There are several possible reasons for outliers, which determine whether they may be deleted or ignored.

percentile

of a distribution is the value that has k% of the data values at or below it and (100-k)% of the data values above it.

The median is the 50th percentile, while Q1 and Q3 are the 25th and 75th percentiles. Percentiles are often used to make comparisons between distributions, such as the Verbal and Math subscales of the SAT.

The median is the 50th percentile, while Q1 and Q3 are the 25th and 75th percentiles. Percentiles are often used to make comparisons between distributions, such as the Verbal and Math subscales of the SAT.

quartile, lower

(Q1) is the median of the lower half of a set of ordered data values.

quartile, upper

(Q3) is the median of the upper half of a set of ordered data values.

quartiles

divide the upper and lower halves of a variable's distribution in half, such that the four parts each contain 25% of the observations or values. The upper and lower halves are determined by the median.

Q1 = lower quartile

Q2 = median (Q2 is not commonly used)

Q3 = upper quartile

The quartiles and the extremes make up the five-number summary.

Q1 = lower quartile

Q2 = median (Q2 is not commonly used)

Q3 = upper quartile

The quartiles and the extremes make up the five-number summary.

range

is a measure of variability that indicates the absolute distance between the largest and smallest value for a given variable.

range, interquartile

(IQR) is a indicates the absolute distance between the upper and lower quartiles for a given variable. It summarizes the spread of the middle half (50%) of the distribution.

sample size

is the total number of observational units in a study or dataset.

The most commonly used symbol for sample size is n.

The most commonly used symbol for sample size is n.

shape

of a distribution refers to the way in which the values of a quantitative variable cluster along the number line; which values have a greater or lesser frequency of occurrence.

When assessing shape, we consider questions such as: are most of the frequently-occurring values clumped in the middle with other possible values becoming less frequent on either side (in the tails of the distribution) or are most of the frequently-occurring values clumped to one side with one long tail and one short one? Is there one clear peak, two, or several?

Terms used to describe shape include symmetric and skewed, as well as unimodal and bimodal.

When assessing shape, we consider questions such as: are most of the frequently-occurring values clumped in the middle with other possible values becoming less frequent on either side (in the tails of the distribution) or are most of the frequently-occurring values clumped to one side with one long tail and one short one? Is there one clear peak, two, or several?

Terms used to describe shape include symmetric and skewed, as well as unimodal and bimodal.

skewed

A numerical variable's distribution has a

------- shape if the values are more spread out on one side of the center than the other.

A skewed distribution may be either left or right skewed.

Right skew means that the values on the right side of the center (higher values) are more spread out.

Left skew means that the values on the left side of the center (lower values) are more spread out.

------- shape if the values are more spread out on one side of the center than the other.

A skewed distribution may be either left or right skewed.

Right skew means that the values on the right side of the center (higher values) are more spread out.

Left skew means that the values on the left side of the center (lower values) are more spread out.

spread

is a more colloquial term for variability.

standard deviation

is a measure of variability. It can be thought of as the average distance of all a variable's individual values from the mean of that variable.

Population and sample standard deviations are calculated slightly differently; the default in most statistical program calculations is sample standard deviation. The square of standard deviation is variance.

A population standard deviation is customarily symbolized by the Greek letter σ while a sample mean is usually represented by s.

Population and sample standard deviations are calculated slightly differently; the default in most statistical program calculations is sample standard deviation. The square of standard deviation is variance.

A population standard deviation is customarily symbolized by the Greek letter σ while a sample mean is usually represented by s.

standardized score

z-score expresses the distance of an individual data point from the variable's mean in terms of standard deviations.

For example, a z-score of 1.5 indicates that a point is one and a half standard deviations above the mean, while a z-score of -1 indicates that a point is 1 standard deviation below the mean.

z = (point - mean) / standard deviation

If the variable is normally distributed, the Empirical Rule may be used to further interpret the z-score.

For example, a z-score of 1.5 indicates that a point is one and a half standard deviations above the mean, while a z-score of -1 indicates that a point is 1 standard deviation below the mean.

z = (point - mean) / standard deviation

If the variable is normally distributed, the Empirical Rule may be used to further interpret the z-score.

statistic, resistant

is a numerical summary measure that is not strongly influenced by outliers; its value will not change much if an outlier is added or removed from a dataset.

Resistant measures include the median, quartiles, IQR, and some percentiles.

Resistant measures include the median, quartiles, IQR, and some percentiles.

stem-and-leaf plot

is a graphical display for numerical data similar to a histogram, except that all individual data values are shown. Each row in the plot begins with a "stem" (the largest digit or digits in the data value), and values within the rows are "leaves" (the remaining portions of the data values). The stems must be evenly spaced.

Strengths: A stemplot displays individual data values and is good for sorting data. It can be used to judge distribution shape when the sample size is moderately large, similar to a histogram.

Weaknesses: A stemplot may be too cluttered when the sample size is large, because it displays all the individual data values and the choice of intervals is more limited than for a histogram.

Strengths: A stemplot displays individual data values and is good for sorting data. It can be used to judge distribution shape when the sample size is moderately large, similar to a histogram.

Weaknesses: A stemplot may be too cluttered when the sample size is large, because it displays all the individual data values and the choice of intervals is more limited than for a histogram.

study

organized data collection activity that is designed and undertaken to answer a specific question (or questions) of interest.

Studies can be sample surveys, observational studies, or randomized experiments. In clinical psychology and the social sciences, case studies are frequently used, while the medical field makes extensive use of clinical trials.

Studies can be sample surveys, observational studies, or randomized experiments. In clinical psychology and the social sciences, case studies are frequently used, while the medical field makes extensive use of clinical trials.

symmetric

A numerical variable's distribution has a------- shape if the values are spread out approximately the same on both sides of its center (i.e., the halves are roughly mirror images of one another).

The most well-known symmetric distribution is probably the bell-shaped curve, but there are many other common symmetric distributions, including the uniform, triangular, U-shaped, and Student's t distributions.

The most well-known symmetric distribution is probably the bell-shaped curve, but there are many other common symmetric distributions, including the uniform, triangular, U-shaped, and Student's t distributions.

unimodal

f it has more only one mode.

A unimodal variable will have one prominent peak in its distribution when displayed in a histogram, dotplot, or stemplot. Number of modes is one of the descriptors of shape.

A unimodal variable will have one prominent peak in its distribution when displayed in a histogram, dotplot, or stemplot. Number of modes is one of the descriptors of shape.

variability

refers to the dispersion (or "spread") of a numerical variable's distribution along a number line, often measured from its location.

Summary measures of variability include standard deviation, variance, IQR, and range, as well as mean absolute deviation and median absolute deviation.

For a categorical variable, we could say that it has a lot of "variability" if there are many different possible categories that occur frequently.

Summary measures of variability include standard deviation, variance, IQR, and range, as well as mean absolute deviation and median absolute deviation.

For a categorical variable, we could say that it has a lot of "variability" if there are many different possible categories that occur frequently.

variance

is a measure of variability, equal to the the square of standard deviation.

This summary measure is not often used for description, but it plays a role in many types of multivariate statistical analyses. For example, the calculation of variance is a necessary intermediate step in the calculation of standard deviation.

This summary measure is not often used for description, but it plays a role in many types of multivariate statistical analyses. For example, the calculation of variance is a necessary intermediate step in the calculation of standard deviation.

z-score

See standardized score.