Individuals

The objects described by a set of data. Individuals may be people, but they may also be animals or things.

Variable

Any characteristic of an individual. A vairable can take different values for different individuals.

Exploratory Data Analysis

Examine data in order to describe their main features.

Categorical variable

Records which of several groups or categories an individual belongs to.

Quantitative variable

Takes numerical values for which it makes sense to do arithmetic operations like adding and averaging.

Distribution

(of a variable) Tells us what values the variable takes and how often it takes these values.

Pie chart

Helps us see what part of the whole each group forms.

Dotplot

Horizontal line representing a variable and a number scale imposed for the values of the variable.

Histogram

Most common graph of distributions with one quantitative variable.

Outlier

An individual observation that falls outside the overall pattern of the graph.

Center

A value that divides the observations so that about half take larger values and about half have smaller values.

Spread

The range of values.

Symmetric

Right and left sides of the histogram are approx. mirror images of each other.

Skewed to the right

Right side of the histogram extends much further out than the left side.

Skewed to the left

Left side of the histogram extends much farther out than the right side.

Stemplot

Each observation is separated into a stem consisting of all but the rightmost digit and a leaf, the final digit. Good for small data sets.

Split stems

Double each stem on a stemplot (0-4 on the first stem, 5-9 on the second stem).

Time plot

Plots each observation againstthe time at which it was measured. Time scale on the x-axis, variable of interest on the y-axis. If there are not too many values, a line graph works well.

Trend

Common, overall pattern.

Mean

¯x¯= (X₁ + X₂ + ... + X₇) / n , ¯x¯ = (1/n) ∑x₁

Nonresistant

Sensitive to the influence of extreme observations (may/may not be outliers).

Median

Arrange all observations in order of size, from smallest to largest. If n = odd, M is the center observation. If n = even, M is the mean of the two cneter observations in the ordered list.

Resistant

Not affected by extreme values.

Range

Difference between largest and smallest observations.

Quartiles

Q1 = the median of the observations to the left of M. Q3 = the median of the observations to the right of M.

IQR

Interquartile range is the distance between the first and third quartiles.

Outlier

1.5 x IQR below Q1 or 1.5 x IQR above Q3.

Boxplot

Center in a boxplot ends at the quartiles and spans the middle portion of the observations. Middle vertical line markes the M, whiskers extend to largest and smallest observations.

Modified boxplot

Outliers are plotted as isolated points; whiskers reach to the second-most outlying value.

Variance s²

s² = [(x₁ - ¯x¯)² + (x₂ - ¯x¯)² + ... + (x₇ - ¯x¯)²] / (n-1), s² = [1/(n-1)]∑(x₁ - ¯x¯)²

Standard deviation

s = √[(1/(n-1))∑(x₁ - ¯x¯)²] Use for normal distributions.

Degrees of freedom

n-1 of the squared deviations can vary freely.

Properties of Standard Deviation

1. s measures the spread about the mean and should be used online when the mean is chosen as the measure of center.

2. s = 0 onle when there is no spread. This happens only when all observations have the same value. Otherwise s > 0. As the observations become more spread out about their mean, s gets larger.

3. s, like the mean ¯x¯, is strongly indluenced by extreme observations. A few outliers can make s very large.

2. s = 0 onle when there is no spread. This happens only when all observations have the same value. Otherwise s > 0. As the observations become more spread out about their mean, s gets larger.

3. s, like the mean ¯x¯, is strongly indluenced by extreme observations. A few outliers can make s very large.

Five number summary

Min. value, Q1, M, Q3, Max. value. Use for skewed distributions.

Density curve

Describes the overall pattern of a distribution. Total area = 1. Area beneath curve gives proportions of observations.

Density curve mean and median

The median of a density curve is the equal-areas point, the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if made solid material. (For a symmetric density curve mean = median and are both in the middle.

Simulation

Pretending to conduct an experiment. Virtual experiment.

Normal distributions

Normal distributions are described by a special family of bell-shapes symmetric density curves, called normal curves. The mean μ and standard deviation σ completely specify a normal distribution N (μ, σ). The mean is the center of the curve and σ is the distance from μ to the inflection points on either side.

68-95-99.7 Rule

All normal distributions satisfy the 68-95-99/7 rule, which describes what percent of ovservations lie within one, two, and three standard deviations of the mean.

Percentile

An observation's percentile is the percent of the distribution that is at or to the left of the observation.

Standardized observations

If x is an observation from a distribution that has mean μ and SD σ, the standardized value of x is: z = (x - μ) / σ

Z-scores

Standardized observations are sometimes called z-scores.

Standard normal distribution

The SND is the normal distribution N(0,1) with mean 0 and SD 1. If a variable x has any normal distribution N(μ,σ) with mean μ and SD σ, then the standardized variable: z = (x - μ) / σ has the standard normal distribution.

The Standard Normal Table

Table A is a table of areas under the SNC. The table entry for each value z is the area under the curve to the left of z.

Finding Normal Proportions

1. State the problem in terms of the observed variable x.

2. Standardize x to restate the problem in terms of a standard normal variable z. Draw a picture to show the area under the SNC.

3. Find the required area under the SNC, using Table A and the fact that the total area under the curve is 1.

2. Standardize x to restate the problem in terms of a standard normal variable z. Draw a picture to show the area under the SNC.

3. Find the required area under the SNC, using Table A and the fact that the total area under the curve is 1.

Assessing normality

One can observe the shape of histograms, stemplots, and boxplots and see how well the data fit the 68-95-99.7 rules for normal distributions. A good method for assessing normality is to construct a normal probability plot.

Response variable

Measures an outcome of a study.

Explanatory variable

Attempts to explain the observed outcomes.

Scatterplot

A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on trhe vertical axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual. Explanatory variable always on the x-axis. Explanatory variable = x, response = y. If there is no explanatory-response variable relationship, either variable can go on the horizontal axis.

Overall pattern of scatterplots

In examining a scatterplot, look for an overall pattern showing the direction, form, and strength of the relationship and then for outliers or other deviations from this pattern.

Positive/Negative Association

If the relationship has a clear direction, we speak of either positive (high values of the two variables tend to occur together)/negative association (high values of one variable tend to occur with low values of the other variable).

Outlier

An individual observation that falls outside the overall pattern of the graph.

Linear Relationships

Where the points show a straight-line pattern, are an important form of relationship between the two variables. Curved relationshis and clusters are other forms to watch for.

Strength of a relationship

Determined by how close the points in the scatterplot lie to a simple form such as a line.

Categorical variable

Plotting points on the scatterplot with different colors or symbols.

Correlation

Correlation measures the strength and direction of the linear relationship between two quantitative variable. Correlation is usually written as r.

Suppose we have data on variables x and y for n individuals. The values for the first individ. are X1 and Y1, second individ. X2, Y2, and so on. The means and SD of the two variables are x-bar and Sx, for the x values and y-bar Sy for the y-values. The correlation r between x and y is:

r = [1/(n-1)][∑((x - xbar) / Sx)((y - ybar) / Sy)

r only measures straight-line relationships.

Suppose we have data on variables x and y for n individuals. The values for the first individ. are X1 and Y1, second individ. X2, Y2, and so on. The means and SD of the two variables are x-bar and Sx, for the x values and y-bar Sy for the y-values. The correlation r between x and y is:

r = [1/(n-1)][∑((x - xbar) / Sx)((y - ybar) / Sy)

r only measures straight-line relationships.

Regression Line

A straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Regression, unlike correlation, requires that we have an explanatory variable and a response variable.

Mathematical model (model)

The LSRL is a model for the data. If the data seems to show a linear trend, then it would be appropriate to try and fit an LSRL to the data.

Predict

Regression lines can be used to accurately approximate certain values of the data.

Least-Squares Regression Line

The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

Equation of the Least-Squares Regression Line

We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means x-bar and y-bar and the standard deviations Sx and Sy of the two variables, and their correlation r. The least-squares regression line is the line

y hat = a + bx

with slope

b = r (Sy/Sx)

and intercept

a = ybar - b(xbar)

y hat = a + bx

with slope

b = r (Sy/Sx)

and intercept

a = ybar - b(xbar)

Plot the line

To plot the line by hand on a scatterplot, use the equation to find y hat for two values of x, one near each end of the range of x in the data. Plot each y hate above its x and draw the line through the two points.

Slope of the Least-Squares Regression Line

b = r (Sy/Sx)

This equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.

This equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.

SSE

The sum of the squares of the deviations of the points about this regression line.

Sum for squares of error.

SSE = ∑(y - ybar)^2

Sum for squares of error.

SSE = ∑(y - ybar)^2

SSM

Sum of squares of deviations about the mean y hat and the sum of squares of deviations about the regression line y hat would be approximately the same.

r^2 in Regression

The coefficient of determination, r^2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.

(SSM - SSE)/SSM

(SSM - SSE)/SSM

Residuals

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,

residual = observed y - predicted y

= y - yhat

residual = observed y - predicted y

= y - yhat

Residual plot

Plots the residents on the vertical axis against the explanatory variable on the horizontal axis. Magnifies the residuals and makes patterns easier to see.

Roundoff error

Residuals from least-squares regression have a special property: the mean of the residuals is always zero. The sum in -0.0002 in the calculator because the software rounded to 4 decimal places.

Outliers and Influential Observations in Regression

An outlier is an observation that lies outside the overall pattern of the other observations in a scatterplot. An observation can be an outlier in the x direction, the y direction, or in both directions.

An observation is influential if removing it would markedly change the position of the regression line. Points that are outliers in the x direction are often influential.

An observation is influential if removing it would markedly change the position of the regression line. Points that are outliers in the x direction are often influential.

Intercept of a Regression Line

The intercept a of a regression line yhat = a + bx is the predicted valueyhat when the explanatory variable x = 0. This prediction is of no statistical use unless x can actually take values near 0.

Relationship Between Correlation and Regression

The correlation r is the slope od the least-squares regression line when we measure both x and y in standardized units. The square of the correlation = amount of the variation in % in one variable that is attributable to the LSR on the other variable.