The objects described by a set of data. Individuals may be people, but they may also be animals or things.
Any characteristic of an individual. A vairable can take different values for different individuals.
Exploratory Data Analysis
Examine data in order to describe their main features.
Records which of several groups or categories an individual belongs to.
Takes numerical values for which it makes sense to do arithmetic operations like adding and averaging.
(of a variable) Tells us what values the variable takes and how often it takes these values.
Helps us see what part of the whole each group forms.
Horizontal line representing a variable and a number scale imposed for the values of the variable.
Most common graph of distributions with one quantitative variable.
A value that divides the observations so that about half take larger values and about half have smaller values.
The range of values.
Right and left sides of the histogram are approx. mirror images of each other.
Skewed to the right
Right side of the histogram extends much further out than the left side.
Skewed to the left
Left side of the histogram extends much farther out than the right side.
Each observation is separated into a stem consisting of all but the rightmost digit and a leaf, the final digit. Good for small data sets.
Double each stem on a stemplot (0-4 on the first stem, 5-9 on the second stem).
Plots each observation againstthe time at which it was measured. Time scale on the x-axis, variable of interest on the y-axis. If there are not too many values, a line graph works well.
Common, overall pattern.
¯x¯= (X₁ + X₂ + ... + X₇) / n , ¯x¯ = (1/n) ∑x₁
Sensitive to the influence of extreme observations (may/may not be outliers).
Arrange all observations in order of size, from smallest to largest. If n = odd, M is the center observation. If n = even, M is the mean of the two cneter observations in the ordered list.
Not affected by extreme values.
Difference between largest and smallest observations.
Q1 = the median of the observations to the left of M. Q3 = the median of the observations to the right of M.
Interquartile range is the distance between the first and third quartiles.
1.5 x IQR below Q1 or 1.5 x IQR above Q3.
Center in a boxplot ends at the quartiles and spans the middle portion of the observations. Middle vertical line markes the M, whiskers extend to largest and smallest observations.
Outliers are plotted as isolated points; whiskers reach to the second-most outlying value.
s² = [(x₁ - ¯x¯)² + (x₂ - ¯x¯)² + ... + (x₇ - ¯x¯)²] / (n-1), s² = [1/(n-1)]∑(x₁ - ¯x¯)²
s = √[(1/(n-1))∑(x₁ - ¯x¯)²] Use for normal distributions.
Degrees of freedom
n-1 of the squared deviations can vary freely.
Properties of Standard Deviation
1. s measures the spread about the mean and should be used online when the mean is chosen as the measure of center.
2. s = 0 onle when there is no spread. This happens only when all observations have the same value. Otherwise s > 0. As the observations become more spread out about their mean, s gets larger.
3. s, like the mean ¯x¯, is strongly indluenced by extreme observations. A few outliers can make s very large.
Five number summary
Min. value, Q1, M, Q3, Max. value. Use for skewed distributions.
Describes the overall pattern of a distribution. Total area = 1. Area beneath curve gives proportions of observations.
Density curve mean and median
The median of a density curve is the equal-areas point, the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if made solid material. (For a symmetric density curve mean = median and are both in the middle.
Pretending to conduct an experiment. Virtual experiment.
Normal distributions are described by a special family of bell-shapes symmetric density curves, called normal curves. The mean μ and standard deviation σ completely specify a normal distribution N (μ, σ). The mean is the center of the curve and σ is the distance from μ to the inflection points on either side.
All normal distributions satisfy the 68-95-99/7 rule, which describes what percent of ovservations lie within one, two, and three standard deviations of the mean.
An observation's percentile is the percent of the distribution that is at or to the left of the observation.
If x is an observation from a distribution that has mean μ and SD σ, the standardized value of x is: z = (x - μ) / σ
Standardized observations are sometimes called z-scores.
Standard normal distribution
The SND is the normal distribution N(0,1) with mean 0 and SD 1. If a variable x has any normal distribution N(μ,σ) with mean μ and SD σ, then the standardized variable: z = (x - μ) / σ has the standard normal distribution.
The Standard Normal Table
Table A is a table of areas under the SNC. The table entry for each value z is the area under the curve to the left of z.
Finding Normal Proportions
1. State the problem in terms of the observed variable x.
2. Standardize x to restate the problem in terms of a standard normal variable z. Draw a picture to show the area under the SNC.
3. Find the required area under the SNC, using Table A and the fact that the total area under the curve is 1.
One can observe the shape of histograms, stemplots, and boxplots and see how well the data fit the 68-95-99.7 rules for normal distributions. A good method for assessing normality is to construct a normal probability plot.
Measures an outcome of a study.
Attempts to explain the observed outcomes.
A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on trhe vertical axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual. Explanatory variable always on the x-axis. Explanatory variable = x, response = y. If there is no explanatory-response variable relationship, either variable can go on the horizontal axis.
Overall pattern of scatterplots
In examining a scatterplot, look for an overall pattern showing the direction, form, and strength of the relationship and then for outliers or other deviations from this pattern.
If the relationship has a clear direction, we speak of either positive (high values of the two variables tend to occur together)/negative association (high values of one variable tend to occur with low values of the other variable).
Where the points show a straight-line pattern, are an important form of relationship between the two variables. Curved relationshis and clusters are other forms to watch for.
Strength of a relationship
Determined by how close the points in the scatterplot lie to a simple form such as a line.
Plotting points on the scatterplot with different colors or symbols.
Correlation measures the strength and direction of the linear relationship between two quantitative variable. Correlation is usually written as r.
Suppose we have data on variables x and y for n individuals. The values for the first individ. are X1 and Y1, second individ. X2, Y2, and so on. The means and SD of the two variables are x-bar and Sx, for the x values and y-bar Sy for the y-values. The correlation r between x and y is:
r = [1/(n-1)][∑((x - xbar) / Sx)((y - ybar) / Sy)
r only measures straight-line relationships.
A straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Regression, unlike correlation, requires that we have an explanatory variable and a response variable.
Mathematical model (model)
The LSRL is a model for the data. If the data seems to show a linear trend, then it would be appropriate to try and fit an LSRL to the data.
Regression lines can be used to accurately approximate certain values of the data.
Least-Squares Regression Line
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of the Least-Squares Regression Line
We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means x-bar and y-bar and the standard deviations Sx and Sy of the two variables, and their correlation r. The least-squares regression line is the line
y hat = a + bx
b = r (Sy/Sx)
a = ybar - b(xbar)
Plot the line
To plot the line by hand on a scatterplot, use the equation to find y hat for two values of x, one near each end of the range of x in the data. Plot each y hate above its x and draw the line through the two points.
Slope of the Least-Squares Regression Line
b = r (Sy/Sx)
This equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.
The sum of the squares of the deviations of the points about this regression line.
Sum for squares of error.
SSE = ∑(y - ybar)^2
Sum of squares of deviations about the mean y hat and the sum of squares of deviations about the regression line y hat would be approximately the same.
r^2 in Regression
The coefficient of determination, r^2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.
(SSM - SSE)/SSM
A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,
residual = observed y - predicted y
= y - yhat
Plots the residents on the vertical axis against the explanatory variable on the horizontal axis. Magnifies the residuals and makes patterns easier to see.
Residuals from least-squares regression have a special property: the mean of the residuals is always zero. The sum in -0.0002 in the calculator because the software rounded to 4 decimal places.
Outliers and Influential Observations in Regression
An outlier is an observation that lies outside the overall pattern of the other observations in a scatterplot. An observation can be an outlier in the x direction, the y direction, or in both directions.
An observation is influential if removing it would markedly change the position of the regression line. Points that are outliers in the x direction are often influential.
Intercept of a Regression Line
The intercept a of a regression line yhat = a + bx is the predicted valueyhat when the explanatory variable x = 0. This prediction is of no statistical use unless x can actually take values near 0.
Relationship Between Correlation and Regression
The correlation r is the slope od the least-squares regression line when we measure both x and y in standardized units. The square of the correlation = amount of the variation in % in one variable that is attributable to the LSR on the other variable.