The objects described by a set of data. Individuals may be people, but they may also be animals or things.
Any characteristic of an individual. A vairable can take different values for different individuals.
Takes numerical values for which it makes sense to do arithmetic operations like adding and averaging.
(of a variable) Tells us what values the variable takes and how often it takes these values.
Horizontal line representing a variable and a number scale imposed for the values of the variable.
A value that divides the observations so that about half take larger values and about half have smaller values.
Each observation is separated into a stem consisting of all but the rightmost digit and a leaf, the final digit. Good for small data sets.
Plots each observation againstthe time at which it was measured. Time scale on the x-axis, variable of interest on the y-axis. If there are not too many values, a line graph works well.
Arrange all observations in order of size, from smallest to largest. If n = odd, M is the center observation. If n = even, M is the mean of the two cneter observations in the ordered list.
Q1 = the median of the observations to the left of M. Q3 = the median of the observations to the right of M.
Center in a boxplot ends at the quartiles and spans the middle portion of the observations. Middle vertical line markes the M, whiskers extend to largest and smallest observations.
Outliers are plotted as isolated points; whiskers reach to the second-most outlying value.
s² = [(x₁ - ¯x¯)² + (x₂ - ¯x¯)² + ... + (x₇ - ¯x¯)²] / (n-1), s² = [1/(n-1)]∑(x₁ - ¯x¯)²
Properties of Standard Deviation
1. s measures the spread about the mean and should be used online when the mean is chosen as the measure of center.
2. s = 0 onle when there is no spread. This happens only when all observations have the same value. Otherwise s > 0. As the observations become more spread out about their mean, s gets larger.
3. s, like the mean ¯x¯, is strongly indluenced by extreme observations. A few outliers can make s very large.
Describes the overall pattern of a distribution. Total area = 1. Area beneath curve gives proportions of observations.
Density curve mean and median
The median of a density curve is the equal-areas point, the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if made solid material. (For a symmetric density curve mean = median and are both in the middle.
Normal distributions are described by a special family of bell-shapes symmetric density curves, called normal curves. The mean μ and standard deviation σ completely specify a normal distribution N (μ, σ). The mean is the center of the curve and σ is the distance from μ to the inflection points on either side.
All normal distributions satisfy the 68-95-99/7 rule, which describes what percent of ovservations lie within one, two, and three standard deviations of the mean.
An observation's percentile is the percent of the distribution that is at or to the left of the observation.
If x is an observation from a distribution that has mean μ and SD σ, the standardized value of x is: z = (x - μ) / σ
Standard normal distribution
The SND is the normal distribution N(0,1) with mean 0 and SD 1. If a variable x has any normal distribution N(μ,σ) with mean μ and SD σ, then the standardized variable: z = (x - μ) / σ has the standard normal distribution.
The Standard Normal Table
Table A is a table of areas under the SNC. The table entry for each value z is the area under the curve to the left of z.
Finding Normal Proportions
1. State the problem in terms of the observed variable x.
2. Standardize x to restate the problem in terms of a standard normal variable z. Draw a picture to show the area under the SNC.
3. Find the required area under the SNC, using Table A and the fact that the total area under the curve is 1.
One can observe the shape of histograms, stemplots, and boxplots and see how well the data fit the 68-95-99.7 rules for normal distributions. A good method for assessing normality is to construct a normal probability plot.
A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on trhe vertical axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual. Explanatory variable always on the x-axis. Explanatory variable = x, response = y. If there is no explanatory-response variable relationship, either variable can go on the horizontal axis.
Overall pattern of scatterplots
In examining a scatterplot, look for an overall pattern showing the direction, form, and strength of the relationship and then for outliers or other deviations from this pattern.
If the relationship has a clear direction, we speak of either positive (high values of the two variables tend to occur together)/negative association (high values of one variable tend to occur with low values of the other variable).
Where the points show a straight-line pattern, are an important form of relationship between the two variables. Curved relationshis and clusters are other forms to watch for.
Strength of a relationship
Determined by how close the points in the scatterplot lie to a simple form such as a line.
Correlation measures the strength and direction of the linear relationship between two quantitative variable. Correlation is usually written as r.
Suppose we have data on variables x and y for n individuals. The values for the first individ. are X1 and Y1, second individ. X2, Y2, and so on. The means and SD of the two variables are x-bar and Sx, for the x values and y-bar Sy for the y-values. The correlation r between x and y is:
r = [1/(n-1)][∑((x - xbar) / Sx)((y - ybar) / Sy)
r only measures straight-line relationships.
A straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Regression, unlike correlation, requires that we have an explanatory variable and a response variable.
Mathematical model (model)
The LSRL is a model for the data. If the data seems to show a linear trend, then it would be appropriate to try and fit an LSRL to the data.
Least-Squares Regression Line
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of the Least-Squares Regression Line
We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means x-bar and y-bar and the standard deviations Sx and Sy of the two variables, and their correlation r. The least-squares regression line is the line
y hat = a + bx
b = r (Sy/Sx)
a = ybar - b(xbar)
Plot the line
To plot the line by hand on a scatterplot, use the equation to find y hat for two values of x, one near each end of the range of x in the data. Plot each y hate above its x and draw the line through the two points.
Slope of the Least-Squares Regression Line
b = r (Sy/Sx)
This equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.
The sum of the squares of the deviations of the points about this regression line.
Sum for squares of error.
SSE = ∑(y - ybar)^2
Sum of squares of deviations about the mean y hat and the sum of squares of deviations about the regression line y hat would be approximately the same.
r^2 in Regression
The coefficient of determination, r^2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.
(SSM - SSE)/SSM
A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,
residual = observed y - predicted y
= y - yhat
Plots the residents on the vertical axis against the explanatory variable on the horizontal axis. Magnifies the residuals and makes patterns easier to see.
Residuals from least-squares regression have a special property: the mean of the residuals is always zero. The sum in -0.0002 in the calculator because the software rounded to 4 decimal places.
Outliers and Influential Observations in Regression
An outlier is an observation that lies outside the overall pattern of the other observations in a scatterplot. An observation can be an outlier in the x direction, the y direction, or in both directions.
An observation is influential if removing it would markedly change the position of the regression line. Points that are outliers in the x direction are often influential.
Intercept of a Regression Line
The intercept a of a regression line yhat = a + bx is the predicted valueyhat when the explanatory variable x = 0. This prediction is of no statistical use unless x can actually take values near 0.