Statistics Chapter 3
Terms in this set (84)
Procedure with 2 variable statistics
1.) Plot data and calculate numerical summaries
2.) Look for overall patterns and deviations from those patterns
3.) When there's a regular overall pattern, use a simplified model to describe it
Measures an outcome of a study
May help explain or influence changes in a response variable
Specific values of variables
It is easiest to identify explanatory and response variables when we actually specify values of one variable to see how it affects another variable.
Often we want to know whether changes in the explanatory variable causes a change in the response variable. Remember, correlation does NOT imply causation.
Graph for displaying relationship between two quantitative variables
Shows the relationship between two quantitative variables measured on the same individuals. The values of one variable (explanatory variable) appear on the horizontal axis and the values of the other variable (response variable) appear on the vertical axis. Each individual in the data appears as a point in the graph.
Always plot explanatory variable if there is one on horizontal axis (x axis) of the scatterplot. We usually call the explanatory variable x and the response variable y. If there is no explanatory-response distinction, either variable can go on the horizontal axis.
How to make a scatterplot
1.) Decide which variable should go on each axis
2.) Label and scale your axes
- Don't start at (0,0)
- Start scale to highlight main body of points
3.) Title your plot
4.) Plot individual data values
How to examine scatterplots
Look for the overall pattern and striking departures from that pattern.
1.) To describe the OVERALL PATTERN of a scatterplot, discuss the direction/trend, the form/shape, clusters and the strength of the relationship
2.) To describe DEPARTURES from the OVERALL PATTERN discuss outliers (an individual that falls outside the overall pattern of the relationship)
The general shape of the graph
Ex: linear relationships/curved relationships/outliers/clusters
Draw oval around data and find the slope of the major axis: negative slope means negative trend while positive slope means positive trend
If relationship has a clear direction, we speak of positive association (high values of one variable tend to occur together) or negative association (high values of one variable tend to occur with low values of the other variable)
How scattered is the data (based on the oval)
How close the points in a scatterplot lie to a simple form such as a line
- Thin hot dog shape = strong
- Football shape = moderate
- Basketball shape = week
- Fan out = differs for different values of explanatory variable
There are a bunch of data points together
- Name ranges of each variable where cluster appears
There's a lot of white space around the data point
- Outlier in response variable (y)
- Outlier in explanatory variable
- Outlier in both
- Outlier b/c doesn't follow the overall pattern/trend
Positive association, negative association
Two variables have a positive association when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together.
Two variables have a negative association when above-average values of one tend to accompany bleow-average values of the other
Problems w/ positive and negative association
Not all relationships have a clear direction that we can describe as a positive association or negative association
Caution w/ scatterplots
Association does not imply causation because there may be other variables lurking in the background that contribute to the relationship between two variables
Linear relationships are important because a straight line is a simple pattern that is quite common; a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line.
Problem with judging linear relationships
Our eyes are not a good judge of strength of a linear relationship. It is easy to be fooled by different scales are the amount of space around the cloud of points. We need to use a numerical measure to supplement the graph. Correlation is the measure we use.
The correlation r measures the direction and strength of the linear relationship between two quantitative variables.
- The correlation r is always a number b/w -1 and 1.
- Correlation indicates the direction of a linear relationship by its sign: r > 0 for a positive association and r <0 for a negative association
- Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward -1 or 1.
- The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line
How to calculate the correlation
Suppose that we have data on variables x and y for n individuals. The means and the standard deviations of the two variables are xbar and sx for the x-values and ybar and sy for the y values. The correlation r between x and y is 1/(n-1) times the sum of the products of zx and zy
(Use calculator: 6 1 4)
Another meaning of correlation
The average of the products of the standardized scores
Problems with correlation calculation
A value of r close to 1 or -1 does not guarantee a linear relationship between two variables. A scatterplot with a clear curved form can have a correlation near 1 or -1. Always plot your data.
Facts about correlation: #1
Correlation makes no distinction between explanatory and response variables. It makes no difference which variable you call x and which you call y in calculating correlation (you can multiply in any order)
Facts about correlation: #2
Because r uses the standardized values of the observations, r does not change when we change the units of measurement of x, y, or both. Transformations do not affect r.
Facts about correlation: #3
The correlation r itself has no units of measurement. It is just a number.
Cautions: describing distribution two variables is more complex than describing the distribution of one variable
1.) Correlation requires that both variables be quantitative so that it makes sense to do the arithmetic indicated by the formula for r.
2.) Correlation measures the strength of only the linear relationship between two variables. Correlation does not describe curved relationships between variables no matter how strong the relationship is. A correlation of 0 doesn't guarantee that there's no relationship between two variables, just that there's no linear relationship.
3.) Like the mean and standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations. Use r with caution when outliers appear in the scatterplot.
4.) Correlation is not a complete summary of two-variable data, even when the relationship b/w the variables is linear. You should give the means and standard deviations of both x and y along with the correlation.
General idea of regression lines
Model for the data: the equation of a regression line gives compact mathematical description of what the model tells us about the relationship between the response variable y and the explanatory variable x
Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis) A regression line relating y to x has an equation of the form: y (hat) = a + bx
Y hat is the predicted value of the response variable y for a given value of the explanatory variable x
B is the slope; the amount by which y is predicted to change when x increases by one unit
*Coefficient of x is always the slope no matter what symbol is used
A is the y-intercept, the predicted value of y when x = 0
The importance of slope vs. y-intercept
The slope of a regression line is an important numerical description of the relationship between the two variables. Although we need the value of the y intercept to draw the line, it is statistically meaningful only when the explanatory variable can actual take values close to zero.
The size of slope
Small slope does not mean there is no relationship. The size of the slope depends on units in which we measure the two variables. You can't say how important a relationship is by looking at the size of the slope of the regression line (unlike correlation).
We can use a regression line to help predict the response (y hat) for a specific value of the explanatory variable x
Accuracy of predictions
The accuracy of predictions from a regression line depends on how much the data scatter about the line.
The use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate.
*Few relationships are linear for all values of the explanatory variable. Don't make predictions using values of x that are much larger or much smaller than those that actually appear in your data.
X = 0 as an extrapolation
Often using the regression line to make a prediction for x = 0 is an extrapolation. That's why the y-intercept isn't always statistically meaningful.
Problems with regression lines
In most cases, no line passes exactly thru all the points in a scatterplot. Because we use the line to predict y from x, the prediction errors we make are errors in y, the vertical direction in the scatterplot. A good regression line makes vertical distances (residuals) of the points from the line as small as possible.
Difference between an observed value of the response variable and the value predicted by the regression line.
Residual = observed y - predicted y
= y - yhat
*Represents the leftover variation in the response variable after fitting the regression line
Least-squares regression line
Least-squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible
(Squares b/c positive and negative cancel out)
Equation for the least-squares regression line
We have data on the explanatory variable x and a response variable y for n individuals. From the data we calculate the means xbar and ybar and the standard deviations sx and sy of the two variables and their correlation r. The least-squares regression line is the line yhat = a + bx with slope b = r(sy/sx) with y intercept a = ybar - bxbar
(xbar, y bar)
The least-squares regression line for any data set passes through the point (x bar, y bar)
Distance and standard deviations
For an increase of one standard deviation (sx) in the value of the explanatory variable x, the least-squares regression line predicts an increase of r standard deviations (rxy) in the response variable y
There is a close connection between correlation and the slope of the least-squares regression line
The slope equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y. When the variables are perfectly correlated (r = 1 or r = -1) the change in the predicted response is the same (in standard deviations units) as the change in x. Otherwise, because -1 ≤ r ≤ 1, the change in y hat is less than the change in x. As the correlation grows less strong, the prediction moves less in response to changes in x.
When doing calculations, don't round until the end of the problem. Use as many decimal places as your calculator stores to get accurate values of the slope and y intercept.
What happens to the least squares regression line if we standardize both variables?
Standardizing a variable converts it mean to 0 and standard deviation to 1. So, xbar, ybar is transformed to (0,0) so the least-squares line for the standardized values will pass through (0,0). Since sx = sy = 1, the slope is equal to the correlation.
Overall pattern vs. departures
Once common method of data analysis is looking for an overall pattern and for striking departures from the pattern. A regression line describes the overall pattern of a linear relationship between an explanatory variable and a response variable. We see departures from this pattern by looking at the residuals.
Benefits of residuals
Residuals show how far data fall from regression line and thus help us assess how well the line fits/describes the data. Residuals can be be calculated from any model fitted to data. However, residuals from least-squares line have a special property: the mean of the least-squares residuals is always zero.
A residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data.
Interpret a residual
The residual says ___ than predicted by the least squares regression line
Examining residual plots
A residual plot turns the regression line horizontal. It magnifies deviations of the points from the line, making it easier to see unusual observations and patterns. If the regression line captures the overall pattern of the data, there should be no pattern in the residuals.
Important things to look for when you examine a residual plot: #1
The residual plot should show no obvious patterns. Ideally, the graph shows an unstructured (Random) scatter of points in a horizontal band centered at zero.
A curved pattern in a residual plot shows that the relationship is not linear.
If the spread about the regression line increases for larger/smaller values of x, predictions of y using this line will be less accurate for these values of x.
Important things to look for when you examine a residual plot: #2
Residuals should be relatively small in size. A regression line that fits the data well should come "close" to most of the points. How do we decide whether residuals are small enough? We consider the size of a "typical prediction error.
Another name for residual
S (standard deviation of the residuals)
We know the average prediction error (mean of residuals) is 0 when using a least-squares regression line since positive and negative residuals cancel. That's why we use standard deviation to find the approximate size of a "typical" or "average" prediction error (residual).
If we use a least-squares line to predict the value s of a response variable y from an explanatory variable x, the standard deviation of the residual (s) is given by s = square root of (sum of residuals squares)/(n-2)
s = square root of (sum of y - yhat)/(n-2)
Interpret s in context
The average residual/prediction error for predicting the response variable is __ using the least squares line
What is another way to see how well a least squares line fits our data
R2 (the coefficient of determination) tells us how well the least-square predicts the values of the response variable
How can we predict y if we don't know x
Use the mean of the response variable
Measures the sum of the totals (total variation in the y values). It is a constant multiple of the variance.
Measures the sum of squared errors
The ratio SSE/SST
Tells us what proportion of the total variation in y still remains after using the regression line to predict values of the response variable (interpret: ___ of the variation in __response variable___ is unaccounted for by the linear model relating y to x
The coefficient of determination
The coefficient of determination
The fraction of the variation in the values of y that is accounted for by the least squares regression line of y on x. We can calculate r2:
r2 = 1 - SSE/SST
where SSE = sum of residuals squared and SST equals sum of observations-mean squared
When is r2 = 1
If all the points fall directly on the least-squares line, SSE = 0 and r2 = 1. Then all of the variation in y is accounted for by the linear relationship with x.
SSE > SST
Since the least-squares line yields the smallest possible sum of squared prediction errors, SSE can never be more than SST which is based on the line y = ybar. In the worst case scenario, the least squares line does not better at predicting y than y = ybar does. Then SSE = SST and r2 = 0
Correlation coefficient and determination coefficient
The determination coefficient is the correlation coefficient squared! There is a relationship between correlation and regression. When reporting regression, find r to note strength of linear relationship. When reporting correlation, find r2 to note how successful the regression was in explaining the response.
Interpreting computer regression output
Constant coefficient: y intercept
Variable coefficient: slope
Variable: explanatory variable name
S: standard deviation of residuals
R-sq: determination coefficient
Interpret regression line
_ % of the variation in the (response variable) is accounted for by the regression line
How to determine if a line is an appropriate model to use for the data
1.) Residual plot (scattered/no pattern)
2.) Small residuals
3.) Find S
4.) Find r2
2 tools to describe the relationship between variables
Limitations of regression and correlation: 1
For regression, the distinction b/w explanatory and response variables is important. Least-squares regression makes distance of data points from line small only in y direction. If we reverse role of two variables we get a different line. This is not true for correlation.
Limitations of regression and correlation: 2
Correlation and regression lines describe only linear relationship. You can calculate correlation and least-squares line for any relationship b/w 2 quantitative variables, but the results are only useful if the scatterplot shows a linear pattern (always plot your data!)
Limitations of regression and correlation: 3
Correlation and least-squares regression lines are not resistant. They are affected by outliers.
How to determine if relationship b/w explanatory and response variable
- Make scatterplot and look for overall pattern; if linear, find regression line and plot it
- Look at size of residuals
- Look at residual plot
- Find r2 and s to determine how well the line describes data and how large our prediction errors will be
An outlier is an observation that lies outside the overall pattern of other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers (large in x direction but not y direction) may not have large residuals.
An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation; points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line
Are all outliers influential?
The least-squares line is most likely to be heavily influenced by observations that are outliers in x. Influential points often have small residuals because they pull the regression line toward themselves. The scatterplot alerts you of these (don't just plot residual plot b/c may miss influential points)
How to verify if a point is influential
Find the regression line both with and without the unusual point. If the line moves more than a small amount when point is deleted, the point is influential
Association does not imply causation
An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y.
*Sometimes association is due to cause and effect but other times it is due to lurking variables
The correlation is real but the conclusion that changing one variable causes a change in the other variable is nonsense.
Correlation vs. association
It only makes sense to talk about correlation between two quantitative variables. If one or both variables are categorical, you should refer to the association b/w them. To be safe, use "association" when describing relationship b/w 2 variables.