independent/input variable, is plotted on the x-axis
dependent/output, is plotted on the y-axis
straight, curved, no pattern, other
positive or negative slope
how much scatter, how closely points follow the form
outliers, clusters, subgroups
increases in one variable generally correspond to decreases in the other
increases in one variable generally correspond to increases in the other
the strength and direction of the linear relationship between two quantitative variables, without significant outliers
Correlation value range
-1 to +1
Units of Correlation
Correlation is immune to changes of...
scale and order
3 conditions needed for Correlation
Quantitative Variables, Straight Enough, No Outliers
Correlation coefficient is found by
finding the average product of the z-scores
a deliberately vague term describing the relationship between two variables
To describe association on a scatterplot, tell the...
Form, Direction, Strength, Unusual Features
Shows the relationship between two quantitative variables on the same cases (individuals)
In perfect correlation, r =
+1 or -1
Perfect correlation only occurs when...
the points lie exactly on a straight line (you can perfectly predict one variable knowing the other)
In no correlation, r =
No correlation means...
knowing about one variable will give no information about the other variable
These should be given with the correlation
Mean and Standard deviation of both x and y
Mean and standard deviation of x and y must be given with correlation because...
Correlation is not a complete description of two-variable data and its formula uses means and standard deviations in the z-scores
Scatterplots and correlation coefficients never prove...
A variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two
To add a categorical variable to an existing scatterplot...
use a different plot color or symbol for each category
Observed value - predicted value (y - ŷ)
Positive residual means...
the model makes an underestimate
Negative residual means...
the model makes an overestimate
Using regression over mean because...
the correlation is always less than 1.0 in magnitude, so each predicted ŷ tends to be fewer standard deviations from its mean than its corresponding x was from its mean
Regression line (Line of best fit)
The unique line that minimizes the variance of the residuals (sum of the squared residuals)
For standardized values of regression line use...
(predicted Zy) = rZx [predicted z-score of y = correlation * z-score of x]
For actual x and y values of regrssion line use...
ŷ = b0 + b1x
To calculate the regression line in real units (actual x and y values)...
1) Find slope, b1 = r * Sy / Sx
2) Find y-intercept, plug b1 and point (x, y) [usually (x¯ , y¯ )]
into ŷ = b0 + b1x and solve for b0
3) Plug in slope, b1, and y-intercept, b0, into ŷ = b0 + b1x
The square of the correlation, r, between x and y; The success of the regression model in terms of the fraction of the variation of y accounted for by the model (R^2 is a percent)
3 conditions needed for Linear Regression Models
1. Quantitative Variables
2. Straight Enough - check original scatterplot & residual scatterplot
3. Outlier (clusters) -points with large residuals and/or high leverage
A high R^2 does not mean...
regression is appropriate
To check the straight enough condition, look at the...
scatterplot of the residuals vs. the x-values
For a regression line to be appropriate, the residual plot should be...
boring; uniform scatter with no direction, shape, or outliers
The key to assessing how well the model fits is...
having variation in the residuals
Standard deviation of the residuals, Se
Gives a measure of how much the points spread around the regression line
1 - R^2 is...
the fraction of the original variation left in the residuals (the percentage of variability not explained by the regression line)
Dubious predictions of y-values based on x-values outside the range of the original data
Some regression problems are
1. Inferring Causation
3. Outliers and Influential Points
4. Change in Scatterplot Pattern
5. Using means (or other summaries) rather than actual data
High leverage points
Have x-values far from x¯ ((x¯ ,y¯ ) is the fulcrum) and pull more strongly on the regression line
The residuals for high leverage points are...
points do not influence model but do inflate R^2
point might not influence model much but are not consistent with the overall form
points that that distort the model
Three kinds of outliers of leverage and residual
1) Extreme Conformers
2) Large Residuals
3) Influential Points
Omitting an influential point from the data...
results in a very different regression model
Influence points are hard to detect because...
They distort the model so much that their residual becomes very small
Best way to verify an outlier and its effect is to...
Calculate the regression line with and without the suspect point
A histogram of the residuals...
Compliments a scatterplot of the residuals in the search for conditions that may compromise the effectiveness of the regression model
Consider comparing two+ regressions if you find...
1) Points with large residuals and/or high leverage.
2) Change in Scatterplot Pattern as a result of changes over time or subsets that behave differently.
Regressions of summaries of the data tend to look...
stronger than the regression on the original data
Regression is stronger when using summaries of data because...
Summary statistics are less variable than the underlying data
A means of altering the data to achieve the conditions/ structure necessary to utilize particular summaries or models
Ladder of Powers
Orders the effects that the re-expressions have on the data
The reasons to consider a re-expression are...
1. Make the form of a scatterplot straighter
2. Make the scatter in a scatterplot more consistent (not fan shaped)
3. Make the distribution of a variable (histogram) more symmetric.
4. Make the spread across different groups (box plots) more similar
A good starting point for the Ladder of Powers is...
If all else fails for the Ladder of Powers...
Use 2 logs (log x and log y)
Base 10 logs are roughly...
One less than the number of digits needed to write the number
1. Can't straighten scatterplots that turn around
2. Can't re-express "-" data values with (+constant to shift > 0)
3. Minimal affect on data values far from 1-100. (-constant to shift)
4. Can't unify multiple modes