# Statistics Vocab ch7-10

## 64 terms

### Explanatory variable

independent/input variable, is plotted on the x-axis

### Response variable

dependent/output, is plotted on the y-axis

### Form

straight, curved, no pattern, other

### Direction

positive or negative slope

### Strength

how much scatter, how closely points follow the form

### Unusual Features

outliers, clusters, subgroups

### Negative Association

increases in one variable generally correspond to decreases in the other

### Positive Association

increases in one variable generally correspond to increases in the other

### Correlation describes

the strength and direction of the linear relationship between two quantitative variables, without significant outliers

-1 to +1

None

scale and order

### 3 conditions needed for Correlation

Quantitative Variables, Straight Enough, No Outliers

### Correlation coefficient is found by

finding the average product of the z-scores

### Association

a deliberately vague term describing the relationship between two variables

### To describe association on a scatterplot, tell the...

Form, Direction, Strength, Unusual Features

### Scatterplot

Shows the relationship between two quantitative variables on the same cases (individuals)

+1 or -1

### Perfect correlation only occurs when...

the points lie exactly on a straight line (you can perfectly predict one variable knowing the other)

0

### No correlation means...

knowing about one variable will give no information about the other variable

### These should be given with the correlation

Mean and Standard deviation of both x and y

### Mean and standard deviation of x and y must be given with correlation because...

Correlation is not a complete description of two-variable data and its formula uses means and standard deviations in the z-scores

Causation

### Lurking variable

A variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two

### To add a categorical variable to an existing scatterplot...

use a different plot color or symbol for each category

### Residual

Observed value - predicted value (y - ŷ)

### Positive residual means...

the model makes an underestimate

### Negative residual means...

the model makes an overestimate

### Using regression over mean because...

the correlation is always less than 1.0 in magnitude, so each predicted ŷ tends to be fewer standard deviations from its mean than its corresponding x was from its mean

### Regression line (Line of best fit)

The unique line that minimizes the variance of the residuals (sum of the squared residuals)

### For standardized values of regression line use...

(predicted Zy) = rZx [predicted z-score of y = correlation * z-score of x]

ŷ = b0 + b1x

### To calculate the regression line in real units (actual x and y values)...

1) Find slope, b1 = r * Sy / Sx
2) Find y-intercept, plug b1 and point (x, y) [usually (x¯ , y¯ )]
into ŷ = b0 + b1x and solve for b0
3) Plug in slope, b1, and y-intercept, b0, into ŷ = b0 + b1x

### R^2

The square of the correlation, r, between x and y; The success of the regression model in terms of the fraction of the variation of y accounted for by the model (R^2 is a percent)

### 3 conditions needed for Linear Regression Models

1. Quantitative Variables
2. Straight Enough - check original scatterplot &amp; residual scatterplot
3. Outlier (clusters) -points with large residuals and/or high leverage

### A high R^2 does not mean...

regression is appropriate

### To check the straight enough condition, look at the...

scatterplot of the residuals vs. the x-values

### For a regression line to be appropriate, the residual plot should be...

boring; uniform scatter with no direction, shape, or outliers

### The key to assessing how well the model fits is...

having variation in the residuals

### Standard deviation of the residuals, Se

Gives a measure of how much the points spread around the regression line

### 1 - R^2 is...

the fraction of the original variation left in the residuals (the percentage of variability not explained by the regression line)

### Extrapolations

Dubious predictions of y-values based on x-values outside the range of the original data

### Some regression problems are

1. Inferring Causation
2. Extrapolation
3. Outliers and Influential Points
4. Change in Scatterplot Pattern
5. Using means (or other summaries) rather than actual data

### High leverage points

Have x-values far from x¯ ((x¯ ,y¯ ) is the fulcrum) and pull more strongly on the regression line

small

### Extreme Conformers

points do not influence model but do inflate R^2

### Large Residuals

point might not influence model much but are not consistent with the overall form

### Influential Points

points that that distort the model

### Three kinds of outliers of leverage and residual

1) Extreme Conformers
2) Large Residuals
3) Influential Points

### Omitting an influential point from the data...

results in a very different regression model

### Influence points are hard to detect because...

They distort the model so much that their residual becomes very small

### Best way to verify an outlier and its effect is to...

Calculate the regression line with and without the suspect point

### A histogram of the residuals...

Compliments a scatterplot of the residuals in the search for conditions that may compromise the effectiveness of the regression model

### Consider comparing two+ regressions if you find...

1) Points with large residuals and/or high leverage.
2) Change in Scatterplot Pattern as a result of changes over time or subsets that behave differently.

### Regressions of summaries of the data tend to look...

stronger than the regression on the original data

### Regression is stronger when using summaries of data because...

Summary statistics are less variable than the underlying data

### Re-expression

A means of altering the data to achieve the conditions/ structure necessary to utilize particular summaries or models

Orders the effects that the re-expressions have on the data

### The reasons to consider a re-expression are...

1. Make the form of a scatterplot straighter
2. Make the scatter in a scatterplot more consistent (not fan shaped)
3. Make the distribution of a variable (histogram) more symmetric.
4. Make the spread across different groups (box plots) more similar

taking logs

### If all else fails for the Ladder of Powers...

Use 2 logs (log x and log y)

### Base 10 logs are roughly...

One less than the number of digits needed to write the number

### Re-expression limitations

1. Can't straighten scatterplots that turn around
2. Can't re-express "-" data values with (+constant to shift > 0)
3. Minimal affect on data values far from 1-100. (-constant to shift)
4. Can't unify multiple modes