# Ch. 5 AP Statistics (Summarizing Bivariate Data)

## 23 terms

### Pearson Correlation Coefficient

The most popular measure of correlation Indicates the magnitude and direction of a co-relational relationship between variables. , measures the degree to which 2 variables are related to each other in a linear way from -1 to 1
it uses the mean and standard deviation to transform original scores into a number that represents the SD from the mean (z score), represented by r= Σ (ZxZy)/(n-1)

### Negative Relationship

A relationship in which the values of one variable increase as the values of another variable decrease (inverse)

### Correlation coefficient

A numerical assessment of the strength of relationship between the x and y values in a set of (x,y) pairs., a statistical index of the relationship between two things (from -1 to +1)

### Positive relationship

A relationship in which increases in the values of the first variable are accompanied by increases in the values of the second variable.

### Population Correlation Coefficient

An analogous measure of how strongly x and y are related in the entire population of pairs from which the sample was obtained. Represented by . Parallels r.

### Regression analysis

A mathematical approach for fitting an equation to a set of data to make quantitative predictions of one variable from the values of another. Think Sir Francis Galton.

### Least Squares Line

Also know as the sample regression line or line of best fit it is the line that minimizes the sums of the squares of the vertical distances from the actual points to the line. Is represented by the symbol ŷ.

### Sum of squared deviations

Most widely used criterion for measuring the goodness of fit of a line y = a +bx to bi-variate data (x1, y1),...., (xn, yn) about the line [Σ][y-(a+bx)]^2

### Danger of extrapolation

Calculation of value of a function outside the range of known values, especially with the use of a regression line.

### Sample Regression Line

The estimate of the true regression line, gives the "best fit" of the sample data, estimating using method of the least squares, is the terminology that is frequently used because of the relationship between the least squares line and the Pearson's correlation coefficient.

### Sir Francis Galton

English scientist (cousin of Charles Darwin) who explored many fields: heredity, meteorology, statistics, psychology, anthropology. Believed in inheritance of mental ability, coined terms eugenics and nature v nurture. Brought about use of questionnaire data analysis and the use of c0-relational data and psychometric,

### Residuals

The name for difference between the observed value of the response variable and the value predicted by the regression line, can be positive or negative.

### Residual Plot

Scatterplot of the (x, residual) pairs. Isolated points or a pattern of points in residual plot are indicative of potential problems. Helps asses the appropriateness of the regression line. Is best if there is NO curvature.

### Influential Observation

An observation that substantially alters the values of slope and y-intercept in the regression equation when it is included in the computations. Does NOT have to be the observation with the largest residual.

### Coefficient of Determination

Denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y, 100r2 is the percentage of variation that can be attributed to the approximate linear relationship between x and y. Larger when residuals are small. r2 = 1 - (SSResid/ SSTo)

### Total sum of squares

Denoted by SSTo, is defined as the summation of (each y observation minus the mean)^2. Usually more than SSResid.

### Residual sum of squares

Also known as the error sum of the squares, is denoted by SSResid, Is the sum of the squared residuals is a measure of y variation that cannot be attributed to an approximate linear relationship (unexplained variation). Usually less than SSTo

### Standard Deviation about the least-squares line

The size of "typical"deviation from the least-squares line. Represented by se = square root {(SSResid/n-2)}

### Polynomial Regression

Is a variation of multiple regression that describes curvilinear relationships, uses R2 = 1- (SSResid/SSTo) where SSResid is the sum of differences between residuals and y values) squared.

### Transformation

Also called a re-expression, is a method that involves using a function of a variable in place of the varaible itself. May involve taking square roots, logarithms, or reciprocals of x and relating them to the y-value.

### Power Transformation

A transformation in which a power/exponent is chosen, and then each original value is raised to that power to obtain the corresponding transformed value. Do NOT pick 0 as the exponent as that would make every value 1, and an exponent of 1 is NOT a transformation either.

### Logistic Regression

Special form of regression in which the dependent variable is a nonmetric, dichotomous (binary) variable. Although some differences exist, the general manner of interpretation is quite similar to linear regression.

### Logistic Regression equation

The graph of this equation is an S-shaped curve. Describes the relationship between the probability of success and a numerical predictor variable. P= e^(a+bx) / 1 +e^(a+bx), where a and b are constants. The further b is from 0 the steeper the curve, in other words b determines the slope. Can use ln on this equation to transform it to a linear graph.

### Flickr Creative Commons Images

Some images used in this set are licensed under the Creative Commons through Flickr.com. Click to see the original works with their full license.