ISQM Chapter 4a
|Collinearity||Relationship between two (collinearity) or more (multicollinearity) variables. Variables exhibit complete collinearity if their correlation coefficient is 1 and a complete lack of collinearity if their correlation coefficient is 0.|
|Condition index||Measure of the relative amount of variance associated with an eigenvalue so that a large ________________ indicates a high degree of collinearity.|
|Cook's distance (Di)||Summary measure of the influence of a single case (observation) based on the total changes in all other residuals when the case is deleted from the estimation process. Large values (usually greater than 1) indicate substantial influence by the case in affecting the estimated regression coefficients.|
|COVRATIO||Measure of the influence of a single observation on the entire set of estimated regression coefficients. A value close to 1 indicates little influence. If the _________ value minus 1 is greater than ± 3p/n (where p is the number of independent variables + 1, and n is the sample size), the observation is deemed to be influential based on this measure.|
|Deleted residual||Process of calculating residuals in which the influence of each observation is removed when calculating its residual. This is accomplished by omitting the ith observation from the regression equation used to calculate its predicted value.|
|DFBETA||Measure of the change in a regression coefficient when an observation is omitted from the regression analysis. The value of _________ is in terms of the coefficient itself; a standardized form (SDFBETA) is also available. No threshold limit can be established for DFBETA, although the researcher can look for values substantially different from the remaining observations to assess potential influence. The SDFBETA values are scaled by their standard errors, thus supporting the rationale for cutoffs of 1 or 2, corresponding to confidence levels of .10 or .05, respectively.|
|DFFIT||Measure of an observation's impact on the overall model fit, which also has a standardized version (SDFFIT). The best rule of thumb is to classify as influential any|
standardized values (SDFFIT) that exceed 2 p n , where p is the number of
independent variables + 1 and n is the sample size. There is no threshold value for
the DFFIT measure.
|Eigenvalue|| Measure of the amount of variance contained in the correlation matrix so|
that the sum of the _____________ is equal to the number of variables. Also known as
the latent root or characteristic root.
|Hat matrix||Matrix that contains values for each observation on the diagonal, known as|
hat values, which represent the impact of the observed dependent variable on its predicted value. If all cases have equal influence, each would have a value of p/n, where p equals the number of independent variables + 1, and n is the number of cases. If a case has no influence, its value would be ‐1 ÷ n, whereas total domination by a single case would result in a value of (n ‐ 1)/n. Values exceeding 2p/n for larger samples, or 3p/n for smaller samples (n ≤ 30), are candidates for classification as influential observations.
|Influential observation||Observation with a disproportionate influence on one or more aspects of the regression estimates. This influence may have as its basis (1) substantial differences from other cases on the set of independent variables, (2) extreme (either high or low) observed values for the criterion variables, or (3) a combination of these effects. Influential observations can either be "good," by reinforcing the pattern of the remaining data, or "bad," when a single or small set of cases unduly affects (biases) the regression estimates.|
|Leverage point||An observation that has substantial impact on the regression results due to its differences from other observations on one or more of the independent variables. The most common measure of a leverage point is the hat value, contained in the hat matrix.|
|Mahalanobis distance (D2)||Measure of the uniqueness of a single observation based on differences between the observation's values and the mean values for all other cases across all independent variables. The source of influence on regression results is for the case to be quite different on one or more predictor variables, thus causing a shift of the entire regression equation.|
|Outlier||In strict terms, an observation that has a substantial difference between its|
actual and predicted values of the dependent variable (a large residual) or between its independent variable values and those of other observations. The objective of denoting outliers is to identify observations that are inappropriate representations of the population from which the sample is drawn, so that they may be discounted or even eliminated from the analysis as unrepresentative.
|Regression coefficient variance-decomposition matrix||Method of determining the relative contribution of each eigenvalue to each estimated coefficient. If two or more coefficients are highly associated with a single eigenvalue (condition index), an unacceptable level of multicollinearity is indicated.|
|Residual||Measure of the predictive fit for a single observation, calculated as the difference between the actual and predicted values of the dependent variable. Residuals are assumed to have a mean of zero and a constant variance. They not only play a key role in determining if the underlying assumptions of regression have been met, but also serve as a diagnostic tool in identifying outliers and influential observations.|
|Standardized residual||Rescaling of the residual to a common basis by dividing each|
residual by the standard deviation of the residuals. Thus, standardized residuals have a mean of 0 and standard deviation of 1. Each standardized residual value can now be viewed in terms of standard errors in middle to large sample sizes. This provides a direct means of identifying outliers as those with values above 1 or 2 for confidence levels of .10 and .05, respectively.
|Studentized residual||Most commonly used form of standardized residual. It differs from other standardization methods in calculating the standard deviation employed. To minimize the effect of a single outlier, the standard deviation of residuals used to standardize the ith residual is computed from regression estimates omitting the ith observation. This is done repeatedly for each observation, each time omitting that observation from the calculations. This approach is similar to the deleted residual, although in this situation the observation is omitted from the calculation of the standard deviation.|
|Tolerance||Commonly used measure of collinearity and multicollinearity. The tolerance of variable i (TOLi) is 1‐ Ri2, where Ri2 is the coefficient of determination for the prediction of variable i by the other predictor variables. Tolerance values approaching zero indicate that the variable is highly predicted (collinear) with the other predictor variables.|
|Variance inflation factor (VIF)||Measure of the effect of other predictor variables on a regression coefficient. _______ is inversely related to the tolerance value ( ______ = 1 ÷ TOLi).|
The __________ reflects the extent to which the standard error of the regression coefficient is increased due to multicollinearity. Large VIF values (a usual threshold is 10.0, which corresponds to a tolerance of .10) indicate a high degree of collinearity or multicollinearity among the independent variables, although values of as high as four have been considered problematic.