Regressions

STUDY
PLAY
Dependent versus Independent Variables
-Dependent variable (Y): the variable that is predicted or caused. Also called response Y.
-Independent variable (X): the variable that is used to predict, or is the cause of, change in another variable. Also called factor or regressor X.
-NOTE: when graphing two variables, the independent variable (X) is always graphed along the horizontal axis and the dependent variable (Y) is always graphed along the vertical axis.
Relationship between two variables
The relationship between two variables can be classified in three ways:
1. Causal/predictive: dependent and independent variables
2. Functional
3. Statistical: knowing the value of the independent variable lets us estimate a value for the dependent variable, but the estimate is not exact. One process of determining the exact nature of a statistical relationship is called regression.
How do we determine if a relationship exists between two variables?
1. The first step is to plot the data on a graph.
-When an analyst has only a few data points, the relationship between two variables can be eye-balled (determined visually).
-When the data sets become fairly large, however, eyeballing a relationship is extremely inaccurate. Statistics are needed to summarize the relationship between two variables.
2. The relationship between two variables can be summarized by a line, and any line can be fully described by its slope and its intercept.
-Slope: ratio of the change in Y to a given change in X.
-Intercept: the point at which the line intersects the Y-axis.
Scatterplots
-Used to graph the relationship between two variables
-Scatterplots provide the basis through which we can draw a regression line.
-Scatterplots also help us understand whether our data are linear
Regressions
-Regression is a technique used to make sense of scatterplot data by finding the line that best fits the data.
-The regression line tells us the relationship between two variables (x and y).
-Are X and Y correlated?
-Regression allows us to estimate the coefficients for these lines.
Inference and Regressions
Inference helps us determine whether our regression findings (i.e., the equation of the regression line) are statistically valid.
Randomized experiments versus observational studies
-We can make experiments fair by randomly assigning treatment and control.
-In regression, we can similarly eliminate bias by randomly assigning the various levels of X.
-If the values of X are assigned at random, then we can make a stronger statement about causality between X and Y.
-If, however, the relation of Y to X occurs in an uncontrolled observational study, then we cannot necessarily conclude anything about causation. In that case, the increase in Y that accompanies a unit change in X would include not only the effect of X but also the effect of any confounding variables that might be changing simultaneously.
Correlation Coefficient
The correlation coefficient (r) ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation).
-1 < r < 1
What does the correlation coefficient tell us?
-The strength of the association between two variables (I.E., how tightly clustered are the data).
-The closer r is to -1 or 1, the more helpful one variable is in predicting the other.
-The sign of the slope of a regression line (- or +). BUT, it does NOT tell us the slope.
Sign and Magnitude of r
-The sign of r is the same as the sign of the slope of the line drawn through a scatterplot.
-The magnitude (absolute value) of r measures the degree to which points lie close to the line.
What does r=0 mean?
When r=0, the correlation coefficient tells us that there is no linear relationship between X and Y.
BUT, that does NOT mean that X and Y have no relationship.
When are correlation coefficients not helpful?
Correlation coefficients will be misleading in the presence of:
- Outliers
- Nonlinear associations. The correlation coefficient is only useful when X and Y have a LINEAR relationship.
Simple Linear Regression
-Are X and Y correlated?
-The regression equation is basically the equation of the line:
Y=a+bX
y ̂i= a + βX + ei
DV=constant+IV+error

-DV (y ̂i): Dependent Variable (what we are trying to predict)
-Constant (a): Y-intercept
-Slope (β)
-IV (X): Independent Variable
-Error (ei): residual
Estimated/Predicted value
For any given xi, a regression line gets close to yi, but not exactly yi. We call the estimated or predicted value "y-hat" (y ̂i)
Error
-The distance a point is from the regression line is referred to as an error (y ̂-y). Recall that the regression line gives the value of y ̂, whereas the data point represents y.
-Error in the context of regression analysis means unexplained variance. I.E., spread or dispersion in the values of Y that cannot be accounted for or explained by changes in X.
-A regression equation will almost always have some error.
Residuals
-We predict pairs of data: (xi, y ̂i)
-We observe pairs of data: (xi,yi)
-For each observation, the difference between the observed and predicted y-value represents the "error" in prediction for that observation.
-The residuals are the vertical distances between the points and the line. Residuals are often referred to as errors.

residual=observed value - predicted value
residual=error=ei=(yi-y ̂i)

Thus, our yi can be split into
-Explained (predicted) part
-Unexplained (residual or error) part
Low Residual
-Low residual is good, high residual is bad.
-Low residual means less error, high residual means more error.
Method of Least Squares
-In a regression, our goal is to find the line that comes as close to the points as possible because that will make the residuals as small as possible. Our objective is to fit a line whose equation is of the form y ̂=a+bx
-The question is: how do we select a and b so that we minimize the pattern of vertical Y deviations (prediction errors/residuals)?
-Minimizing residuals would be easy if there were just two points. You would simply pass a line between them. However, there are usually many points.
-The solution is to square each deviation and then minimize the sum of all of these.
Min ∑(yi-y ̂i)^2
-Gauss's method of least-squares is the most common way to find a line with many points of data that also minimizes the amount of residuals.
Ordinary Least Squares (OLS)
-The technique of linear regression uses these concepts to find the best line to describe a relationship, which statisticians have agreed is the line that minimizes the squared errors about it.
-This form of regression is called ordinary least squares, or simply linear regression.
Regression equations have two pieces of information we care about
Y=a+bX
-Intercept coefficient: a when x = 0.
-Slope coefficient (b): a one unit change in X associated with a b change in Y.
Estimating the slope and the intercept
-The true population regression is usually unknown to the statistician, who must estimate it by observing X and Y.
-The estimated line will not exactly coincide with the true population line.
Normal approximation rule for regression
The slope estimate b is approximately normally distributed, with:
-Expected value of b=β
-Standard error of b=
σ/√(∑x^2 )
Goodness of Fit
-Once the regression line has been found, we usually want to see how well that line summarizes the data. To do so, we use what statisticians call measures of goodness of fit.
-Any relationship between two variables can be summarized by linear regression. A regression line per se, however, does not tell us how well the regression line summarizes the data.
-For example, although both graphs have the same regression line, the regression line of (b) has more goodness of fit because the data points are clustered closer to the regression line.
Measures of goodness of fit
-Standard error of the estimate (Sy|x)
-Standard error of the slope
-Coefficient of determination (r^2)
-Multiple coefficient of determination (R^2)
-F-tests
Standard error of the estimate
(Sy|x)
An estimate of the variation in y ̂ (the predicted value of y). This can be used to place confidence intervals around an estimate that is based on a regression equation.
Standard error of the slope
If we took several samples with an independent and a dependent variable and calculated a regression slope for each sample, the sample slopes would vary somewhat. The standard deviation of these slope estimates is called the standard error of the slope. Can also be used to test the statistical significance of the slope (using a t-test).
Coefficient of determination
-The proportion of variation in Y that is explained by X.
-It is also the ratio of the explained variation to the total variation in Y.
-Measures goodness of fit. I.E., how well our estimate regression line fits the data and whether our variables are improvements on doing nothing.
-It ranges between 0 (no fit) and 1 (perfect fit).
goodness of fit=r^2

r^2=(explained SS)/(total SS)
Caution about extrapolation
-We should not extrapolate and make predictions outside the range of data values used in our sample.
-In general, even when a linear regression does a good job of summarizing a relationship between y and x within the observed range of data, it is dangerous to extrapolate this relationship beyond the range of the data.
-We can only predict what happens in the range of data that we have in our sample.
We have produced the following regression equation to examine the relationship between prestige and education: prestige-hat=_cons+b*educat, where ŷ = -10.732 + 5.361x, interpret b =5.361:
-The "substantive" interpretation:
• There is a positive relationship between education and prestige, and thus the more education you have the more prestige you will achieve.
• Remember, a one unit change in X is associated with a b change in Y.

-The "statistical" interpretation:
• Our null hypothesis is that b is not statistically significantly different from zero. So, H0: b = 0.
• b is statistically significantly different from 0 because the t value is more than 1.96 and the p-value is less than 0.05. So we reject the null hypothesis.
Significance of b coefficients
In interpreting b coefficients, the word "significant" can be thought of as "Reject H0"
How can we tell whether to reject H0?
-T-tests: reject the null if |t*|>tc
-P-values: reject the null if the p-value is less than the critical p-value
-Confidence interval: reject the null if the confidence interval does NOT include zero. Confidence intervals are valuable for substantive interpretations as well. A regression using our sample is the estimated regression of the underlying population. In 95% of future regressions on new samples, the true slope will fall within the confidence interval
Confidence Level
-To compute a 90% confidence interval, t= 1.65
-To compute a 95% confidence interval, t= 1.96
-To compute a 99% confidence interval, t= 2.57
Significance Level
Also called the alpha value.
α=1-confidence level

-To compute a 10% significance level, p=0.1
-To compute a 5% significance level, p=0.05
-To compute a 1% significance level, p=0.01
Education and Prestige example: Relationship between r^2 and R^2
-Regression of y on x--->
y ̂ = -10.732+5.361x
-Regression of x on y--->
x ̂ = 4.424 + 0.135y
-Different slopes but same r-coefficient (r = 0.8502). Thus the regression accounts for r^2 = 0.72, which is about 72% of the variation in the prestige scores.
Regression Assumptions
These assumptions allow for statistical inference.
(1) Errors are normally distributed with mean zero
(2) Homoscedasticity (constant errors)
(3) Errors are independent (not auto-correlated)
(4) Independent and dependent variables are intervals
(5) Relationship between independent and dependent variables is linear
(1) Errors are normally distributed with mean zero
-For any value of X, the errors in predicting Y are normally distributed with a mean of zero.
-Whenever e (error) has a mean of zero and is normally distributed, statisticians have found that sample slopes (b) have a mean equal to the population slope (β) and are distributed as a t distribution with a standard deviation (sb).
-If mean of errors is not zero, the intercept will be biased (add or subtract from the intercept)
-If errors are not normal and the sample size is small, F-tests and t-tests are not valid
-Most researchers don't worry much since the intercept coefficient is rarely substantively important
-After regression: Stata can store residuals. Take the mean of stored residuals; examine histogram
(2) Homoscedasticity
-The distribution of the errors is constant regardless of the value of X
-This assumption is called "homoscedasticity," and its violation (nonconstant error) is called "heteroskedasticity."
-I.E., errors should not get larger as the value of X gets larger.
-The opposite situation is just as severe. If error decreases as X increases, the data are still heteroskedastic and violate this regression assumption.
-The same problem can affect the dependent variable; that is, the variance of the error term can increase as the values of Y increase.
-If this assumption of linear regression is violated, then both the standard error and t statistic associated with the slope coefficient will be inaccurate. This violation can be serious because the standard error and the t statistic are used for testing the statistical significance of the slope, that is, whether there is a relationship between the independent variable X and the dependent variable Y.
-Heteroskedasticity can occur for various reasons: Outliers, measurement error, exclusion of one or more relevant X variables from the regression equation.
-After regression: Check residuals against x (or against predicted y). "Football shape". Variety of test statistics (Breusch-Pagan and Cook-Weisburg) and corrections. Transform variable, address outliers, use robust standard errors
(3) Errors are independent (not auto-correlated)
-I.E., the size of one error is not a function of the size of any previous errors.
-We can test for non-independent errors by examining the residuals. If they appear to be random with respect to each other, then the errors are independent, and we need not worry.
-Autocorrelation: when our residuals are not random with respect to each other. Auto-correlated errors result in underestimated standard errors. I.E., autocorrelation can result in slopes that appear to be significant when, in fact, they are not.
-Durbin-Watson statistic: a value close to 2 indicates NO autocorrelation, a value equal to zero indicates perfect positive autocorrelation, and a value equal to 4 indicates perfect negative autocorrelation.
-Use Prais-Winsten transformation to deal with autocorrelation.
-Before regression: Use intuition/common sense. Common time series problem.
-After regression: check scatter plots of residuals by independent variable.
(4) Independent and dependent variables are intervals
-The purist position is that regression cannot be performed with nominal or ordinal data.
-In practice, however, regression with nominal or ordinal variables is possible.
-Nominal variables with values of 1 or 0 are called dummy variables.
-Using regression with dummy dependent variables often results in probabilities greater than 1 or less than 0. Special types of analysis called probit and logit analysis can be used to restrict probabilities to values between 0 and 1.
(5) Relationship between independent and dependent variables is linear
-Linear relationships are those that can be summarized by a straight line.
-If linear regression is used to summarize a nonlinear relationship, the regression equation will be inaccurate.
-Statistical computer programs cannot distinguish linear from nonlinear relationships.
-Before regression: Check scatter plots for nonlinear patterns. Transform x, y or both to create a linear relationship
-After regression: check scatterplots between predicted values and residuals (rvfplot), which can amplify nonlinearities
Non-linear transformations
-If a scatter of x and y is not linear, we can transform x and/or y in order to better describe the scatter plot
-The "best" transformation depends on the shape of the relationship
-The transformation influences our interpretation of the results. Often prefer to transform x, for ease of interpretation.
-A one unit increase in logged X is associated with a b unit increase in Y.
-For example, taking the log of the distribution.
Non-linear transformations: Steps
-Scatter diagrams to diagnose
-Compute transformation
-Produce scatter with new variable(s)
-Try alternative transformation if necessary
Outliers
-Our least squares line can sometimes be markedly affected by outlying data
-Data pairs (x,y) can be influential
-Just as mean, standard deviation, and correlation can be influenced by outliers, so can regressions
-We want to check for influential observations
>Look at scatterplot and summary statistics to identify possible influential observations
>When in doubt, run a regression with the observations of concern and without
>Do the results change much?
>There are other techniques to diagnose influence, such as "leverage tests."
Simple/Bivariate Regression Output
-The coefficients section of the output displays the values for the intercept and slope.
-Constant is the same thing as intercept.
-A negative sign in front of the slope coefficient indicates a negative relationship.
Simple (bivariate) versus Multivariate Regression
In many situations, a dependent variable will have more than one cause. Under such circumstances, simple regression is an inadequate technique. Sometimes statistically significant relationships from bivariate models "wash out" in the presence of other explanatory variables. Bivariate regression is a very useful tool, but one should be careful before making definitive statements about the existence of causal relationships using only a bivariate regression.
Multivariate Regression
-Multiple regression is a technique used for interval-level data when the analyst has more than one independent variable and wants to explain or predict scores on the dependent variable.
-The real power of regression is in the extension to two or more explanatory variables
-In multivariate regression, we are still interested in relationship between x and y but we want to look at this by adjusting for other linear associations in the data
Why introduce more variables to a regression model?
-To remove variables from residuals, e, so that the regression equation is correctly specified. This makes predictions of y more accurate.
-To hold other possible causes of y constant, thus producing a more accurate assessment of the role x plays in the values of y. If x is related to other variables that affect y, omitting these lurking or confounding variables would lead us to (wrongly) ascribe part of their effects to x.
Multivariate Regression Equation
y ̂i = a + (β1X1) + (β2X2) + (β3X3) + (β4(β4*X4)+ ei

-Graphically, adding each variable adds an additional dimension
-We still want to find the coefficient values that minimize the sum of the squared residuals.
Two important concepts in multivariate regression
(1) Partial Slopes
(2) Multiple Coefficient of Determination (R^2)
Each is analogous to its counterpart in simple regression: the bivariate slope (b) and the coefficient of determination (r^2).
Partial Slopes
-β1, β2, β3, β4, etc.
-When a partial slope equals 0, it means that the independent variable in question (e.g., X1) is unrelated to Y when we control for the other independent variables (e.g., X2, X3, etc.).
Standard Error
-The true relation of Y to any X is measured by the unknown population slope.
-We estimate it with the sample slope, which varies randomly from sample to sample, fluctuating around its target with an approximately normal distribution.
-The estimated standard error (SE) of the sample slope forms the basis for confidence intervals and t tests.
Multiple Coefficient of Determination
-R^2 is the proportion of variation in Y that is explained by the independent variables (X1, X2, X3, etc.).
-Measures goodness of fit.
-By definition, R2 is bounded between 0 and 1.
-We can decompose the total variation of y around its mean into explained and residual parts:

R^2=ESS/TSS = (1-RSS)/TSS

-ESS=the explained sum of squares
-RSS = the residual sum of squares
-TSS = the total sum of squares
Relationship between r^2 and R^2
-In simple regression, squaring the correlation between x and y (r^2) is equivalent to R^2.
-In multivariate regression, squaring the correlation between x and y (r^2) is NOT equivalent to R^2 because other variables are now accounting for some of the relationship in the regression.
-R^2 is usually larger than r^2 because R^2 increases as we add more independent variables to our regression model.
Problem with R^2
-R^2 goes up every time you add independent variables to your regression equation.
-Because of the way it is calculated, R^2 goes up even if we add independent variables that have little or no explanatory power.
-To remedy this problem, statistical programs generate a statistic called "Adjusted R^2."
Adjusted R^2
-Measures the goodness of fit of your theory (your regression equation).
-The adjusted R^2 saves the day by measuring how good of a predictor your independent variables are. So the adjusted R^2 will go down if the independent variables are not useful for predicting Y (dependent variable).
-The Adjusted R^2 provides a more accurate picture of the explanatory power of a model because it adjusts for the presence of partial slopes with insignificant t-values.
-If a model includes both statistically significant and insignificant partial slopes, the Adjusted R^2 will be lower than the R^2.
-If a model includes only partial slope coefficients with significant t values, the Adjusted R^2 and R^2 values will be quite similar.
-There are valid reasons for including partial slopes with insignificant t values in your regression model, so you should not automatically remove independent variables from the model simply because they lower the Adjusted R^2.
Spurious Relationship
Whenever a relationship between a dependent and an independent variable is spurious, the regression slope between the two variables will fall to zero if we introduce a third variable in the regression model that causes the other two.
Specified Relationship
A relations is specified if two variables appear unrelated but become related in the presence of a third variable.
Checking Multivariate Regression
-As in simple regression, it's a good idea to check whether linear multiple regression equation does a good job of summarizing the data.
-It isn't as easy as looking at the data directly when there are more than two independent variables.
-Fancier statistical tests are useful to check multivariate regressions
-It also helps to plot residuals against fitted values of ŷ and against each of the x's.
-If the regression is not linear, use transformations.
Scalar Transformations
You increase the value of X in each pair of values (Xi, Yi) by adding or multiplying by the same value. So the scale changes but the relationship does NOT change.
(Xi, Yi)-->(2, 4); (4,6)
(100Xi, Yi)-->(200, 4); (400, 6)
Non-linear Transformations
You take the log of X in each pair of values (Xi, Yi). With nonlinear transformations, the relationship does change into a linear relationship.
(Xi, Yi)-->(100, 10); (200, 15)
(logXi, Yi)-->(2, 10); (2.3, 15)
Why use scalar versus non-linear transformations?
-Use scalar transformations for convenience, to make it easier to see the relationship. Because the relationship does not change as a result of scalar transformations.
-Use nonlinear transformations when you are running regressions with nonlinear relationships to transform them into linear relationships. The method of ordinary least squares REQUIRES a linear relationship between X and Y.
What happens to the coefficients in your model (your regression equation) when you apply each type of transformation?
-Scalar transformation: the coefficients do not change.
-Nonlinear transformation: all the coefficients change because we've changed the underlying relationship between X and Y
Dummy Variables
y ̂i = a + (β1X1) + (β2

-When the numerical outcome of an economic process depends in part on some categorical characteristic of the observation, this information must be brought into the regression specification somehow.
-The technique for doing this involves constructing new repressors known as dummy variables.
-A dummy variable is a two-category variable that is usually coded 1 if a condition is met versus 0 if the condition is not met.
-One group is called the excluded group and the other is called the included group.
Dummy Variable Regression Output
-When working with a dummy variables in statistical software, remember that the slope coefficient will always express the effect of the variable in terms of the category that is coded as 1.
-For purposes of consistency and interpretation, it is recommended that the values for the 1 category in a dummy variable be reserved for cases that possess a particular attribute, or where a condition does exist. The values for the 0 category should be reserved for cases that do not possess a particular attribute, or where a condition does not exist. An inconsistent definition of the 1s and 0s can make the interpretation of results very confusing.
-You should not interpret or report standardized regression coefficients (beta weights) for regression equations containing dummy variables. Instead, always report unstandardized coefficients when using a regression model that contains dummy variables.
Standardized Coefficients
-Also known as beta weights, standardized coefficients are useful for assessing the relative impact of independent variables in a regression equation.
-However, analysts often report unstandardized coefficients because they are easier to interpret.
-Even when the beta weight for a particular variable is larger than the beta weights for all of the other variables in an equation, you have not necessarily discovered the independent variable that will always be most influential in explaining variation in the dependent variable.
-The magnitude of beta weights can change when new variables are added or existing variables are removed from an equation. -Values for beta weights can also change when the number of observations in a data set changes.
Parallel Slopes Model
The parallel slopes model means that the slopes are identical across independent variables, but the intercepts vary.
Interpretation of Dummy Variables
Example: black-white wage gap among men over their life course.
y ̂i = a + (β1Agei) + (β2
-Race: 1 for whites and 0 for blacks
-Conceptually, β2 can be interpreted in two ways:
1. Constant separation between the two regression lines.
2. Difference in intercepts for blacks and whites.
-Both yield a similar substantive interpretation: the average difference in y for blacks and whites.
Two categories: Mirer Earnings Example
-Compute the intercepts for whites and blacks in equation
-Earnings in $1000 =-0.778 + 0.762EDU-1.926RACE+ ei
-There are two intercepts that we can compute:
1. White (RACE=0)-->-0.778
2. Black (RACE=1)-->-0.778-1.926=-2.704
-On average, the difference between the intercepts is 1.926. The substantive interpretation of this is that, on average, black people start from a lower income than white people, no matter the number of years of education.
Extending idea of Parallel Slopes to data with more than two categories
-What do we do when a categorical variable has more than 2 values? We can construct dummy variables to represent each category.
-For example, we can construct four dummy variables to represent level of education (less than HS, HS grad, some college, college grad).
-We might be tempted to write a regression equation to represent the four education categories.
-BUT, there is a problem: this model is not identified. We don't have a "base."
-If all four dummy variables were included, there would be one more coefficient than we could interpret logically.
-Solution: in formulating the regression model, one of the dummy variables must be excluded and wrapped up into the constant. The intercept for the excluded category becomes the constant (a or β0).
-Then, the coefficients of the included dummy variables represent the difference between the intercepts of that category and the excluded category.
-Take-away: we always interpret the dummy variable(s) relative to the base
More than two categories: Mirer Earnings Example
-How does where someone lives shape her earnings?
-From the region variable (REG) we can create four dummy variables: DNEAST, DNCENT, DSOUTH, and DWEST.
-Each dummy variable serves to identify each observation as being in one of two groups (i.e., in the specified region or not).
-Northeast: base (constant). Stata omitted this one and wrapped it up in the constant because you always need a baseline to compare against--> -0.803
-North Central--> 0.288+(-0.803)=-0.515
-South-->-0.828+(-0.803)=-1.631
-West-->-1.992+(-0.803)=-2.795
-The substantive interpretation is this is that people living in the West always start off with a lower income that people living in the Northeast.
Regression with multiple dummy variables
-Imagine a case where you want to know the effects of education, gender, and race on earnings. How to predict earnings for different categories of people?
-Gender: female=1
-Race: black=1
-What is the equation?
Earnings=a+β1 Women+β2 Black+β3 Education
Regression with multiple dummy variables: What is the constant measuring?
The variables White and Men are omitted and used as the constant (i.e., our "base").
Regression with multiple dummy variables: How do we predict earnings for white women?
Earnings=a+β1(1)+β2(0)+β3 Education
Regression with multiple dummy variables: How do we predict earnings for black men?
Earnings=a+β1(0)+β2(1)+β3 Education
Multiple Regression Assumptions
Multiple regression differs from bivariate regression in that two additional assumptions must be considered when developing regression equations:
(6) Correct Specification
(7) No Multicollinearity
(6) Correct Specification
-A well-specified regression equation includes all or most of the independent variables known to be relevant predictors of the dependent variable.
-As a starting point in specifying a model, the analyst should research the topic to see what others examining the same question have used as explanatory variables. The more substantive knowledge we possess about the topic in question, the better our ability to select relevant independent variables.
-Keep substantively important variables, even if not statistically important
Correct Specification: Selecting Explanatory Variables
Think about the following questions when selecting explanatory variables for a multiple regression equation:
• Have key variables been omitted from the equation?
• Can I explain why I selected each independent variable?
• Are we controlling for all relevant variables? Are there relevant variables being excluded?
• Does the dependent variable cause the independent variable?
• Are the variables well measured?
• Is the sample size too small?
• Is the sample size too big? An excessive sample size might give you artificial statistically significant results.
• Do some independent variables mediate others?
• Are the independent variables too correlated?
• Is the sample biased?
Omitted variable bias
• Omitted variable bias occurs when an omitted variable is relevant for the dependent variable and correlated with at least one independent variable
• The more confounding variables we omit from an observational study (even unintentionally), the more bias we risk, with the riskiest case being simple regression, which omits them all.
• "Lurking Variables"
• Solution: do not exclude relevant variables
How do we check whether an omitted variable is relevant?
Run the regression with and without it, then compare and see if there is a significant difference.
When should a variable be dropped from an equation?
-Removing a variable with an insignificant slope coefficient and thus low explanatory power may help improve the fit of the model.
-However, for policy analysis, knowing what is NOT statistically significant is often as important as knowing what is statistically significant.
-If you drop insignificant variables from a model, the audience might ask why certain variables were left out or excluded.
-Multivariate solution to each coefficient depends on all the variables (e.g., b2 depends on y, x2 and x1).
-For this reason, we cannot just remove a variable because we don't like it or because it's not statistically significant. If we remove it then we'd be creating a NEW equation for which we would have to run a NEW regression.
-If we drop a regressor, we may seriously bias the remaining coefficients. If we have strong prior grounds for believing that X is related positively to Y, X generally should not be dropped from the regression equation if it has the right sign. Instead, it should be retained along with the information in its confidence interval and p-value.
(7) No Multicollinearity
-Multicollinearity refers to high inter-correlation among two or more independent variables.
-Multicollinearity makes it difficult for the regression equation to estimate unique partial slopes for each independent variable.
-Partial slope estimates and the associated t values can be misleading if one independent variable is highly correlated with another.
-Not only is it difficult to distinguish the effect of one independent variable from another, but high multicollinearity also typically results in partial slope coefficients with inflated standard errors, thus making it hard to obtain statistically significant results.
Diagnosing Multicollinearity
Multicollinearity can be diagnosed in a regression equation by looking for two things:
1. The equation may produce a high Adjusted R^2 but slope coefficients that are not statistically significant.
2. The value of the coefficients may change when independent variables are added to or subtracted from the equation.
Dealing with Multicollinearity
-One of the simplest ways to avoid or address multicollinearity is to assess logically whether each independent variable included in a model is really measuring something different from the others.
-You can test whether one independent variable might be measuring the same thing as another by calculating a correlation coefficient for the variables in question.
-Also look at scatter plots.
Restricted versus Unrestricted Models
-Unrestricted model: the full or big model
y ̂i = a + (β1X1) + (β2
-Restricted model: parsimonious model or model with no independent variables. This model is essentially the mean.
y ̂i = a + ei-->Equal to the mean
Restricted versus Unrestricted Models: Mirer Earnings Example
-We have three independent variables: ED, EXP, and EXPSQ.
-The underlying regression model is:
Earnings=β0+β1 ED+β2 EXP+β3 EXPSQ+ei
-Suppose our null hypothesis is that experience (EXP and EXPSQ) has no effect on earnings.
-H0: β2=β3=0
-H1: β2≠0 and/or β3≠0
-If the null hypothesis that H0: β2=β3=0 is true, the underlying regression model can be restated validly as:
Earnings=β0+β1 ED+ei
-In other words, the null hypothesis serves to restrict some of the coefficients in the underlying regression model.
-The second equation is referred to as the restricted form of the model, and the first equation is the unrestricted form.
-Unrestricted-->Earnings=β0+β1 ED+β2 EXP+β3 EXPSQ+ei
-Restricted-->Earnings=β0+β1 ED+ei
F-tests
-Standard t tests are by far the most commonly examined hypothesis test in regression analysis.
-Sometimes, however, we are concerned with more than one coefficient, and the t tests are inadequate for our needs.
-F-tests measure goodness of fit.
-Involve the null and alternate hypotheses.
-Do the independent variables reliably predict the dependent variable?
Two types of F-tests
1.Null hypothesis: all the coefficients in the regression model (except the intercept) taken together as a group are equal to zero, meaning that the independent variables DO NOT reliably predict the dependent variable. The alternate hypothesis in this case is that the independent variables DO, in fact, reliably predict the dependent variable.
2.Null hypothesis: some coefficients are equal to zero. I.E., which model is better? The unrestricted model or the more parsimonious restricted model?
Interpreting the F-statistic
Type 1:
-If all of the partial slope coefficients in a multiple regression lack explanatory power, the F statistic will be very low.
-The larger the F statistic, the more likely it is that at least one of the independent variables is statistically significant (i.e., not equal to 0).
-The F statistic is useful for assessing whether multicollinearity is a problem. If the F statistic indicates a low probability that all partial slopes are equal to zero but none of the partial slope coefficients is statistically significant, this combination often indicates the presence of high multicollinearity between two or more of the independent variables.

Type 2:
-The value of F reflects the increase in the residual sum of squares (RSS) resulting from estimating the restricted form of the model rather than the unrestricted form.
-The logic is that if the restrictions inherent in the null hypothesis are wrong, F will tend to have a high value. Hence, large values of F cast doubt on the null hypothesis.
-In brief, reject H0 if F*≥Fc
Five steps of F-tests
1. State the hypothesis clearly
2. Choose the level of significance
3. Construct the decision rule
4. Determine the value of the test statistic F*
5. State and interpret the conclusion of the test
How to construct F-tests
1. Identify an unrestricted model and a restricted model to compare against
2. Compare sum of squared residuals from a restricted model (the null) relative to the unrestricted model with more coefficients. This takes into account how many variables are included in each (way to test whether a kitchen-sink style model is better than a more parsimonious model)
F-tests versus T-tests
-F-tests determine whether a the set of independent variables reliably predict the dependent variable as a group
-T-tests determine whether a specific coefficient is different than zero
Regression: words of caution
-Association is not causation
-Be aware of underlying OLS regression assumptions
-Beware of extrapolation
-Specification matters: "Omitted variable bias" or "lurking variables."
-Beware of outliers
Final thoughts on regression: Step 1
Before stating: think about your sample
-Is it a random or biased sample? If it is biased, we should NOT run a regression or do any inferential statistics.
-Is the sample size appropriate? The minimum sample size is 30 observations, but be wary of using a sample size that is too large.
Final thoughts on regression: Step 2
Choose independent variables that are:
-Relevant to your theory or research question
-NOT caused by Y. This is called endogeneity (when Y causes X). Endogeneity is bad!!
-Well measured
Also, do not exclude relevant variables.
Final thoughts on regression: Step 3
-Once you trust your data enough, check scatter plots and descriptive statistics
-If nonlinear relationship between X and Y, transform before running a regression.
-Look for potential outliers
Final thoughts on regression: Step 4
Run regression(s). You might have to run several regressions, especially if there are multiple outliers and/or intervening variables.
-Run regression(s) with and without potential outliers
-Run regression(s) with and without intervening variables
Final thoughts on regression: Step 6
Run T-tests and F-tests. Check for statistical significance.
Final thoughts on regression: Step 7-9
7. Check remaining assumptions. Tweak and re-run regressions.
8. Report, report, report everything: how and why you chose your variables, which regressions you chose to run, etc.
9. Aim for "replicability"
What can go wrong with multiple regression?
1. Are important independent variables left out of the model?
-Leaving important variables out of a regression model can bias the coefficients of other variables and lead to spurious conclusions.
-Important variables are those that affect the dependent variable and are correlated with the variables that are the focus of the study.

2. Does the dependent variable affect any of the independent variables?
-If the dependent variable in a regression model has an effect on one or more independent variables, any or all of the regression coefficients may be seriously biased.
-Non-experimental data rarely tell us anything about the direction of a causal relationship. You must decide the direction based on your prior knowledge of the phenomenon you're studying.
-Time ordering usually gives us the most important clues about the direction of causality.

3. How well are the independent variables measured?
-Measurement error in independent variables leads to bias in the coefficients. Variables with more measurement error tend to have coefficients that are biased toward 0.
-The degree of measurement error in a variable is usually quantified by an estimate of its reliability, a number between 0 and 1. A reliability of 1 indicates that the variable is perfectly measured, whereas a reliability of 0 indicates that the variation in the variable is pure measurement error.

4. Is the sample large enough to detect important effects? In small samples, the approximations used to calculate p values may not be very accurate, so be cautious in interpreting them.

5. Is the sample so large that trivial effects are statistically significant?
-In large samples, even trivial effects may be statistically significant.
-You need to look carefully at the magnitude of each coefficient to determine whether it is large enough to be substantively interesting.
-When the measurement scale of the variable is unfamiliar, standardized coefficients can be helpful in evaluating the substantive significance of a regression coefficient.

6. Do some variables mediate the effects of other variables?
-If you're interested in the effect of x on y, but the regression model also includes intervening variables w and z, the coefficient for x may be misleadingly small.
-You have estimated the direct effect of x on y, but you have missed the indirect effects through w and z.
-If intervening variables w and z are removed from the regression model, the coefficient for x represents its total effect on y. The total effect is the sum of the direct and indirect effects.

7. Are some independent variables too highly correlated?
-If two or more independent variables are highly correlated, it's difficult to get good estimates of the effect of each variable controlling for the others. This problem is known as multicollinearity.
-When two independent variables are highly collinear, it's easy to incorrectly conclude that neither has an effect on the dependent variable.

8. Is the sample biased?
-As with any statistical analysis, it's important to consider whether the sample is representative of the intended population.
-A probability sample is the best way to get a representative sample.
-If a substantial portion of the intended sample refuses to participate in the study, regression analysis may produce biased estimates.
YOU MIGHT ALSO LIKE...