Level 2 - Quantitative Methods
Session 3 and 4
Terms in this set (81)
CovX,= E(X, - Xm)(Yi-Ym)/N-1
Statistical measure of the degree to which 2 variables move together. Actual value is not useful because very sensitive to the scale of the 2 variables.
Cor(x,y)= Cov(X,Y)//(St Dev X * St Dev Y)
Bounded by + - 1.
Limitations of Correlation analysis
1.)Impact of outliers
2.) Potential for spurious correlation( not causal - its by chance)
3.) Potential for non linear - does not capture non linear relationships
Test of the hypothesis that r = 0.
Purpose - to test whether the r btw the population of 2 variables is zero.
Use a T test, with calculated test stat:
T = r*Sqrt(n-2)//Sqrt(1-r^2)
*Reject Ho if t critical < T, or T < negative t critical
Dependent Variable - other terms for it
AKA the explained, endogenous, or predicted variable
Independent Variable - other terms for it
AKA explanatory, exogenous, or predicting variable
List the assumptions underlying linear regression
- Linear relationship exists btw dependent and independent variables
- Variance of residual term is constant for all observationsIndependent variable is uncorrelated with the residuals
- Expected value of the residual term is Zero
- Residual term is independently distributed, ie residuls are uncorrelated.
-Residual term is normally distributed
Interpret Regression coeficients
b1= regression slope coef
ei=residual for the ith observation AKA the "Error Term"
Estimated slope coef?
for the regression line describes the change in Y (dependent) for one unit change in X. This slope term is calculated as b^1=CovXY//S^2X
where S^2 = Variance of X
Calculate the intercept term b0
b0 = Y(mean) - b1*X(mean)
Standard Error Estimate (SEE)
AKA " Standard error of the residual, AKA Standard error of the regression"
SEE Measures the degree of variability of the actual Y values relative to the estimated Y values from a regression equation. *The smaller the SSE, the better the fit of the regression line.
Coefficient of Determination (R^2)
The % of the total variation in the dependent variable explained by the independent variable.
*For simple linear regression ( w/ a single inde variable), simply square the correlation coefficient r.
R^2 = r^2 for regression with one independent variable
Hypothesis test for regression coefficient
ie - is a slope coefficient different from Zero?
If the confidence interval at the desired level of significance does not include Zero, the null is rejected and the coef is said to be statistically different from Zero
Use the T table at N-2 Degrees of freedom..
Ho:b1 = 0 vs Ha:b1 not = 0.
b1 +- (tc
sb1), or b1 - (tc
where sb1= standard error - this will probably be given on the exam.
Estimated slope coef is .64 with a standard error = .26. Assuming N = 36 observations, determine if the estimated slope coef is significantly different from zero at a 5 % level of significance.
calculated test stat = b1(est) - 0//Sb1
Find critical 2 tailed values from T table with N-2 degrees of freedom ( 34)
Confidence intervals for predicted value (predicted by linear function).
Y(predicted) +- (tc*sf)
where tc = two tailed critical t value at desired level of sig, with N-2 df.
sf = standard error of the forecast
Total sum of squares (SST)
Total variation in the dependent variable. SST is equal to the sum of squared differences btw the actual values and the mean of Y.
SST = E(Yi - Ymean)^2 so............................
SST = RSS + SSE
Regression sum of squares (RSS)
Variation in the dependent variable that is explained by the independent variable. Sum of squared distances btw reg line predicted Y values and the mean of Y.
*ANOVA - Mean Sum of Squares = RSS/K, where K = # slope parameters.DF = 1 for simple
Sum of Squared Errors ( SSE)
The unexplained variation in the dependent variable - not explained by the regression line. Distance btw actual Y values and predicted Y values.
*ANOVA - Mean Sum of Squares MSE = SSE/(N-2)
R^2 = SST - SSE //SST
= Total variation - Unexplained variation // Total Variation
= Explained Variation/Total Variation
The standard deviation of the regression error terms and is equal to the square root of the mean squared error. Measures the degree of variability of actual y values relative to those estimated by the regression equation.
Sqrt(MSE) = Sqrt(SSE/N-2), since MSE = SSE/N-2
*refer to error term in ANOVA table
What is the difference btw the sum of squared errors (SSE) and the standard error estimate (SEE)?
SSE is the sum of squared residuals, while SEE is the standard deviation of the residuals.SEE gagues of the fit of the regression line.
*Always a ONE TAILED TEST. Assesses how well a set of independent variables, as a group, explains the variation in the dependent variables. This is more useful in multiple regression.
F calc =MSR/MSE
Where mean sum of squares, MSR = RSS/K
and mean sum of squares, MSE = SSE/N-K-1
Numerator DF = K = 1
Denominator DF = N - K -1
Reject H0 if F calc > F critical (from table with num/denom degrees of freedom)
Limitations of regression analysis
1.)Linear relationships may change over time
2.)Usefullness in investment analysis will be limited if other market participants are aware of and act on evidence
3.)Not useful if assumptions don't hold - ie heteroskedastic (non constant variance of error terms) or autocorrelation ( error terms are not independent).
When to use a T test
to determine if a correlation coefficient, r is statistically significant
*significance supported if test stat is outside of the critical t values with N-2 degrees of freedom. Higher the correlation, r, the greater likelihood of significance.
when error terms are correlated, or in other words are not independent.
What is hetroskedasticity and how is it detected?
A non constant variance of the error terms ( ie actual return - model predicted return)
Detected by examining a scatter plot of the residuals or using a BP Chi Square Test
assumptions of linear regression
1) linear relationship exists
2) independent variable is uncorrelated with the residuals
3)expected value of the residual term is zero
4) variance of residual term is constant for all observations
5) the residual term is independently distributed, ie the residual for one obs is not correlated with with that of another observation
t test for hypothesis test of estimated regression parameters
t = b estimate1 - b 1//sb1 estimate
where b1 is the hypothesized value
to test significance of regression coef
ho:bj = 0 vs ha:bj not = 0
t test - reject the null of no stat signigicance if t > t critical.
the smallest level of significance for which the null hypothesis can be rejected. Alternative to t test is to compare the p value to significance level.
p< significance, reject the null, stat significance exists.
p> significance, fail to reject the null
list the assumptions of a multiple regression model
1.) linear relationship btw inde and dependent variables.
2.) independent variables a non random, and the is no linear relationship btw them
3.) expected value of the error term is zero
4.) the variance of the error terms is constant for all observations
5.) error term for one obs is not correlated with that of another obs.
6.) error term is normally distributed
what is a limitation of r squared?
it almost always increases as variables are added to the model.aka "overestimating the regression,"
adjusted R squared
R^2a = 1- [( n-1)/(n-k-1)*(1-r^2)]
where, n =number of obs and k= number independent variables
3 common multiple regression assumption violations
what is hetroskedasticity, how is it detected, and how is it corrected?
this is when the variance of the residuals is not the same across all observations in a sample. when sub samples are more spread out than the rest of the sample.
unconditional - not dependent on the level of the independent variable, not as problematic
conditional - related to the level of independent variables. variance increases as the value increases. this creates problems for stat inference.
what are the effects of heteroskedasticity
1.standard errors are unreliable
2.) coef estimates are not impacted (b1)
3.) if the standard errors a too small, t stats will be too large and null hypo of zero stat sig will be rejected too often. opp effect if too large.
4.)F test is unreliable
t test of a correlation coefficient
used to determine if a r is statistically significant.
t = r *sqrt(n-2)//sqrt(1-r^2)
cor(a,b) = cov(a,b)//(sa*sb)
how to calculate a slope coefficient and intercept term
b1 = covariance/variance
b0 = mean(y) - b1*(mean x)
SEE = sqrt(MSE) = sqrt(SSE/n-2)
the standard deviation of the residuals.
what is the difference between the SSE and the SEE?
SSE is the sum of the squared residuals and SEE (standard error estimate)is the standard deviation of the residuals.
Chi Square Test
Calls for a regression of the squared residuals on the independent variables. If there is conditional heteroskedasticity, the independent variables will significantly contribute to the explanation of the squared residuals.
Test Stat = n*R^2 (from second regression of the squared residuals) with K degrees of freedom.
**Always a one tailed test since there is only a problem if the R squared and Test statistic are too large. If Test stat > critical chi square value, reject the null and conclude there is a problem with conditional heteroskedasticity.
How to correct conditional heteroskedasticity?
Most commonly, CFA recommended is to calculate a robust standard error. AKA "white corrected standard errors or hetero sked consistent standard errors). These are then used to recalculate T stats using original regression coefficients.
Alternative method to correct - Use generalized least squares which will modify the original equation.
AKA "Autocorrelation " refers to the situation when residual terms are correlated with one another.
Positive SC - when a positive regression error in one time period increases the probability of observing a positive regression error for the next time period.(standard errors are too small = too many type 1 errors (reject a true null).
Negative SC - when a positive "" "" increases the prob of observing a negative error in the next period.
How to detect Serial Correlation
1.) Residual Plots (positive looks like a wave ~~)
2.) Durbin Watson Statistic
DW = 2(1-r) when the sample size is large
**DW > 2 if error terms are negatively correlated (r<0)
**DW<2 if error terms are positively correlated
H0: Regression has no positive serial correlation
Ha: Serial correlation is present
DW<dL (from table), error terms are positive serial correlated. Reject the null.
DL<DW<DU inconclusive test
DW>DU, no evidence of positive correlation, fail to reject the null.
How to correct for serial correlation?
Adjust the coefficient standard errors - CFA recommended, using the Hansen method, which also corrects for conditional heteroskedasticity. If only heterosked without serial, use white corrected standard errors.
Improve the specification of the model - ie, explicitly incorporate the time series nature, include a seasonal term.
The condition when two or more of the independent variables in a multiple regression are highly correlated with each other. This distorts the standard error estimate and coefficient standard errors, leading to problems with T tests for stat significance. This does not impact the consistency of slope coefficients, however they are unreliable.
Standard errors of slope coefficients are inflated, leading to greater prob of falsly concluding that a variable is not statistically significant (type 2 error), lower T calc.
How to detect multicollinearlity
If the situation arises when a T test shows none of the individual coefficients is significantly different from zero, while the F test is significant and R^2 is high.
**This can only happen when the independent variables are highly correlated with eachother
** General rule , if the absolute value of sample correlation btw any to inde variablies is > .7, multi colin is a potential problem. (only when only 2 variables).
How to correct multicollinearlity
Most common way - to omit one or more of the correlated independent variables.
Define regression model specification
The selection of the explanatory (independent) variables to be included in the regression and he transformations, if any of those explanatory variables.
* when we change the specification of a model, regression parameters (b0,b1, b2,etc) will change.
3 categories of model mis-specification
1.) Functional Form - omitted variables, or ones that require transorming, data is pooled improperly.
2.) Explanatory variables are correlated with the error term - lagged dependent variable is used as an independent variable, a function of the dependent variable is used as an independent (ie forecasting the past), INDEPENDENT VARIABLES ARE MEASURED WITH ERROR.
3.) Other time series misspecifications result in nonstationarity.
**As a result, regression coefs are often biased and/or inconsistent, so no confidence in our hypothesis tests.
Review - Unbiased Estimator vs Biased
Unbiased - expected value of estamator is equal to the parameter you are trying to estimate. (ie, expected value of sample mean = population mean).
Consistent Estimator - accuracy increases as the sample size increases. (standard error falls - as size increases, standard error approaches zero).
Qualitative Dependent Variables
A dummy variable that takes on a value of either zero or one. ie - if default, 1, else 0.
Probit and Logit models VS Discriminant
Probit, based on normal distribution, logit, based on logistic distribution. Coefficients relate the independent variables to the likelihood of an event occurring, such as a merger, bankruptcy, or default.
Discriminant models - generate an overall score, or ranking for an observations. Used with financial ratios to generate a bankruptcy score (likelihood of going under).
Putting it all together with multiple regression. The steps to decide if a model can be used accurately.
Are coefficients statistically significant ? (t test)
Is the overall model statistically significant (F test)
Is there heteroskedasticity, conditional? (Chi Square) - if so, use white corrected standard errors
Is serial correlation present? If so, use hansen model to adjust standard errors
oes the model have sig multicollinearity? (drop a correlated variable).
Confidence interval for a regression coefficient?
Estimated regression coefficient +- (Critical T)*(Coef Standard Error)
How to correct Serial Correlation Vs Conditional Heteroskedasticity?
1.) Serial Correlation, use hansen method to adjust standard errors.
2.) Conditional Heteroskedasty - Use white corrected standard errors.
Log linear trend model
Used when a time series displays exponential growth. Random variable tends to increase at some constant growth rate g. Data plots a convex curve.
so...yt = e^b0+b1(t)
if bo+b1(t) = 8.41, hit 8.41 (2nd) e^x.
When to use a Log Linear Model?
When data plots with a curved shape, the residuals are persistently positive or negative for a period of time. Good for financial data.
For a time series model without serial autocorrelation, what should the DW stat be?
Approximately = 2.
What is an AR model?
When the dependent variable is regressed against one or more lagged values of itself, the model is called an auto regressive (AR) model.
ie Xt = b0 + b1X(t-1) + e
Three conditions for a time series model to be considered "covariance stationary" - a requirement for statistical inferences in AR model to be valid.
1.)Constant and finite expected value( AKA Has a mean reverting level.)
2.)Constant and finite variance. Time series volatility around its mean doesnt change over time.
3.). Constant and finite co variance btw values at any given lag.
Forecasting with an AR model (using the "chain rule of forecasting")
For AR2 model, first calculate the one step ahead forecast before the two step ahead forecast can be calculated.
Explain how auto correlations of the residuals can be used to test whether the autoregressive (AR) model fits the time series. (is it specified correctly)
1.) Calculate autocorrelations of the models residuals. (correlation btw forecast errors from one period to the next).
2.) Test whether auto correlations are stat different from Zero.
modified T Stat = Correlation(error at T, T-1)//(1/SqrtT)with T-2 DF.
T is the number of observations
*Cant use the Durbin Watson Test - for testing serial correlation of the error term in an autoregressive model.
* If modified T calc > T stat (2 tailed, with n-2 df), the model is mis specified and the AR model doesnt fit the time series.
Explain "Mean Reversion"
For the model Xt = b0 + b1XT
** Mean reverting level is expressed: Xt = b0//(1-b1)
** if XT>b0//(1-b1), model says X(t+1) will be lower
** If XT<b0//(1-b1), model says X(t+1) will be higher
Do all covariance stationary time series have a finite mean reverting level?
Contrast In sample vs Out of sample forecats
1.) In sample are within the range of data used to estimate the model.
2.)Out of sample forecast are made outside the sampling period. More important in analysis - Tells us: Does the model have real world predictive power?
What is the RMSE?
Root Mean Squared Error - used to compare the accuracy of the AR models in forecasting out of sample values.
The square root of the average of the squared errors.
** The model with the lower RMSE for out of sampe data will have lower forecast error and higher predictive power in future.
Instability of coefficients of a time series model
With financial and economic conditions being dynamic, and the estimated regression coeffs in one period differ from those in another period. Things change, so model no longer applies.
**Shorter time series are more stable since over longer period of time, conditions can change. Regulatory changes, economic environment, etc.
Random Walk vs Covariance Stationary Process
1.) Random Walk - Predicted value of the series (dependent value), in one period is equal to the value of the series in the previous period plus a random error term.
*Xt = X(t-1) + e
Expected value of each error term is zero, variance of e's is constant, and no serial correlation in error terms.
1-b.)Random Walk with a Drift - Intercept term is not = 0.
For all random walks (b1=1, "unit root"), the mean reverting level is undefined, since b0/1-b1 = b0//0. Need to transform the data.
2.)Co Variance stationary - Time series must have a finite mean reverting level to be co variance stationary.
Is a Random Walk process co variance stationary?
No - it has a Unit Root. b1 = 1.
How to test for non stationarity ?
1.) Run an AR model and examine autocorrelations - A stationary process will typically have residual autocorrelations that are not significantly different from zero at all lags or decay to zero as number of lags increases.
2.) Perform a Dickey Fuller test - Transform the model and test whether the transformed coefficient g =( b1 - 1 ) is different from zero with a modified T test.
if g = 0 cannot be rejected, than the time series has a unit root.
If we believe a time series is a random walk, has a unit root,and is therefore not covariance stationary: we can tranform the data to a covariance stationary time series using first differencing procedure. Subtract the value of the dependent variable in the immediately preceeding period from the current value to define a new dependent variable (y).
**Model the change of dependent variable rather than the value for the variable.
How to test and correct for seasonality in a time series model.
Observe residual autocorrelations for the various time lags and compare t vaues to critical T to determine if any of the autocorrelations are statistically different from zero. If so, there is seasonality. If so, model is mis specified and needs to include a seasonality term for greater accuracy.
How to adjust for seasonality in an AR model?
Add an additional lag of the dependent variable (corresponding to the same period in the previous year) to the original model as another independent variable.
Explain ARCH and describe how ARCH models can be applied to predict the variance of a time series.
Autoregressive conditional heteroskedasticity - When the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. When this happens, the standard errors and tests are invalid.
What is the ARCH regression model
e^2 = a0+a1e^2(t-1)+mt
a0 is the constant and mt is the error term
**If the coefficient, a1 is statistically different from zero, the time series is ARCH. Use t test, or P value < significance to confirm.
Two time series are economically linked (related to the same macro variables) or follow the same trend. If they are co integrated, the error term from regressing one on the other is covariance stationary and the t tests are reliable.
what are the three conditions for covariance stationary?
1) constant and finite expected value, has a mean reverting level
2) constant and finite variance
3) constant and finite covariance
**random walks are not cov stationary bc b1 = 1
so mean revting level b0/(1-b1) is undefined. need to first differeence to render b1 not equal to 1
what is first differencing and when is it used?
models the change in value of an independent variable rather than the value of the independent variable itself. this is used in a random walk, non cov stationary, when we have a unit root.
means that two time series are economically linked (related to the same macro variables) or follow the same trend and this is not expected to change. if so, the error term from regressing one on the other is cov stationry and t tests are good.
dickey fuller with eg critical vlues
to test residuals for a unit root to see if we have cointegration. if there is a unit riot, cointegration does not exist and tests are not reliable.