DR. MANOJKUMAR Z. CHOPDA: December 2011

Additional notes on regression analysis

To include or not to include the CONSTANT? Interpreting STANDARD ERRORS, "t" STATISTICS, and SIGNIFICANCE LEVELS of coefficients Interpreting the F-RATIO
Interpreting PLOTS OF PREDICTED VALUES AND RESIDUALS Interpreting the CORRELATION MATRIX OF THE COEFFICIENTS Interpreting CONFIDENCE INTERVALS TYPES of confidence intervals
Dealing with OUTLIERS
MULTIPLICATIVE regression models and the LOGARITHM transformation

To include or not to include the CONSTANT?

Most multiple regression models include a constant term, since this ensures that the model will be "unbiased"--i.e., the mean of the residuals will be exactly zero. (The coefficients in a regression model are estimated by "least squares"--i.e., minimizing the mean squared error. Now, the mean squared error is equal to the variance of the errors plus the square of their mean: this is a mathematical identity. Changing the value of the constant in the model changes the mean of the errors but doesn't affect the variance. Hence, if the sum of squared errors is to be minimized, the constant must be chosen such that the mean of the errors is zero.) In a simple regression model, the constant represents the Y-intercept of the regression line, in unstandardized form. In a multiple regression model, the constant represents the value that would be predicted for the dependent variable if all the independent variables were simultaneously equal to zero--a situation which may not physically or economically meaningful. If you are not particularly interested in what would happen if all the independent variables were simultaneously zero, then you normally leave the constant in the model regardless of its statistical significance. In addition to ensuring that the in-sample errors are unbiased, the presence of the constant allows the regression line to "seek its own level" and provide the best fit to data which may only be "locally" linear.

In rare cases you may wish to exclude the constant from the model. (This is one of the "Analysis Options" in the regression procedure.) Usually, this will be done only if (i) it is possible to imagine the independent variables all assuming the value zero simultaneously, and you feel that in this case it should logically follow that the dependent variable will also be equal to zero; or else (ii) the constant is redundant with the set of independent variables you wish to use. An example of case (i) would be a model in which all variables--dependent and independent--represented first differences of other time series. E.g., if you are regressing DIFF(Y) on DIFF(X), you are essentially predicting changes in Y as a linear function of changes in X. In this case it might be reasonable to assume that Y should be unchanged, on the average, whenever X is unchanged--i.e., that Y should not have an upward or downward trend in the absence of any change in the level of X. An example of case (ii) would be a situation in which you wish to use a full set of seasonal indicator variables--e.g., you are using quarterly data, and you wish to include variables Q1, Q2, Q3, and Q4 representing additive seasonal effects. Thus, Q1 might look like 1 0 0 0 1 0 0 0 ..., Q2 would look like 0 1 0 0 0 1 0 0 ..., and so on. You could not use all four of these and a constant in the same model, since Q1+Q2+Q3+Q4 = 1 1 1 1 1 1 1 1 . . . . , which is the same as a constant term. I.e., the five variables Q1, Q2, Q3, Q4, and CONSTANT are not linearly independent: any one of them can be expressed as a linear combination of the other four. A technical prerequisite for fitting a linear regression model is that the independent variables must be linearly independent; otherwise the least-squares coefficients cannot be determined uniquely, and we say the regression "fails."

Note that the term "independent" is used in (at least) three different ways in regression jargon: any single variable may be called an independent variable if it is being used as a predictor, rather than as the predictee. A group of variables is linearly independent if no one of them can be expressed exactly as a linear combination of the others. A pair of variables is said to be statistically independent if they are not only linearly independent but also utterly uninformative with respect to each other. In a regression model, you want your dependent variable to be statistically dependent on the independent variables, which must be linearly (but not necessarily statistically) independent among themselves. Got it?

Return to top of page

Interpreting STANDARD ERRORS, "t" STATISTICS, AND SIGNIFICANCE LEVELS OF COEFFICIENTS

Your regression output not only gives point estimates of the coefficients of the variables in the regression equation, it also gives information about the precision of these estimates. Under the assumption that your regression model is correct--i.e., that the dependent variable really is a linear function of the independent variables, with independent and identically normally distributed errors--the coefficient estimates are expected to be unbiased and their errors (i.e., the discrepancies between the true values and the estimated value/s) are normally distributed. The standard errors of the estimated coefficients are the estimated standard deviations of the errors in the coefficient estimates. In general, the standard error of the coefficient estimate for variable X is equal to the SEE (of the model) times a factor that depends only on the values of X and the other independent variables (not on Y), and which is roughly inversely proportional to the standard deviation of X. Now, the SEE may be considered to measure the overall amount of "noise" in the data, whereas the standard deviation of X measures the strength of the "signal" in X. Hence, you can think of the standard error of the estimated coefficient of X as the reciprocal of the "signal-to-noise ratio" for observing the effect of X on Y. The larger the standard error of the coefficient estimate, the worse the signal-to-noise ratio--i.e., the less precise the measurement of the coefficient.

The t-statistics for the independent variables are equal to their coefficient estimates divided by their respective standard errors. In theory, the t-statistic of any one variable may be used to test the hypothesis that the true value of the coefficient is zero (which is to say, the variable should not be included in the model). If the regression model is correct (i.e., satisfies "the 4 assumptions"), then the estimated values of the coefficients should be normally distributed around the true values. In particular, if the true value of a coefficient is zero, then its estimated coefficient should be normally distributed with mean zero. If the standard deviation of this normal distribution were exactly known, then the coefficient estimate divided by the (known) standard deviation would have a standard normal distribution, with a mean of 0 and a standard deviation of 1. But the standard deviation is not exactly known; instead, we have only an estimate of it, namely the standard error of the coefficient estimate. Now, the coefficient estimate divided by its standard error does not have the standard normal distribution, but instead something closely related: the "Student's t" distribution with n - p degrees of freedom, where n is the number of observations fitted and p is the number of coefficients estimated, including the constant. The t distribution resembles the standard normal distribution, but has somewhat "fatter tails"--i.e., relatively more extreme values. However, the difference between the t and the standard normal is negligible if the number of degrees of freedom is more than about 30.

In a standard normal distribution, only 5% of the values fall outside the range plus-or-minus 2. Hence, as a rough rule of thumb, a t-statistic larger than 2 in absolute value would have a 5% or smaller probability of occurring "by chance" if the true coefficient were zero. Most stat packages (including Statgraphics) will compute for you the exact probability of exceeding the observed t-value by chance if the true coefficient were zero. (This is labeled as "significance level" on your model-fitting results report.) A low value for this probability indicates that the coefficient is significantly different from zero, i.e., it seems to contribute something to the model.

Usually you are on the lookout for variables that could be removed without seriously affecting the standard error of the estimate (SEE). A low t-statistic (or equivalently, a moderate-to-large exceedance probability) for a variable suggests that the SEE would not be adversely affected by its removal. The commonest rule-of-thumb in this regard is to remove the least important variable if its t-statistic is less than 2 in absolute value, and/or the exceedance probability is greater than .05. Of course, the proof of the pudding is still in the eating: if you remove a variable with a low t-statistic and this leads to an undesirable increase in the standard error (or deterioration of some other statistic, such as Durbin-Watson), then you should probably put it back in.

Generally you should only add or remove variables one at a time, in "stepwise" fashion, since when one variable is added or removed, the other variables may increase or decrease in significance. For example, if X1 (say) is the least significant variable in the original regression, but X2 is almost equally insignificant, then you should try removing X1 first and see what happens to X2: the latter may remain insignificant after X1 is removed, in which case you might try removing it as well, or it may rise in significance, in which you may wish to leave it in.

Note: the t-statistic is usually not used as a basis for deciding whether or not to include the constant term. Usually the decision to include or exclude the constant is based on a priori reasoning, as noted above. If it is included, it may not have direct economic significance, and you generally don't scrutinize its t-statistic too closely.

Return to top of page

Interpreting the F-RATIO

The F-ratio and its exceedance probability provide a test of the significance of all the independent variables (other than the constant term) taken together. The variance of the dependent variable may be considered to initially have n-1 degrees of freedom, since n observations are initially available (each including an error component that is "free" from all the others in the sense of statistical independence); but one degree of freedom is used up in computing the sample mean around which to measure the variance--i.e., in estimating the constant term alone. As noted above, the effect of fitting a regression model with p coefficients including the constant is to decompose this variance into an "explained" part and an "unexplained" part. The explained part may be considered to have used up p-1 degrees of freedom (since this is the number of coefficients estimated besides the constant), and the unexplained part has the remaining "unused" n - p degrees of freedom.

The F-ratio is the ratio of the explained-variance-per-degree-of-freedom-used to the unexplained-variance-per-degree-of-freedom-unused, i.e.:

F = ((Explained variance)/(p-1) )/((Unexplained variance)/(n - p))

Now, a set of n observations could in principle be perfectly fitted by a model with a constant and any n - 1 linearly independent other variables--i.e., n total variables--even if the independent variables had no predictive power in a statistical sense. This suggests that any irrelevant variable added to the model will, on the average, account for a fraction 1/(n-1) of the original variance. Thus, if the true values of the coefficients are all equal to zero (i.e., if all the independent variables are in fact irrelevant), then each coefficient estimated might be expected to merely soak up a fraction 1/(n - 1) of the original variance. In this case, the numerator and the denominator of the F-ratio should both have approximately the same expected value; i.e., the F-ratio should be roughly equal to 1. On the other hand, if the coefficients are really not all zero, then they should soak up more than their share of the variance, in which case the F-ratio should be significantly larger than 1. Statgraphics (and most other stat packages) will compute the F-ratio and also its exceedance probability--i.e., the probability of getting as large or larger a value merely "by chance" if the true coefficients were all zero. (This is shown in the ANOVA table obtained by selecting "ANOVA" from the tabular options menu that appears after fitting the model.) As with the exceedance probabilities for the t-statistics, smaller is better. A low exceedance probability (say, less than .05) for the F-ratio suggests that at least some of the variables are significant.

In a simple regression model, the F-ratio is simply the square of the t-statistic of the (single) independent variable, and the exceedance probability for F is the same as that for t. In a multiple regression model, the exceedance probability for F will generally be smaller than the lowest exceedance probability of the t-statistics of the independent variables (other than the constant). Hence, if at least one variable is known to be significant in the model, as judged by its t-statistic, then there is really no need to look at the F-ratio. The F-ratio is useful primarily in cases where each of the independent variables is only marginally significant by itself (e.g., has a t-statistic between 1 and 2 in absolute value, and an exceedance probability greater than .05), and you're wondering whether or not to junk the whole model.

Return to top of page

Interpreting PLOTS OF PREDICTED VALUES AND RESIDUALS

Plots of fitted (predicted) values, observed values, and residuals are essential in determining whether the fitted model satisfies the "four assumptions of linear regression," namely:

(i) linearity of the relationship between the dependent and independent variables;

(ii) independence (i.e., non-autocorrelation) of the errors;

(iii) homosecedasticity (i.e. constant variance) of the errors;

(iv) normality of the error distribution.

In a simple regression model, linearity is best revealed by a plot of the observed and predicted values versus the independent variable. This is simply a scatter plot of Y versus X with the regression line superimposed. (This is obtainable via the "Interval plots" option on the "Graphical Options" menu that appears after fitting a model.) If the relationship is not linear, the scatter of points will show a systematic deviation from the regression line. For a multiple regression model, rather than plotting the observed and predicted values versus each of the independent variables in turn, it is simpler to generate a single plot of observed values versus the predicted values, in order to verify linearity. (This is another one of the "graphical options" for a regression analysis.) Ideally, on a plot of observed vs. predicted, the points should be scattered around a diagonal straight line. A systematic deviation from this line may indicate, for example, that larger errors tend to accompany larger predictions, suggesting non-linearity in one or more variables. If this is observed, you may then wish to plot observed values or residuals versus individual independent variables. The following example (from a regression of MPG on HORSEPOWER in the CARDATA file) is suggestive of a slightly nonlinear relationship (note the slight bowing upward of the data at either end):

The plot of residuals versus predicted values is essentially the same plot, except the reference line is horizontal rather than at 45 degrees. Departures from linearity may show up more clearly here:

Plots of residuals vs. time (i.e., versus row number) are useful in verifying independence (i.e., lack of serial correlation) and homoscedasticity (constant variance) of the errors. If the independence assumption is badly violated, this may already have been revealed by a bad Durbin-Watson statistic. In particular, a D-W stat much less than 2 represents the case of positive serial correlation. In this case, the residuals-vs-time plot will probably have a "wave," "trend," or "random walk" pattern, with long runs of values having the same sign. Strong positive correlation often suggests a need for incorporating lagged variables and/or differences into the model. A D-W stat much larger than 2 would mean negative serial correlation, i.e., too much alternation of signs (plus-minus-plus-minus... etc.) between consecutive residuals. This may indicate that too much differencing has been used. (As noted earlier, more subtle violations of the independence assumption may be revealed by a plot of the autocorrelation function of the residuals across many lags.) Violations of the homoscedasticity assumption are most often manifested in a tendency for the errors to get larger in magnitude over time; this may indicate the need for some sort of non-linear transformation of the dependent variable (e.g., logging, deflating) and/or the independent variables.

A normal probability plot of the residuals the most straightforward test of the normality assumption behind the regression model. (To draw this plot for a regression analysis in Statgraphics, you must first save the RESIDUALS to the data spreadsheet and then use the Plot/Exploratory_Plots/Probability_Plot procedure to plot them.) The normal probability plot shows the actual percentiles of the residuals vs. the theoretical percentiles of a normal distribution with the same mean and variance. Ideally, this plot should be a diagonal straight line. If, instead, it is bowed upwards or downwards, this indicates a skewed (asymmetric) distribution. If it is S-shaped or reverse- S-shaped, this indicates that the distribution is too fat-tailed (has too many outliers, like a "t" distribution with few degrees of freedom) or skinny-tailed (has too few outliers, like a uniform distribution). Bad-looking probability plots are often associated with (i) the presence of a few bad outliers, or (ii) non-linear patterns in the data, and/or (iii) heteroscedasticity. Here is an example of a normal probability plot showing a slight departure from normality in the right tail of the distribution--i.e., a few too many large positive outliers. (This degree of non-normality is probably not serious enough to warrent a radical change in the model.)

Return to top of page

Interpreting the CORRELATION MATRIX OF THE COEFFICIENTS

The correlation matrix of the estimated coefficients (one of the tabular options that appears after fitting a regression model) provides an a posteriori indication of the relative independence of the variables in the context of the fitted model. That is, it shows the extent to which pairs of variables provide independent information for purposes of predicting the dependent variable, given the presence of other variables in the model. Extremely high values here (say, much above 0.9 in absolute value) suggest that some pairs of variables are not providing independent information. In this case, either (i) both variables are providing the same information--i.e., they are redundant; or (ii) there is some linear function of the two variables (e.g., their sum or difference) that summarizes the information they carry.

In case (i)--i.e., redundancy--it is usually desirable to try removing one of the variables. The estimated coefficients of redundant variables are often extremely large and utterly lacking in economic interpretation. This condition is referred to as multicollinearity. In the most extreme cases of multicollinearity--e.g., when one of the independent variables is an exact linear combination of some of the others--the regression calculation will fail. (Statgraphics usually detects this condition and tells you which variable is found to be a linear combination of the others.)

In case (ii), it may be possible to replace the two variables by the appropriate linear function (e.g., their sum or difference) if you can identify it, but this is not strictly necessary. This case often arises when two or more different lags of the same variable are used as independent variables in a time series regression model.

Return to top of page

Interpreting CONFIDENCE INTERVALS

Suppose that you fit a regression model to a certain time series--say, some sales data--and the fitted model predicts that sales in the next period will be $83.421M. Does this mean you should expect sales to be exactly $83.421M? Of course not. This is merely what we would call a "point estimate" or "point prediction." It should really be considered as an average taken over some range of likely values. For a point estimate to be really useful, it should be accompanied by information concerning its degree of precision--i.e., the width of the range of likely values. We would like to be able to state how confident we are that actual sales will fall within a given distance--say, $5M or $10M--of the predicted value of $83.421M.

In "classical" statistical methods such as linear regression, information about the precision of point estimates is usually expressed in the form of confidence intervals. For example, the regression model above might yield the additional information that "the 95% confidence interval for next period's sales is $75.910M to $90.932M." Does this mean that, based on all the available data, we should conclude that there is a 95% probability of next period's sales falling in the interval from $75.910M to $90.932M? That is, should we consider it a "19-to-1 long shot" that sales would fall outside this interval, for purposes of betting? The answer to this is:

No, strictly speaking, a confidence interval is not a probability interval for purposes of betting.

Rather, a 95% confidence interval is an interval calculated by a certain rule having the property that, in the long run, it will cover the true value in 95% of the cases in which the correct model has been fitted. (Alas, you never know for sure whether you have identified the "correct" model, although residual diagnostics help you rule out obviously incorrect ones.) In other words, if everybody all over the world used this rule (on correct models) on his or her data, year in and year out, then you would expect an overall average "hit rate" of 95%. But on your data today there is no guarantee that 95% of the computed confidence intervals will cover the true values, nor that a single confidence interval has, based on the available data, a 95% chance of covering the true value. The "95%" figure refers to a long-run property of the rule used to calculate the interval, rather than a property of the particular set of data you are analyzing. This is not to say that a confidence interval cannot be meaningfully interpreted, but merely that it shouldn't be taken too literally in any single case.

In fitting a model to a given data set, you are often simultaneously estimating many things: e.g., coefficients of different variables, predictions for different future observations, etc. Thus, a model for a given data set may yield many different sets of confidence intervals. You may wonder whether it is valid to take the long-run view here: e.g., if I calculate 95% confidence intervals for "enough different things" from the same data, can I expect about 95% of them to cover the true values? The answer to this is:

No, multiple confidence intervals calculated from a single model fitted to a single data set are not independent with respect to their chances of covering the true values.

For example, if the confidence interval for one period's forecast fails to cover the true value, it is likely that the confidence interval for a neighboring period's forecast will also fail to cover the true value.

Ideally, you would like your confidence intervals to be as narrow as possible: more precision is preferred to less. Does this mean that, when comparing alternative forecasting models for the same time series, you should pick the one that yields the narrowest confidence intervals around forecasts? That is, should narrow confidence intervals for forecasts be considered as a sign of a "good fit?" The answer, alas, is:

No, the best model does not necessarily yield the narrowest confidence intervals around forecasts.

In theory, there is only one "correct" model for forecasting a given time series from a given body of historical data. The confidence intervals yielded by the correct model will be realistic guides to the precision with which future observations can be predicted. If an incorrect model is fitted, it may yield confidence intervals that are all unrealistically wide or all unrealistically narrow. (Unfortunately, you never know for sure whether you have identified the correct model, although you look to the residual plots and statistics, etc., for assurance in this regard.) That is to say, a bad model does not necessarily know it is a bad model, and warn you by giving extra-wide confidence intervals.

Notwithstanding these caveats, confidence intervals are indispensable, since they are usually the only estimates of the degree of precision in your coefficient estimates and forecasts that are provided by most stat packages (including Statgraphics). And, if (i) your data set is sufficiently large, and your model passes the diagnostic tests concerning the "4 assumptions of regression analysis," and (ii) you don't have strong prior feelings about what the coefficients of the variables in the model should be, then you can treat a 95% confidence interval as an approximate 95% probability interval. (In the long run.)

Return to top of page

TYPES OF CONFIDENCE INTERVALS

In regression forecasting, you may be concerned with point estimates and confidence intervals for some or all of the following:

The coefficients of the independent variables
The height of the regression line at different points in time (i.e., for given values of the independent variables)
The predictions based on the regression line at different points in time

In all cases, there is a simple relationship between the point estimate and its surrounding confidence interval:

(Conf. Int.) = (Point Est.) + (Critical t-value) * (Std. Deviation or Std. Error)

For a 95% confidence interval, the "critical t value" is the value that is exceeded with probability 0.025 (one-tailed) in a t distribution with n-p degrees of freedom, where p is the number of coefficients in the model--including the constant term if any. (In general, for a 100*(1-x) percent confidence interval, you would use the t value exceeded with probability x/2.) If the number of degrees of freedom is large--say, more than 30--the t distribution closely resembles the standard normal distribution, and the relevant critical t value for a 95% confidence interval is approximately equal to 2. (More precisely, it is 1.96.) In this case, therefore, the 95% confidence interval is roughly equal to the point estimate "plus or minus two standard deviations." Here is a selection of critical t values to use for different confidence intervals and different numbers of degrees of freedom, taken from a standard table of the t distribution:

Degrees of       t-value for confidence interval
Freedom (n-p)      50%     80%     90%    95%
--------------   ------ ------ ------ ------
10               0.700   1.372   1.812   2.228
20               0.687   1.325   1.725   2.086
30               0.683   1.310   1.697   2.042
60               0.679   1.296   1.671   2.000
Infinite         0.674   1.282   1.645   1.960

A t-distribution with "infinite" degrees of freedom is a standard normal distribution.

The "standard error or standard deviation" in the above equation depends on the nature of the thing for which you are computing the confidence interval. For the confidence interval around a coefficient estimate, this is simply the "standard error of the coefficient estimate" that appears beside the point estimate on the "Analysis Summary" report in Statgraphics. (Recall that this is proportional to the standard error of the estimate, and inversely proportional to the standard deviation of the independent variable.)

For a confidence interval around the height of the regression line at some point (also called a "confidence interval for the mean"), the relevant standard deviation is referred to as the "standard deviation of the mean" at that point. This quantity depends on the following factors:

the standard errors of all the coefficient estimates
the correlation matrix of the coefficient estimates
the values of the independent variables at that point

In general, the standard deviation of the mean--and hence the width of the confidence interval around the regression line--increases with the standard errors of the coefficient estimates, increases with the distances of the independent variables from their respective means, and decreases with the degree of correlation between the coefficient estimates. (Note: a model characterized by "multicollinearity" will have exaggerated standard errors for the coefficient estimates, but this is essentially canceled out, for purposes of calculating confidence intervals, by a high degree of correlation between the coefficient estimates.)

When confidence intervals for the mean are plotted vs. the independent variable in a simple regression model, the resulting graph has a characteristic wasp-waisted shape, as in this plot showing the regression of MPG on HORSEPOWER (note that the confidence intervals for means are narrowest at the center of the range of the independent variable):

When fitting a simple regression model in Statgraphics, you can obtain a plot like this by selecting "Interval Plots" from the graphical options menu, choosing the independent variable as the variable to plot against, and selecting "Confidence interval for means" from the "Pane Options" menu.

For a confidence interval around a prediction based on the regression line at some point, the relevant standard deviation is called the "standard deviation of the prediction." It reflects the error in the estimated height of the regression line plus the true error, or "noise," that is hypothesized in the basic model:

DATA = SIGNAL + NOISE

In this case, the regression line represents your best estimate of the true signal, and the standard error of the estimate (SEE, introduced earlier) is your best estimate of the standard deviation of the true noise. Now (trust me), for essentially the same reason that the fitted values are uncorrelated with the residuals, it is also true that the errors in estimating the height of the regression line are uncorrelated with the true errors. Therefore, the variances of these two components of error in each prediction are additive. Since variances are the squares of standard deviations, this means:

(Std. dev. of pred.)^2 = (Std. dev. of mean)^2 + (SEE)^2

Note that, whereas the SEE is a fixed number, the standard deviations of the predictions and the standard deviations of the means will usually vary from point to point in time, since they depend on the values of the independent variables. However, the SEE is typically much larger than the standard errors of the means at most points, hence the standard deviations of the predictions will often not vary by much from point to point, and will be only slightly larger than the SEE. Here is the plot of confidence intervals for predictions in the regression of MPG on HORSEPOWER (note that the intervals for predictions are very much wider than the confidence intervals for means, and the width appears almost constant):

It is possible to compute confidence intervals for either means or predictions around the fitted values and/or around any true forecasts which may have been generated. Statgraphics will automatically generate forecasts rather than fitted values wherever the dependent variable is "missing" but the independent variables are not, or wherever the WEIGHTS variable--if any--has a value of zero. Confidence intervals are automatically calculated for forecasts, and can be plotted via the "Interval plots" option on the graphical options menu in the regression procedure, or printed via the "Reports" option. The forecasts and confidence intervals can also be saved to the spreadsheet via the "Save Results" button on the Analysis Window Toolbar.

The interpretation of a confidence interval around a fitted value is a bit tricky: it is the confidence interval that would be computed for that value as a forecast if the same conjunction of values of the independent variables that occurred at that point were to occur again.

Return to top of page

DEALING WITH OUTLIERS

One of the underlying assumptions of linear regression analysis is that the distribution of the errors is approximately normal with a mean of zero. A normal distribution has the property that about 68% of the values will fall within +/- 1 standard deviation from the mean, 95% will fall within +/- 2 standard deviations, and 99.8% will fall within +/- 3 standard deviations. Hence, a value more than 3 standard deviations from the mean will occur only very rarely: about once every 500 observations on the average. Now, the residuals from fitting a model may be considered as estimates of the true errors that occurred at different points in time, and the standard error of the estimate (SEE) is the estimated standard deviation of their distribution. Hence, if the normality assumption is satisfied, you should rarely encounter a residual whose absolute value is greater than 3 times the SEE. An observation whose residual is much greater than 3 times the SEE is therefore usually called an "outlier." In the "Reports" option in the Statgraphics regression procedure, residuals greater than 3 times the SEE are marked with an asterisk (*). Outliers are also readily spotted on time-plots and normal probability plots of the residuals.

If your data set contains hundreds of observations, an outlier or two may not be cause for alarm. But outliers can spell trouble for models fitted to small data sets: since the sum of squares of the residuals is the basis for estimating parameters and calculating error statistics and confidence intervals, one or two bad outliers in a small data set can badly skew the results. When outliers are found, two questions should be asked: (i) are they merely "flukes" of some kind (e.g., data entry errors, or the result of exceptional conditions that are not expected to recur), or do they represent a real effect that you might wish to include in your model; and (ii) how much have the coefficients, error statistics, and predictions, etc., been affected?

An outlier may or may not have a dramatic effect on a model, depending on the amount of "leverage" that it has. Its leverage depends on the values of the independent variables at the point where it occurred: if the independent variables were all relatively close to their mean values, then the outlier has little leverage and will mainly affect the value of the estimated CONSTANT term and the SEE. However, if one or more of the independent variable had relatively extreme values at that point, the outlier may have a large influence on the estimates of the corresponding coefficients: e.g., it may cause an otherwise insignificant variable to appear significant, or vice versa.

The best way to determine how much leverage an outlier (or group of outliers) has, is to exclude it from fitting the model, and compare the results with those originally obtained. You can do this in Statgraphics by using the WEIGHTS option: e.g., if outliers occur at observations 23 and 59, and you have already created a time-index variable called INDEX, you could type:

INDEX <> 23 & INDEX <> 59

in the WEIGHTS field on the input panel, and then re-fit the model. The two observations would then be excluded from fitting, and forecasts would be generated for them instead. The discrepancies between the forecasts and the actual observations, measured in terms of the corresponding standard-deviations-of- predictions (as displayed via the F9 key), would provide a good guide to how "surprising" these observations really were.

An alternative method, which is often used in stat packages lacking a WEIGHTS option, is to "dummy out" the outliers: i.e., add a dummy variable for each outlier to the set of independent variables. These observations will then be fitted with zero error independently of everything else, and the same coefficient estimates, predictions, and confidence intervals will be obtained as if they had been excluded outright. (However, statistics such as R-squared, MAE, and Durbin-Watson will be somewhat different, since they depend on the sum-of-squares of the original observations as well as the sum of squared residuals, and/or they fail to correct for the number of coefficients estimated.) In Statgraphics, to dummy-out the observations at periods 23 and 59, you could add the two variables:

INDEX = 23

INDEX = 59

to the set of independent variables on the model-definition panel. The estimated coefficients for these variables would exactly equal the difference between the offending observations and the predictions generated for them by the model. (I do not recommend this, however, since it doesn't yield an honest R-squared, MAE, and D-W.)

If it turns out the outlier (or group thereof) does have a significant effect on the model, then you must ask whether there is justification for throwing it out. Go back and look at your original data and see if you can think of any explanations for outliers occurring where they did. Sometimes you will discover data entry errors: e.g., "2138" might have been punched instead of "3128." You may discover some other reason: e.g., a strike or stock split occurred, a regulation or accounting method was changed, the company treasurer ran off to Panama, etc. In this case, you must use your own judgment as to whether to merely throw the observations out, or leave them in, or perhaps alter the model to account for additional effects. A compromise solution might be to use the WEIGHTS feature to put a small (but still positive) weight on the dubious observations. For example, typing:

1 + (INDEX <> 23 & INDEX <> 59)

in the WEIGHTS field would put a weight of 1 on the observations at periods 23 and 59, and a weight of 2 on all other observations.

Return to top of page

MULTIPLICATIVE REGRESSION MODELS AND THE LOGARITHM TRANSFORMATION

The basic linear regression model assumes that the contributions of the different independent variables to the prediction of the dependent variable are additive. For example, if X and Z are assumed to contribute additively to Y, the regression model is:

Y(t) = a + bX(t) + cZ(t)

This is fitted in Statgraphics by simply specifying Y as the dependent variable and X and Z as independent variables. Here, if X increases by one unit, other things being equal, then Y is expected to increase by b units. That is, the absolute change in Y is proportional to the absolute change in X, with the coefficient b representing the constant of proportionality. Similarly, if Z increases by 1 unit, other things equal, Y will increase by c units. And if both X and Z increase by 1 unit, then Y will change by b+c units. That is, the total change in Y is determined by adding the effects of the separate changes in X and Z.

In some situations, though, it may be felt that the dependent variable is affected multiplicatively by the independent variables. In other words, it might be felt that the percentage change in Y should be proportional to the percentage change in X, and similarly for Z. And further, if X and Z both changed, then total percentage change in Y should be roughly the sum of the percentage changes that would have resulted separately. The appropriate model for this situation is the "multiplicative regression model":

Y(t) = a (X(t)^b ) (Z(t)^c)

Here, Y is proportional to the product of X and Z, each raised to some power, whose value we can try to estimate from the data. The coefficients b and c (i.e., the powers of X and Z in the model) are referred to as the elasticities of Y with respect to Z and Z, respectively. If either of these is equal to 1, we say that the response of Y to that variable has unitary elasticity--i.e., the percentage change in Y is exactly the same as the percentage change in the independent variable. If the coefficient is less than 1, the response is said to be inelastic--i.e., the percentage change in Y will be somewhat less than the percentage change in the independent variable.

The multiplicative model, in its "raw" form above, cannot be fitted using linear regression techniques. However, it can be converted into an equivalent linear model via the logarithm transformation. The logarithm function (LOG, in Statgraphics), has the property that it converts products into sums: LOG(XY) = LOG(X)+LOG(Y), for any (positive) X and Y. That is, the logarithm of a product is the sum of the separate logarithms. Also, it converts powers into multipliers: LOG(X^b) = b(LOG(X)). Using these rules, we can apply the logarithm transformation to both sides of the above equation:

LOG(Y(t)) = LOG(a(X(t)^b) (Z(t)^c))

= LOG(a) + b(LOG(X(t)) + c(LOG(Z(t))

Thus, LOG(Y) is a linear function of LOG(X) and LOG(Z). This can be fitted in Statgraphics multiple-regression procedure by specifying LOG(Y) as the dependent variable and LOG(X) and LOG(Z) as the independent variables. (Of course, you could also have more or fewer than 2 independent variables in such a model.) The estimated coefficients of LOG(X) and LOG(Z) will represent estimates of the powers of X and Z in the original multiplicative form of the model, i.e., the estimated elasticities of Y with respect to X and Z. The estimated CONSTANT term will represent the logarithm of the multiplicative constant (a) in the original multiplicative model.

Another situation in which the logarithm transformation may be used is in "normalizing" the distribution of one or more of the variables, even if a priori the relationships are not known to be multiplicative. It is technically not necessary for the dependent or independent variables to be normally distributed--only the errors in the predictions are assumed to be normal. However, the assumption of normally distributed errors is often more plausible in models in which both the dependent variable and the predictions for it are themselves roughly normally distributed. If some of the variables have highly skewed distributions (e.g., runs of small values with occasional large "spikes" in a single direction), it may be difficult to fit them into a linear model yielding normally distributed errors. Scatter plots involving such variables will be very strange-looking: the points will clump along one edge or both edges of the graph. And, if a regression model is fitted using the skewed variables in their raw form, the distribution of the predictions and/or the dependent variable will also be skewed, which may yield to non-normal errors. In this case, if values of the skewed variables are all positive, it may be possible to make their distributions more normal-looking by applying the logarithm transformation--i.e., using LOG(X) instead of X, or LOG(Y) instead

Return to top of page

DR. MANOJKUMAR Z. CHOPDA

Thursday, 1 December 2011

About Me

Blog Archive