Forecast Calibration And The Importance of Out of Sample Diagnostics in Predictive Modeling

Several months ago when I posted my predictions for growth of the U.S. economy for the rest of the 2015 year, I mentioned that I would be coming back and taking a more critical look at the accuracy of my results and challenging some of the assumptions that my prediction made. The time for that critical eye has come; today I am  evaluating the predictive capacity of my GDP growth forecast models from a historical standpoint. This process can be helpful to maximize forecasting accuracy, and is a good idea in general when producing forecasts as it gives us some actual evidence as to the quality of the predictions that we can make.

One of the problems with out of sample modeling is that since we don’t have a reference point  to measure our predicted results against, we don’t have a lot of options when it comes to analyzing the quality of our predictive output. When performing in sample modeling in the form of regression analysis we can compare predicted output with the actual sample used to generate the model and get an idea of how well the model describes the data. When the data we want to describe isn’t available yet, our options are much more limited for evaluating how well we can describe that data, and it isn’t always a good idea to assume that a model generated from a statistical sample will necessarily be good at describing data outside of that sample. One of my professors once told my advanced statistical methods class that using a model to evaluate out of sample data should make us feel uncomfortable and a little bit dirty. Obviously he was a member of the math department and not an economist, but his point is well taken. Without performing any sort of diagnostic tests to provide evidence of how well we can describe future data, we have a very weak argument for the accuracy and therefore usefulness of any estimates we generate.

Fortunately we aren’t completely helpless, there are a few diagnostic options available to assess how good our forecasts are going to be, and one of these is the calibration test. This is a test we can use to assess the accuracy of a model that gives us an interval prediction, or a prediction that an actual value will be somewhere within a predicted range. Since most statistics software will provide a confidence interval estimate with a forecast even in instances when a point forecast would be more useful, this is usually a viable option as a diagnostic tool. Because of the uncertainty of using a statistical sample, this interval is generally reported as being accurate 95% of the time; a calibration test is a technique we can use when we have records of past predictions to assess whether or not the actual values we are trying to predict fall within the 95% interval of prediction 95% of the time. If the actual values fall within the predicted range 95% of the time then our model is correctly calibrated, if not, then we know that our model needs work, or at the very least that its output should be interpreted with a grain of salt.

In a nutshell a calibration test tells us if our model is as good at making predictions as our statistical software reports it is. In the end this still doesn’t tell us exactly how accurate our next prediction will be because the future isn’t usually exactly the same as the past, but it paints a better picture of how good we can expect it to be.

In order to perform a calibration test on my U.S. GDP growth models I have generated monthly predictions using both of my predictive models for the past three and a half years by essentially pretending that I didn’t know what I know now. For example, to generate a forecast for the month of January, 2014 I remove all of the actual data that I have after December 2013 and then run my model to generate my prediction based only on data available prior to that month and record that prediction. I have repeated this process forty time to record predictions dating back to the beginning of 2012. Each prediction spans a six month period so I can assess the accuracy of my predictions for up to half a year out. An additional benefit of this process is that it provides me with a further point of comparison between my two predictive models and better informs my decision on how to weight their output in the future to maximize the accuracy of my forecasts.

When evaluating the historical output of my predictive models there are two primary measurements I use to assess their predictive accuracy. The first thing I look at is calibration; do the models meet up to their reported 95% accuracy? The answer is; not at first. When predicting one month in advance neither model has a track record of encompassing the actual values within the 95% interval range at even 80% of the time. Looking at the data, the majority of the predictive failure comes clustered around the GDP drop that occurred around the beginning of 2014 to below -0.6% and the high growth period in the summer of 2014 that reaches over 0.5% at its highest point. Since these two values could be considered statistical outliers, I am not too concerned by this predictive failure. The further out the prediction month gets from the month that the prediction is being made the better calibrated my models become. This is in part because the intervals become wider the further out the forecast goes and in part because the forecast also becomes less volatile the further out from the forecasting period. The other measurement that I look at to assess the predictive accuracy of my models is the correlation between the predicted values and the actual historical GDP values. The measure is more of a means of comparing the accuracy of my two models to one another than as an individual assessment technique; by itself a correlation coefficient of .9864 doesn’t tell me much other than that I’m close to the mark. When I look at both of these measurements together I get a good idea of which model is better at doing what.

ForecastComparison

The above table reports each model’s correlation between the predicted values and actual GDP, the percentage of actual values that fell within the 95% prediction range, and how far off this percentage is from 95%. What this tells us is that while the prediction-as-a-whole model is better at predicting what will happen next month and slightly better at a month after that, the model that forecasts GDP by its constituent parts is generally more accurate after that. The by-parts model also has a much better track record of predicting the range that the actual number will fall into after just one month.

Another thing that I look at when evaluating my forecast results is whether or not my model might have a timing problem; whether they predict changes when they occur or if they tends to consistently predict changes occurring before they actually do. This is a theory I developed after comparing my forecast to that of professional forecasts published by the U.S. government and finding that my estimates for changes in growth seemed to be similar to the published estimates, but fell a month behind in key places. Upon testing this hypothesis by looking at correlations between actual GDP and one month lagged versions of my predictions I find little evidence to support this.

Lastly in my evaluation process, I compare the error of the one month out predictions of each of my models to the first difference of the actual GDP values. The idea here is that a good forecast model will give you a prediction that is more accurate than simply assuming next month will be the same as this month. What I find is that my as-a-whole model, the more accurate of the two when predicting one period out, passes this test, but the by-parts model does not. This supports my other comparisons, reinforcing the conclusion that the as-a-whole model is better at predicting GDP at one month out. This method really only tells us about a one month out prediction as we would need to have knowledge of the future to use this to evaluate predictions made further out, and if we knew the future forecasting would be kind of a pointless exercise.

What all of this means going forward is that I will be weighing my as-a-whole prediction heavily in the first month or two and then reversing and weighing my prediction-by-parts model more heavily for predictions further out. While my calibration is a little off the mark, my models don’t do too badly, and knowing how far off they are historically may allow me to adjust the interval forecasts provided when making future predictions in accordance with how far off each period generally tends to be. Revised estimates of U.S. economic growth for the rest of the year as well as a look at how well my previous predictions for the past several months stood up in the face of reality soon to come.

One thought on “Forecast Calibration And The Importance of Out of Sample Diagnostics in Predictive Modeling

  1. Pingback: 2015 U.S. Economic Growth Forecast Update | Curious Economist

Leave a comment