- Simple Seasonal Averages
- Moving Averages and Centered Moving Averages
- Linear Regression with Coded Vectors
- Simple Seasonal Exponential Smoothing
- Holt-Winters Models
Moving Averages and Centered Moving Averages
A couple of points about seasonality in a time series bear repeating, even if they seem obvious. One is that the term “season” does not necessarily refer to the four seasons of the year that result from the tilting of the Earth’s axis. In predictive analytics, “season” often means precisely that, because many of the phenomena that we study do vary along with the progression of spring through winter: sales of winter or summer gear, incidence of certain widespread diseases, weather events caused by the location of the jet stream and changes in the temperature of the water in the eastern Pacific ocean, and so on.
Equally, events that occur regularly can act like meteorological seasons, even though they have only a tenuous connection to the solstices and equinoxes. Eight-hour shifts in hospitals and factories often get expressed in the incidence of intakes and expenditures of energy; there, a season is eight hours long and the seasons cycle every day, not every year. Due dates for taxes signal the beginning of a flood of dollars into municipal, state, and federal treasuries; there, the season might be one year long (personal income taxes), six months (property taxes in many states), quarterly (many corporate taxes), and so on.
This isn’t just terminological musing. The ways we identify seasons and the period of time during which the seasons turn have real, if often minor, implications for how we measure their effects. The following sections discuss how some analysts vary the way they calculate moving averages according to whether the number of seasons is odd or even.
Using Moving Averages Instead of Simple Averages
Suppose that a large city is considering the reallocation of its traffic police to better address the incidence of driving while impaired, which the city believes has been increasing. Four weeks ago, new legislation went into effect, legalizing the possession and recreational use of marijuana. Since then, the daily number of traffic arrests for DWI seems to be trending up. Complicating matters is the fact that the number of arrests appears to spike on Fridays and Saturdays. To help plan for manpower requirements into the future, you’d like to forecast any underlying trend that’s being established. You’d also like to time the deployment of your resources to take account of any weekend-related seasonality that’s taking place.
Figure 5.9 has the relevant data you have to work with.
Figure 5.9 With this data set, each day of the week constitutes a season.
Even by just eyeballing the chart in Figure 5.9, you can tell that the trend of the number of daily arrests is up. You’ll have to plan to expand the number of traffic officers, and hope that the trend levels off soon. Further, the data bears out the notion that more arrests occur routinely on Fridays and Saturdays, so your resource allocation needs to address those spikes.
But you need to quantify the underlying trend, to determine how many additional police you’ll have to bring on. You also need to quantify the expected size of the weekend spikes, to determine how many additional police you need watching for erratic drivers on those days.
The problem is that as of yet you don’t know how much of the daily increase is due to trend and how much is due to that weekend effect. You can start off by detrending the time series. Earlier in this chapter, in “Simple Seasonal Averages,” you saw an example of how to detrend a time series in order to isolate the seasonal effects using the method of simple averages. In this section you’ll see how to do so using moving averages—very likely, the moving-averages approach is used more often in predictive analytics than is the simple-averages approach.
There are various reasons for the greater popularity of moving averages, among them, that the moving-averages approach does not ask you to collapse your data in the process of quantifying a trend. Recall that the earlier example made it necessary to collapse quarterly averages to annual averages, calculate an annual trend, and then distribute one-fourth of the annual trend across each quarter in the year. That step was needed in order to remove trend from the seasonal effects. In contrast, the moving-averages approach enables you to detrend the time series without resorting to that sort of machination.
Figure 5.10 shows how the moving-averages approach works in the present example.
Figure 5.10 The moving average in the second chart clarifies the underlying trend.
Figure 5.10 adds a moving average column, and a column for specific seasonals, to the data set in Figure 5.9. Both additions require some discussion.
The spikes in arrests that take place on weekends gives you reason to believe that you’re working with seasons that repeat once each week. Therefore, start by getting the average for the encompassing period—that is, the first seven seasons, Monday through Sunday. The formula for the average in cell D5, the first available moving average, is as follows:
=AVERAGE (C2:C8)
That formula is copied and pasted down through cell D29, so you have 25 moving averages based on 25 runs of seven consecutive days. Notice that in order to show both the first and the last few observations in the time series, I have hidden rows 10 through 17. You can unhide them, if you want, in this chapter’s workbook, available from the publisher’s website. Make a multiple selection of visible rows 9 and 18, right-click one of their row headers, and choose Unhide from the shortcut menu.
Because each moving average in Figure 5.10 encompasses seven days, no moving average is paired with the first three or final three actual observations. Copying and pasting the formula in cell D5 up one day to cell D4 runs you out of observations—there is no observation recorded in cell C1. Similarly, there is no moving average recorded below cell D29. Copying and pasting the formula in D29 into D30 would require an observation in cell C33, and no observation is available for the day that cell would represent.
It would be possible, of course, to shorten the length of the moving average to, say, five instead of seven. So doing would mean that the moving-average formulas in Figure 5.10 could start in cell D4 instead of D5. However, in this sort of analysis, you want the length of the moving average to equal the number of seasons: seven days in a week for events that recur weekly implies a moving average of length seven, and four quarters in a year for events that recur annually implies a moving average of length four.
Along similar lines, we generally quantify seasonal effects in such a way that they total to zero within the encompassing time period. As you saw in this chapter’s first section, on simple averages, this is done by calculating the average of (say) the four quarters in a year, and then subtracting the average for the year from each quarterly figure. So doing ensures that the total of the seasonal effects is zero. In turn, that’s useful because it puts the seasonal effects on a common footing—a summer effect of +11 is as far from the mean as a winter effect of –11. If you want to average five seasons instead of seven to get your moving average, you’re better off finding a phenomenon that repeats every five seasons instead of every seven.
Understanding Specific Seasonals
Figure 5.10 also shows what are called specific seasonals in column E. They are what’s left after subtracting the moving average from the actual observation. To get a sense of what the specific seasonals represent, consider the moving average in cell D5. It is the average of the observations in C2:C8. The deviations of each observation from the moving average (for example, C2 – D5) are guaranteed to sum to zero—that’s a characteristic of an average. Therefore, each deviation expresses the effect of being associated with that particular day in that particular week. It’s a specific seasonal, then—specific because the deviation applies to that particular Monday or Tuesday and so on, and seasonal because in this example we’re treating each day as though it were a season in the encompassing period of a week.
Because each specific seasonal measures the effect of being in that season vis-à-vis the moving average for that group of (here) seven seasons, you can subsequently average the specific seasonals for a particular season (for example, all the Fridays in your time series) to estimate that season’s general, rather than specific, effect. That average is not confounded by an underlying trend in the time series, because each specific seasonal expresses a deviation from its own particular moving average.
Aligning the Moving Averages
There’s also the question of aligning the moving averages with the original data set. In Figure 5.10, I have aligned each moving average with the midpoint of the range of observations that it includes. So, for example, the formula in cell D5 averages the observations in C2:C8, and I have aligned it with the fourth observation, the midpoint of the averaged range, by placing it in row 5. This arrangement is termed a centered moving average, and many analysts prefer to align each moving average with the midpoint of the observations that it averages.
Another approach is the trailing moving average. In that case, each moving average is aligned with the final observation that it averages—and therefore it trails behind its arguments. This is often the preferred arrangement if you want to use a moving average as a forecast, as is done with exponential smoothing, because your final moving average occurs coincident with the final available observation.
Centered Moving Averages with Even Numbers of Seasons
We usually adopt a special procedure when the number of seasons is even rather than odd. That’s the typical state of affairs: There tend to be even numbers of seasons in the encompassing period for typical seasons such as months, quarters, and quadrennial periods (for elections).
The difficulty with an even number of seasons is that there is no midpoint. Two is not the midpoint of a range starting at 1 and ending at 4, and neither is 3; if it can be said to have one, its midpoint is 2.5. Six is not the midpoint of 1 to 12, and neither is 7; its purely theoretical midpoint is 6.5.
To act as though a midpoint exists, you need to add a layer of averaging atop the moving averages. See Figure 5.11.
Figure 5.11 Excel offers several ways to calculate a centered moving average.
The idea behind this approach to getting a moving average that’s centered on an existing midpoint, when there’s an even number of seasons, is to pull that midpoint forward by half a season. You calculate a moving average that would be centered at, say, the third point in time if five seasons instead of four constituted one full turn of the calendar.
That’s done by taking two consecutive moving averages and averaging them. So in Figure 5.11, there’s a moving average in cell E6 that averages the values in D3:D9. Because there are four seasonal values in D3:D9, the moving average in E6 is thought of as centered at the imaginary season 2.5, half a point short of the first available candidate season, 3. (Seasons 1 and 2 are unavailable as midpoints for lack of data to average prior to Season 1.)
Note, though, that the moving average in cell E8 averages the values in D5:D11, the second through the fifth in the time series. That average is centered at (imaginary) point 3.5, a full period ahead of the average centered at 2.5. By averaging the two moving averages, so the thinking goes, you can pull the center point of the first moving average forward by half a point, from 2.5 to 3.
That’s what the averages in column F of Figure 5.11 do. Cell F7 provides the average of the moving averages in E6 and E8. And the average in F7 is aligned with the third data point in the original time series, in cell D7, to emphasize that the average is centered on that season.
If you expand the formula in cell F7 as well as the moving averages in cells E6 and E8, you’ll see that it turns out to be a weighted average of the first five values in the time series, with the first and fifth value given a weight of 1, and the second through fourth values given a weight of 2. That leads us to a quicker and simpler way to calculate a centered moving average with an even number of seasons.
Still in Figure 5.11, the weights are stored in the range H3:H11. This formula returns the first centered moving average, in cell I7:
=SUMPRODUCT(D3:D11,$H$3:$H$11)/8
That formula returns 13.75, which is identical to the value calculated by the double-average formula in cell F7. Making the reference to the weights absolute, by means of the dollar signs in $H$3:$H$11, you can copy the formula and paste it down as far as necessary to get the rest of the centered moving averages.
Detrending the Series with Moving Averages
When you have subtracted the moving averages from the original observations to get the specific seasonals, you have removed the underlying trend from the series. What’s left in the specific seasonals is normally a stationary, horizontal series with two effects that cause the specific seasonals to depart from an absolutely straight line: the seasonal effects and random error in the original observations. Figure 5.12 shows the results for this example.
Figure 5.12 The specific seasonal effects for Friday and Saturday remain clear in the detrended series.
The upper chart in Figure 5.12 shows the original daily observations. Both the general upward trend and the weekend seasonal spikes are clear.
The lower chart shows the specific seasonals: the result of detrending the original series with a moving-average filter, as described earlier in “Understanding Specific Seasonals.” You can see that the detrended series is now virtually horizontal (a linear trendline for the specific seasonals has a slight downward drift), but the seasonal Friday and Saturday spikes are still in place.
The next step is to move beyond the specific seasonals to the seasonal indexes. See Figure 5.13.
Figure 5.13 The specific seasonals effects are first averaged and then normalized to reach the seasonal indexes.
In Figure 5.13, the specific seasonals in column E are rearranged in the tabular form shown in the range H4:N7. The purpose is simply to make it easier to calculate the seasonal averages. Those averages are shown in H11:N11.
However, the figures in H11:N11 are averages, not deviations from an average, and therefore we can’t expect them to sum to zero. We still need to adjust them so that they express deviations from a grand mean. That grand mean appears in cell N13, and is the average of the seasonal averages.
We can arrive at the seasonal indexes by subtracting the grand mean in N13 from each of the seasonal averages. The result is in the range H17:N17. These seasonal indexes are no longer specific to a particular moving average, as is the case with the specific seasonals in column E. Because they’re based on an average of each instance of a given season, they express the average effect of a given season across the four weeks in the time series.
Furthermore, they are measures of a season’s—here, a day’s—effect on traffic arrests vis-à-vis the average for a seven-day period.
We can now use those seasonal indexes to deseasonalize the series. We’ll use the deseasonalized series to get forecasts by way of linear regression or Holt’s method of smoothing trended series (discussed in Chapter 4). Then we simply add the seasonal indexes back into the forecasts to reseasonalize them. All this appears in Figure 5.14.
Figure 5.14 After you have the seasonal indexes, the finishing touches as applied here are the same as in the method of simple averages.
The steps illustrated in Figure 5.14 are largely the same as those in Figures 5.6 and 5.7, discussed in the following sections.
Deseasonalizing the Observations
Subtract the seasonal indexes from the original observations to deseasonalize the data. You can do this as shown in Figure 5.14, in which the original observations and the seasonal indexes are arranged as two lists beginning in the same row, columns C and F. This arrangement makes it a little easier to structure the calculations.
You can also do the subtraction as shown in Figure 5.6, in which the original quarterly observations (C12:F16), the quarterly indexes (C8:F8), and the deseasonalized results (C20:F24) are shown in a tabular format. That arrangement makes it a little easier to focus on the seasonal indexes and the deseasoned quarterlies.
Forecast from the Deseasonalized Observations
In Figure 5.14, the deseasonalized observations are in column H, and in Figure 5.7 they’re in column C. Regardless of whether you want to use a regression approach or a smoothing approach to the forecast, it’s best to arrange the deseasonalized observations in a single-column list.
In Figure 5.14, the forecasts are in column J. The following array formula is entered in the range J2:J32.
=TREND(H2:H32)
Earlier in this chapter, I pointed out that if you omit the x-values argument from the TREND() function’s arguments, Excel supplies the default values 1, 2, ..., n, where n is the number of y-values. In the formula just given, H2:H32 contains 31 y-values. Because the argument normally containing the x-values is missing, Excel supplies the default values 1, 2, ..., 31. Those are the values we would want to use anyway, in column B, so the formula as given is equivalent to =TREND(H2:H32,B2:B32). And that’s the structure used in D5:D24 of Figure 5.7:
=TREND(C5:C24,B5:B24)
Making the One-Step-Ahead Forecast
So far you have arranged for forecasts of the deseasonalized time series from t = 1 through t = 31 in Figure 5.14, and from t = 1 through t = 20 in Figure 5.7. These forecasts constitute useful information for various purposes, including assessing the accuracy of the forecasts by means of an RMSE analysis.
But your main purpose is forecasting at least the next, as yet unobserved time period. To get that, you could first forecast from the TREND() or LINEST() function if you’re using regression, or from the exponential smoothing formula if you’re using Holt’s method. Then you can add the associated seasonal index to the regression or smoothing forecast, to get a forecast that includes both the trend and the seasonal effect.
In Figure 5.14, you get the regression forecast in cell J33 with this formula:
=TREND(H2:H32,,B33)
In this formula, the y-values in H2:H32 are the same as in the other TREND() formulas in column J. So are the (default) x-values of 1 through 32. Now, though, you supply a new x-value as the function’s third argument, which you tell TREND() to look for in cell B33. It’s 32, the next value of t. And Excel returns the value 156.3 in cell J33. The TREND() function in cell J33 is telling Excel, in effect, “Calculate the regression equation for the values in H2:H32 regressed on the t values 1 through 31. Apply that regression equation to the new x-value of 32 and return the result.”
You’ll find the same approach taken in cell D25 of Figure 5.7, where the formula to get the one-step-ahead forecast is this:
=TREND(C5:C24,B5:B24,B25)
Adding the Seasonal Indexes Back In
The final step is to reseasonalize the forecasts by adding the seasonal indexes to the trend forecasts, reversing what you did four steps back when you subtracted the indexes from the original observations. This is done in column F in Figure 5.7 and column K in Figure 5.14.
Don’t forget to add the appropriate seasonal index for the one-step-ahead forecast, with the results shown in cell F25 in Figure 5.7 and in cell K33 in Figure 5.14. (I’ve shaded the one-step-ahead cells in both Figure 5.7 and Figure 5.14 to highlight the forecasts.)
You can find charts of three representations of the traffic arrest data in Figure 5.15: the deseasonalized series, the linear forecast from the deseasonalized data, and the reseasonalized forecasts. Note that the forecasts incorporate both the general trend of the original data and its Friday/Saturday spikes.
Figure 5.15 Charting the forecasts.