Idea Transcript
Applied Time Series Analysis SS 2014
Dr. Marcel Dettling Institute for , main="Passenger Bookings") The result is displayed on the next page. There are a number of features in the plot which are common to many time series. For example, it is apparent that the number of passengers travelling on the airline is increasing with time. In general, a systematic change in the mean level of a time series that does not appear to be periodic is known as a trend. The simplest model for a trend is a linear increase or decrease, an often adequate approximation. We will discuss how to estimate trends, and how to decompose time series into trend and other components in section 4.3. The , main="Lynx Trappings") The plot on the next page shows that the number of trapped lynx reaches high and low values every about 10 years, and some even larger figure every about 40 years. To our knowledge, there is no fixed natural period which suggests these results. Thus, we will attribute this behavior not to a deterministic periodicity, but to a random, stochastic one.
6000 4000 2000 0
# of Lynx Trapped
Lynx Trappings
1820
1840
1860
1880
1900
1920
Time
This leads us to the heart of time series analysis: while understanding and modeling trend and seasonal variation is a very important aspect, much of the time series methodology is aimed at stationary series, i.e. , main="Luteinizing Hormone")
2.5 1.5
2.0
LH level
3.0
3.5
Luteinizing Hormone
0
10
20
30
40
Time
For this series, given the way the measurements were made (i.e. 10 minute intervals), we can almost certainly exclude any deterministic seasonal variation. But is there any stochastic cyclic behavior? This question is more difficult to answer. Normally, one resorts to the simpler question of analyzing the correlation of subsequent records, called autocorrelations. The autocorrelation for lag 1 can be visualized by producing a scatterplot of adjacent observations:
Page 5
ATSA
1 Introduction
> plot(lh[1:47], lh[2:48], pch=20) > title("Scatterplot of LH ) Because subsetting from a multiple time series object results in a vector, but not a time series object, we need to regenerate a latter one, sharing the arguments of the original. In the plot we clearly observe that the series has a trend, i.e. the mean is obviously non-constant over time. This is typical for all financial time series.
2000
4000
smi
6000
8000
SMI Daily Closing Value
1992
1993
1994
1995
1996
1997
1998
Time
Such trends in financial time series are nearly impossible to predict, and difficult to characterize mathematically. We will not embark in this, but analyze the so-called log-returns, i.e. the logged-value of today’s value divided by the one of yesterday:
Page 7
ATSA
1 Introduction
> lret.smi plot(lret.smi, main="SMI Log-Returns")
0.00 -0.04 -0.08
lret.smi
0.04
SMI Log-Returns
1992
1993
1994
1995
1996
1997
1998
Time
The SMI log-returns are a close approximation to the relative change (percent values) with respect to the previous day. As can be seen above, they do not exhibit a trend anymore, but show some of the stylized facts that most log-returns of financial time series share. Using lagged scatterplots or the correlogram (to be discussed later in section 4.4), you can convince yourself that there is no serial correlation. Thus, there is no direct dependency which could be exploited to predict tomorrows return based on the one of today and/or previous days. However, it is visible that large changes, i.e. log-returns with high absolute values, imply that future log-returns tend to be larger than normal, too. This feature is also known as volatility clustering, and financial service providers are trying their best to exploit this property to make profit. Again, you can convince yourself of the volatility clustering effect by taking the squared log-returns and analyzing their serial correlation, which is different from zero.
1.3
Goals in Time Series Analysis
A first impression of the purpose and goals in time series analysis could be gained from the previous examples. We conclude this introductory section by explicitly summarizing the most important goals.
1.3.1
Exploratory Analysis
Exploratory analysis for time series mainly involves visualization with time series plots, decomposition of the series into deterministic and stochastic parts, and studying the dependency structure in the , main="Traffic Holdups") Page 16
ATSA
3 Time Series in R
120 90 100 80
# of Days
140
Traffic Holdups
2004
2005
2006
2007
2008
2009
2010
Time
3.1.2
Other Classes
Besides the basic ts class, there are several more which offer a variety of additional options, but will rarely to never be required during our course. Most prominently, this includes the zoo package, which provides infrastructure for both regularly and irregularly spaced time series using arbitrary classes for the time stamps. It is designed to be as consistent as possible with the ts class. Coercion from and to zoo is also readily available. Some further packages which contain classes and methods for time series include xts, its, tseries, fts, timeSeries and tis. Additional information on their content and philosophy can be found on CRAN.
3.2
Dates and Times in R
While for the ts class, the handling of times has been solved very simply and easily by enumerating, doing time series analysis in R may sometimes also require to explicitly working with date and time. There are several options for dealing with date and date/time ) [1] "2012-01-27" > as.Date("14. Februar, 2012", format="%d. %B, %Y") [1] "2012-02-14" Internally, Date objects are stored as the number of days passed since the 1st of January in 1970. Earlier dates receive negative numbers. By using the as.numeric() function, we can easily find out how many days are past since the reference date. Also back-conversion from a number of past days to a date is straightforward: > mydat ndays ndays [1] 15384
Page 18
ATSA
3 Time Series in R
> tdays class(tdays) tdays [1] "1997-05-19" A very useful feature is the possibility of extracting weekdays, months and quarters from Date objects, see the examples below. This information can be converted to factors. In this form, they serve for purposes such as visualization, decomposition, or time series regression. > weekdays(mydat) [1] "Dienstag" > months(mydat) [1] "Februar" > quarters(mydat) [1] "Q1" Furthermore, some very useful summary statistics can be generated from Date objects: median, mean, min, max, range, ... are all available. We can even subtract two dates, which results in a difftime object, i.e. the time difference in days. > dat dat [1] "2000-01-01" "2004-04-04" "2007-08-09" > min(dat) [1] "2000-01-01" > max(dat) [1] "2007-08-09" > mean(dat) [1] "2003-12-15" > median(dat) [1] "2004-04-04" > dat[3]-dat[1] Time difference of 2777 days Another option is generating time sequences. For example, to generate a vector of 12 dates, starting on August 3, 1985, with an interval of one single day between them, we simply type: > seq(as.Date("1985-08-03"), by="days", length=12) [1] "1985-08-03" "1985-08-04" "1985-08-05" "1985-08-06" [5] "1985-08-07" "1985-08-08" "1985-08-09" "1985-08-10" [9] "1985-08-11" "1985-08-12" "1985-08-13" "1985-08-14" The by argument proves to be very useful. We can supply various units of time, and even place an integer in front of it. This allows creating a sequence of dates separated by two weeks:
Page 19
ATSA
3 Time Series in R
> seq(as.Date("1992-04-17"), by="2 weeks", length=12) [1] "1992-04-17" "1992-05-01" "1992-05-15" "1992-05-29" [5] "1992-06-12" "1992-06-26" "1992-07-10" "1992-07-24" [9] "1992-08-07" "1992-08-21" "1992-09-04" "1992-09-18"
3.2.2
The chron Package
The chron() function converts dates and times to chron objects. The dates and times are provided separately to the chron() function, which may well require some inital pre-processing. For such parsing, R-functions such as substr() and strsplit() can be of great use. In the chron package, there is no support for time zones and daylight savings time, and chron objects are internally stored as fractional days since the reference date of January 1st, 1970. By using the function as.numeric(), these internal values can be accessed. The following example illustrates the use of chron: > library(chron) > dat dts tme fmt cdt cdt [1] (07-06-09 16:43:20) (07-08-29 07:22:40) [3] (07-10-21 16:48:40) (07-12-17 11:18:50) As before, we can again use the entire palette of summary statistic functions. Of some special interest are time differences, which can now be obtained as either fraction of days, or in weeks, hours, minutes, seconds, etc.: > cdt[2]-cdt[1] Time in days: [1] 80.61065 > difftime(cdt[2], cdt[1], units="secs") Time difference of 6964760 secs
3.2.3
POSIX Classes
The two classes POSIXct and POSIXlt implement date/time information, and in contrast to the chron package, also support time zones and daylight savings time. We recommend utilizing this functionality only when urgently needed, because the handling requires quite some care, and may on top of that be system dependent. Further details on the use of the POSIX classes can be found on CRAN. As explained above, the POSIXct class also stores dates/times with respect to the internal reference, whereas the POSIXlt class stores them as a list of components (hour, min, sec, mon, etc.), making it easy to extract these parts.
Page 20
ATSA
3.3
3 Time Series in R
, header=T) tsd lines(tr, t.04.adj) The time series plot below is enhanced with polynomial trend lines of order 4 (blue), 5 (red) and 6 (green). From this visualization, it is hard to decide which of the polynomials is most appropriate as a trend estimate. Because there are some boundary effects for orders 5 and 6, we might guess that their additional flexibility is not required. As we will see below, this is treacherous.
4
(%)
5
6
Unemployment in Maine
3
O(4) O(5) O(6) 1996
1998
2000
2002
2004
2006
Time
A better way for judging the fit of a parametric model is by residual analysis. We plot the remainder term Eˆt versus time and add a LOESS smoother. The following bit of code shows this for the grade 4 polynomial. > > > > >
re.est >
0
100
200
300
400
500
Time
Page 81
ATSA
5 Stationary Time Series Models
> acf.true pacf.true arima(diff(attbond), order=c(0,0,1)) Call: arima(x = diff(attbond), order = c(0, 0, 1)) Coefficients: ma1 -0.2865 s.e. 0.0671
intercept -0.0247 0.0426
sigma^2 = 0.6795: log likelihood = -234.16, aic = 474.31 Please note that the application of the three estimation procedures here was just for illustrative purposes, and to show the (slight) differences that manifest themselves when different estimators are employed. In any practical work, you can easily restrict yourself to the application of the arima() procedure using the default fitting by method="CSS-ML". For verifying the quality of the fit, a residual analysis is mandatory. The residuals of the MA(1) are estimates of the innovations Et . The model can be seen as adequate if the residuals reflect the main properties of the innovations. Namely, they should be stationary and free of any dependency, as well as approximately Gaussian. We can verify this by producing a time series plot of the residuals, along with their ACF and PACF, and a Normal QQ-Plot. Sometimes, it is also instructive to plot the fitted values into the original . By default however, these CSS estimates are only used as starting values for a MLE. If Gaussian innovations are assumed, then the joint distribution of any ARMA( p, q ) process vector X ( X 1 ,..., X n ) has a multivariate normal distribution. X ( X 1 ,..., X n ) ~ N (0,V ) , resp. Y (Y1 ,..., Yn ) ~ N (m 1,V ) . MLE then relies on determining the parameters m (if a shifted ARMA( p, q ) is estimated), 1 ,..., p ; 1 ,..., q and E2 simultaneously by maximizing the probability density function of the above multivariate Gaussian with assuming the ) title("Logged Monthly Price for a Crude Oil Barrel")
Page 91
ATSA
6 SARIMA and GARCH Models
3.5 3.0 2.5
log(Price)
4.0
Logged Monthly Price for a Crude Oil Barrel
1990
1995
2000
2005
Time
The series does not exhibit any apparent seasonality, but there is a clear trend, so that it is non-stationary. We try first-order (i.e. d 1 ) differencing, and then check whether the result is stationary. > dlop plot(dlop, ylab="Differences") > title("Differences of Logged Monthly Crude Oil Prices")
0.0 -0.4
-0.2
Differences
0.2
0.4
Differences of Logged Monthly Crude Oil Prices
1990
1995
2000
2005
Time
The trend was successfully removed by taking differences. ACF and PACF show that the result is serially correlated. There may be a drop-off in the ACF at lag 1, and in the PACF at either lag 1 or 2, suggesting an ARIMA(1,1,1) or an ARIMA(2,1,1) for the logged oil prices. However, the best performing model is somewhat surprisingly an ARIMA(1,1, 2) , for which we show the results.
Page 92
ATSA
6 SARIMA and GARCH Models
> par(mfrow=c(1,2)) > acf(dlop, main="ACF", ylim=c(-1,1), lag.max=24) > pacf(dlop, main="ACF", ylim=c(-1,1), lag.max=24)
0.5 -1.0
-0.5
0.0
Partial ACF
0.0 -1.0
-0.5
ACF
0.5
1.0
PACF
1.0
ACF
0.0
0.5
1.0
1.5
2.0
0.5
Lag
1.0
1.5
2.0
Lag
The fitting can be done with the arima() procedure that (by default) estimates the coefficients using Maximum Likelihood with starting values obtained from the Conditional Sum of Squares method. We can either let the procedure do the differencing: > arima(lop, order=c(1,1,2)) Call: arima(x = lop, order = c(1, 1, 2)) Coefficients: ar1 ma1 0.8429 -0.5730 s.e. 0.1548 0.1594 sigma^2 = 0.006598:
ma2 -0.3104 0.0675
log likelihood = 261.88,
aic = -515.75
Or, we can use the differenced series dlop as input and fit an ARMA(1, 2) . However, we need to tell R to not include an intercept – this is not necessary when the trend was removed by taking differences. The command is: > arima(dlop, order=c(1,0,2), include.mean=FALSE) The output from this is exactly the same as above. The next step is to perform residual analysis – if the model is appropriate, they must look like White Noise. This is (, header=T) beer pacf(d.d12.lbeer, ylim=c(-1,1), main="PACF")
0.5 -1.0
-0.5
0.0
Partial ACF
0.0 -1.0
-0.5
ACF
0.5
1.0
PACF
1.0
ACF
0.0
0.5
1.0
1.5
2.0
Lag
0.5
1.0
1.5
2.0
Lag
There is very clear evidence that series Zt is serially dependent, and we could try an ARMA( p, q ) to model this dependence. As for the choice of the order, this is not simple on the basis of the above correlograms. They suggest that high values for p and q are required, and model fitting with subsequent residual analysis and AIC inspection confirm this: p 14 and q 11 yield a good result. It is (not so much in the above, but generally when analyzing ) title("Global Temperature Anomalies")
Page 103
ATSA
7 Time Series Regression
0.0 0.2 0.4 0.6 0.8 -0.4
anomaly
Global Temperature Anomalies
1970
1975
1980
1985
1990
1995
2000
2005
Time
There is a clear trend which seems to be linear. Despite being monthly measured, the )
Wind
20 65 5 10 55 80 70
Temp
90 35
45
Oxidant
Air Pollution )
0.2 0.0 -0.2
resid(fit.lm)
0.4
Residuals of the lm() Function
1970
1975
1980
1985
1990
1995
2000
2005
dat$time
It is fairly obvious from the time series plot that the residuals are correlated. Our main tool for describing the dependency structure is the ACF and PACF plots, however. These are as follows: > par(mfrow=c(1,2)) > acf(resid(fit.lm), main="ACF of Residuals") > pacf(resid(fit.lm), main="PACF of Residuals")
0.5 -1.0
-0.5
0.0
Partial ACF
0.0 -1.0
-0.5
ACF
0.5
1.0
PACF of Residuals
1.0
ACF of Residuals
0
5
10
15 Lag
20
25
0
5
10
15
20
25
Lag
The ACF shows a rather slow exponential decay, whereas the PACF shows a clear cut-off at lag 2. With these stylized facts, it might well be that an AR (2)
Page 109
ATSA
7 Time Series Regression
model is a good description for the dependency among the residuals. We verify this: > fit.ar2 pacf(resid(fit.gls), main="PACF of GLS-Residuals")
0.5 -1.0
-0.5
0.0
Partial ACF
0.0 -1.0
-0.5
ACF
0.5
1.0
PACF of GLS-Residuals
1.0
ACF of GLS-Residuals
0
5
10
15
20
25
Lag
0
5
10
15
20
25
Lag
The plots look similar to the ACF/PACF plots of the OLS residuals. This is often the case in practice, only for more complex situations, there can be a bigger discrepancy. And because we observe an exponential decay in the ACF, and a clear cut-off at lag 2 in the PACF, we conjecture that the GLS residuals meet the properties of the correlation structure we hypothesized, i.e. of an AR (2) model. Thus, we can now use the GLS fit for drawing inference. We first compare the OLS and GLS point estimate for the trend, along with its confidence interval: > coef(fit.lm)["time"] time 0.01822374 > confint(fit.lm, "time") 2.5 % 97.5 % time 0.01702668 0.0194208 > coef(fit.gls)["time"] time 0.02017553 > confint(fit.gls, "time") 2.5 % 97.5 % time 0.01562994 0.02472112 We obtain a temperature increase of 0.0182 degrees per year with the OLS, and of 0.0202 with the GLS. While this may seem academic, the difference among the point estimates is around 10%, and theory tells us that the GLS result is more reliable. Moreover, the length of the confidence interval is 0.0024 with the OLS, Page 117
ATSA
7 Time Series Regression
and 0.0091 and thus 3.5 times as large with the GLS. Thus, without accounting for the dependency among the errors, the precision of the trend estimate is by far overestimated. Nevertheless, also the confidence interval obtained from GLS regression does not contain the value 0, and thus, the null hypothesis on no global warming trend is rejected (with margin!). Finally, we investigate the significance of the seasonal effect. Because this is a factor variable, i.e. a set of dummy variables, we cannot just produce a confidence interval. Instead, we have to rely on a significance test, i.e. a partial F-test. Again, we compare what is obtained from OLS and GLS: > drop1(fit.lm, test="F") Single term deletions Model: temp ~ time + season Df Sum of Sq RSS AIC F value Pr(F) 6.4654 -1727.0 time 1 14.2274 20.6928 -1240.4 895.6210 anova(fit.gls) Denom. DF: 407 numDF F-value p-value (Intercept) 1 78.40801 > >
dat corStruct model fit.gls
lines(predict(fit, n.ahead=29), col="blue", lty=3)
6.0 5.5 5.0 4.5
Observed / Fitted
6.5
Holt-Winters filtering
1980
1985
1990
1995
Time
Page 140
ATSA
8 Forecasting
It is also very instructive to plot how level, trend and seasonality evolved over time. This can be done very simply in R: > plot(fit$fitted, main="Holt-Winters-Fit")
level
5.5 0.014 4.8 5.4 6.0 4.5
trend
0.2 0.008
season
-0.2
xhat
Holt-Winters-Fit
1985
1990
1995
Time
Since we are usually more interested in the prediction on the original scale, i.e. in liters rather than log-liters of wine, we just re-exponentiate the values. Please note that the result is an estimate of the median rather than the mean of the series. There are methods for correction, but the difference is usually only small. > plot(aww, xlim=c(1980, 1998)) > lines(exp(fit$fitted[,1]), col="red") > lines(exp(predict(fit, n.ahead=29)), col="blue", lty=3)
100 200 300 400 500 600
aww
Holt-Winters-Forecast for the Original Series
1980
1985
1990
1995
Time
Page 141
ATSA
8 Forecasting
Also, we note that the (insample) 1-step prediction error is equal to 50.04, which is quite a reduction when compared to the series’ standard deviation which is 121.4. Thus, the Holt-Winters fit has substantial explanatory power. Of course, it would now be interesting to test the accuracy of the predictions. We recommend that you, as an exercise, put aside the last 24 observations of the Australian white wine , type="h") > spec.pgram(wave, log="no", type="h") Time Series Plot of Wave Tank , lwd=2, ...)
Kalman filtering in R requires to specifiy the state space model first. We need to supply argument y which stands for the observed time series , xlab="Time", ylab="") title("Kalman Filtered Intercept") plot(fit$m[,2], type="l", xlab="Time", ylab="") title("Kalman Filtered Slope")
Page 168
ATSA
11 State Space Models
Kalman Filtered Slope
4.0
-1
4.4
0
4.8
1
5.2
2
Kalman Filtered Intercept
0
5
10
15
20
25
30
0
5
10
Time
15
20
25
30
Time
The plots show the Kalman filter output for intercept and slope. The estimates pick up the true values very quickly, even after the change in the regime. It is worth noting that in this example, we had a very clear signal with relatively little noise, and we favored recovering the truth by specifying the state space formulation with the true error variances that are generally unknown in practice. Example: Batmobile We here consider another regression problem where time-varying coefficients may be necessary. The description of the practical situation is as follows: In April 1979 the Albuquerque Police Department began a special enforcement program aimed at countering driving while intoxicated accidents. The program was composed of a squad of police officers, breath alcohol testing (BAT) devices, and a van named batmobile, which housed a BAT device and was used as a mobile station. The )
The time series plot of the residuals shows very clear evidence that there is timely dependence. In contrast to what the regression model with constant coefficients suggests, the level of accidents seems to rise in the control period, then drops markedly after the BAT program was introduced. The conclusion is that our above regression model is not adequate and needs to be enhanced. However, just adding an indicator variable that codes for the times before and after the introduction of the BAT program will not solve the issue. It is evident that the level of the residuals is not constant before the program started, and it does not suddenly drop to a constant lower level thereafter.
20 0 -20 -60
-40
ts(resid(fit))
40
Time Plot of Residuals
0
10
20
30
40
50
Time
The alternative is to formulate a state space model and estimate it with the Kalman filter. We (conceptually) assume that all regression parameters are time dependent, and rewrite the model as: ACCt Lt 1t FUELt 2 t 1Q2 ( t ) 3t 1Q3 ( t ) 4 t 1Q4 ( t ) Et
Our main interest lies in the estimation of the modified intercept term Lt , which we now call the level. We expect it to drop after the introduction of the BAT program, but let’s see if this materializes. The state vector X t we are using contains the regression coefficients, and the state equation which describes their evolution over time is as follows:
Page 170
ATSA
11 State Space Models
Lt 1 1t 0 X t GX t 1 Wt , where X t 2t , G 0 3t 0 0 4t
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 10 0 0 0 , wt 0 0 0 0 1
0 0 0 0 0
0 0 0 0 0
0 0 0 . 0 0
0 0 0 0 0
As we can see, we only allow for some small, random permutations in the level Lt , but not in the other regression coefficients. The observation equation then describes the regression problem, i.e.
Yt Ft X t Vt , where Yt ACCt , Ft (1, Fuelt ,1Q2 (t ) ,1Q3 ( t ) ,1Q4 (t ) ) , Vt ~ N (0, V2 ) The variance of the noise term V2 is set to 600, which is the error variance in the canonical regression fit. Also the starting values for Kalman filtering, as well as the variances of these initial states are taken from there. Hence, the code for the state space formulation, as well as for Kalman filtering is as follows: > > > + + + + + + >
y.mat