Abstract

In this report, there are two goals: (1) Provide an explanation on how the gold fixing prices vary. (2) Provide predictions/ forecasts on the future gold fixing prices. For the first goal, we have successfully built a regression with auto-correlated error model that can explain log transformed gold fixing price using log transformed volatility value and its quadratic form and with residuals structure being SARIMA(0,1,0)(0,0,1)(6). For the second goal, we build a SARIMA(1,1,0)(1,0,0)(6) model and provide five steps ahead predictions and standard errors.

Introduction

Gold price itself is a good leading indication of how the current state of economy is in and where the economy will head toward. The ability to explain and predict gold price can aid in the planning of fiscal and monetary policies for government as well as investment planning for private sectors.

We can often observe positive relationships between the price of gold and the rate of inflation. That is, when rate of inflation is increasing, gold price tends to rise. This result stems from the fact that gold is the most basic and natural choice for store of value. Currently, due to the unprecedented economic hardship brought by Covid-19, there have been many quantitative easing policies in place, and in turn, inflation rate has been increasing for a prolonged period. As a result, gold prices have been rising but with a volatile path. Many questions has been asked whether inflation rate is running out of control? Should quantitative easing continue? Therefore, in this report we try to explain and predict gold fixing price with the intention that this may be useful to explain how the inflation rate will vary.

The price of gold we used in this analysis is “The London Bullion Market Association (LBMA) Gold Price”. In essence, this is the bench mark of all the gold prices around the world. LBMA is the largest OTC market and is participated by dominant banks around the world. Gold prices decided here will have ripple effects on various other lesser gold markets. The price is recorded in U.S. dollar per troy ounce. The frequency of this data is twice per day (10:30am & 3:30pm London time). We use the daily 10:30 am data and the this series has over 1,000 data points. This data set is recorded by ICE Benchmark Administration and is provided by U.S Federal reserve bank at St. Louis.

The predictor variable we choose is gold ETF volatility index. We understand that gold price itself is influenced by many competing factors ranging from tangible economic activities such as the making of ornamental objects and industrial component; intangible activities such as hedging against inflation. The attempt to incorporate as many predictors as possible will often result in multi-collinearity issues. Gold ETF volatility index consists of a portfolio of stocks designed to closely track the price performance and yield of gold. Therefore, this can be thought of as a pool of information directly related to gold price. In a sense, we can use this predictor to capture many aspects of the variation of gold price without running the risks of multi-collinearity issues as well as building too complicated a model.

Analysis

Generalized Least Square

# data preparation
Gold<- read.table("C:\\Users\\absur\\Desktop\\Pis\\[2] UIUC\\[1] Courses\\Statistics 429\\Final Project\\GOLDAMGBD228NLBM.csv"
                   , header = T, sep = ",")
Volatility<- read.table("C:\\Users\\absur\\Desktop\\Pis\\[2] UIUC\\[1] Courses\\Statistics 429\\Final Project\\GVZCLS.csv"
                        , header = T, sep = ",")
data1<- left_join(Gold,Volatility,by = "DATE")
data1<- data1 %>% rename(gold=GOLDAMGBD228NLBM,volatility=GVZCLS)
data1<- data1 %>% filter(gold != ".", volatility!=".")
data1<- as_tibble(data1)
data1<- data1 %>% mutate_at(c("gold","volatility"),as.numeric)
data1<- data1 %>% mutate(gold_log=log(gold),vol_log=log(volatility))
data1<- data1 %>% separate(DATE, into=c("Year","Month","Date"),sep="-")
counting<- data1 %>% group_by(Year) %>% summarize(n=n())
time.2016<- seq(from=2016, to=2017,by=1/246)[222:246]
time.2017<- seq(from=2017,to=2018,by=1/246)[1:246]
time.2018<- seq(from=2018,to=2019,by=1/247)[1:247]
time.2019<- seq(from=2019,to=2020,by=1/247)[1:247]
time.2020<- seq(from=2020,to=2021,by=1/248)[1:248]
time.2021<- seq(from=2021,to=2022,by=1/246)[1:223]
time.stamp<- c(time.2016,time.2017,time.2018,time.2019,time.2020,time.2021)
data1<- data1 %>% mutate(time=time.stamp)

1. Preliminary data analysis

(1). From the below time series plot, there is a clear upward trend. Using visual inspection and numerical evidence, it is confirmed that variance is unequal before and after 2020 (8397 and 11907 respectively). Note that this unequal variance can be overcome by log transformation. More importantly, taking log transformation in this case makes more sense as what we are interesting in is the relative growth and decline in gold price and not necessarily about the absolute price. Therefore, we proceed to work with log-transformed gold price.

par(mfrow=c(1,1))
tsplot(data1$time,data1$gold,type = "l",xlab = "Time",ylab = "",col=astsa.col(4, .8),gg=TRUE,lwd=1,main = "Gold Price Time Series")

after.2020<- data1 %>% slice(766:dim(data1)[1])
before.2020<- data1 %>% slice(1:765)

text(2018.5,1800, paste("Variance :" ))
text(2018.5,1700,paste("Before 2020 :",round(var(before.2020$gold))))
text(2018.5,1600,paste("After 2020 :",round(var(after.2020$gold))))


tsplot(data1$time,data1$gold_log,type = "l",xlab = "Time",ylab = "",col=astsa.col(4, .8),gg=TRUE,lwd=1,main = "Log Gold Price Time Series")

text(2018.5,7.5, paste("Variance :" ))
text(2018.5,7.45, paste("Before 2020 :",round(var(before.2020$gold_log),4)))
text(2018.5,7.40, paste("After 2020 :",round(var(after.2020$gold_log),4)))

par(mfrow=c(1,1))

(2). From below left, We can see that this series is highly auto correlated. ACF decays really slowly. Overall, this suggest that this time series data is not stationary.

(3). From below right, We find out that log-transformed volatility index has higher linear relationship(0.71) than its original form(0.64). This suggests that it will provide more explanatory power in a regression model. Thus, we proceed to use log transformed volatility index as predictor variable. Also, as this is an economic data, this transformation makes sense in this context.

(4). We also discover that there is a clear curve linear relationship, and thus we should consider using quadratic predictor.In addition, lagging the variable seems to have no effect on the relationships. In addition, time variable also shows strong linear relationship with log-gold price. In summary, we suggest using log volatility, quadratic log volatility, and time as predictor variables.

acf1(data1$gold_log,main = "Dependence of Log Gold Price",gg=TRUE,lwd=2)
##  [1] 1.00 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.97 0.96 0.96
## [16] 0.96 0.96 0.95 0.95 0.95 0.95 0.94 0.94 0.94 0.94 0.93 0.93 0.93 0.93 0.93
## [31] 0.92 0.92 0.92 0.92 0.91 0.91 0.91 0.91 0.91 0.90 0.90 0.90 0.90 0.89 0.89
## [46] 0.89
volatility<- data1$volatility
log_gold_price<- data1$gold_log
log_volatility_index<- data1$vol_log
time<- data1$time

lag2.plot(volatility,log_gold_price,max.lag = 0,lwl = 2,gg=TRUE,col=astsa.col(4, .3), pch=20, cex=1)
lag2.plot(log_volatility_index,log_gold_price,max.lag = 0,lwl = 2,gg=TRUE,col=astsa.col(4, .3), pch=20, cex=1)
lag2.plot(time,log_gold_price,max.lag = 0,lwl = 2,gg=TRUE,col=astsa.col(4, .3), pch=20, cex=1)

2. Regression Model

(1). Regression Model: log(Gold)= B0 + B1 * log(volatility) + B2* log(volatility)^2 + B3*time. The estimated coefficients are really significant and the fitted model is not bad at explaining variation within the data.

fit1<- lm(gold_log~vol_log+I(vol_log^2)+time,data = data1)
fit1 %>% summary() %>% coefficients()
##                   Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  -160.69442685 2.486088499 -64.63745 0.000000e+00
## vol_log         1.12912794 0.075989921  14.85892 4.607224e-46
## I(vol_log^2)   -0.16712997 0.013461680 -12.41524 1.968359e-33
## time            0.08228955 0.001242939  66.20561 0.000000e+00
results<- fit1 %>% summary() 
paste("Adjusted R Square",":",round(results$adj.r.squared,4))
## [1] "Adjusted R Square : 0.9005"

(2). We check the residuals to see whether they satisfied the normal iid assumptions. From the below time series plot, we observe that that they are not white noise and there are clear cyclical behaviors. From the ACF, it is auto-correlated for many lags. Therefore, we should apply regression with auto-correlated errors.

fit1.residuals<- fit1 %>% resid()
tsplot(fit1.residuals,col=astsa.col(4, .8),gg=TRUE,lwd=1,main = "Residuals",ylab = "",xlab = "Series")
acf1(fit1.residuals,main = "Dependence of Residuals",gg=TRUE,lwd=2)
##  [1] 0.96 0.94 0.92 0.90 0.88 0.85 0.84 0.83 0.82 0.81 0.79 0.78 0.77 0.76 0.75
## [16] 0.74 0.72 0.71 0.70 0.69 0.68 0.67 0.65 0.64 0.63 0.62 0.61 0.60 0.60 0.58
## [31] 0.57 0.56 0.55 0.54 0.53 0.52 0.51 0.50 0.48 0.47 0.46 0.45 0.44 0.43 0.42
## [46] 0.42

(3). From the previous ACF plot, we understand that the residuals are not stationary series. After difference,from the below graph, the series becomes stable. From ACF and PACF plots below, we notice the the structure might be ARIMA(0,1,1) because ACF cuts off and PACF tails off. Also, there seems to be seasonality for lag 6, so we should also consider SARIMA(0,1,1,0,0,1,6) because seasonal ACF cuts off and PACF tails off.

fit1.residuals.diff<- fit1.residuals %>% diff()
tsplot(fit1.residuals.diff,col=astsa.col(4, .8),gg=TRUE,lwd=1,main = "Residuals After Differencing",ylab = "",xlab = "Series")

##       [,1]  [,2]  [,3]  [,4] [,5]  [,6]  [,7]  [,8] [,9] [,10] [,11] [,12]
## ACF  -0.16 -0.01 -0.06 -0.01 0.03 -0.11 -0.02 -0.05 0.04  0.00 -0.02 -0.02
## PACF -0.16 -0.03 -0.06 -0.03 0.02 -0.11 -0.06 -0.07 0.00 -0.01 -0.02 -0.04
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## ACF   0.01  0.02  0.02     0 -0.01  0.00     0 -0.02 -0.03  0.04 -0.01  0.01
## PACF -0.01  0.00  0.01     0 -0.01 -0.01     0 -0.02 -0.03  0.03  0.00  0.00
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF  -0.03 -0.01  0.06 -0.06  0.05  0.01 -0.01 -0.01  0.01     0  0.02 -0.04
## PACF -0.03 -0.03  0.05 -0.05  0.03  0.03 -0.02 -0.02  0.01     0  0.03 -0.03
##      [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## ACF   0.02  0.01     0 -0.02  0.00 -0.03  0.03 -0.04 -0.01  0.01
## PACF  0.01  0.01     0 -0.02  0.01 -0.03  0.03 -0.04 -0.02  0.00

(4). For ARIMA(0,1,1), from below, the t-table suggests that the MA part is significant. However, the box statistics indicates that the serial correlation has not been solved.

resid1<- sarima(fit1.residuals,0,1,1,no.constant = T,details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.268481 
## iter   2 value -4.281574
## iter   3 value -4.281639
## iter   3 value -4.281639
## iter   3 value -4.281639
## final  value -4.281639 
## converged
## initial  value -4.281651 
## iter   1 value -4.281651
## final  value -4.281651 
## converged

resid1$ttable
##     Estimate     SE t.value p.value
## ma1  -0.1682 0.0292 -5.7547       0

(5). For SARIMA(0,1,1,0,0,1,6), the below t-table suggests the SMA part is significant as well. In addition, serial correlation seems to improve. We first try to employ this residual structure to run a new regression with autocorrelated error model.

resid2<- sarima(fit1.residuals,0,1,1,0,0,1,6,no.constant = T,details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.268481 
## iter   2 value -4.288393
## iter   3 value -4.288637
## iter   3 value -4.288637
## iter   3 value -4.288637
## final  value -4.288637 
## converged
## initial  value -4.288646 
## iter   1 value -4.288646
## final  value -4.288646 
## converged

resid2$ttable
##      Estimate     SE t.value p.value
## ma1   -0.1755 0.0296 -5.9280       0
## sma1  -0.1216 0.0289 -4.2002       0

(6). From the box statistics below, the serial correltion problems in residuals are solved. However, one serious issue appears, which is the time predictor variable becomes insignificant. We suspect that since time variable might be highly correlated with the residuals and thus when residuals are restructured, it becomes insignificant. Therefore, we try to exclude the variable and rebuild the model.

fit2<- sarima(data1$gold_log,0,1,1,0,0,1,6
              ,xreg = cbind(data1$vol_log,I((data1$vol_log)^2),data1$time),details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.771799 
## iter   2 value -4.775492
## iter   3 value -4.775527
## iter   4 value -4.775528
## iter   5 value -4.775535
## iter   5 value -4.775535
## iter   5 value -4.775535
## final  value -4.775535 
## converged
## initial  value -4.775553 
## iter   2 value -4.775553
## iter   2 value -4.775553
## iter   2 value -4.775553
## final  value -4.775553 
## converged

fit2$ttable
##       Estimate     SE t.value p.value
## ma1     0.0380 0.0287  1.3214  0.1866
## sma1   -0.0739 0.0276 -2.6794  0.0075
## xreg1   0.0995 0.0369  2.6995  0.0070
## xreg2  -0.0158 0.0065 -2.4196  0.0157
## xreg3   0.0785 0.0570  1.3788  0.1682

(7). We rebuild a model without the time variable and inspect the residuals again. From below, after difference, we recognize similar residual structure. That is, SARIMA(0,1,1,0,0,1,6).

fit3<- lm(gold_log~vol_log+I(vol_log^2),data = data1)
fit3.residuals<- fit3 %>% resid()
fit3.residuals.diff<- fit3.residuals %>% diff()

##       [,1]  [,2]  [,3]  [,4] [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## ACF  -0.17 -0.01 -0.06  0.01 0.04 -0.11 -0.03 -0.06  0.03 -0.01 -0.02 -0.04
## PACF -0.17 -0.04 -0.06 -0.01 0.03 -0.10 -0.07 -0.09 -0.01 -0.02 -0.03 -0.06
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## ACF   0.04  0.03  0.03  0.01 -0.01  0.02  0.00 -0.04 -0.05  0.03  0.01     0
## PACF  0.02  0.01  0.03  0.02 -0.01  0.01  0.01 -0.03 -0.05  0.01  0.01     0
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF  -0.03 -0.03  0.05 -0.06  0.05  0.03 -0.03  0.00  0.01     0  0.03 -0.04
## PACF -0.03 -0.05  0.02 -0.07  0.02  0.05 -0.03 -0.02  0.01     0  0.04 -0.03
##      [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## ACF   0.03  0.00     0 -0.01 -0.01 -0.01  0.05 -0.05 -0.01  0.03
## PACF  0.02  0.01     0 -0.01  0.00 -0.02  0.05 -0.04 -0.03  0.02

(8). From t-table below, all the parts are significant but the serial correlation in residuals seems to be unsolved. Still, we try to employ this new structure to run a regression with auto-correlated error model again.

resid3<- sarima(fit3.residuals,0,1,1,0,0,1,6,no.constant = T,details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -3.671804 
## iter   2 value -3.694550
## iter   3 value -3.694901
## iter   3 value -3.694901
## iter   3 value -3.694901
## final  value -3.694901 
## converged
## initial  value -3.694878 
## iter   1 value -3.694878
## final  value -3.694878 
## converged

resid3$ttable
##      Estimate     SE t.value p.value
## ma1   -0.1924 0.0294 -6.5547       0
## sma1  -0.1245 0.0291 -4.2825       0

(9). This time the predictor variables all remain significant. However, We discover that the MA parts of the residuals become insignificant. We then exclude it and see how the model responds.

fit4<- sarima(data1$gold_log,0,1,1,0,0,1,6
              ,xreg = cbind(data1$vol_log,I((data1$vol_log)^2)),details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.771094 
## iter   2 value -4.774712
## iter   3 value -4.774724
## iter   4 value -4.774728
## iter   5 value -4.774749
## iter   6 value -4.774750
## iter   7 value -4.774756
## iter   8 value -4.774757
## iter   8 value -4.774757
## iter   8 value -4.774757
## final  value -4.774757 
## converged
## initial  value -4.774772 
## iter   1 value -4.774772
## final  value -4.774772 
## converged

fit4$ttable
##       Estimate     SE t.value p.value
## ma1     0.0397 0.0287  1.3861  0.1660
## sma1   -0.0720 0.0275 -2.6167  0.0090
## xreg1   0.0990 0.0369  2.6858  0.0073
## xreg2  -0.0157 0.0065 -2.4083  0.0162

(10). After eliminating the MA parts, the remaining seasonal part is still significant. More importantly, we see that the residuals are mostly uncorrelated with only one point being significant. Also, the residual approximately normally distributed. Therefore, we select this model with variable log(volatility), log(volatility)^2 with auto-correlated errors SARIMA(0,1,0,0,0,1,6).

fit5<- sarima(data1$gold_log,0,1,0,0,0,1,6
              ,xreg = cbind(data1$vol_log,I((data1$vol_log)^2)),details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.771094 
## iter   2 value -4.773971
## iter   3 value -4.773977
## iter   4 value -4.773984
## iter   5 value -4.773987
## iter   6 value -4.773987
## iter   7 value -4.773988
## iter   7 value -4.773988
## iter   7 value -4.773988
## final  value -4.773988 
## converged
## initial  value -4.773998 
## iter   1 value -4.773999
## final  value -4.773999 
## converged

fit5$ttable
##       Estimate     SE t.value p.value
## sma1   -0.0738 0.0275 -2.6842  0.0074
## xreg1   0.1065 0.0366  2.9090  0.0037
## xreg2  -0.0169 0.0065 -2.6072  0.0092

Time Series Model - Seasonal ARIMA

data2<- Gold %>% as_tibble() %>% rename(gold=GOLDAMGBD228NLBM)
data2<- data2 %>% filter(gold != ".") %>% mutate_at("gold",as.numeric)
data2<- data2 %>% mutate(gold_log=log(gold))
data2<- data2 %>% separate(DATE, into=c("Year","Month","Date"),sep="-")
time.2016<- seq(from=2016, to=2017,by=1/251)[226:251]
time.2017<- seq(from=2017,to=2018,by=1/251)[1:251]
time.2018<- seq(from=2018,to=2019,by=1/253)[1:253]
time.2019<- seq(from=2019,to=2020,by=1/252)[1:252]
time.2020<- seq(from=2020,to=2021,by=1/253)[1:253]
time.2021<- seq(from=2021,to=2022,by=1/227)[1:227]
time.stamp<- c(time.2016,time.2017,time.2018,time.2019,time.2020,time.2021)
data2<- data2 %>% mutate(time=time.stamp)

(1). From previous preliminary data analysis, we already know that the series is not stationary after the log-transformation. Thus, we first take difference of the data and the series becomes stationary.

diff.log.gold<- diff(data2$gold_log)
tsplot(diff.log.gold,main = "Log Gold Price After Differencing",gg=T,ylab = "",xlab = "Series",col = astsa.col(4, .8))

  1. From the below graph, we observe non-seasonal part reveals ARIMA(1,1,0) or ARIMA(0,1,1). Both will be acceptable and the results will be close. Seasonal parts are both tailing off in lag 6. Thus, seasonal models are warranted. We compare SARIMA(1,1,0)(1,0,0)(6), SARIMA(0,1,1)(1,0,0,6), and SARIMA(0,1,1)(1,0,1)(6)
acf2(diff.log.gold,main = "ACF and PACF of the Differenced Log Gold Price")

##      [,1] [,2]  [,3]  [,4]  [,5]  [,6] [,7]  [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.06 0.01 -0.01 -0.05 -0.05 -0.09 0.00 -0.03 0.03  0.02  0.00  0.06  0.01
## PACF 0.06 0.01 -0.01 -0.05 -0.04 -0.09 0.01 -0.03 0.03  0.01 -0.01  0.05  0.01
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF      0 -0.01 -0.03 -0.04 -0.04     0 -0.02  0.01  0.05 -0.02 -0.04  0.00
## PACF     0  0.00 -0.02 -0.03 -0.02     0 -0.02  0.00  0.04 -0.04 -0.05  0.01
##      [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## ACF  -0.02  0.03 -0.03  0.08 -0.07  0.03  0.03  0.02  0.01  0.00  0.02 -0.01
## PACF -0.02  0.04 -0.03  0.08 -0.08  0.04  0.03  0.03 -0.01  0.02  0.01  0.00
##      [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## ACF   0.02  0.01 -0.02  0.02 -0.06 -0.04 -0.05  0.01  0.04
## PACF  0.02  0.02 -0.02  0.01 -0.06 -0.04 -0.05  0.02  0.03
  1. From the below t-tables and residual analysis, we understand that all the estimated parts are significant and thus all the models deserve some considerations. Also, the box statistics are never significant in all three models, and residuals seem to break away from normal distribution only at tail.
ts1<- sarima(data2$gold_log,1,1,0,1,0,0,6,no.constant = T,details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.777614 
## iter   2 value -4.783772
## iter   3 value -4.783777
## iter   3 value -4.783777
## iter   3 value -4.783777
## final  value -4.783777 
## converged
## initial  value -4.782425 
## iter   2 value -4.782426
## iter   2 value -4.782426
## iter   2 value -4.782426
## final  value -4.782426 
## converged

ts1$ttable
##      Estimate     SE t.value p.value
## ar1    0.0594 0.0283  2.1005  0.0359
## sar1  -0.0918 0.0282 -3.2521  0.0012
ts2<- sarima(data2$gold_log,0,1,1,1,0,0,6,no.constant = T,details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.777977 
## iter   2 value -4.784162
## iter   3 value -4.784169
## iter   3 value -4.784169
## iter   3 value -4.784169
## final  value -4.784169 
## converged
## initial  value -4.782416 
## iter   2 value -4.782417
## iter   2 value -4.782417
## iter   2 value -4.782417
## final  value -4.782417 
## converged

ts2$ttable
##      Estimate     SE t.value p.value
## ma1    0.0591 0.0281  2.0996  0.0360
## sar1  -0.0921 0.0282 -3.2626  0.0011
ts3<- sarima(data2$gold_log,0,1,1,1,0,1,6,no.constant = T,details = T,gg=TRUE, col=astsa.col(4, .8))
## initial  value -4.777977 
## iter   2 value -4.778773
## iter   3 value -4.783984
## iter   4 value -4.783996
## iter   5 value -4.784020
## iter   6 value -4.784106
## iter   7 value -4.784286
## iter   8 value -4.784559
## iter   9 value -4.784875
## iter  10 value -4.784890
## iter  11 value -4.784912
## iter  12 value -4.784914
## iter  13 value -4.784961
## iter  14 value -4.785000
## iter  15 value -4.785021
## iter  16 value -4.785024
## iter  17 value -4.785024
## iter  17 value -4.785024
## iter  17 value -4.785024
## final  value -4.785024 
## converged
## initial  value -4.783831 
## iter   2 value -4.783835
## iter   3 value -4.783840
## iter   4 value -4.783872
## iter   5 value -4.783929
## iter   6 value -4.784003
## iter   7 value -4.784101
## iter   8 value -4.784265
## iter   9 value -4.784265
## iter  10 value -4.784282
## iter  11 value -4.784283
## iter  12 value -4.784307
## iter  13 value -4.784328
## iter  14 value -4.784368
## iter  15 value -4.784445
## iter  16 value -4.784472
## iter  17 value -4.784485
## iter  18 value -4.784486
## iter  19 value -4.784501
## iter  20 value -4.784546
## iter  21 value -4.784558
## iter  22 value -4.784564
## iter  23 value -4.784565
## iter  24 value -4.784592
## iter  25 value -4.784612
## iter  26 value -4.784673
## iter  27 value -4.784767
## iter  28 value -4.784776
## iter  29 value -4.784780
## iter  30 value -4.784781
## iter  31 value -4.784790
## iter  32 value -4.784813
## iter  33 value -4.784874
## iter  34 value -4.784895
## iter  35 value -4.784900
## iter  36 value -4.784901
## iter  37 value -4.784924
## iter  38 value -4.784944
## iter  39 value -4.784989
## iter  40 value -4.785058
## iter  41 value -4.785074
## iter  42 value -4.785076
## iter  43 value -4.785076
## iter  44 value -4.785085
## iter  45 value -4.785100
## iter  46 value -4.785118
## iter  47 value -4.785128
## iter  48 value -4.785129
## iter  49 value -4.785130
## iter  50 value -4.785130
## iter  50 value -4.785130
## iter  50 value -4.785130
## final  value -4.785130 
## converged

ts3$ttable
##      Estimate     SE  t.value p.value
## ma1    0.0650 0.0282   2.2995  0.0216
## sar1  -0.9379 0.0501 -18.7337  0.0000
## sma1   0.8968 0.0649  13.8205  0.0000

(4). From AIC and BIC, all the models are close, AIC suggests the third while BIC suggests the first. We decide to propose the first model for its simplicity.

## [1] "AIC : -6.7222 -6.7222 -6.726"
## [1] "BIC -6.70999 -6.70997 -6.70974"

(5). With the established model SARIMA(1,1,0,1,0,0,6), we provide 5-step ahead forecast and their corresponding SE. Note that we forecast in log-terms.

Log_Gold_Price<- data2$gold_log
sarima.for(Log_Gold_Price,1,1,0,1,0,0,6,n.ahead = 5
           ,no.constant = T,plot.all = F,xlab="Series",col=astsa.col(4, .8),lty=1,pch="",main="5-step Ahead Forecast for Log Gold Price")

## $pred
## Time Series:
## Start = 1263 
## End = 1267 
## Frequency = 1 
## [1] 7.492236 7.492833 7.492726 7.492682 7.493688
## 
## $se
## Time Series:
## Start = 1263 
## End = 1267 
## Frequency = 1 
## [1] 0.008375478 0.012201556 0.015104056 0.017533349 0.019664842

Comparison- Prediction RMSE

training_gold_log<-data1$gold_log[1:1231] 
training_vol_log<- data1$vol_log[1:1231] 
training_vol_log_square<- training_gold_log^2

testing_vol_log<- data1$vol_log[1232:1236] 
testing_gold<- data1$gold[1232:1236]

GLS<- sarima(data1$gold_log,0,1,0,0,0,1,6
              ,xreg = cbind(data1$vol_log,I((data1$vol_log)^2)),details = F,gg=TRUE, col=astsa.col(4, .8))

Time_series<- sarima(data1$gold_log,1,1,0,1,0,0,6,no.constant = T,details = F,gg=TRUE, col=astsa.col(4, .8))



GLS_forecast<- sarima.for(training_gold_log,0,1,0,0,0,1,6,n.ahead = 5
              ,xreg = cbind(training_vol_log,I((training_vol_log)^2)),newxreg = cbind(testing_vol_log,I((testing_vol_log)^2)),plot = F)


Time_series_forcast<- sarima.for(training_gold_log,1,1,0,1,0,0,6,n.ahead = 5
           ,no.constant = T,plot = F)

GLS_prediction<- exp(GLS_forecast$pred)
Time_series_prediction<- exp(Time_series_forcast$pred)

# rmse
rmse_GLS<- sqrt(mean((GLS_prediction-testing_gold)^2))
rmse_time<- sqrt(mean((Time_series_prediction-testing_gold)^2))

results<- data.frame(series=1:5,True_value=testing_gold,GLS=GLS_prediction,time=Time_series_prediction)

results<-as_tibble(results)

colnames(results)<-c("Series","True Value","GLS","Time Series")
results
results_prediction<- data.frame(GLS=rmse_GLS,Time_Series=rmse_time)
rownames(results_prediction)<-"RMSE"
colnames(results_prediction)<- c("GLS","Time Series")
results_prediction
# 1.8% 

Results

  1. For the regression model, we have two explanatory variables: log-transformed volatility index and the quadratic form of this variable. The first one is positive(0.106) and the second is negative (-0.017). We should interpret these results in growth term. That is, 100% growth in volatility index will be associated on average 10 percent growth in gold fixing price. However, in real world, the growth and decline will be much more subtle than this scenario. For the negative quadratic terms, this means that as the volatility increases its effect will be slowly decreasing, which are referred to as decreasing marginal effects by economists. Note that we have used regression with autocorrelated errors to correct the residuals, suggesting that besides the predictor variable there is something we have not totally accounted for. Since lag 6 approximately correspond to a week in this data set (5 transaction days a week), we could think that there is a weekly association between data.

  2. For the SARIMA model, we have SARIMA(1,1,0)(1,0,0)(6) with AR(0.0594) and SAR(0.0918). This model also explains or predicts in growth terms similarly to the regression models above. The result suggests the current or next gold fixing price is positively correlated with the previous one point while negatively correlated with previous six points, with the closer one giving stronger effects. Different from the regression results, with time series model, we see the suspected “weekly” effect more clearly.

  3. Using the time series model in this scenario is more advantageous than regression model if we want to predict the gold fixing price. One reason is the ease of data collection. That is, we can use the gold fixing price itself without having to collect and match the volatility data. Second, some financial theory suggest that the best information we have for the price may be price itself since the market is efficient. In this case, timer series model satisfies this assumption better.

Discussion

  1. Though volatility index is closely related to gold fixing price, it is somehow an indirect connection, because it traces a pool of gold related stocks which are themselves complex data. It might be better if we can find an more direct pool of data.

  2. By building the regression model, we are only estimating the association. The underlying connection between volatility and fixing price can be complicated. To further establish other association or explain more variation, we should incorporate more fundamental macroeconomic data. For various related economic data, we may consider to use PCR to avoid multi-collinearity issues.

  3. In predicting the gold fixing price, we use a time series model. However, we do not do any cross-validation to fine tune the parameters. Thus, we do not know how effective this model is.