Chapter5.rmd

# Chapter 5

```{r echo=FALSE, message=FALSE, warning=FALSE, Load_packages}

library(fpp2)

```

1. Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.

daily20 <- head(elecdaily,20)

```{r echo=FALSE, message=FALSE, warning=FALSE, Question1}

# I needed to make elecdaily data because there weren't elecdaily data when I solved this question at the first time. 
# I could get only elecdemand data, which have half-hourly electricity demand data for Victoria, Australia.
# Even if I could get elecdaily data from the development version in github(https://github.com/robjhyndman/forecast), I left the part of codes for making elecdaily data.

# Aggregate demand column by sum and workday and temperature columns by mean. Set nfrequency as 365 to aggregate data for each day.
elecdaily <- ts.union(
  aggregate(elecdemand[, "Demand"], nfrequency = 365, FUN = sum),
  aggregate(elecdemand[, !colnames(elecdemand) %in% c("Demand")], nfrequency = 365, FUN = mean)
  )
# Need to change the names of columns after aggregating.
colnames(elecdaily) <- colnames(elecdemand)

# It will be easier to aggregate if I know the index of the columns that I want to remove.
#elecdemand[, -1]
#elecdemand[, -c(2,3)]

daily20 <- head(elecdaily, 20)
daily20

# a. Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?
autoplot(daily20)

# Use tslm function to find the regression model
tslm_Dem_Temp <- tslm(Demand ~ Temperature, data = daily20)
tslm_Dem_Temp

# There is a positive relationship between the two variables. It looked like it happened because of air conditioner and fan. It's likely that as temperature increased, people wanted to run them and it increased the demand of electricity

# A scatter plot of Demand against Temperature is shown below with the estimated regression line. This graph shows the positive relationship a lot more clearly
daily20 %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
    ylab("Electricity Demand") +
    xlab("Temperature") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)

# b. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(tslm_Dem_Temp$residuals)
# I think that this model is adequate because residuals aren't correlated with each other. But there was an outlier.

# c. Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15 and compare it with the forecast if the maximum temperature was 35. Do you believe these forecasts?
fc_Dem_Temp <- forecast(tslm_Dem_Temp, 
  newdata=data.frame(Temperature=c(15,35)))
fc_Dem_Temp
# I think that the model forecasted rightly because the forecasted temperature values were near to the range of temperatures in the data

# d. Give prediction intervals for your forecasts. The following R code will get you started:
# 80% intervals
fc_Dem_Temp$upper[, 1]
fc_Dem_Temp$lower[, 1]
# 95% intervals
fc_Dem_Temp$upper[, 2]
fc_Dem_Temp$lower[, 2]

# e. Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
    ylab("Electricity Demand") +
    xlab("Temperature") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)
# The result plot indicates that the model was made by few data points. It could've explained the data of the first 20 days well, but it wasn't right model for total data points

```


2. Data set mens400 contains the winning times (in seconds) for the men's 400 meters final in each Olympic Games from 1896 to 2016.

```{r echo=FALSE, message=FALSE, warning=FALSE, Question2}

# a. Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400)
# Feature1. Winning times in Olympic men's 400m track final had the trend of decreasing as time went on.
# Feature2. There are missing values.

# b. Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?
# Extract time part from mens400 time series to do linear modeling.
time_mens400 <- time(mens400)
tslm_mens400 <- tslm(mens400 ~ time_mens400, 
                     data = mens400)

# Show data with regression line
autoplot(mens400) +
  geom_abline(slope = tslm_mens400$coefficients[2],
              intercept = tslm_mens400$coefficients[1],
              colour = "red")

# Get the winning time decreasing rate
tslm_mens400$coefficients[2]
# The winning times have been decreasing at average rate of 0.06457 second per year.

# c. Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
cbind(Time = time_mens400, 
      Residuals = tslm_mens400$residuals) %>%
  as.data.frame() %>%
  ggplot(aes(x = Time, y = Residuals)) +
    geom_point() +
    ylab("Residuals of Regression Line(Unit:s)")
# The residual plot shows that the regression model generally fitted the data well. I can check it using checkresiduals function, too.
checkresiduals(tslm_mens400)

# d. Predict the winning time for the men's 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
# I made linear model with na.action argument as na.exclude to exclude missing values.
# And then I used the linear model in forecast function to get prediction interval.
# Forecast function can't calculate prediction interval when there is any missing values in the data that I excluded them fitting linear model.
lm_mens400 <- lm(
  mens400 ~ time_mens400, 
  data = mens400,
  na.action = na.exclude
  )

fc_mens400 <- forecast(
  lm_mens400, 
  newdata = data.frame(time_mens400 = 2020)
  )

autoplot(mens400) +
  autolayer(fc_mens400, PI = TRUE)

# Get 80% and 95% prediction intervals
fc_mens400$upper
fc_mens400$lower

# 80% interval is from 40.45 to 43.63
# 95% interval is from 39.55 to 44.53
# But we need to consider that they were calculated from the assumption that the model's residuals were normally distributed. But we saw from the result of checkresiduals function that it isn't true.

```


3. Type easter(ausbeer) and interpret what you see.

```{r echo=FALSE, message=FALSE, warning=FALSE, Question3}

help("ausbeer")
head(ausbeer)
str(ausbeer)
# Quarterly Australian beer production data. There are 218 data points.

time(ausbeer)[c(1, length(ausbeer))]
# start is 1st quarter of 1956 and the last is the 2nd quarter of 2010.

easter(ausbeer)
# easter function returns a vector of 0's or 1's or fractional parts in the observed time period. If full Easter holidays are in a time period, it returns 1, and returns 0 if there isn't any. If the holidays are extended from a period to the other, easter function returns fractional portions to each of them.

```


5. The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

```{r echo=FALSE, message=FALSE, warning=FALSE, Question5}

# a. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy)
head(fancy, 50)
# Sales generally increased from January to December. Sales increased dramatically in Decembers. Sales in Decembers increased as time went on, but in 1991, sales decreased. In most years, there was also unexpected increase every March, but the increases were a lot smaller than the increases in Decembers.

# b. Explain why it is necessary to take logarithms of these data before fitting a model.
# The size of the seasonal variations should be almost same across the whole series to be fitted well to a model. Fancy data shows that seasonal variations increased exponentially. Therefore it is necessary to take logarithms of the data.

# c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a "surfing festival" dummy variable.

# make "surfing_festival" dummy variable using time index of fancy. The value is 1 if the year is equal to or above 1988 and the month is March.
Time <- time(fancy)
surfing_festival <- c()
for(i in 1:length(Time)){
  month <- round(12*(Time[i] - floor(Time[i]))) + 1
  year <- floor(Time[i])
  if(year >= 1988 & month == 3){
    surfing_festival[i] <- 1
  } else {
    surfing_festival[i] <- 0
  }
}
# If I had made surfing_festival as a list, I should've needed to use unlist function to make it as atomic vector, not nested list. tslm function can get vector or factor type data, but it cannot get nested list.

tslm_log_fancy <- tslm(
  BoxCox(fancy, 0) ~ trend + season + surfing_festival
  )

# d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(tslm_log_fancy$residuals)
# The residuals have pattern against time. It means that there is correlation between residuals and time.

cbind(Residuals = tslm_log_fancy$residuals,
      Fitted_values = tslm_log_fancy$fitted.values) %>%
  as.data.frame() %>%
  ggplot(aes(x = Fitted_values,
             y = Residuals)) +
    geom_point()
# The size of the residuals changed as we move along the x-axis(fitted values). It means that even after log transformation, there are still heteroscedacity in the errors or that the variance of the residuals aren't still constant. 

# e. Do boxplots of the residuals for each month. Does this reveal any problems with the model?
cbind.data.frame(
    Month = factor(
      month.abb[round(12*(Time - floor(Time)) + 1)],
      labels = month.abb,
      ordered = TRUE
    ),
    Residuals = tslm_log_fancy$residuals
    ) %>%
  ggplot(aes(x = Month,
             y = Residuals)) +
    geom_boxplot()
# If vectors are combined by cbind function, the class of the result is matrix, which should hold one type of data. If I want to make the columns to have different data types, I need to use cbind.data.frame function instead. Instead, if I still want to use cbind function, I need to use as.numeric function in mapping of ggplot.
# If the mapping of boxplot is (factor x factor), it would be difficult to see any box because boxplot function can't aggregate factor type data. The result would be looked like a scatter plot.

# To see the change of the residuals for each month, I used ggsubseriesplot function.
ggsubseriesplot(tslm_log_fancy$residuals)

# The distribution of the residuals was unsymetrical for some months. And for some months, the median of the residuals wasn't 0(residuals' mean should be 0 for all months because getting the minimum SSE means getting mean). Residuals with such properties can't have normal distribution, which will make it difficult to get prediction interval.

# f. What do the values of the coefficients tell you about each variable?
tslm_log_fancy$coefficients
# The model has positive trend. It means that as time goes on, the sales amount generally increases. 
# And all seasonal variables are positive. It means that the sales amount was minimum on January and the sales of the other months were greater than January for most of years. 
# Finally, surfing_festival variable's coefficient is 0.501 and it isn't small compared to the others. It means that there were increased sales in Marchs when surfing festival happened.

# g. What does the Breusch-Godfrey test tell you about your model?
checkresiduals(tslm_log_fancy)
# The p value of the test is less than 0.05. It means that the residuals can be distinguished from white noise. The residuals can be correlated with each other.

# h. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
# make surfing festival data for the months of 1994 through 1996.
future_fancy <- rep(0, 36)
for(i in 1:36){
  if(i %% 12 == 3){
    future_fancy[i] <- 1
  }
}
# make future data as time series.
future_fancy <- ts(data = future_fancy,
                   start = 1994,
                   end = c(1996, 12),
                   frequency = 12)

# forecast
fc_tslm_log_fancy <- forecast(
  tslm_log_fancy,
  newdata = data.frame(Time = time(future_fancy),
                       surfing_festival = future_fancy)
)

# plot the forecast
autoplot(fc_tslm_log_fancy)

# show prediction interval
fc_tslm_log_fancy$upper
fc_tslm_log_fancy$lower
# The intervals on Decembers were especially large.

# i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.

# make fc_tslm_fancy object, which are inverse log transformed version of fc_tslm_log_fancy.
fc_tslm_fancy <- fc_tslm_log_fancy

# members which should be inverse log transformed.
members_inv.log <- c('x', 'mean', 'lower', 'upper', 'residuals', 'fitted')

# apply inverse log transform to the members.
fc_tslm_fancy[members_inv.log] <- lapply(
  fc_tslm_log_fancy[members_inv.log],
  InvBoxCox,
  lambda = 0
)

# apply inverse log transform to 'BoxCox(fancy, 0)' member in model's model.
fc_tslm_fancy[['model']][['model']][1] <- lapply(
  fc_tslm_log_fancy[['model']][['model']][1],
  InvBoxCox,
  lambda = 0
)

autoplot(fc_tslm_fancy)
# Even if I transformed the data, fitted values and forecasts, the name of predicted values is still 'BoxCox(fancy, 0)'. 
# I can't change it because it came from the formula in tslm function. Changing it means making new model, not just changing the variable's name.
# I think that it is better to set lambda = 0 in forecast function from the very first to forecast using log transformation.

fc_tslm_fancy$upper
fc_tslm_fancy$lower
# The range of prediction intervals became a lot bigger after inverse log transformation.

# j. How could you improve these predictions by modifying the model?
# The predictions when I don't use log transformation.
tslm_fancy <- tslm(
  fancy ~ trend + season + surfing_festival
  )

fc_tslm_fancy2 <- forecast(
  tslm_fancy, 
  newdata = data.frame(Time = time(future_fancy),
                       surfing_festival = future_fancy)
)

autoplot(fc_tslm_fancy2)
# The result shows that the forecasts couldn't reflect the exponential growth trend.

# I could've improved the predictions by using log transformation. By using the transformation, the predictions could reflect the sales' exponential growth trend better.

```


6. The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in "thousand barrels per day". Consider only the data to the end of 2004.

```{r echo=FALSE, message=FALSE, warning=FALSE, Question7}

# a. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.
str(gasoline)
head(gasoline)
# They are weekly data and it would be useful to make model with harmonic regression.

# extract gasoline data up to the end of 2004 and plot it
gasoline_until_2004 <- window(gasoline, end = 2005)
autoplot(gasoline_until_2004, xlab = "Year")

# make tslm model
for(num in c(1, 2, 3, 5, 10, 20)){
  #make variable names for each model using pair number.
  var_name <- paste("tslm_ft",
                    as.character(num),
                    "_gasoline_until_2004",
                    sep = "")
  
  #assign ts linear model to each variable name.
  assign(var_name,
         tslm(gasoline_until_2004 ~ trend + fourier(
           gasoline_until_2004,
           K = num
           ))
         )
  
  #plot the data and fitted values
  print(
    autoplot(gasoline_until_2004) +
      autolayer(get(var_name)$fitted.values,
                series = as.character(num)) +
      ggtitle(var_name) +
      ylab("gasoline") +
      guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
  )
}

autoplot(gasoline_until_2004) +
  autolayer(tslm_ft1_gasoline_until_2004$fitted.values, series = "1") +
  autolayer(tslm_ft5_gasoline_until_2004$fitted.values, series = "2") +
  autolayer(tslm_ft10_gasoline_until_2004$fitted.values, series = "3") +
  autolayer(tslm_ft10_gasoline_until_2004$fitted.values, series = "5") +
  autolayer(tslm_ft20_gasoline_until_2004$fitted.values, series = "10") +
  autolayer(tslm_ft20_gasoline_until_2004$fitted.values, series = "20") +
  guides(colour = guide_legend(title = "Fourier Transform pairs")) +
  scale_color_discrete(breaks = c(1, 2, 3, 5, 10, 20))
# as more number of Fourier pairs used, the fitted line looks more like the original data. But the fitted lines didn't follow the trend well.

# b. Select the appropriate number of Fourier terms to include by minimizing the AICc or CV value.
for(i in c(1, 2, 3, 5, 10, 20)){
  tslm_ft_gasoline_until_2004.name <- paste(
    "tslm_ft", as.character(i), "_gasoline_until_2004",
    sep = ""
  )
  writeLines(
    paste(
    "\n", tslm_ft_gasoline_until_2004.name, "\n"
    )
  )
  print(CV(get(tslm_ft_gasoline_until_2004.name)))
}

# In the above 6 K values, 10 minimized the AICc and CV value.

# Get exact number of Fourier pairs to minimize AICc or CV
min_AICc <- Inf
min_K_by_AICc <- 0
min_CV <- Inf
min_K_by_CV <- 0
AICc_K <- 0
CV_K <- 0

# maximum number of pairs is 26 because the frequency of gasoline data is about 52.18
for(num in 1:26){
  AICc_K <- CV(
    tslm(
      gasoline_until_2004 ~ trend + fourier(gasoline_until_2004, K = num)
    )
  )[["AICc"]]
  print(AICc_K)
  CV_K <- CV(
    tslm(
      gasoline_until_2004 ~ trend + fourier(gasoline_until_2004, K = num)
    )
  )[["CV"]]
  print(CV_K)
  
  # If the minimum AICc and CV values are found, the loop don't need to run anymore. Therefore print the result number of pairs and break the loop.
  # If num = 1, don't run below codes and move to num = 2. With just the result of num = 1, I cannot know whether the AICc and CV values are minimum.
  if(num != 1){
    if(AICc_K >= min_AICc & CV_K >= min_CV){
      writeLines(
        paste("The number of Fourier Transform pairs to minimize AICc",
              "\n",
              as.character(min_K_by_AICc)
        )
      )
      writeLines(
        paste("The number of Fourier Transform pairs to minimize CV",
              "\n",
              as.character(min_K_by_CV)
        )
      )
      break
    }
  }
  
  # find the minimum AICc and CV and the number of pairs at the state.
  if(AICc_K < min_AICc){
    min_AICc <- AICc_K
    min_K_by_AICc <- num
  }
  if(CV_K < min_CV){
    min_CV <- CV_K
    min_K_by_CV <- num
  }
}
# To get minimum AICc or CV, I need 7 pairs.

# c. Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)
tslm_ft7_gasoline_until_2004 <- tslm(
  gasoline_until_2004 ~ trend + fourier(
    gasoline_until_2004, 
    K = 7
    )
  )

checkresiduals(tslm_ft7_gasoline_until_2004)

# d. To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. This can be done as follows.

fc_gasoline_2005 <- forecast(
  tslm_ft7_gasoline_until_2004,
  newdata=data.frame(fourier(
    gasoline_until_2004, K = 7, h = 52)
    )
  )

# where tslm_ft7_gasoline_until_2004 is the fitted model using tslm, K is the number of Fourier terms used in creating fit, and h is the forecast horizon required. Got the next year's forecasts.

# e. Plot the forecasts along with the actual data for 2005. What do you find?
autoplot(fc_gasoline_2005) +
  autolayer(window(
    gasoline,
    start = 2004,
    end = 2006
    )
  ) +
  scale_x_continuous(limits = c(2004, 2006))

# Almost all of actual data were in the 80% prediction interval. But the model couldn't predict the sudden fall in the fall. The drop was a lot bigger than expected.

```


7. Data set huron gives the water level of Lake Huron in feet from 1875-1972.

```{r echo=FALSE, message=FALSE, warning=FALSE, Question7}

# a. Plot the data and comment on its features.
autoplot(huron)
str(huron)
head(huron)
# There are lots of fluctuations that show some cyclic behaviour. And up to about the year 1930 there were downward trend, but after that the trend disappeared.

# b. Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.
h <- 8
# simple linear regression
tslm_huron <- tslm(huron ~ trend)
fc_tslm_huron <- forecast(tslm_huron, h=h)

# linear regression with log transformation
tslm_log_huron <- tslm(huron ~ trend, 
                          lambda = 0)
fc_tslm_log_huron <- forecast(tslm_log_huron, h=h)

# piecewise linear regression
t <- time(huron)
t.break <- 1915
t_piece <- ts(pmax(0,t-t.break), start=1875)

tslm_pw_huron <- tslm(huron ~ t + t_piece)
  t_new <- t[length(t)]+seq(h)
  t_piece_new <- t_piece[length(t_piece)]+seq(h)
  
newdata <- cbind(t=t_new,
                 t_piece=t_piece_new) %>%
  as.data.frame()

fc_tslm_pw_huron <- forecast(
  tslm_pw_huron,
  newdata = newdata
  )

# cubic spline regression
tslm_spline_huron <- splinef(huron, lambda = 0)
fc_tslm_spline_huron <- forecast(
  tslm_spline_huron,
  newdata = newdata
  )

# plot the results
autoplot(huron) +
  autolayer(fitted(tslm_huron), series = "Linear") +
  autolayer(fitted(tslm_log_huron), series="Logarithm") +
  autolayer(fitted(tslm_pw_huron), series = "Piecewise") +
  autolayer(fitted(tslm_spline_huron), series = "Cubic Spline") +
  autolayer(fc_tslm_pw_huron, series="Piecewise") +
  autolayer(fc_tslm_huron$mean, series = "Linear") +
  autolayer(fc_tslm_log_huron$mean, series="Logarithm") +
  autolayer(fc_tslm_spline_huron$mean, series="Cubic Spline") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron water level change") +
  guides(colour=guide_legend(title=" "))
# It looked like spline model didn't catch the trend well. Linear model and log transformed linear model couldn't reflect the trend change around the year 1915.

# c. Generate forecasts from these two models for the period upto 1980 and comment on these.
autoplot(huron) +
  autolayer(fitted(tslm_huron), series = "Linear") +
  autolayer(fc_tslm_huron, series = "Linear") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron water level change",
          subtitle = "using Linear Regression model") +
  guides(colour=guide_legend(title=" ")) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

autoplot(huron) +
  autolayer(fitted(tslm_pw_huron), series = "Piecewise") +
  autolayer(fc_tslm_pw_huron, series="Piecewise") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron water level change",
          subtitle = "using piecewise linear model") +
  guides(colour=guide_legend(title=" ")) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

# Linear regression model shows that the point forecasts and the upper and the lower bounds of prediction intervals decrease as time goes on. It didn't reflect the trend change around the year 1915.
# Piecewise linear regression model shows that the point forecasts and the prediction intervals are almost same over time. It reflected the trend change.

```


### Question 4, 8 are related with math, not related with coding that I didn't include them in here.