PresentationSP23.qmd

---
title: "Using Gradient-Boosted Machine Learning Models to Forecast Lake Water Quality"
author: "Greg Harrison"
format: revealjs
bibliography: references.bib
editor: visual
---

## NEON Forecasting Challenge

-   Ecological forecasting could improve natural resource management and advance predictive theory.
-   The Ecological Forecasting Initiative is hosting a data science challenge to accurately forecast observations at NEON sites across the U.S.
-   Focused on aquatics theme, generating 1- to 30-day ahead forecasts of water temperature, dissolved oxygen, and chlorophyll-a at 24 lakes and streams.
-   Contributes to a growing set of automated forecasting models that will be used to quantify the boundaries prediction in ecology.


## NEON Workflow

[![](Images/NeonWorkflow.jpeg){width=30%, fig-align="center"}](https://essopenarchive.org/doi/full/10.22541/essoar.167079499.99891914/v1) 

Figure from [@thomas2023]

## What is XGBoost?

-   **XGBoost** (Extreme Gradient Boosting): Ensemble model based on decisions trees that utilizes Gradient Boosting and is optimized to run fast
-   **Ensemble**: Combines multiple models together for enhanced performance
-   **Gradient Boosting**: Each new model learns to correct for previous models shortcomings
-   **Extreme**: Can use parallelization and GPUs to speed up training

## XGBoost Diagram

[![](Images/xgboostpicture.webp){width=80%, height=80%,fig-align=\"center\"}](https://medium.com/swlh/gradient-boosting-trees-for-classification-a-beginners-guide-596b594a14ea)

Figure from [@xgboost]

## Where has XGBoost been used before?

-   Predicting groundwater levels in Selangor, Malaysia, outperforming both ANNs and SVMs [@ibrahemahmedosman2021]

-   Predicting changes in ICU Covid-19 patients' sequential organ failure assessment scores [@montomoli2021]

-   Predicting heart disease in clinic patients with around 91.8% accuracy [@budholiya2022]


## Questions

-   How well does XGBoost perform in forecasting temperature, oxygen, and chlorophyll-a compare to baseline models such as Climatology and Persistence?
-   Do oxygen and chlorophyll-a forecasts benefit from using forecasted water temperature?

## Methods

-   R script to train XGBoost models on historical data, used day of year and forecasted air temperatures as inputs.
-   Reserved random 20% of data to evaluate model performance.
-   Forecast distribution due to combination of ensemble variance and random noise added based on model performance on testing data.
-   Two main families of models, parallel where each variable was predicted independently of the others, and sequential, where predicted water temperature was used as an input for Dissolved Oxygen and Chlorophyll-a

## Parallel Model

![](Images/ParallelModel.png){width=80%, fig-align="center"}

## Sequential Model

![](Images/SequentialModel.png){width=80%, fig-align="center"}

## Methods

-   Using Github actions produce a reproducible workflow that produces daily forecasts.
-   Every day a new XGBoost model is trained for each NEON site and target variable, and then uses NOAA forecasts to predict values 30 days into the future and submits.
-   Forecasting since February 14, 2023 (65 Days)

## Focal Analysis Sites

<!--Our results will focus on Lake Suggs and Lake Barco, two similar lakes in northern Florida.-->

![](Images/SUGG_BARC.png){width=80%, fig-align="center"}

```{r}
library(ggplot2)
library(dplyr)
library(gridExtra)
library(ggpubr)
```

```{r}
scores <- neon4cast::combined_scores(theme="aquatics", collect=FALSE)

model_ids <- c("xgboost_temp_oxygen_parallel", "xgboost_temp_oxygen_sequential", "xgboost_temp_oxygen_chla_parallel", "xgboost_temp_oxygen_chla_sequential", "climatology", "flareGLM", "persistenceRW")

df <- scores %>% 
  dplyr::filter(model_id %in% model_ids) %>% 
  dplyr::filter(reference_datetime > "2023-02-01") %>% 
  dplyr::collect()
```

```{r}
df <- df %>% dplyr::mutate(horizon = datetime-lubridate::as_datetime(reference_datetime)) %>%
    dplyr::mutate(horizon = as.numeric(lubridate::as.duration(horizon),
                                units = "seconds"),
           horizon = horizon / 86400)

```

## Example Forecast

```{r}
df %>% filter(site_id=="SUGG",
              reference_datetime=="2023-02-21",
              model_id=="xgboost_temp_oxygen_chla_parallel") %>%
  mutate(variable = ifelse(variable == "chla", "chla (ug/L)", variable),
         variable = ifelse(variable == "oxygen", "oxygen (mg/L)", variable),
         variable = ifelse(variable == "temperature", "temperature (C)", variable)) %>% 
ggplot(aes(x = horizon)) +
  geom_ribbon(aes(ymin=quantile02.5, ymax=quantile97.5) , alpha=0.5) + 
  geom_line(aes(y=median)) + 
  geom_point(aes(y=observation), color="blue") + 
  facet_wrap(~variable, scales = "free") + 
  ggtitle("Forecast for Lake Suggs generated Feb 21") + 
  labs(y = "Value", x = "Horizon", color="Model") 
  #scale_fill_manual(name="", values = c("95% Confidence" = "black")) + 
  #scale_color_manual(name="", values = c("Forecast Median" = "black")) 
  #scale_sh_manual(name="", values = c("Observed" = "black"))
```

# Results

```{r}
combined_df <- df
combined_df$model_id[combined_df$model_id == "xgboost_temp_oxygen_chla_parallel"] <- "xgboost_temp_oxygen_parallel"
combined_df$model_id[combined_df$model_id == "xgboost_temp_oxygen_chla_sequential"] <- "xgboost_temp_oxygen_sequential"
combined_df$site_id[combined_df$site_id == "SUGG"] <- "BARC"
```

```{r}
model_means <- combined_df %>%   
   group_by(model_id, horizon, variable, site_id) %>% 
   summarise_at(c("crps","logs"), mean, na.rm = TRUE) 

```

## Temperature

```{r}
ggplot(model_means %>% filter(variable=="temperature") %>% filter(site_id=="BARC"), aes(x=horizon, y=logs,
      color=model_id)) + geom_line() + labs(x="Horizon", y="Log Score", color="Model")
```

## Oxygen

```{r}
ggplot(model_means %>% filter(variable=="oxygen") %>% filter(site_id=="BARC"), aes(x=horizon, y=logs, 
      color=model_id)) + geom_line() + labs(x="Horizon", y="Log Score", color="Model")
```

## Chlorphyll-a

```{r}
# library(ggpubr)
ggplot(model_means %>% filter(variable=="chla") %>% filter(site_id=="BARC"), aes(x=horizon, y=logs, color=model_id)) + geom_line() + labs(x="Horizon", y="Log Score", color="Model") #  + facet_wrap(~variable)
```

## Performance of Parallel XGBoost Relative to Climatology

<!--y axis: delta log off of climatology x axis: horizon, $delta$ log $2^{\Delta \log differnce}$ log probability with climatology add negative if its worse and positive if its better-->

```{r}
#model_means <- combined_df %>%   
#   group_by(model_id, horizon, variable, site_id) %>% 
#   summarise_at(c("crps","logs"), mean, na.rm = TRUE) 

temperature <- model_means %>% filter(variable=="temperature") %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
oxygen <- model_means %>% filter(variable=="oxygen") %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
chla <- model_means %>% filter(variable=="chla") %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
ctemperature <- model_means %>% filter(variable=="temperature") %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
coxygen <- model_means %>% filter(variable=="oxygen") %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
cchla <- model_means %>% filter(variable=="chla") %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")

xgb <- model_means %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
clim <- model_means %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
xgb["diff"] = 2^(-(xgb$logs-clim$logs))

```

```{r}
moneyplot <- ggplot(xgb) + geom_line(aes(xgb$horizon, xgb$diff, color=variable)) + geom_hline(yintercept = 1) + labs(y = "Relative Density of Prediction to Climatology", x = "Horizon", color="Variable") 
moneyplot + annotate("text", y = 1.85, x = 5, label =  "Better than Climatology") + annotate("text", y = 0.15, x = 5, label = "Worse than Climatology")   + coord_cartesian(ylim=c(0,2), xlim= c(0,34))
```

# Other Themes

## Mountain Lake Biological Station


```{r}
mlscores <- neon4cast::combined_scores(theme="terrestrial_daily", collect=FALSE)

mlmodel_ids <- c("climatology", "xgboost_parallel")

mldf <- mlscores %>% 
  dplyr::filter(model_id %in% mlmodel_ids) %>% 
  dplyr::filter(reference_datetime > "2023-03-28") %>% 
  dplyr::collect()
```

```{r}
mldf <- mldf %>% dplyr::mutate(horizon = datetime-lubridate::as_datetime(reference_datetime)) %>%
    dplyr::mutate(horizon = as.numeric(lubridate::as.duration(horizon),
                                units = "seconds"),
           horizon = horizon / 86400)
mlmodel_means <- mldf %>%   
   group_by(model_id, horizon, variable, site_id) %>% 
   summarise_at(c("crps","logs"), mean, na.rm = TRUE) 

```

```{r}
ggplot(mlmodel_means %>% filter(site_id=="MLBS")%>% filter(variable=="nee")) + 
  geom_line(aes(horizon, logs, color=model_id)) + 
  ggtitle("Mountain Lake Biological Station (NEE)") + 
  scale_fill_discrete(name="Experimental\nCondition", breaks=c("climatology", "xgboost_parallel"), labels=c("Climatology", "XGBoost")) + 
  labs(y = "Log Score", x = "Horizon", color="Model") 
```


## Conclusion

- XGBoost showed potential for strong forecasting capabilities with limited model inputs
- On average outperformed climatology on Temperature and Oxygen variables across all horizons and better than climatology on the first 18 days
- Water temperature as a predictor for oxygen and chlorophyll-a did not improve predictions as currently designed 
- Further exploration into refining our application of this model to ecological forecasting could be beneficial

## References