forked from Grepath/XGBoostNeon4Casts
-
Notifications
You must be signed in to change notification settings - Fork 1
/
PresentationSP23.qmd
232 lines (164 loc) · 9.53 KB
/
PresentationSP23.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
title: "Using Gradient-Boosted Machine Learning Models to Forecast Lake Water Quality"
author: "Greg Harrison"
format: revealjs
bibliography: references.bib
editor: visual
---
## NEON Forecasting Challenge
- Ecological forecasting could improve natural resource management and advance predictive theory.
- The Ecological Forecasting Initiative is hosting a data science challenge to accurately forecast observations at NEON sites across the U.S.
- Focused on aquatics theme, generating 1- to 30-day ahead forecasts of water temperature, dissolved oxygen, and chlorophyll-a at 24 lakes and streams.
- Contributes to a growing set of automated forecasting models that will be used to quantify the boundaries prediction in ecology.
## NEON Workflow
[![](Images/NeonWorkflow.jpeg){width=30%, fig-align="center"}](https://essopenarchive.org/doi/full/10.22541/essoar.167079499.99891914/v1)
Figure from [@thomas2023]
## What is XGBoost?
- **XGBoost** (Extreme Gradient Boosting): Ensemble model based on decisions trees that utilizes Gradient Boosting and is optimized to run fast
- **Ensemble**: Combines multiple models together for enhanced performance
- **Gradient Boosting**: Each new model learns to correct for previous models shortcomings
- **Extreme**: Can use parallelization and GPUs to speed up training
## XGBoost Diagram
[![](Images/xgboostpicture.webp){width=80%, height=80%,fig-align=\"center\"}](https://medium.com/swlh/gradient-boosting-trees-for-classification-a-beginners-guide-596b594a14ea)
Figure from [@xgboost]
## Where has XGBoost been used before?
- Predicting groundwater levels in Selangor, Malaysia, outperforming both ANNs and SVMs [@ibrahemahmedosman2021]
- Predicting changes in ICU Covid-19 patients' sequential organ failure assessment scores [@montomoli2021]
- Predicting heart disease in clinic patients with around 91.8% accuracy [@budholiya2022]
## Questions
- How well does XGBoost perform in forecasting temperature, oxygen, and chlorophyll-a compare to baseline models such as Climatology and Persistence?
- Do oxygen and chlorophyll-a forecasts benefit from using forecasted water temperature?
## Methods
- R script to train XGBoost models on historical data, used day of year and forecasted air temperatures as inputs.
- Reserved random 20% of data to evaluate model performance.
- Forecast distribution due to combination of ensemble variance and random noise added based on model performance on testing data.
- Two main families of models, parallel where each variable was predicted independently of the others, and sequential, where predicted water temperature was used as an input for Dissolved Oxygen and Chlorophyll-a
## Parallel Model
![](Images/ParallelModel.png){width=80%, fig-align="center"}
## Sequential Model
![](Images/SequentialModel.png){width=80%, fig-align="center"}
## Methods
- Using Github actions produce a reproducible workflow that produces daily forecasts.
- Every day a new XGBoost model is trained for each NEON site and target variable, and then uses NOAA forecasts to predict values 30 days into the future and submits.
- Forecasting since February 14, 2023 (65 Days)
## Focal Analysis Sites
<!--Our results will focus on Lake Suggs and Lake Barco, two similar lakes in northern Florida.-->
![](Images/SUGG_BARC.png){width=80%, fig-align="center"}
```{r}
library(ggplot2)
library(dplyr)
library(gridExtra)
library(ggpubr)
```
```{r}
scores <- neon4cast::combined_scores(theme="aquatics", collect=FALSE)
model_ids <- c("xgboost_temp_oxygen_parallel", "xgboost_temp_oxygen_sequential", "xgboost_temp_oxygen_chla_parallel", "xgboost_temp_oxygen_chla_sequential", "climatology", "flareGLM", "persistenceRW")
df <- scores %>%
dplyr::filter(model_id %in% model_ids) %>%
dplyr::filter(reference_datetime > "2023-02-01") %>%
dplyr::collect()
```
```{r}
df <- df %>% dplyr::mutate(horizon = datetime-lubridate::as_datetime(reference_datetime)) %>%
dplyr::mutate(horizon = as.numeric(lubridate::as.duration(horizon),
units = "seconds"),
horizon = horizon / 86400)
```
## Example Forecast
```{r}
df %>% filter(site_id=="SUGG",
reference_datetime=="2023-02-21",
model_id=="xgboost_temp_oxygen_chla_parallel") %>%
mutate(variable = ifelse(variable == "chla", "chla (ug/L)", variable),
variable = ifelse(variable == "oxygen", "oxygen (mg/L)", variable),
variable = ifelse(variable == "temperature", "temperature (C)", variable)) %>%
ggplot(aes(x = horizon)) +
geom_ribbon(aes(ymin=quantile02.5, ymax=quantile97.5) , alpha=0.5) +
geom_line(aes(y=median)) +
geom_point(aes(y=observation), color="blue") +
facet_wrap(~variable, scales = "free") +
ggtitle("Forecast for Lake Suggs generated Feb 21") +
labs(y = "Value", x = "Horizon", color="Model")
#scale_fill_manual(name="", values = c("95% Confidence" = "black")) +
#scale_color_manual(name="", values = c("Forecast Median" = "black"))
#scale_sh_manual(name="", values = c("Observed" = "black"))
```
# Results
```{r}
combined_df <- df
combined_df$model_id[combined_df$model_id == "xgboost_temp_oxygen_chla_parallel"] <- "xgboost_temp_oxygen_parallel"
combined_df$model_id[combined_df$model_id == "xgboost_temp_oxygen_chla_sequential"] <- "xgboost_temp_oxygen_sequential"
combined_df$site_id[combined_df$site_id == "SUGG"] <- "BARC"
```
```{r}
model_means <- combined_df %>%
group_by(model_id, horizon, variable, site_id) %>%
summarise_at(c("crps","logs"), mean, na.rm = TRUE)
```
## Temperature
```{r}
ggplot(model_means %>% filter(variable=="temperature") %>% filter(site_id=="BARC"), aes(x=horizon, y=logs,
color=model_id)) + geom_line() + labs(x="Horizon", y="Log Score", color="Model")
```
## Oxygen
```{r}
ggplot(model_means %>% filter(variable=="oxygen") %>% filter(site_id=="BARC"), aes(x=horizon, y=logs,
color=model_id)) + geom_line() + labs(x="Horizon", y="Log Score", color="Model")
```
## Chlorphyll-a
```{r}
# library(ggpubr)
ggplot(model_means %>% filter(variable=="chla") %>% filter(site_id=="BARC"), aes(x=horizon, y=logs, color=model_id)) + geom_line() + labs(x="Horizon", y="Log Score", color="Model") # + facet_wrap(~variable)
```
## Performance of Parallel XGBoost Relative to Climatology
<!--y axis: delta log off of climatology x axis: horizon, $delta$ log $2^{\Delta \log differnce}$ log probability with climatology add negative if its worse and positive if its better-->
```{r}
#model_means <- combined_df %>%
# group_by(model_id, horizon, variable, site_id) %>%
# summarise_at(c("crps","logs"), mean, na.rm = TRUE)
temperature <- model_means %>% filter(variable=="temperature") %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
oxygen <- model_means %>% filter(variable=="oxygen") %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
chla <- model_means %>% filter(variable=="chla") %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
ctemperature <- model_means %>% filter(variable=="temperature") %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
coxygen <- model_means %>% filter(variable=="oxygen") %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
cchla <- model_means %>% filter(variable=="chla") %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
xgb <- model_means %>% filter(model_id=="xgboost_temp_oxygen_parallel") %>% filter(site_id=="BARC")
clim <- model_means %>% filter(model_id=="climatology") %>% filter(site_id=="BARC")
xgb["diff"] = 2^(-(xgb$logs-clim$logs))
```
```{r}
moneyplot <- ggplot(xgb) + geom_line(aes(xgb$horizon, xgb$diff, color=variable)) + geom_hline(yintercept = 1) + labs(y = "Relative Density of Prediction to Climatology", x = "Horizon", color="Variable")
moneyplot + annotate("text", y = 1.85, x = 5, label = "Better than Climatology") + annotate("text", y = 0.15, x = 5, label = "Worse than Climatology") + coord_cartesian(ylim=c(0,2), xlim= c(0,34))
```
# Other Themes
## Mountain Lake Biological Station
```{r}
mlscores <- neon4cast::combined_scores(theme="terrestrial_daily", collect=FALSE)
mlmodel_ids <- c("climatology", "xgboost_parallel")
mldf <- mlscores %>%
dplyr::filter(model_id %in% mlmodel_ids) %>%
dplyr::filter(reference_datetime > "2023-03-28") %>%
dplyr::collect()
```
```{r}
mldf <- mldf %>% dplyr::mutate(horizon = datetime-lubridate::as_datetime(reference_datetime)) %>%
dplyr::mutate(horizon = as.numeric(lubridate::as.duration(horizon),
units = "seconds"),
horizon = horizon / 86400)
mlmodel_means <- mldf %>%
group_by(model_id, horizon, variable, site_id) %>%
summarise_at(c("crps","logs"), mean, na.rm = TRUE)
```
```{r}
ggplot(mlmodel_means %>% filter(site_id=="MLBS")%>% filter(variable=="nee")) +
geom_line(aes(horizon, logs, color=model_id)) +
ggtitle("Mountain Lake Biological Station (NEE)") +
scale_fill_discrete(name="Experimental\nCondition", breaks=c("climatology", "xgboost_parallel"), labels=c("Climatology", "XGBoost")) +
labs(y = "Log Score", x = "Horizon", color="Model")
```
## Conclusion
- XGBoost showed potential for strong forecasting capabilities with limited model inputs
- On average outperformed climatology on Temperature and Oxygen variables across all horizons and better than climatology on the first 18 days
- Water temperature as a predictor for oxygen and chlorophyll-a did not improve predictions as currently designed
- Further exploration into refining our application of this model to ecological forecasting could be beneficial
## References