Explore and analyze the continuous chlorophyll data to be included in the NDFS synthesis manuscript. We will attempt to fit multiple models to predict weekly average chlorophyll fluorescence values. All models will only include representative stations for 4 habitat types - upstream (RD22), lower Yolo Bypass (STTD), Cache Slough complex (LIB), and downstream (RVB). At a minimum, the models will contain the two categorical variables - Year and Station - as predictor variables. In some of the models, we will add weekly average flow as a continuous predictor which replaces the categorical predictor - flow action period - in the original analysis. Additionally, we’ll add a GAM smooth for Week number term to account for seasonality in some of the models. After fitting multiple models, we’ll use a model selection process to determine the best one.
# Load packages
library(tidyverse)
library(scales)
library(knitr)
library(mgcv)
library(car)
library(gratia)
library(here)
library(conflicted)
# Source functions
source(here("manuscript_synthesis/src/global_functions.R"))
# Declare package conflict preferences
conflicts_prefer(dplyr::filter(), dplyr::lag())
Display current versions of R and packages used for this analysis:
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.3 (2023-03-15 ucrt)
## os Windows 10 x64 (build 19045)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.utf8
## ctype English_United States.utf8
## tz America/Los_Angeles
## date 2024-02-29
## pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
## bslib 0.4.2 2022-12-16 [1] CRAN (R 4.2.2)
## cachem 1.0.8 2023-05-01 [1] CRAN (R 4.2.3)
## callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.2)
## car * 3.1-2 2023-03-30 [1] CRAN (R 4.2.3)
## carData * 3.0-5 2022-01-06 [1] CRAN (R 4.2.1)
## cli 3.6.2 2023-12-11 [1] CRAN (R 4.2.3)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.2)
## conflicted * 1.2.0 2023-02-01 [1] CRAN (R 4.2.2)
## crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.1)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.1)
## digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.3)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.2.3)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.1)
## evaluate 0.21 2023-05-05 [1] CRAN (R 4.2.3)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.2.3)
## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.2)
## fs * 1.6.3 2023-07-20 [1] CRAN (R 4.2.3)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
## ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.2.3)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.2.3)
## gratia * 0.8.1 2023-02-02 [1] CRAN (R 4.2.2)
## gtable 0.3.4 2023-08-21 [1] CRAN (R 4.2.3)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.2.1)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.3)
## htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3)
## htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.3)
## httpuv 1.6.9 2023-02-14 [1] CRAN (R 4.2.2)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.1)
## jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.3)
## knitr * 1.42 2023-01-25 [1] CRAN (R 4.2.2)
## later 1.3.0 2021-08-18 [1] CRAN (R 4.2.1)
## lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.3)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.2.3)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.2.3)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.1)
## Matrix 1.5-4 2023-04-04 [1] CRAN (R 4.2.3)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.1)
## mgcv * 1.8-42 2023-03-02 [1] CRAN (R 4.2.2)
## mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.1)
## mvnfast 0.2.8 2023-02-23 [1] CRAN (R 4.2.2)
## nlme * 3.1-162 2023-01-31 [2] CRAN (R 4.2.3)
## patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.1)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3)
## pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.2.3)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.1)
## pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.2.3)
## prettyunits 1.2.0 2023-09-24 [1] CRAN (R 4.2.3)
## processx 3.8.2 2023-06-30 [1] CRAN (R 4.2.3)
## profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.1)
## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
## ps 1.7.5 2023-04-18 [1] CRAN (R 4.2.3)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.2.3)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.1)
## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.2.3)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.2.3)
## remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.1)
## rlang * 1.1.3 2024-01-10 [1] CRAN (R 4.2.3)
## rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.3)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.1)
## rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
## sass 0.4.6 2023-05-03 [1] CRAN (R 4.2.3)
## scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.1)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
## shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.2)
## stringi 1.8.3 2023-12-11 [1] CRAN (R 4.2.3)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.2.3)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.3)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.2.3)
## tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.1)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.2)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.2.3)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.2.3)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.1)
## usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.1)
## utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.2.3)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.2.3)
## xfun 0.39 2023-04-20 [1] CRAN (R 4.2.3)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.1)
## yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
##
## [1] C:/R/win-library/4.2
## [2] C:/Program Files/R/R-4.2.3/library
##
## ──────────────────────────────────────────────────────────────────────────────
# Define file path for processed data
fp_data <- here("manuscript_synthesis/data/processed")
# Import weekly average water quality data
df_wq <- readRDS(file.path(fp_data, "wq_week_avg_2013-2019.rds"))
# Import weekly average flow data
df_flow <- readRDS(file.path(fp_data, "flow_week_avg_2013-2019.rds"))
# Create a vector for the factor order of StationCode
sta_order <- c("RD22", "STTD", "LIB", "RVB")
# We will use LIS flow data as a proxy for STTD
df_flow_c <- df_flow %>% mutate(StationCode = if_else(StationCode == "LIS", "STTD", StationCode))
# Prepare chlorophyll and flow data for exploration and analysis
df_chla_c1 <- df_wq %>%
select(StationCode, Year, Week, Chla) %>%
drop_na(Chla) %>%
# Filter to only include representative stations for 4 habitat types - RD22, STTD, LIB, RVB
filter(StationCode %in% sta_order) %>%
# Join flow data to chlorophyll data
left_join(df_flow_c, by = join_by(StationCode, Year, Week)) %>%
# Remove all NA flow values
drop_na(Flow) %>%
mutate(
# Scale and log transform chlorophyll values
Chla_log = log(Chla * 1000),
# Apply factor order to StationCode
StationCode = factor(StationCode, levels = sta_order),
# Add a column for Year as a factor for the model
Year_fct = factor(Year)
) %>%
arrange(StationCode, Year, Week)
df_chla_c1 %>%
summarize(
min_week = min(Week),
max_week = max(Week),
num_samples = n(),
.by = c(StationCode, Year)
) %>%
arrange(StationCode, Year) %>%
kable()
StationCode | Year | min_week | max_week | num_samples |
---|---|---|---|---|
RD22 | 2014 | 39 | 45 | 7 |
RD22 | 2015 | 30 | 45 | 16 |
RD22 | 2016 | 25 | 38 | 14 |
RD22 | 2017 | 28 | 44 | 17 |
RD22 | 2018 | 28 | 45 | 18 |
RD22 | 2019 | 28 | 45 | 18 |
STTD | 2013 | 33 | 44 | 12 |
STTD | 2014 | 30 | 45 | 14 |
STTD | 2015 | 30 | 46 | 17 |
STTD | 2016 | 25 | 38 | 14 |
STTD | 2017 | 28 | 39 | 10 |
STTD | 2018 | 29 | 42 | 14 |
STTD | 2019 | 30 | 45 | 16 |
LIB | 2014 | 40 | 45 | 6 |
LIB | 2015 | 27 | 46 | 20 |
LIB | 2016 | 22 | 38 | 17 |
LIB | 2017 | 28 | 44 | 17 |
LIB | 2018 | 33 | 44 | 10 |
LIB | 2019 | 28 | 43 | 8 |
RVB | 2013 | 27 | 46 | 20 |
RVB | 2014 | 30 | 45 | 16 |
RVB | 2015 | 27 | 46 | 20 |
RVB | 2016 | 22 | 38 | 15 |
RVB | 2017 | 28 | 44 | 17 |
RVB | 2018 | 28 | 45 | 18 |
RVB | 2019 | 28 | 45 | 18 |
Looking at the sample counts and date ranges, we’ll only include years 2015-2019 for the analysis.
df_chla_c2 <- df_chla_c1 %>%
filter(Year %in% 2015:2019) %>%
mutate(Year_fct = fct_drop(Year_fct))
We’ll create another dataframe that has up to 2 lag variables for chlorophyll to be used in the models to help with serial autocorrelation.
df_chla_c2_lag <- df_chla_c2 %>%
# Fill in missing weeks for each StationCode-Year combination
group_by(StationCode, Year) %>%
# Create lag variables of scaled log transformed chlorophyll values
mutate(
lag1 = lag(Chla_log),
lag2 = lag(Chla_log, 2)
) %>%
ungroup()
Let’s explore the data with some plots. First, lets plot the data in scatter plots of chlorophyll and flow faceted by Station and grouping all years together.
df_chla_c2 %>%
ggplot(aes(x = Flow, y = Chla)) +
geom_point() +
geom_smooth(formula = "y ~ x") +
facet_wrap(vars(StationCode), scales = "free") +
theme_bw()
At first glance, I’m not sure how well flow is going to be able to predict chlorophyll concentrations. At the furthest upstream station - RD22 - chlorophyll appears to be highest at the lowest flows, but the variation is at its maximum at the lowest flows. There may be some dilution effect going on here at the higher flows. At STTD, there does seem to be a modest increase in chlorophyll concentrations at the mid-range flows. This pattern is even more obvious at LIB. There appears to be no effect of flow on chlorophyll at RVB, but the range of chlorophyll concentrations is narrow at this station (between 0 and 5).
Let’s break these scatterplots apart by year to see how these patterns vary annually.
df_chla_c2 %>%
ggplot(aes(x = Flow, y = Chla)) +
geom_point() +
geom_smooth(formula = "y ~ x") +
facet_wrap(
vars(StationCode, Year),
ncol = 5,
scales = "free",
labeller = labeller(.multi_line = FALSE)
) +
theme_bw()
The patterns appear to vary annually at each station, which may justify using a 3-way interaction.
First, we will attempt to fit a generalized additive model (GAM) to the data set to help account for seasonality in the data. We’ll try running a GAM using a three-way interaction between Year, Weekly Average Flow, and Station, and a cyclic penalized cubic regression spline smooth term for week number to account for seasonality (restricting the k-value to 5 to reduce overfitting). Initially, we’ll run the GAM without accounting for serial autocorrelation.
m_gam_flow3 <- gam(
Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", k = 5),
data = df_chla_c2,
method = "REML",
knots = list(week = c(0, 52))
)
Lets look at the model summary and diagnostics:
summary(m_gam_flow3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc",
## k = 5)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.776e+00 1.065e-01 91.828 < 2e-16 ***
## Year_fct2016 -8.399e-01 1.543e-01 -5.444 1.16e-07 ***
## Year_fct2017 -9.692e-01 2.377e-01 -4.078 5.97e-05 ***
## Year_fct2018 -4.060e-01 1.421e-01 -2.858 0.004600 **
## Year_fct2019 -1.337e-01 1.396e-01 -0.958 0.338960
## Flow -1.964e-03 4.744e-04 -4.140 4.63e-05 ***
## StationCodeSTTD -1.147e+00 1.339e-01 -8.568 8.07e-16 ***
## StationCodeLIB -2.358e+00 4.002e-01 -5.893 1.12e-08 ***
## StationCodeRVB -2.720e+00 5.987e-01 -4.543 8.37e-06 ***
## Year_fct2016:Flow 2.327e-03 7.570e-04 3.074 0.002324 **
## Year_fct2017:Flow 3.152e-02 1.225e-02 2.574 0.010594 *
## Year_fct2018:Flow 5.652e-04 6.522e-04 0.867 0.386873
## Year_fct2019:Flow -5.601e-05 5.740e-04 -0.098 0.922341
## Year_fct2016:StationCodeSTTD 1.225e+00 1.969e-01 6.220 1.87e-09 ***
## Year_fct2017:StationCodeSTTD 1.205e+00 3.243e-01 3.716 0.000246 ***
## Year_fct2018:StationCodeSTTD -2.392e-01 1.879e-01 -1.273 0.204262
## Year_fct2019:StationCodeSTTD -9.220e-01 1.848e-01 -4.990 1.08e-06 ***
## Year_fct2016:StationCodeLIB 2.302e+00 8.067e-01 2.853 0.004658 **
## Year_fct2017:StationCodeLIB 1.216e+00 5.196e-01 2.339 0.020037 *
## Year_fct2018:StationCodeLIB -2.069e+00 5.209e-01 -3.972 9.12e-05 ***
## Year_fct2019:StationCodeLIB -1.666e-01 5.159e-01 -0.323 0.746950
## Year_fct2016:StationCodeRVB 2.489e+00 8.278e-01 3.006 0.002895 **
## Year_fct2017:StationCodeRVB 1.520e+00 7.592e-01 2.002 0.046309 *
## Year_fct2018:StationCodeRVB 5.291e-01 6.760e-01 0.783 0.434481
## Year_fct2019:StationCodeRVB -6.025e-01 6.932e-01 -0.869 0.385527
## Flow:StationCodeSTTD 5.176e-03 7.181e-04 7.207 5.66e-12 ***
## Flow:StationCodeLIB 1.783e-03 5.383e-04 3.312 0.001051 **
## Flow:StationCodeRVB 2.103e-03 4.964e-04 4.237 3.10e-05 ***
## Year_fct2016:Flow:StationCodeSTTD -4.440e-03 1.097e-03 -4.048 6.73e-05 ***
## Year_fct2017:Flow:StationCodeSTTD -2.326e-02 1.366e-02 -1.703 0.089738 .
## Year_fct2018:Flow:StationCodeSTTD -9.688e-04 9.659e-04 -1.003 0.316769
## Year_fct2019:Flow:StationCodeSTTD -1.569e-03 8.547e-04 -1.836 0.067435 .
## Year_fct2016:Flow:StationCodeLIB -1.936e-03 9.423e-04 -2.055 0.040881 *
## Year_fct2017:Flow:StationCodeLIB -3.143e-02 1.225e-02 -2.566 0.010814 *
## Year_fct2018:Flow:StationCodeLIB -1.440e-03 8.404e-04 -1.713 0.087812 .
## Year_fct2019:Flow:StationCodeLIB 2.452e-04 7.403e-04 0.331 0.740800
## Year_fct2016:Flow:StationCodeRVB -2.582e-03 7.770e-04 -3.322 0.001015 **
## Year_fct2017:Flow:StationCodeRVB -3.168e-02 1.225e-02 -2.586 0.010237 *
## Year_fct2018:Flow:StationCodeRVB -6.853e-04 6.678e-04 -1.026 0.305729
## Year_fct2019:Flow:StationCodeRVB -1.390e-05 5.930e-04 -0.023 0.981311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.722 3 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.898 Deviance explained = 91.1%
## -REML = 255 Scale est. = 0.10912 n = 314
appraise(m_gam_flow3)
shapiro.test(residuals(m_gam_flow3))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_flow3)
## W = 0.97687, p-value = 5.975e-05
k.check(m_gam_flow3)
## k' edf k-index p-value
## s(Week) 3 2.72227 0.9693372 0.29
draw(m_gam_flow3, select = 1, residuals = TRUE, rug = FALSE)
plot(m_gam_flow3, pages = 1, all.terms = TRUE)
acf(residuals(m_gam_flow3))
Box.test(residuals(m_gam_flow3), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_flow3)
## X-squared = 105.31, df = 20, p-value = 1.398e-13
Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look pretty good. However, the residuals are autocorrelated.
Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_gam_flow3_lag1 <- gam(
Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", k = 5) + lag1,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_flow3_lag1))
Box.test(residuals(m_gam_flow3_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_flow3_lag1)
## X-squared = 28.061, df = 20, p-value = 0.108
m_gam_flow3_lag2 <- gam(
Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", k = 5) + lag1 + lag2,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_flow3_lag2))
Box.test(residuals(m_gam_flow3_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_flow3_lag2)
## X-squared = 15.686, df = 20, p-value = 0.7359
The model with 1 lag term already seems to address the serial autocorrelation. Let’s use AIC to see how they compare.
AIC(m_gam_flow3, m_gam_flow3_lag1, m_gam_flow3_lag2)
## df AIC
## m_gam_flow3 43.94624 237.4562
## m_gam_flow3_lag1 44.81188 158.7898
## m_gam_flow3_lag2 45.81545 154.3799
It looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_gam_flow3_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc",
## k = 5) + lag1 + lag2
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.942e+00 5.626e-01 10.563 < 2e-16 ***
## Year_fct2016 -5.724e-01 1.590e-01 -3.599 0.000391 ***
## Year_fct2017 -6.505e-01 2.400e-01 -2.711 0.007223 **
## Year_fct2018 -2.955e-01 1.412e-01 -2.093 0.037481 *
## Year_fct2019 -7.407e-02 1.378e-01 -0.538 0.591405
## Flow -1.364e-03 4.548e-04 -2.999 0.003008 **
## StationCodeSTTD -6.939e-01 1.437e-01 -4.830 2.49e-06 ***
## StationCodeLIB -1.555e+00 3.969e-01 -3.918 0.000118 ***
## StationCodeRVB -1.970e+00 6.566e-01 -3.001 0.002988 **
## lag1 5.316e-01 6.573e-02 8.088 3.49e-14 ***
## lag2 -1.386e-01 6.071e-02 -2.282 0.023376 *
## Year_fct2016:Flow 1.446e-03 7.135e-04 2.027 0.043828 *
## Year_fct2017:Flow 2.220e-02 1.352e-02 1.642 0.101936
## Year_fct2018:Flow 5.982e-04 6.055e-04 0.988 0.324231
## Year_fct2019:Flow 1.557e-04 5.359e-04 0.291 0.771614
## Year_fct2016:StationCodeSTTD 8.113e-01 2.031e-01 3.995 8.73e-05 ***
## Year_fct2017:StationCodeSTTD 8.102e-01 3.275e-01 2.474 0.014073 *
## Year_fct2018:StationCodeSTTD -6.404e-02 1.869e-01 -0.343 0.732109
## Year_fct2019:StationCodeSTTD -5.797e-01 1.899e-01 -3.052 0.002538 **
## Year_fct2016:StationCodeLIB 2.058e+00 8.967e-01 2.295 0.022639 *
## Year_fct2017:StationCodeLIB 8.948e-01 4.964e-01 1.803 0.072770 .
## Year_fct2018:StationCodeLIB -8.208e-01 5.527e-01 -1.485 0.138923
## Year_fct2019:StationCodeLIB -4.930e-02 4.829e-01 -0.102 0.918769
## Year_fct2016:StationCodeRVB 1.952e+00 9.201e-01 2.121 0.034971 *
## Year_fct2017:StationCodeRVB 1.055e+00 7.978e-01 1.323 0.187191
## Year_fct2018:StationCodeRVB 6.910e-01 7.092e-01 0.974 0.330883
## Year_fct2019:StationCodeRVB -1.381e-01 7.264e-01 -0.190 0.849409
## Flow:StationCodeSTTD 3.192e-03 7.093e-04 4.499 1.08e-05 ***
## Flow:StationCodeLIB 1.191e-03 5.202e-04 2.289 0.022987 *
## Flow:StationCodeRVB 1.517e-03 4.813e-04 3.151 0.001842 **
## Year_fct2016:Flow:StationCodeSTTD -2.740e-03 1.034e-03 -2.649 0.008627 **
## Year_fct2017:Flow:StationCodeSTTD -1.654e-02 1.472e-02 -1.124 0.262253
## Year_fct2018:Flow:StationCodeSTTD -8.351e-04 9.053e-04 -0.923 0.357214
## Year_fct2019:Flow:StationCodeSTTD -9.759e-04 7.996e-04 -1.221 0.223506
## Year_fct2016:Flow:StationCodeLIB -7.654e-04 9.668e-04 -0.792 0.429389
## Year_fct2017:Flow:StationCodeLIB -2.206e-02 1.352e-02 -1.632 0.104110
## Year_fct2018:Flow:StationCodeLIB -4.712e-04 8.824e-04 -0.534 0.593834
## Year_fct2019:Flow:StationCodeLIB -5.711e-05 6.916e-04 -0.083 0.934265
## Year_fct2016:Flow:StationCodeRVB -1.681e-03 7.394e-04 -2.274 0.023898 *
## Year_fct2017:Flow:StationCodeRVB -2.235e-02 1.352e-02 -1.653 0.099754 .
## Year_fct2018:Flow:StationCodeRVB -7.416e-04 6.256e-04 -1.186 0.237042
## Year_fct2019:Flow:StationCodeRVB -2.587e-04 5.598e-04 -0.462 0.644462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.461 3 7.763 1.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.918 Deviance explained = 93.1%
## -REML = 215.73 Scale est. = 0.087877 n = 274
appraise(m_gam_flow3_lag2)
shapiro.test(residuals(m_gam_flow3_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_flow3_lag2)
## W = 0.96317, p-value = 1.857e-06
k.check(m_gam_flow3_lag2)
## k' edf k-index p-value
## s(Week) 3 2.460695 0.9467043 0.1675
draw(m_gam_flow3_lag2, select = 1, residuals = TRUE, rug = FALSE)
plot(m_gam_flow3_lag2, pages = 1, all.terms = TRUE)
anova(m_gam_flow3_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc",
## k = 5) + lag1 + lag2
##
## Parametric Terms:
## df F p-value
## Year_fct 4 4.954 0.000755
## Flow 1 8.993 0.003008
## StationCode 3 11.462 4.96e-07
## lag1 1 65.416 3.49e-14
## lag2 1 5.210 0.023376
## Year_fct:Flow 4 1.873 0.116090
## Year_fct:StationCode 12 4.300 3.77e-06
## Flow:StationCode 3 7.237 0.000116
## Year_fct:Flow:StationCode 12 1.241 0.256118
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.461 3.000 7.763 1.33e-05
The model diagnostics look pretty good. Note that the 3-way
interaction between Year, Station, and Flow isn’t significant. We’ll use
m_gam_flow3_lag2
in the model selection process.
rm(m_gam_flow3, m_gam_flow3_lag1)
Now we’ll try running a GAM using all two-way interactions between Year, Flow, and Station.
m_gam_flow2 <- gam(
Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", k = 5),
data = df_chla_c2,
method = "REML",
knots = list(week = c(0, 52))
)
Lets look at the model summary and diagnostics:
summary(m_gam_flow2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc",
## k = 5)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.711e+00 9.377e-02 103.564 < 2e-16 ***
## Year_fct2016 -5.547e-01 1.298e-01 -4.273 2.64e-05 ***
## Year_fct2017 -3.981e-01 1.251e-01 -3.183 0.001620 **
## Year_fct2018 -3.184e-01 1.205e-01 -2.641 0.008722 **
## Year_fct2019 -1.354e-01 1.206e-01 -1.123 0.262566
## Flow -1.452e-03 2.513e-04 -5.780 1.97e-08 ***
## StationCodeSTTD -1.041e+00 1.256e-01 -8.287 4.70e-15 ***
## StationCodeLIB -2.074e+00 3.037e-01 -6.831 5.13e-11 ***
## StationCodeRVB -2.597e+00 5.281e-01 -4.918 1.49e-06 ***
## Year_fct2016:Flow -2.286e-04 1.412e-04 -1.619 0.106640
## Year_fct2017:Flow -1.430e-04 1.304e-04 -1.096 0.274109
## Year_fct2018:Flow -1.149e-04 1.307e-04 -0.879 0.380090
## Year_fct2019:Flow -7.474e-05 1.310e-04 -0.571 0.568768
## Year_fct2016:StationCodeSTTD 8.459e-01 1.788e-01 4.732 3.52e-06 ***
## Year_fct2017:StationCodeSTTD 3.354e-01 1.870e-01 1.794 0.073912 .
## Year_fct2018:StationCodeSTTD -3.324e-01 1.738e-01 -1.913 0.056752 .
## Year_fct2019:StationCodeSTTD -1.007e+00 1.709e-01 -5.895 1.07e-08 ***
## Year_fct2016:StationCodeLIB 1.112e+00 2.863e-01 3.884 0.000128 ***
## Year_fct2017:StationCodeLIB 3.338e-01 2.693e-01 1.240 0.216084
## Year_fct2018:StationCodeLIB -1.770e+00 2.879e-01 -6.149 2.64e-09 ***
## Year_fct2019:StationCodeLIB -4.850e-01 2.927e-01 -1.657 0.098653 .
## Year_fct2016:StationCodeRVB 2.036e+00 7.646e-01 2.663 0.008182 **
## Year_fct2017:StationCodeRVB 8.638e-01 6.689e-01 1.291 0.197600
## Year_fct2018:StationCodeRVB 4.469e-01 6.078e-01 0.735 0.462836
## Year_fct2019:StationCodeRVB -5.091e-01 6.217e-01 -0.819 0.413560
## Flow:StationCodeSTTD 3.620e-03 3.234e-04 11.195 < 2e-16 ***
## Flow:StationCodeLIB 1.409e-03 2.834e-04 4.970 1.16e-06 ***
## Flow:StationCodeRVB 1.578e-03 2.342e-04 6.739 8.92e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.687 3 22.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.888 Deviance explained = 89.8%
## -REML = 200.62 Scale est. = 0.12004 n = 314
appraise(m_gam_flow2)
shapiro.test(residuals(m_gam_flow2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_flow2)
## W = 0.98376, p-value = 0.001287
k.check(m_gam_flow2)
## k' edf k-index p-value
## s(Week) 3 2.686871 0.9690676 0.275
draw(m_gam_flow2, select = 1, residuals = TRUE, rug = FALSE)
plot(m_gam_flow2, pages = 1, all.terms = TRUE)
acf(residuals(m_gam_flow2))
Box.test(residuals(m_gam_flow2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_flow2)
## X-squared = 144.04, df = 20, p-value < 2.2e-16
Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look really good. However, the residuals are autocorrelated.
Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_gam_flow2_lag1 <- gam(
Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", k = 5) + lag1,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_flow2_lag1))
Box.test(residuals(m_gam_flow2_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_flow2_lag1)
## X-squared = 30.611, df = 20, p-value = 0.06054
m_gam_flow2_lag2 <- gam(
Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", k = 5) + lag1 + lag2,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_flow2_lag2))
Box.test(residuals(m_gam_flow2_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_flow2_lag2)
## X-squared = 18.837, df = 20, p-value = 0.5324
The model with 1 lag term already seems to address the serial autocorrelation, but the lag2 model is even better. Let’s use AIC to see how they compare.
AIC(m_gam_flow2, m_gam_flow2_lag1, m_gam_flow2_lag2)
## df AIC
## m_gam_flow2 31.93061 256.9933
## m_gam_flow2_lag1 32.78679 153.1460
## m_gam_flow2_lag2 33.79579 148.0033
Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_gam_flow2_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc",
## k = 5) + lag1 + lag2
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.338e+00 5.236e-01 10.194 < 2e-16 ***
## Year_fct2016 -3.493e-01 1.243e-01 -2.811 0.005349 **
## Year_fct2017 -2.403e-01 1.175e-01 -2.045 0.041952 *
## Year_fct2018 -1.804e-01 1.128e-01 -1.600 0.110883
## Year_fct2019 -3.798e-02 1.118e-01 -0.340 0.734386
## Flow -8.481e-04 2.374e-04 -3.573 0.000426 ***
## StationCodeSTTD -5.400e-01 1.261e-01 -4.281 2.69e-05 ***
## StationCodeLIB -1.093e+00 3.071e-01 -3.559 0.000448 ***
## StationCodeRVB -1.434e+00 5.446e-01 -2.634 0.008992 **
## lag1 5.876e-01 6.315e-02 9.305 < 2e-16 ***
## lag2 -1.389e-01 5.918e-02 -2.346 0.019763 *
## Year_fct2016:Flow -1.337e-04 1.446e-04 -0.924 0.356166
## Year_fct2017:Flow -6.283e-05 1.314e-04 -0.478 0.633076
## Year_fct2018:Flow -6.137e-05 1.311e-04 -0.468 0.640055
## Year_fct2019:Flow -2.909e-05 1.311e-04 -0.222 0.824550
## Year_fct2016:StationCodeSTTD 4.972e-01 1.701e-01 2.923 0.003798 **
## Year_fct2017:StationCodeSTTD 1.932e-01 1.764e-01 1.095 0.274447
## Year_fct2018:StationCodeSTTD -1.479e-01 1.610e-01 -0.919 0.359077
## Year_fct2019:StationCodeSTTD -6.087e-01 1.647e-01 -3.697 0.000270 ***
## Year_fct2016:StationCodeLIB 6.306e-01 2.856e-01 2.208 0.028198 *
## Year_fct2017:StationCodeLIB 1.881e-01 2.575e-01 0.730 0.465824
## Year_fct2018:StationCodeLIB -1.098e+00 2.886e-01 -3.804 0.000181 ***
## Year_fct2019:StationCodeLIB -3.000e-01 2.815e-01 -1.066 0.287667
## Year_fct2016:StationCodeRVB 1.217e+00 8.017e-01 1.519 0.130194
## Year_fct2017:StationCodeRVB 3.038e-01 6.574e-01 0.462 0.644412
## Year_fct2018:StationCodeRVB 2.598e-01 5.860e-01 0.443 0.657883
## Year_fct2019:StationCodeRVB -3.972e-01 5.960e-01 -0.666 0.505767
## Flow:StationCodeSTTD 2.007e-03 3.206e-04 6.262 1.72e-09 ***
## Flow:StationCodeLIB 8.623e-04 2.659e-04 3.242 0.001352 **
## Flow:StationCodeRVB 9.164e-04 2.177e-04 4.210 3.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.435 3 8.747 3.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.917 Deviance explained = 92.7%
## -REML = 149.4 Scale est. = 0.089063 n = 274
appraise(m_gam_flow2_lag2)
shapiro.test(residuals(m_gam_flow2_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_flow2_lag2)
## W = 0.96772, p-value = 7.839e-06
k.check(m_gam_flow2_lag2)
## k' edf k-index p-value
## s(Week) 3 2.435016 0.9459394 0.155
draw(m_gam_flow2_lag2, select = 1, residuals = TRUE, rug = FALSE)
plot(m_gam_flow2_lag2, pages = 1, all.terms = TRUE)
anova(m_gam_flow2_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc",
## k = 5) + lag1 + lag2
##
## Parametric Terms:
## df F p-value
## Year_fct 4 2.789 0.027126
## Flow 1 12.765 0.000426
## StationCode 3 10.800 1.10e-06
## lag1 1 86.583 < 2e-16
## lag2 1 5.506 0.019763
## Year_fct:Flow 4 0.551 0.698106
## Year_fct:StationCode 12 5.788 8.65e-09
## Flow:StationCode 3 13.123 5.74e-08
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.435 3.000 8.747 3.04e-06
The model diagnostics look pretty good. Note that the 2-way
interaction between Year and Flow isn’t significant. We’ll use
m_gam_flow2_lag2
in the model selection process.
rm(m_gam_flow2, m_gam_flow2_lag1)
Next we’ll try running a GAM using a two-way interaction between Year and Station but not including flow as a predictor.
m_gam_cat2 <- gam(
Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5),
data = df_chla_c2,
method = "REML",
knots = list(week = c(0, 52))
)
Lets look at the model summary and diagnostics:
summary(m_gam_cat2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.50498 0.10404 91.360 < 2e-16 ***
## Year_fct2016 -0.54029 0.15349 -3.520 0.000501 ***
## Year_fct2017 -0.21984 0.14466 -1.520 0.129675
## Year_fct2018 -0.29821 0.14264 -2.091 0.037427 *
## Year_fct2019 -0.14474 0.14264 -1.015 0.311065
## StationCodeSTTD -0.76699 0.14452 -5.307 2.21e-07 ***
## StationCodeLIB -1.79937 0.13941 -12.907 < 2e-16 ***
## StationCodeRVB -1.85932 0.13941 -13.337 < 2e-16 ***
## Year_fct2016:StationCodeSTTD 0.85363 0.21322 4.003 7.93e-05 ***
## Year_fct2017:StationCodeSTTD 0.03494 0.22023 0.159 0.874055
## Year_fct2018:StationCodeSTTD -0.32131 0.20713 -1.551 0.121925
## Year_fct2019:StationCodeSTTD -0.84627 0.20306 -4.168 4.06e-05 ***
## Year_fct2016:StationCodeLIB 1.41654 0.20470 6.920 2.87e-11 ***
## Year_fct2017:StationCodeLIB 0.26245 0.19919 1.318 0.188682
## Year_fct2018:StationCodeLIB -1.75609 0.21578 -8.138 1.17e-14 ***
## Year_fct2019:StationCodeLIB -0.46106 0.22475 -2.051 0.041121 *
## Year_fct2016:StationCodeRVB 0.63920 0.20807 3.072 0.002327 **
## Year_fct2017:StationCodeRVB -0.02780 0.19919 -0.140 0.889118
## Year_fct2018:StationCodeRVB -0.01988 0.19635 -0.101 0.919435
## Year_fct2019:StationCodeRVB -0.60985 0.19635 -3.106 0.002083 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.568 3 15.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.839 Deviance explained = 85%
## -REML = 189.6 Scale est. = 0.17204 n = 314
appraise(m_gam_cat2)
shapiro.test(residuals(m_gam_cat2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_cat2)
## W = 0.98769, p-value = 0.009076
k.check(m_gam_cat2)
## k' edf k-index p-value
## s(Week) 3 2.56794 0.9387348 0.1675
draw(m_gam_cat2, select = 1, residuals = TRUE, rug = FALSE)
plot(m_gam_cat2, pages = 1, all.terms = TRUE)
acf(residuals(m_gam_cat2))
Box.test(residuals(m_gam_cat2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_cat2)
## X-squared = 278.51, df = 20, p-value < 2.2e-16
Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look really good. However, the residuals are autocorrelated.
Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_gam_cat2_lag1 <- gam(
Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_cat2_lag1))
Box.test(residuals(m_gam_cat2_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_cat2_lag1)
## X-squared = 26.832, df = 20, p-value = 0.1401
m_gam_cat2_lag2 <- gam(
Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + lag2,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_cat2_lag2))
Box.test(residuals(m_gam_cat2_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_cat2_lag2)
## X-squared = 15.258, df = 20, p-value = 0.7615
The model with 1 lag term already seems to address the serial autocorrelation, but the lag2 model is even better. Let’s use AIC to see how they compare.
AIC(m_gam_cat2, m_gam_cat2_lag1, m_gam_cat2_lag2)
## df AIC
## m_gam_cat2 23.86710 362.7642
## m_gam_cat2_lag1 24.59559 189.2854
## m_gam_cat2_lag2 25.62646 176.4000
Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_gam_cat2_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) +
## lag1 + lag2
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.968075 0.509024 7.795 1.73e-13 ***
## Year_fct2016 -0.264678 0.129335 -2.046 0.041759 *
## Year_fct2017 -0.094817 0.118855 -0.798 0.425768
## Year_fct2018 -0.120328 0.117920 -1.020 0.308514
## Year_fct2019 -0.020108 0.117202 -0.172 0.863919
## StationCodeSTTD -0.256423 0.124046 -2.067 0.039750 *
## StationCodeLIB -0.744025 0.146809 -5.068 7.83e-07 ***
## StationCodeRVB -0.768135 0.150118 -5.117 6.20e-07 ***
## lag1 0.757746 0.061040 12.414 < 2e-16 ***
## lag2 -0.178844 0.062076 -2.881 0.004308 **
## Year_fct2016:StationCodeSTTD 0.383395 0.179813 2.132 0.033967 *
## Year_fct2017:StationCodeSTTD -0.023914 0.183626 -0.130 0.896489
## Year_fct2018:StationCodeSTTD -0.111566 0.170698 -0.654 0.513979
## Year_fct2019:StationCodeSTTD -0.412364 0.171467 -2.405 0.016904 *
## Year_fct2016:StationCodeLIB 0.616167 0.183385 3.360 0.000902 ***
## Year_fct2017:StationCodeLIB 0.103366 0.163030 0.634 0.526644
## Year_fct2018:StationCodeLIB -0.840384 0.200085 -4.200 3.71e-05 ***
## Year_fct2019:StationCodeLIB -0.212731 0.191415 -1.111 0.267482
## Year_fct2016:StationCodeRVB 0.272513 0.173941 1.567 0.118452
## Year_fct2017:StationCodeRVB -0.037916 0.162719 -0.233 0.815938
## Year_fct2018:StationCodeRVB 0.003196 0.160067 0.020 0.984088
## Year_fct2019:StationCodeRVB -0.286425 0.162510 -1.763 0.079207 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.214 3 5.061 0.000402 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.906 Deviance explained = 91.4%
## -REML = 101.27 Scale est. = 0.10141 n = 274
appraise(m_gam_cat2_lag2)
shapiro.test(residuals(m_gam_cat2_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_cat2_lag2)
## W = 0.9529, p-value = 9.872e-08
k.check(m_gam_cat2_lag2)
## k' edf k-index p-value
## s(Week) 3 2.214206 0.9270206 0.09
draw(m_gam_cat2_lag2, select = 1, residuals = TRUE, rug = FALSE)
plot(m_gam_cat2_lag2, pages = 1, all.terms = TRUE)
anova(m_gam_cat2_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) +
## lag1 + lag2
##
## Parametric Terms:
## df F p-value
## Year_fct 4 1.324 0.26146
## StationCode 3 10.923 9.14e-07
## lag1 1 154.104 < 2e-16
## lag2 1 8.300 0.00431
## Year_fct:StationCode 12 3.571 6.34e-05
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Week) 2.214 3.000 5.061 0.000402
The model diagnostics look pretty good but not quite as good as with
the initial model. We’ll use m_gam_cat2_lag2
in the model
selection process.
rm(m_gam_cat2, m_gam_cat2_lag1)
Let’s try the weekly average model as a linear model with a three-way interaction between Year, Weekly Average Flow, and Station but without the smooth term for week number. Initially, we’ll run the model without accounting for serial autocorrelation.
m_lm_flow3 <- lm(Chla_log ~ Year_fct * Flow * StationCode, data = df_chla_c2)
Lets look at the model summary and diagnostics:
summary(m_lm_flow3)
##
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow * StationCode, data = df_chla_c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98399 -0.20320 -0.01209 0.17753 1.31988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.762e+00 1.160e-01 84.185 < 2e-16 ***
## Year_fct2016 -7.803e-01 1.678e-01 -4.651 5.13e-06 ***
## Year_fct2017 -1.275e+00 2.557e-01 -4.987 1.09e-06 ***
## Year_fct2018 -3.717e-01 1.557e-01 -2.387 0.017642 *
## Year_fct2019 -1.064e-01 1.529e-01 -0.696 0.486980
## Flow -2.168e-03 5.114e-04 -4.239 3.07e-05 ***
## StationCodeSTTD -1.148e+00 1.467e-01 -7.830 1.07e-13 ***
## StationCodeLIB -2.487e+00 4.255e-01 -5.846 1.43e-08 ***
## StationCodeRVB -3.270e+00 6.506e-01 -5.027 9.03e-07 ***
## Year_fct2016:Flow 2.807e-03 8.177e-04 3.432 0.000691 ***
## Year_fct2017:Flow 4.980e-02 1.296e-02 3.842 0.000152 ***
## Year_fct2018:Flow 4.445e-04 7.150e-04 0.622 0.534642
## Year_fct2019:Flow -7.480e-05 6.283e-04 -0.119 0.905318
## Year_fct2016:StationCodeSTTD 1.250e+00 2.156e-01 5.798 1.84e-08 ***
## Year_fct2017:StationCodeSTTD 1.716e+00 3.439e-01 4.989 1.08e-06 ***
## Year_fct2018:StationCodeSTTD -2.869e-01 2.057e-01 -1.395 0.164109
## Year_fct2019:StationCodeSTTD -9.572e-01 2.025e-01 -4.726 3.67e-06 ***
## Year_fct2016:StationCodeLIB 2.235e+00 8.545e-01 2.615 0.009414 **
## Year_fct2017:StationCodeLIB 1.584e+00 5.556e-01 2.850 0.004698 **
## Year_fct2018:StationCodeLIB -2.048e+00 5.678e-01 -3.606 0.000369 ***
## Year_fct2019:StationCodeLIB -2.341e-01 5.550e-01 -0.422 0.673534
## Year_fct2016:StationCodeRVB 3.488e+00 8.735e-01 3.993 8.39e-05 ***
## Year_fct2017:StationCodeRVB 2.697e+00 8.098e-01 3.330 0.000986 ***
## Year_fct2018:StationCodeRVB 9.703e-01 7.291e-01 1.331 0.184338
## Year_fct2019:StationCodeRVB 2.551e-02 7.444e-01 0.034 0.972687
## Flow:StationCodeSTTD 4.930e-03 7.866e-04 6.267 1.42e-09 ***
## Flow:StationCodeLIB 1.893e-03 5.719e-04 3.309 0.001061 **
## Flow:StationCodeRVB 2.442e-03 5.331e-04 4.580 7.07e-06 ***
## Year_fct2016:Flow:StationCodeSTTD -4.311e-03 1.203e-03 -3.585 0.000399 ***
## Year_fct2017:Flow:StationCodeSTTD -3.529e-02 1.471e-02 -2.399 0.017117 *
## Year_fct2018:Flow:StationCodeSTTD -7.185e-04 1.057e-03 -0.680 0.497159
## Year_fct2019:Flow:StationCodeSTTD -1.288e-03 9.364e-04 -1.375 0.170214
## Year_fct2016:Flow:StationCodeLIB -2.501e-03 9.998e-04 -2.502 0.012938 *
## Year_fct2017:Flow:StationCodeLIB -4.968e-02 1.297e-02 -3.830 0.000159 ***
## Year_fct2018:Flow:StationCodeLIB -1.229e-03 9.207e-04 -1.335 0.182876
## Year_fct2019:Flow:StationCodeLIB 8.259e-05 8.062e-04 0.102 0.918482
## Year_fct2016:Flow:StationCodeRVB -3.242e-03 8.342e-04 -3.887 0.000127 ***
## Year_fct2017:Flow:StationCodeRVB -5.013e-02 1.296e-02 -3.867 0.000138 ***
## Year_fct2018:Flow:StationCodeRVB -6.903e-04 7.317e-04 -0.943 0.346284
## Year_fct2019:Flow:StationCodeRVB -1.432e-04 6.474e-04 -0.221 0.825074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3624 on 274 degrees of freedom
## Multiple R-squared: 0.8923, Adjusted R-squared: 0.877
## F-statistic: 58.23 on 39 and 274 DF, p-value: < 2.2e-16
df_chla_c2 %>% plot_lm_diag(Chla_log, m_lm_flow3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
shapiro.test(residuals(m_lm_flow3))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_lm_flow3)
## W = 0.97352, p-value = 1.555e-05
acf(residuals(m_lm_flow3))
Box.test(residuals(m_lm_flow3), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_flow3)
## X-squared = 137.15, df = 20, p-value < 2.2e-16
The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.
Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_lm_flow3_lag1 <- df_chla_c2_lag %>%
drop_na(Chla_log, lag1) %>%
lm(Chla_log ~ Year_fct * Flow * StationCode + lag1, data = .)
acf(residuals(m_lm_flow3_lag1))
Box.test(residuals(m_lm_flow3_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_flow3_lag1)
## X-squared = 26.352, df = 20, p-value = 0.1545
m_lm_flow3_lag2 <- df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2) %>%
lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2, data = .)
acf(residuals(m_lm_flow3_lag2))
Box.test(residuals(m_lm_flow3_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_flow3_lag2)
## X-squared = 13.106, df = 20, p-value = 0.8728
The model with 1 lag term already has better ACF and Box-Ljung test results than the initial model. Let’s use AIC to see how they compare.
AIC(m_lm_flow3, m_lm_flow3_lag1, m_lm_flow3_lag2)
## df AIC
## m_lm_flow3 41 292.7994
## m_lm_flow3_lag1 42 183.8907
## m_lm_flow3_lag2 43 177.8827
Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_lm_flow3_lag2)
##
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow * StationCode + lag1 +
## lag2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10604 -0.15916 -0.00246 0.11706 0.97717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.151e+00 5.577e-01 9.236 < 2e-16 ***
## Year_fct2016 -4.942e-01 1.616e-01 -3.057 0.002495 **
## Year_fct2017 -7.132e-01 2.504e-01 -2.848 0.004796 **
## Year_fct2018 -2.494e-01 1.473e-01 -1.693 0.091852 .
## Year_fct2019 -5.465e-02 1.440e-01 -0.380 0.704640
## Flow -1.413e-03 4.645e-04 -3.043 0.002615 **
## StationCodeSTTD -6.074e-01 1.488e-01 -4.081 6.16e-05 ***
## StationCodeLIB -1.326e+00 4.060e-01 -3.266 0.001256 **
## StationCodeRVB -2.238e+00 6.850e-01 -3.268 0.001249 **
## lag1 6.009e-01 6.712e-02 8.954 < 2e-16 ***
## lag2 -1.262e-01 6.347e-02 -1.989 0.047871 *
## Year_fct2016:Flow 1.648e-03 7.277e-04 2.265 0.024448 *
## Year_fct2017:Flow 2.732e-02 1.386e-02 1.971 0.049931 *
## Year_fct2018:Flow 5.512e-04 6.344e-04 0.869 0.385818
## Year_fct2019:Flow 2.174e-04 5.600e-04 0.388 0.698236
## Year_fct2016:StationCodeSTTD 7.355e-01 2.116e-01 3.476 0.000607 ***
## Year_fct2017:StationCodeSTTD 9.581e-01 3.364e-01 2.848 0.004790 **
## Year_fct2018:StationCodeSTTD -9.220e-02 1.956e-01 -0.471 0.637855
## Year_fct2019:StationCodeSTTD -5.315e-01 1.989e-01 -2.672 0.008076 **
## Year_fct2016:StationCodeLIB 1.796e+00 9.278e-01 1.935 0.054173 .
## Year_fct2017:StationCodeLIB 8.549e-01 5.126e-01 1.668 0.096688 .
## Year_fct2018:StationCodeLIB -5.611e-01 5.719e-01 -0.981 0.327564
## Year_fct2019:StationCodeLIB -1.799e-01 4.976e-01 -0.362 0.718000
## Year_fct2016:StationCodeRVB 2.675e+00 9.350e-01 2.861 0.004608 **
## Year_fct2017:StationCodeRVB 1.758e+00 8.198e-01 2.144 0.033071 *
## Year_fct2018:StationCodeRVB 1.155e+00 7.307e-01 1.581 0.115196
## Year_fct2019:StationCodeRVB 4.856e-01 7.436e-01 0.653 0.514402
## Flow:StationCodeSTTD 2.734e-03 7.374e-04 3.707 0.000262 ***
## Flow:StationCodeLIB 1.279e-03 5.221e-04 2.450 0.015009 *
## Flow:StationCodeRVB 1.673e-03 4.912e-04 3.406 0.000778 ***
## Year_fct2016:Flow:StationCodeSTTD -2.337e-03 1.081e-03 -2.161 0.031691 *
## Year_fct2017:Flow:StationCodeSTTD -1.781e-02 1.525e-02 -1.168 0.244031
## Year_fct2018:Flow:StationCodeSTTD -5.762e-04 9.461e-04 -0.609 0.543101
## Year_fct2019:Flow:StationCodeSTTD -7.217e-04 8.369e-04 -0.862 0.389363
## Year_fct2016:Flow:StationCodeLIB -1.086e-03 9.786e-04 -1.110 0.268090
## Year_fct2017:Flow:StationCodeLIB -2.723e-02 1.387e-02 -1.964 0.050720 .
## Year_fct2018:Flow:StationCodeLIB -2.727e-05 9.181e-04 -0.030 0.976333
## Year_fct2019:Flow:StationCodeLIB -2.769e-04 7.181e-04 -0.386 0.700161
## Year_fct2016:Flow:StationCodeRVB -2.024e-03 7.498e-04 -2.700 0.007442 **
## Year_fct2017:Flow:StationCodeRVB -2.759e-02 1.386e-02 -1.990 0.047745 *
## Year_fct2018:Flow:StationCodeRVB -8.088e-04 6.547e-04 -1.235 0.217949
## Year_fct2019:Flow:StationCodeRVB -4.458e-04 5.826e-04 -0.765 0.444982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.311 on 232 degrees of freedom
## Multiple R-squared: 0.9237, Adjusted R-squared: 0.9102
## F-statistic: 68.51 on 41 and 232 DF, p-value: < 2.2e-16
df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2) %>%
plot_lm_diag(Chla_log, m_lm_flow3_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
shapiro.test(residuals(m_lm_flow3_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_lm_flow3_lag2)
## W = 0.95757, p-value = 3.559e-07
Anova(m_lm_flow3_lag2, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
##
## Response: Chla_log
## Sum Sq Df F value Pr(>F)
## (Intercept) 8.2491 1 85.3064 < 2.2e-16 ***
## Year_fct 1.5818 4 4.0896 0.0031915 **
## Flow 0.8952 1 9.2578 0.0026149 **
## StationCode 2.6225 3 9.0401 1.100e-05 ***
## lag1 7.7523 1 80.1687 < 2.2e-16 ***
## lag2 0.3826 1 3.9563 0.0478710 *
## Year_fct:Flow 0.8906 4 2.3024 0.0593683 .
## Year_fct:StationCode 4.2199 12 3.6366 5.227e-05 ***
## Flow:StationCode 1.6269 3 5.6082 0.0009948 ***
## Year_fct:Flow:StationCode 1.6343 12 1.4084 0.1627989
## Residuals 22.4343 232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model diagnostics look okay, but not as good as with the GAM
models. Note that the 3-way interaction between Year, Station, and Flow
isn’t significant in the ANOVA table. We’ll use
m_lm_flow3_lag2
in the model selection process.
rm(m_lm_flow3, m_lm_flow3_lag1)
Let’s try a linear model using all two-way interactions between Year, Weekly Average Flow, and Station. Initially, we’ll run the model without accounting for serial autocorrelation.
m_lm_flow2 <- lm(Chla_log ~ (Year_fct + Flow + StationCode)^2, data = df_chla_c2)
Lets look at the model summary and diagnostics:
summary(m_lm_flow2)
##
## Call:
## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2, data = df_chla_c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02115 -0.21555 -0.02562 0.20756 1.30970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6720421 0.1037997 93.180 < 2e-16 ***
## Year_fct2016 -0.4032148 0.1425002 -2.830 0.00499 **
## Year_fct2017 -0.3833291 0.1386668 -2.764 0.00607 **
## Year_fct2018 -0.2825822 0.1338134 -2.112 0.03557 *
## Year_fct2019 -0.0926685 0.1338116 -0.693 0.48917
## Flow -0.0015353 0.0002749 -5.585 5.43e-08 ***
## StationCodeSTTD -1.0366460 0.1394711 -7.433 1.24e-12 ***
## StationCodeLIB -2.1586815 0.3294116 -6.553 2.62e-10 ***
## StationCodeRVB -2.9189896 0.5835889 -5.002 9.93e-07 ***
## Year_fct2016:Flow -0.0003420 0.0001541 -2.220 0.02721 *
## Year_fct2017:Flow -0.0002571 0.0001435 -1.791 0.07431 .
## Year_fct2018:Flow -0.0001841 0.0001438 -1.280 0.20143
## Year_fct2019:Flow -0.0001688 0.0001439 -1.173 0.24161
## Year_fct2016:StationCodeSTTD 0.8302909 0.1985714 4.181 3.86e-05 ***
## Year_fct2017:StationCodeSTTD 0.4036756 0.2070786 1.949 0.05223 .
## Year_fct2018:StationCodeSTTD -0.3665744 0.1925111 -1.904 0.05789 .
## Year_fct2019:StationCodeSTTD -1.0430308 0.1897975 -5.495 8.63e-08 ***
## Year_fct2016:StationCodeLIB 0.9218534 0.3150367 2.926 0.00371 **
## Year_fct2017:StationCodeLIB 0.2307078 0.2962035 0.779 0.43669
## Year_fct2018:StationCodeLIB -1.8797945 0.3149611 -5.968 7.08e-09 ***
## Year_fct2019:StationCodeLIB -0.5044002 0.3185041 -1.584 0.11438
## Year_fct2016:StationCodeRVB 2.5842980 0.8235230 3.138 0.00188 **
## Year_fct2017:StationCodeRVB 1.4826319 0.7325138 2.024 0.04390 *
## Year_fct2018:StationCodeRVB 0.6227655 0.6661707 0.935 0.35066
## Year_fct2019:StationCodeRVB -0.1434418 0.6796939 -0.211 0.83301
## Flow:StationCodeSTTD 0.0035789 0.0003591 9.966 < 2e-16 ***
## Flow:StationCodeLIB 0.0014129 0.0003025 4.670 4.64e-06 ***
## Flow:StationCodeRVB 0.0017470 0.0002547 6.860 4.26e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3849 on 286 degrees of freedom
## Multiple R-squared: 0.8732, Adjusted R-squared: 0.8612
## F-statistic: 72.93 on 27 and 286 DF, p-value: < 2.2e-16
df_chla_c2 %>% plot_lm_diag(Chla_log, m_lm_flow2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
shapiro.test(residuals(m_lm_flow2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_lm_flow2)
## W = 0.97497, p-value = 2.752e-05
acf(residuals(m_lm_flow2))
Box.test(residuals(m_lm_flow2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_flow2)
## X-squared = 214.85, df = 20, p-value < 2.2e-16
The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.
Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_lm_flow2_lag1 <- df_chla_c2_lag %>%
drop_na(Chla_log, lag1) %>%
lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1, data = .)
acf(residuals(m_lm_flow2_lag1))
Box.test(residuals(m_lm_flow2_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_flow2_lag1)
## X-squared = 33.835, df = 20, p-value = 0.02727
m_lm_flow2_lag2 <- df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2) %>%
lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2, data = .)
acf(residuals(m_lm_flow2_lag2))
Box.test(residuals(m_lm_flow2_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_flow2_lag2)
## X-squared = 16.487, df = 20, p-value = 0.686
The model with 2 lag terms seems to be okay in terms of serial autocorrelation. Let’s use AIC to see how they compare.
AIC(m_lm_flow2, m_lm_flow2_lag1, m_lm_flow2_lag2)
## df AIC
## m_lm_flow2 29 320.2073
## m_lm_flow2_lag1 30 179.7601
## m_lm_flow2_lag2 31 173.1492
Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_lm_flow2_lag2)
##
## Call:
## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 +
## lag2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24138 -0.14868 -0.01613 0.13053 0.87049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5102528 0.5172489 8.720 4.39e-16 ***
## Year_fct2016 -0.2225024 0.1268556 -1.754 0.080689 .
## Year_fct2017 -0.2049353 0.1228178 -1.669 0.096477 .
## Year_fct2018 -0.1265295 0.1180393 -1.072 0.284812
## Year_fct2019 0.0042951 0.1172864 0.037 0.970817
## Flow -0.0007791 0.0002437 -3.197 0.001573 **
## StationCodeSTTD -0.4510406 0.1311688 -3.439 0.000687 ***
## StationCodeLIB -0.8412407 0.3153205 -2.668 0.008144 **
## StationCodeRVB -1.4624572 0.5715023 -2.559 0.011103 *
## lag1 0.6683254 0.0642124 10.408 < 2e-16 ***
## lag2 -0.1363594 0.0621237 -2.195 0.029109 *
## Year_fct2016:Flow -0.0002104 0.0001502 -1.400 0.162647
## Year_fct2017:Flow -0.0001343 0.0001373 -0.978 0.328924
## Year_fct2018:Flow -0.0001202 0.0001369 -0.878 0.380766
## Year_fct2019:Flow -0.0001003 0.0001367 -0.733 0.464045
## Year_fct2016:StationCodeSTTD 0.4173963 0.1782275 2.342 0.019989 *
## Year_fct2017:StationCodeSTTD 0.1784208 0.1852472 0.963 0.336426
## Year_fct2018:StationCodeSTTD -0.1664026 0.1689473 -0.985 0.325630
## Year_fct2019:StationCodeSTTD -0.5590542 0.1729367 -3.233 0.001395 **
## Year_fct2016:StationCodeLIB 0.3796717 0.2958825 1.283 0.200645
## Year_fct2017:StationCodeLIB 0.0333266 0.2679057 0.124 0.901104
## Year_fct2018:StationCodeLIB -1.1124628 0.2993880 -3.716 0.000251 ***
## Year_fct2019:StationCodeLIB -0.3875763 0.2907185 -1.333 0.183720
## Year_fct2016:StationCodeRVB 1.5642012 0.8259346 1.894 0.059427 .
## Year_fct2017:StationCodeRVB 0.6738241 0.6836514 0.986 0.325293
## Year_fct2018:StationCodeRVB 0.4731218 0.6102908 0.775 0.438949
## Year_fct2019:StationCodeRVB -0.0331205 0.6185916 -0.054 0.957344
## Flow:StationCodeSTTD 0.0017758 0.0003334 5.327 2.27e-07 ***
## Flow:StationCodeLIB 0.0008313 0.0002653 3.133 0.001941 **
## Flow:StationCodeRVB 0.0009021 0.0002239 4.030 7.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3141 on 244 degrees of freedom
## Multiple R-squared: 0.9182, Adjusted R-squared: 0.9084
## F-statistic: 94.38 on 29 and 244 DF, p-value: < 2.2e-16
df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2) %>%
plot_lm_diag(Chla_log, m_lm_flow2_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
shapiro.test(residuals(m_lm_flow2_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_lm_flow2_lag2)
## W = 0.95878, p-value = 5.036e-07
Anova(m_lm_flow2_lag2, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
##
## Response: Chla_log
## Sum Sq Df F value Pr(>F)
## (Intercept) 7.5000 1 76.0331 4.387e-16 ***
## Year_fct 0.6129 4 1.5533 0.187469
## Flow 1.0081 1 10.2202 0.001573 **
## StationCode 2.2025 3 7.4427 8.661e-05 ***
## lag1 10.6856 1 108.3274 < 2.2e-16 ***
## lag2 0.4752 1 4.8179 0.029109 *
## Year_fct:Flow 0.2845 4 0.7209 0.578345
## Year_fct:StationCode 5.2495 12 4.4349 2.004e-06 ***
## Flow:StationCode 2.8070 3 9.4856 5.971e-06 ***
## Residuals 24.0685 244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model diagnostics look somewhat worse than those for the 3-way
interaction model. Note that the 2-way interaction between Year and Flow
isn’t significant. We’ll use m_lm_flow2_lag2
in the model
selection process.
rm(m_lm_flow2, m_lm_flow2_lag1)
We’ll try running a linear model using a two-way interaction between Year and Station but not including flow as a predictor. Initially, we’ll run the model without accounting for serial autocorrelation.
m_lm_cat2 <- lm(Chla_log ~ Year_fct * StationCode, data = df_chla_c2)
Lets look at the model summary and diagnostics:
summary(m_lm_cat2)
##
## Call:
## lm(formula = Chla_log ~ Year_fct * StationCode, data = df_chla_c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35952 -0.25359 -0.05221 0.26722 1.31786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.45473 0.11172 84.627 < 2e-16 ***
## Year_fct2016 -0.40045 0.16354 -2.449 0.014927 *
## Year_fct2017 -0.19510 0.15566 -1.253 0.211051
## Year_fct2018 -0.26862 0.15355 -1.749 0.081263 .
## Year_fct2019 -0.11515 0.15355 -0.750 0.453881
## StationCodeSTTD -0.75644 0.15566 -4.860 1.92e-06 ***
## StationCodeLIB -1.74964 0.14989 -11.673 < 2e-16 ***
## StationCodeRVB -1.80959 0.14989 -12.073 < 2e-16 ***
## Year_fct2016:StationCodeSTTD 0.84307 0.22969 3.670 0.000287 ***
## Year_fct2017:StationCodeSTTD 0.10665 0.23653 0.451 0.652400
## Year_fct2018:StationCodeSTTD -0.35014 0.22269 -1.572 0.116942
## Year_fct2019:StationCodeSTTD -0.88642 0.21865 -4.054 6.45e-05 ***
## Year_fct2016:StationCodeLIB 1.38154 0.22018 6.275 1.25e-09 ***
## Year_fct2017:StationCodeLIB 0.21272 0.21439 0.992 0.321898
## Year_fct2018:StationCodeLIB -1.88368 0.23137 -8.141 1.12e-14 ***
## Year_fct2019:StationCodeLIB -0.46150 0.24192 -1.908 0.057411 .
## Year_fct2016:StationCodeRVB 0.60290 0.22371 2.695 0.007444 **
## Year_fct2017:StationCodeRVB -0.07752 0.21439 -0.362 0.717928
## Year_fct2018:StationCodeRVB -0.06960 0.21132 -0.329 0.742124
## Year_fct2019:StationCodeRVB -0.65957 0.21132 -3.121 0.001980 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4469 on 294 degrees of freedom
## Multiple R-squared: 0.8243, Adjusted R-squared: 0.8129
## F-statistic: 72.59 on 19 and 294 DF, p-value: < 2.2e-16
df_chla_c2 %>% plot_lm_diag(Chla_log, m_lm_cat2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
shapiro.test(residuals(m_lm_cat2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_lm_cat2)
## W = 0.98856, p-value = 0.01425
acf(residuals(m_lm_cat2))
Box.test(residuals(m_lm_cat2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_cat2)
## X-squared = 331.73, df = 20, p-value < 2.2e-16
Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look pretty good. However, the residuals are autocorrelated.
Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_lm_cat2_lag1 <- df_chla_c2_lag %>%
drop_na(Chla_log, lag1) %>%
lm(Chla_log ~ Year_fct * StationCode + lag1, data = .)
acf(residuals(m_lm_cat2_lag1))
Box.test(residuals(m_lm_cat2_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_cat2_lag1)
## X-squared = 29.858, df = 20, p-value = 0.07219
m_lm_cat2_lag2 <- df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2) %>%
lm(Chla_log ~ Year_fct * StationCode + lag1 + lag2, data = .)
acf(residuals(m_lm_cat2_lag2))
Box.test(residuals(m_lm_cat2_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_lm_cat2_lag2)
## X-squared = 16.063, df = 20, p-value = 0.7127
The model with 2 lag terms seems to be okay in terms of serial autocorrelation. Let’s use AIC to see how they compare.
AIC(m_lm_cat2, m_lm_cat2_lag1, m_lm_cat2_lag2)
## df AIC
## m_lm_cat2 21 406.6058
## m_lm_cat2_lag1 22 202.1676
## m_lm_cat2_lag2 23 189.5969
Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_lm_cat2_lag2)
##
## Call:
## lm(formula = Chla_log ~ Year_fct * StationCode + lag1 + lag2,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32999 -0.16419 -0.02045 0.15089 0.85556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.459253 0.493928 7.004 2.26e-11 ***
## Year_fct2016 -0.182532 0.129908 -1.405 0.161226
## Year_fct2017 -0.078006 0.122065 -0.639 0.523366
## Year_fct2018 -0.090019 0.120957 -0.744 0.457437
## Year_fct2019 0.002086 0.120402 0.017 0.986191
## StationCodeSTTD -0.211284 0.127028 -1.663 0.097499 .
## StationCodeLIB -0.625336 0.146282 -4.275 2.72e-05 ***
## StationCodeRVB -0.644967 0.149378 -4.318 2.27e-05 ***
## lag1 0.806179 0.061397 13.131 < 2e-16 ***
## lag2 -0.175878 0.063486 -2.770 0.006017 **
## Year_fct2016:StationCodeSTTD 0.336363 0.184635 1.822 0.069675 .
## Year_fct2017:StationCodeSTTD -0.015832 0.188606 -0.084 0.933169
## Year_fct2018:StationCodeSTTD -0.128766 0.175323 -0.734 0.463356
## Year_fct2019:StationCodeSTTD -0.393012 0.176281 -2.229 0.026665 *
## Year_fct2016:StationCodeLIB 0.535094 0.185883 2.879 0.004337 **
## Year_fct2017:StationCodeLIB 0.064763 0.167490 0.387 0.699326
## Year_fct2018:StationCodeLIB -0.811907 0.205745 -3.946 0.000103 ***
## Year_fct2019:StationCodeLIB -0.214839 0.196894 -1.091 0.276254
## Year_fct2016:StationCodeRVB 0.233510 0.177836 1.313 0.190356
## Year_fct2017:StationCodeRVB -0.061986 0.167397 -0.370 0.711475
## Year_fct2018:StationCodeRVB -0.025504 0.164606 -0.155 0.876994
## Year_fct2019:StationCodeRVB -0.285560 0.167246 -1.707 0.088974 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3279 on 252 degrees of freedom
## Multiple R-squared: 0.9079, Adjusted R-squared: 0.9002
## F-statistic: 118.2 on 21 and 252 DF, p-value: < 2.2e-16
df_chla_c2_lag %>%
drop_na(Chla_log, lag1, lag2) %>%
plot_lm_diag(Chla_log, m_lm_cat2_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
shapiro.test(residuals(m_lm_cat2_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_lm_cat2_lag2)
## W = 0.94723, p-value = 2.28e-08
Anova(m_lm_cat2_lag2, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
##
## Response: Chla_log
## Sum Sq Df F value Pr(>F)
## (Intercept) 5.2737 1 49.0497 2.262e-11 ***
## Year_fct 0.3050 4 0.7092 0.5862914
## StationCode 2.5688 3 7.9641 4.295e-05 ***
## lag1 18.5375 1 172.4135 < 2.2e-16 ***
## lag2 0.8252 1 7.6747 0.0060171 **
## Year_fct:StationCode 3.8050 12 2.9492 0.0007352 ***
## Residuals 27.0944 252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model diagnostics don’t look that great. However, we’ll use
m_lm_cat2_lag2
in the model selection process.
rm(m_lm_cat2, m_lm_cat2_lag1)
Finally, we’ll try running a GAM model using smooths for weekly average flow by Station and Year and a smooth term for week number to account for seasonality. Initially, we’ll run the model without accounting for serial autocorrelation.
m_gam_sflow <- gam(
Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = "cc", k = 5),
data = df_chla_c2,
method = "REML",
knots = list(week = c(0, 52))
)
Lets look at the model summary and diagnostics:
summary(m_gam_sflow)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) +
## Year_fct * StationCode + s(Week, bs = "cc", k = 5)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.9415 0.4617 15.035 < 2e-16 ***
## Year_fct2016 -0.8407 0.3009 -2.794 0.005564 **
## Year_fct2017 -0.5569 0.2631 -2.117 0.035152 *
## Year_fct2018 -0.4401 0.2562 -1.718 0.086918 .
## Year_fct2019 -0.2726 0.2859 -0.953 0.341160
## StationCodeSTTD -8.1285 37.3586 -0.218 0.827913
## StationCodeLIB 0.4849 0.6923 0.700 0.484250
## StationCodeRVB 0.5447 0.6180 0.881 0.378860
## Year_fct2016:StationCodeSTTD 0.8281 0.1736 4.771 2.95e-06 ***
## Year_fct2017:StationCodeSTTD 0.3340 0.1816 1.839 0.066926 .
## Year_fct2018:StationCodeSTTD -0.2712 0.1689 -1.606 0.109483
## Year_fct2019:StationCodeSTTD -0.9023 0.1695 -5.324 2.08e-07 ***
## Year_fct2016:StationCodeLIB 1.3519 0.3581 3.776 0.000195 ***
## Year_fct2017:StationCodeLIB 0.4174 0.2613 1.598 0.111254
## Year_fct2018:StationCodeLIB -1.6855 0.2793 -6.034 5.05e-09 ***
## Year_fct2019:StationCodeLIB -0.4878 0.3338 -1.461 0.145045
## Year_fct2016:StationCodeRVB 1.4699 0.9618 1.528 0.127582
## Year_fct2017:StationCodeRVB 0.5908 0.6516 0.907 0.365409
## Year_fct2018:StationCodeRVB 0.2254 0.5915 0.381 0.703400
## Year_fct2019:StationCodeRVB -0.2519 0.8749 -0.288 0.773628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Flow):StationCodeRD22 1.000 1.000 0.079 0.779
## s(Flow):StationCodeSTTD 1.120 1.315 0.375 0.691
## s(Flow):StationCodeLIB 1.000 1.000 0.052 0.821
## s(Flow):StationCodeRVB 1.000 1.000 0.049 0.825
## s(Flow):Year_fct2015 1.000 1.000 0.050 0.823
## s(Flow):Year_fct2016 1.309 1.567 0.101 0.834
## s(Flow):Year_fct2017 1.000 1.000 0.049 0.826
## s(Flow):Year_fct2018 1.000 1.000 0.049 0.825
## s(Flow):Year_fct2019 2.159 2.768 0.276 0.767
## s(Week) 2.719 3.000 25.774 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 103/104
## R-sq.(adj) = 0.895 Deviance explained = 90.6%
## -REML = 127.47 Scale est. = 0.11201 n = 314
appraise(m_gam_sflow)
shapiro.test(residuals(m_gam_sflow))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_sflow)
## W = 0.98185, p-value = 0.0005262
k.check(m_gam_sflow)
## k' edf k-index p-value
## s(Flow):StationCodeRD22 9 1.000011 1.0644839 0.8800
## s(Flow):StationCodeSTTD 9 1.120242 1.0644839 0.8675
## s(Flow):StationCodeLIB 9 1.000027 1.0644839 0.8675
## s(Flow):StationCodeRVB 9 1.000030 1.0644839 0.8925
## s(Flow):Year_fct2015 9 1.000027 1.0644839 0.8250
## s(Flow):Year_fct2016 9 1.309135 1.0644839 0.8525
## s(Flow):Year_fct2017 9 1.000037 1.0644839 0.8575
## s(Flow):Year_fct2018 9 1.000033 1.0644839 0.8600
## s(Flow):Year_fct2019 9 2.159088 1.0644839 0.8675
## s(Week) 3 2.719405 0.9426987 0.1700
concurvity(m_gam_sflow, full = FALSE)$worst
## para s(Flow):StationCodeRD22
## para 1.000000e+00 2.643357e-01
## s(Flow):StationCodeRD22 2.643355e-01 1.000000e+00
## s(Flow):StationCodeSTTD 2.261153e-01 2.168304e-06
## s(Flow):StationCodeLIB 2.292994e-01 6.512119e-07
## s(Flow):StationCodeRVB 2.802548e-01 6.855557e-06
## s(Flow):Year_fct2015 2.324841e-01 3.248981e-01
## s(Flow):Year_fct2016 1.907664e-01 2.857509e-01
## s(Flow):Year_fct2017 1.940114e-01 2.319783e-01
## s(Flow):Year_fct2018 1.910646e-01 2.226717e-01
## s(Flow):Year_fct2019 1.880875e-01 3.424910e-01
## s(Week) 2.016117e-31 1.574039e-01
## s(Flow):StationCodeSTTD s(Flow):StationCodeLIB
## para 2.261152e-01 2.292994e-01
## s(Flow):StationCodeRD22 1.559760e-06 5.495882e-07
## s(Flow):StationCodeSTTD 1.000000e+00 1.640025e-07
## s(Flow):StationCodeLIB 1.868875e-07 1.000000e+00
## s(Flow):StationCodeRVB 2.987217e-07 8.290068e-12
## s(Flow):Year_fct2015 2.212644e-01 9.174394e-01
## s(Flow):Year_fct2016 1.576949e-01 5.098824e-01
## s(Flow):Year_fct2017 1.064741e-01 3.817659e-01
## s(Flow):Year_fct2018 1.427419e-01 3.281524e-01
## s(Flow):Year_fct2019 3.311852e-01 2.162509e-01
## s(Week) 5.820713e-02 6.532370e-02
## s(Flow):StationCodeRVB s(Flow):Year_fct2015
## para 2.802548e-01 2.324841e-01
## s(Flow):StationCodeRD22 5.589096e-06 3.248782e-01
## s(Flow):StationCodeSTTD 2.393704e-07 2.212752e-01
## s(Flow):StationCodeLIB 1.131060e-12 9.174394e-01
## s(Flow):StationCodeRVB 1.000000e+00 9.295169e-01
## s(Flow):Year_fct2015 9.295169e-01 1.000000e+00
## s(Flow):Year_fct2016 4.867459e-01 2.029570e-20
## s(Flow):Year_fct2017 9.999383e-01 2.318509e-21
## s(Flow):Year_fct2018 6.431556e-01 1.766783e-20
## s(Flow):Year_fct2019 4.786662e-01 2.002231e-20
## s(Week) 1.477669e-01 1.014688e-01
## s(Flow):Year_fct2016 s(Flow):Year_fct2017
## para 1.907664e-01 1.940114e-01
## s(Flow):StationCodeRD22 2.857273e-01 2.319049e-01
## s(Flow):StationCodeSTTD 1.576851e-01 1.064663e-01
## s(Flow):StationCodeLIB 5.098824e-01 3.817659e-01
## s(Flow):StationCodeRVB 4.867459e-01 9.999383e-01
## s(Flow):Year_fct2015 1.272877e-20 1.803852e-21
## s(Flow):Year_fct2016 1.000000e+00 6.705339e-26
## s(Flow):Year_fct2017 6.212374e-26 1.000000e+00
## s(Flow):Year_fct2018 1.806426e-25 2.485377e-25
## s(Flow):Year_fct2019 4.870358e-26 4.060936e-26
## s(Week) 1.081710e-01 6.038884e-02
## s(Flow):Year_fct2018 s(Flow):Year_fct2019 s(Week)
## para 1.910646e-01 1.880875e-01 2.183715e-31
## s(Flow):StationCodeRD22 2.226813e-01 3.424794e-01 1.574335e-01
## s(Flow):StationCodeSTTD 1.427468e-01 3.311809e-01 5.820459e-02
## s(Flow):StationCodeLIB 3.281525e-01 2.162509e-01 6.532372e-02
## s(Flow):StationCodeRVB 6.431556e-01 4.786662e-01 1.477669e-01
## s(Flow):Year_fct2015 1.311535e-20 1.908661e-20 1.014688e-01
## s(Flow):Year_fct2016 1.666665e-25 6.434476e-26 1.081710e-01
## s(Flow):Year_fct2017 2.663491e-25 3.679162e-26 6.038884e-02
## s(Flow):Year_fct2018 1.000000e+00 8.502842e-26 7.451261e-02
## s(Flow):Year_fct2019 2.091614e-25 1.000000e+00 7.697321e-02
## s(Week) 7.451261e-02 7.697321e-02 1.000000e+00
draw(m_gam_sflow, select = 10, residuals = TRUE, rug = FALSE)
plot(m_gam_sflow, pages = 1, all.terms = TRUE)
acf(residuals(m_gam_sflow))
Box.test(residuals(m_gam_sflow), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_sflow)
## X-squared = 139.11, df = 20, p-value < 2.2e-16
The diagnostic plots look really good. However, the residuals are autocorrelated.
Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.
m_gam_sflow_lag1 <- gam(
Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_sflow_lag1))
Box.test(residuals(m_gam_sflow_lag1), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_sflow_lag1)
## X-squared = 31.058, df = 20, p-value = 0.05443
m_gam_sflow_lag2 <- gam(
Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + lag2,
data = df_chla_c2_lag,
method = "REML",
knots = list(week = c(0, 52))
)
acf(residuals(m_gam_sflow_lag2))
Box.test(residuals(m_gam_sflow_lag2), lag = 20, type = 'Ljung-Box')
##
## Box-Ljung test
##
## data: residuals(m_gam_sflow_lag2)
## X-squared = 20.838, df = 20, p-value = 0.4067
The model with 2 lag terms seems to be okay in terms of serial autocorrelation. Let’s use AIC to see how they compare.
AIC(m_gam_sflow, m_gam_sflow_lag1, m_gam_sflow_lag2)
## df AIC
## m_gam_sflow 35.59213 239.6761
## m_gam_sflow_lag1 35.54780 146.3048
## m_gam_sflow_lag2 35.57415 141.1547
It looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.
summary(m_gam_sflow_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) +
## Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 +
## lag2
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.78540 0.62649 6.042 5.75e-09 ***
## Year_fct2016 -0.59159 0.31742 -1.864 0.06358 .
## Year_fct2017 -0.33673 0.29720 -1.133 0.25834
## Year_fct2018 -0.27953 0.29071 -0.962 0.33725
## Year_fct2019 -0.05803 0.29006 -0.200 0.84160
## StationCodeSTTD -9.99171 22.15069 -0.451 0.65234
## StationCodeLIB 0.49217 0.67725 0.727 0.46811
## StationCodeRVB 0.55317 0.64109 0.863 0.38907
## lag1 0.56833 0.06260 9.078 < 2e-16 ***
## lag2 -0.13552 0.05828 -2.325 0.02089 *
## Year_fct2016:StationCodeSTTD 0.52390 0.16786 3.121 0.00202 **
## Year_fct2017:StationCodeSTTD 0.23333 0.17447 1.337 0.18237
## Year_fct2018:StationCodeSTTD -0.10863 0.15940 -0.681 0.49622
## Year_fct2019:StationCodeSTTD -0.52711 0.16512 -3.192 0.00160 **
## Year_fct2016:StationCodeLIB 0.83072 0.32873 2.527 0.01214 *
## Year_fct2017:StationCodeLIB 0.35730 0.29392 1.216 0.22532
## Year_fct2018:StationCodeLIB -0.96976 0.31585 -3.070 0.00238 **
## Year_fct2019:StationCodeLIB -0.14044 0.31112 -0.451 0.65210
## Year_fct2016:StationCodeRVB 0.99033 0.82897 1.195 0.23340
## Year_fct2017:StationCodeRVB 0.08019 0.68487 0.117 0.90689
## Year_fct2018:StationCodeRVB 0.04818 0.61823 0.078 0.93795
## Year_fct2019:StationCodeRVB -0.67188 0.62938 -1.068 0.28680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Flow):StationCodeRD22 1.0000 1.000 0.003 0.953
## s(Flow):StationCodeSTTD 0.9546 1.149 0.440 0.640
## s(Flow):StationCodeLIB 1.0000 1.000 0.000 0.999
## s(Flow):StationCodeRVB 1.0000 1.000 0.000 0.998
## s(Flow):Year_fct2015 1.3234 1.564 0.065 0.926
## s(Flow):Year_fct2016 1.0095 1.019 0.000 0.999
## s(Flow):Year_fct2017 1.0000 1.000 0.000 0.998
## s(Flow):Year_fct2018 1.0000 1.000 0.000 0.998
## s(Flow):Year_fct2019 1.0001 1.000 0.000 1.000
## s(Week) 2.5102 3.000 10.067 2.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 105/106
## R-sq.(adj) = 0.92 Deviance explained = 93%
## -REML = 80.961 Scale est. = 0.086231 n = 274
appraise(m_gam_sflow_lag2)
shapiro.test(residuals(m_gam_sflow_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_sflow_lag2)
## W = 0.96744, p-value = 7.155e-06
k.check(m_gam_sflow_lag2)
## k' edf k-index p-value
## s(Flow):StationCodeRD22 9 1.0000160 1.1504105 0.9950
## s(Flow):StationCodeSTTD 9 0.9545833 1.1504105 0.9850
## s(Flow):StationCodeLIB 9 1.0000160 1.1504105 0.9875
## s(Flow):StationCodeRVB 9 1.0000142 1.1504105 0.9950
## s(Flow):Year_fct2015 9 1.3234330 1.1504105 0.9925
## s(Flow):Year_fct2016 9 1.0095499 1.1504105 0.9925
## s(Flow):Year_fct2017 9 1.0000069 1.1504105 0.9975
## s(Flow):Year_fct2018 9 1.0000193 1.1504105 0.9950
## s(Flow):Year_fct2019 9 1.0001430 1.1504105 0.9950
## s(Week) 3 2.5102477 0.9407824 0.1375
concurvity(m_gam_sflow_lag2, full = FALSE)$worst
## para s(Flow):StationCodeRD22
## para 1.000000e+00 2.664237e-01
## s(Flow):StationCodeRD22 2.664234e-01 1.000000e+00
## s(Flow):StationCodeSTTD 2.226279e-01 1.972005e-06
## s(Flow):StationCodeLIB 2.262774e-01 4.363332e-07
## s(Flow):StationCodeRVB 2.846715e-01 2.345258e-07
## s(Flow):Year_fct2015 2.372263e-01 3.305668e-01
## s(Flow):Year_fct2016 1.896821e-01 3.062400e-01
## s(Flow):Year_fct2017 1.932414e-01 2.557964e-01
## s(Flow):Year_fct2018 1.897747e-01 2.485280e-01
## s(Flow):Year_fct2019 1.885812e-01 3.588214e-01
## s(Week) 2.319438e-31 1.678239e-01
## s(Flow):StationCodeSTTD s(Flow):StationCodeLIB
## para 2.226279e-01 2.262774e-01
## s(Flow):StationCodeRD22 3.943426e-07 1.849880e-07
## s(Flow):StationCodeSTTD 1.000000e+00 4.597192e-08
## s(Flow):StationCodeLIB 9.044911e-08 1.000000e+00
## s(Flow):StationCodeRVB 2.039630e-07 3.049802e-12
## s(Flow):Year_fct2015 2.169149e-01 9.582905e-01
## s(Flow):Year_fct2016 1.409583e-01 5.240970e-01
## s(Flow):Year_fct2017 9.560632e-02 4.111982e-01
## s(Flow):Year_fct2018 1.610334e-01 4.775919e-01
## s(Flow):Year_fct2019 3.478116e-01 1.863801e-01
## s(Week) 8.184854e-02 8.951237e-02
## s(Flow):StationCodeRVB s(Flow):Year_fct2015
## para 2.846715e-01 2.372263e-01
## s(Flow):StationCodeRD22 1.000659e-07 3.305499e-01
## s(Flow):StationCodeSTTD 8.347470e-08 2.169225e-01
## s(Flow):StationCodeLIB 1.536946e-12 9.582905e-01
## s(Flow):StationCodeRVB 1.000000e+00 9.204540e-01
## s(Flow):Year_fct2015 9.204540e-01 1.000000e+00
## s(Flow):Year_fct2016 4.552008e-01 4.285171e-21
## s(Flow):Year_fct2017 9.999652e-01 1.040580e-21
## s(Flow):Year_fct2018 6.738619e-01 1.666663e-21
## s(Flow):Year_fct2019 5.049372e-01 3.001059e-21
## s(Week) 1.440000e-01 1.209400e-01
## s(Flow):Year_fct2016 s(Flow):Year_fct2017
## para 1.896821e-01 1.932414e-01
## s(Flow):StationCodeRD22 3.062229e-01 2.557911e-01
## s(Flow):StationCodeSTTD 1.409587e-01 9.560097e-02
## s(Flow):StationCodeLIB 5.240970e-01 4.111985e-01
## s(Flow):StationCodeRVB 4.552008e-01 9.999652e-01
## s(Flow):Year_fct2015 1.500348e-21 9.988296e-22
## s(Flow):Year_fct2016 1.000000e+00 5.444336e-25
## s(Flow):Year_fct2017 4.175246e-25 1.000000e+00
## s(Flow):Year_fct2018 4.100016e-25 7.496398e-26
## s(Flow):Year_fct2019 2.719358e-25 2.189423e-26
## s(Week) 1.207275e-01 6.868218e-02
## s(Flow):Year_fct2018 s(Flow):Year_fct2019 s(Week)
## para 1.897747e-01 1.885812e-01 3.616743e-32
## s(Flow):StationCodeRD22 2.485501e-01 3.588190e-01 1.678192e-01
## s(Flow):StationCodeSTTD 1.610323e-01 3.478052e-01 8.184793e-02
## s(Flow):StationCodeLIB 4.775918e-01 1.863800e-01 8.951239e-02
## s(Flow):StationCodeRVB 6.738619e-01 5.049372e-01 1.440000e-01
## s(Flow):Year_fct2015 1.694295e-21 2.390613e-21 1.209400e-01
## s(Flow):Year_fct2016 2.850699e-25 3.735339e-25 1.207275e-01
## s(Flow):Year_fct2017 1.874783e-25 3.013873e-26 6.868218e-02
## s(Flow):Year_fct2018 1.000000e+00 9.162466e-26 8.522678e-02
## s(Flow):Year_fct2019 7.298452e-26 1.000000e+00 9.372902e-02
## s(Week) 8.522678e-02 9.372902e-02 1.000000e+00
draw(m_gam_sflow_lag2, select = 1:4, residuals = TRUE, rug = FALSE)
draw(m_gam_sflow_lag2, select = 5:9, residuals = TRUE, rug = FALSE)
draw(m_gam_sflow_lag2, select = 10, residuals = TRUE, rug = FALSE)
anova(m_gam_sflow_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) +
## Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 +
## lag2
##
## Parametric Terms:
## df F p-value
## Year_fct 4 2.543 0.0403
## StationCode 3 0.406 0.7490
## lag1 1 82.411 < 2e-16
## lag2 1 5.407 0.0209
## Year_fct:StationCode 12 5.961 4.38e-09
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Flow):StationCodeRD22 1.0000 1.0000 0.003 0.953
## s(Flow):StationCodeSTTD 0.9546 1.1487 0.440 0.640
## s(Flow):StationCodeLIB 1.0000 1.0000 0.000 0.999
## s(Flow):StationCodeRVB 1.0000 1.0000 0.000 0.998
## s(Flow):Year_fct2015 1.3234 1.5644 0.065 0.926
## s(Flow):Year_fct2016 1.0095 1.0189 0.000 0.999
## s(Flow):Year_fct2017 1.0000 1.0000 0.000 0.998
## s(Flow):Year_fct2018 1.0000 1.0000 0.000 0.998
## s(Flow):Year_fct2019 1.0001 1.0003 0.000 1.000
## s(Week) 2.5102 3.0000 10.067 2.17e-07
The model diagnostics look a little worse than with the initial model
but they still look pretty good. Note that the approximate significance
of all smooth terms are greater than 0.05 except for the s(Week) term.
We’ll use m_gam_sflow_lag2
in the model selection
process.
rm(m_gam_sflow, m_gam_sflow_lag1)
Now we’ll compare the seven candidate models with AIC to select the one with the best fit. As a summary, here are the 7 models we are comparing:
m_gam_flow3_lag2
- GAM 3-way interactions
with s(Week) m_gam_flow2_lag2
- GAM 2-way interactions
with s(Week) m_gam_cat2_lag2
- GAM 2-way interaction
between Station and Year with s(Week) but without Flow m_lm_flow3_lag2
- LM 3-way interactions
without seasonal term m_lm_flow2_lag2
- LM 2-way interactions
without seasonal term m_lm_cat2_lag2
- LM 2-way interaction between
Station and Year but without Flow and seasonal term m_gam_sflow_lag2
- GAM using smooths for Flow
with s(Week) # AIC values
df_m_aic <-
AIC(
m_gam_flow3_lag2,
m_gam_flow2_lag2,
m_gam_cat2_lag2,
m_lm_flow3_lag2,
m_lm_flow2_lag2,
m_lm_cat2_lag2,
m_gam_sflow_lag2
) %>%
as_tibble(rownames = "Model")
# BIC values
df_m_bic <-
BIC(
m_gam_flow3_lag2,
m_gam_flow2_lag2,
m_gam_cat2_lag2,
m_lm_flow3_lag2,
m_lm_flow2_lag2,
m_lm_cat2_lag2,
m_gam_sflow_lag2
) %>%
as_tibble(rownames = "Model")
# Combine AIC and BIC and calculate differences from lowest value
df_m_aic_bic <-
left_join(df_m_aic, df_m_bic, by = join_by(Model, df)) %>%
mutate(across(c(AIC, BIC), ~ .x - min(.x), .names = "{.col}_delta")) %>%
select(Model, df, starts_with("AIC"), starts_with("BIC"))
# Sort by AIC
df_m_aic_bic %>% arrange(AIC)
## # A tibble: 7 × 6
## Model df AIC AIC_delta BIC BIC_delta
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m_gam_sflow_lag2 35.6 141. 0 270. 0.697
## 2 m_gam_flow2_lag2 33.8 148. 6.85 270. 1.12
## 3 m_gam_flow3_lag2 45.8 154. 13.2 320. 50.9
## 4 m_lm_flow2_lag2 31 173. 32.0 285. 16.2
## 5 m_gam_cat2_lag2 25.6 176. 35.2 269. 0
## 6 m_lm_flow3_lag2 43 178. 36.7 333. 64.3
## 7 m_lm_cat2_lag2 23 190. 48.4 273. 3.71
According to AIC, model 7 (GAM using smooths for Flow with s(Week)) was the model with the best fit. BIC preferred model 3 (GAM 2-way interaction between Station and Year with s(Week) but without Flow) with model 7 coming in close second place. Before we proceed with model 7, let’s revisit the model diagnostics and take a closer look at how the back-transformed fitted values from the model match the observed values.
appraise(m_gam_sflow_lag2)
shapiro.test(residuals(m_gam_sflow_lag2))
##
## Shapiro-Wilk normality test
##
## data: residuals(m_gam_sflow_lag2)
## W = 0.96744, p-value = 7.155e-06
k.check(m_gam_sflow_lag2)
## k' edf k-index p-value
## s(Flow):StationCodeRD22 9 1.0000160 1.1504105 0.9925
## s(Flow):StationCodeSTTD 9 0.9545833 1.1504105 0.9950
## s(Flow):StationCodeLIB 9 1.0000160 1.1504105 0.9925
## s(Flow):StationCodeRVB 9 1.0000142 1.1504105 0.9875
## s(Flow):Year_fct2015 9 1.3234330 1.1504105 0.9850
## s(Flow):Year_fct2016 9 1.0095499 1.1504105 0.9900
## s(Flow):Year_fct2017 9 1.0000069 1.1504105 0.9925
## s(Flow):Year_fct2018 9 1.0000193 1.1504105 0.9925
## s(Flow):Year_fct2019 9 1.0001430 1.1504105 0.9925
## s(Week) 3 2.5102477 0.9407824 0.1025
draw(m_gam_sflow_lag2, select = 1:4, residuals = TRUE, rug = FALSE)
draw(m_gam_sflow_lag2, select = 5:9, residuals = TRUE, rug = FALSE)
draw(m_gam_sflow_lag2, select = 10, residuals = TRUE, rug = FALSE)
anova(m_gam_sflow_lag2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) +
## Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 +
## lag2
##
## Parametric Terms:
## df F p-value
## Year_fct 4 2.543 0.0403
## StationCode 3 0.406 0.7490
## lag1 1 82.411 < 2e-16
## lag2 1 5.407 0.0209
## Year_fct:StationCode 12 5.961 4.38e-09
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Flow):StationCodeRD22 1.0000 1.0000 0.003 0.953
## s(Flow):StationCodeSTTD 0.9546 1.1487 0.440 0.640
## s(Flow):StationCodeLIB 1.0000 1.0000 0.000 0.999
## s(Flow):StationCodeRVB 1.0000 1.0000 0.000 0.998
## s(Flow):Year_fct2015 1.3234 1.5644 0.065 0.926
## s(Flow):Year_fct2016 1.0095 1.0189 0.000 0.999
## s(Flow):Year_fct2017 1.0000 1.0000 0.000 0.998
## s(Flow):Year_fct2018 1.0000 1.0000 0.000 0.998
## s(Flow):Year_fct2019 1.0001 1.0003 0.000 1.000
## s(Week) 2.5102 3.0000 10.067 2.17e-07
df_m_gam_sflow_lag2_fit <- df_chla_c2_lag %>%
drop_na(lag1, lag2) %>%
fitted_values(m_gam_sflow_lag2, data = .) %>%
mutate(fitted_bt = exp(fitted) / 1000)
plt_m_gam_sflow_lag2_fit <- df_m_gam_sflow_lag2_fit %>%
ggplot(aes(x = fitted_bt, y = Chla)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
theme_bw() +
labs(
x = "Back-transformed Fitted Values",
y = "Observed Values"
)
plt_m_gam_sflow_lag2_fit
Let’s group by station.
plt_m_gam_sflow_lag2_fit + facet_wrap(vars(StationCode), scales = "free")
Now, group by year.
plt_m_gam_sflow_lag2_fit + facet_wrap(vars(Year_fct), scales = "free")
Everything looks pretty decent with this model. Not perfect, but pretty good given the number of data points. Note that variability does increase as the chlorophyll values increase. We’ll select model 7 as our final model for our analysis.
df_m_aic_bic %>%
arrange(AIC) %>%
mutate(across(where(is.numeric), ~ paste0(formatC(.x, digits = 1, format = "f"), "##"))) %>%
write_csv(here("manuscript_synthesis/results/tables/chl_aic_weekly_models.csv"))