Purpose

Explore and analyze the continuous chlorophyll data to be included in the NDFS synthesis manuscript. We will attempt to fit multiple models to predict weekly average chlorophyll fluorescence values. All models will only include representative stations for 4 habitat types - upstream (RD22), lower Yolo Bypass (STTD), Cache Slough complex (LIB), and downstream (RVB). At a minimum, the models will contain the two categorical variables - Year and Station - as predictor variables. In some of the models, we will add weekly average flow as a continuous predictor which replaces the categorical predictor - flow action period - in the original analysis. Additionally, we’ll add a GAM smooth for Week number term to account for seasonality in some of the models. After fitting multiple models, we’ll use a model selection process to determine the best one.

Global code and functions

# Load packages
library(tidyverse)
library(scales)
library(knitr)
library(mgcv)
library(car)
library(gratia)
library(ggeffects)
library(emmeans)
library(multcomp)
library(here)
library(conflicted)

# Source functions
source(here("manuscript_synthesis/src/global_functions.R"))

# Declare package conflict preferences 
conflicts_prefer(dplyr::filter(), dplyr::lag(), dplyr::select())

Display current versions of R and packages used for this analysis:

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.1 (2024-06-14 ucrt)
##  os       Windows 11 x64 (build 26100)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.utf8
##  ctype    English_United States.utf8
##  tz       America/Los_Angeles
##  date     2024-12-19
##  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version  date (UTC) lib source
##  abind          1.4-5    2016-07-21 [1] CRAN (R 4.4.0)
##  bslib          0.8.0    2024-07-29 [1] CRAN (R 4.4.1)
##  cachem         1.1.0    2024-05-16 [1] CRAN (R 4.4.1)
##  car          * 3.1-2    2023-03-30 [1] CRAN (R 4.4.0)
##  carData      * 3.0-5    2022-01-06 [1] CRAN (R 4.4.0)
##  cli            3.6.3    2024-06-21 [1] CRAN (R 4.4.1)
##  coda           0.19-4.1 2024-01-31 [1] CRAN (R 4.4.0)
##  codetools      0.2-20   2024-03-31 [2] CRAN (R 4.4.1)
##  colorspace     2.1-1    2024-07-26 [1] CRAN (R 4.4.1)
##  conflicted   * 1.2.0    2023-02-01 [1] CRAN (R 4.4.0)
##  devtools       2.4.5    2022-10-11 [1] CRAN (R 4.4.0)
##  digest         0.6.37   2024-08-19 [1] CRAN (R 4.4.1)
##  dplyr        * 1.1.4    2023-11-17 [1] CRAN (R 4.4.0)
##  ellipsis       0.3.2    2021-04-29 [1] CRAN (R 4.4.0)
##  emmeans      * 1.10.4   2024-08-21 [1] CRAN (R 4.4.1)
##  estimability   1.5.1    2024-05-12 [1] CRAN (R 4.4.1)
##  evaluate       0.24.0   2024-06-10 [1] CRAN (R 4.4.1)
##  fansi          1.0.6    2023-12-08 [1] CRAN (R 4.4.0)
##  farver         2.1.2    2024-05-13 [1] CRAN (R 4.4.1)
##  fastmap        1.2.0    2024-05-15 [1] CRAN (R 4.4.1)
##  forcats      * 1.0.0    2023-01-29 [1] CRAN (R 4.4.0)
##  fs           * 1.6.4    2024-04-25 [1] CRAN (R 4.4.0)
##  generics       0.1.3    2022-07-05 [1] CRAN (R 4.4.0)
##  ggeffects    * 1.7.1    2024-09-01 [1] CRAN (R 4.4.1)
##  ggokabeito     0.1.0    2021-10-18 [1] CRAN (R 4.4.0)
##  ggplot2      * 3.5.1    2024-04-23 [1] CRAN (R 4.4.0)
##  glue           1.7.0    2024-01-09 [1] CRAN (R 4.4.0)
##  gratia       * 0.9.2    2024-06-25 [1] CRAN (R 4.4.1)
##  gtable         0.3.5    2024-04-22 [1] CRAN (R 4.4.0)
##  here         * 1.0.1    2020-12-13 [1] CRAN (R 4.4.0)
##  hms            1.1.3    2023-03-21 [1] CRAN (R 4.4.0)
##  htmltools      0.5.8.1  2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets    1.6.4    2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv         1.6.15   2024-03-26 [1] CRAN (R 4.4.0)
##  insight        0.20.5   2024-10-02 [1] CRAN (R 4.4.2)
##  jquerylib      0.1.4    2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite       1.8.8    2023-12-04 [1] CRAN (R 4.4.0)
##  knitr        * 1.48     2024-07-07 [1] CRAN (R 4.4.1)
##  later          1.3.2    2023-12-06 [1] CRAN (R 4.4.0)
##  lattice        0.22-6   2024-03-20 [1] CRAN (R 4.4.0)
##  lifecycle      1.0.4    2023-11-07 [1] CRAN (R 4.4.0)
##  lubridate    * 1.9.3    2023-09-27 [1] CRAN (R 4.4.0)
##  magrittr       2.0.3    2022-03-30 [1] CRAN (R 4.4.0)
##  MASS         * 7.3-61   2024-06-13 [1] CRAN (R 4.4.1)
##  Matrix         1.7-0    2024-03-22 [1] CRAN (R 4.4.0)
##  memoise        2.0.1    2021-11-26 [1] CRAN (R 4.4.0)
##  mgcv         * 1.9-1    2023-12-21 [1] CRAN (R 4.4.0)
##  mime           0.12     2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI         0.1.1.1  2018-05-18 [1] CRAN (R 4.4.0)
##  multcomp     * 1.4-26   2024-07-18 [1] CRAN (R 4.4.1)
##  munsell        0.5.1    2024-04-01 [1] CRAN (R 4.4.0)
##  mvnfast        0.2.8    2023-02-23 [1] CRAN (R 4.4.0)
##  mvtnorm      * 1.3-1    2024-09-03 [1] CRAN (R 4.4.1)
##  nlme         * 3.1-166  2024-08-14 [1] CRAN (R 4.4.1)
##  patchwork    * 1.3.0    2024-09-16 [1] CRAN (R 4.4.2)
##  pillar         1.9.0    2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild       1.4.4    2024-03-17 [1] CRAN (R 4.4.0)
##  pkgconfig      2.0.3    2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload        1.4.0    2024-06-28 [1] CRAN (R 4.4.1)
##  profvis        0.3.8    2023-05-02 [1] CRAN (R 4.4.0)
##  promises       1.3.0    2024-04-05 [1] CRAN (R 4.4.0)
##  purrr        * 1.0.2    2023-08-10 [1] CRAN (R 4.4.0)
##  R6             2.5.1    2021-08-19 [1] CRAN (R 4.4.0)
##  Rcpp           1.0.13   2024-07-17 [1] CRAN (R 4.4.1)
##  readr        * 2.1.5    2024-01-10 [1] CRAN (R 4.4.0)
##  remotes        2.5.0    2024-03-17 [1] CRAN (R 4.4.0)
##  rlang        * 1.1.4    2024-06-04 [1] CRAN (R 4.4.1)
##  rmarkdown      2.28     2024-08-17 [1] CRAN (R 4.4.1)
##  rprojroot      2.0.4    2023-11-05 [1] CRAN (R 4.4.0)
##  rstudioapi     0.16.0   2024-03-24 [1] CRAN (R 4.4.0)
##  sandwich       3.1-0    2023-12-11 [1] CRAN (R 4.4.0)
##  sass           0.4.9    2024-03-15 [1] CRAN (R 4.4.0)
##  scales       * 1.3.0    2023-11-28 [1] CRAN (R 4.4.0)
##  sessioninfo    1.2.2    2021-12-06 [1] CRAN (R 4.4.0)
##  shiny          1.9.1    2024-08-01 [1] CRAN (R 4.4.1)
##  stringi        1.8.4    2024-05-06 [1] CRAN (R 4.4.0)
##  stringr      * 1.5.1    2023-11-14 [1] CRAN (R 4.4.0)
##  survival     * 3.7-0    2024-06-05 [1] CRAN (R 4.4.1)
##  TH.data      * 1.1-2    2023-04-17 [1] CRAN (R 4.4.0)
##  tibble       * 3.2.1    2023-03-20 [1] CRAN (R 4.4.0)
##  tidyr        * 1.3.1    2024-01-24 [1] CRAN (R 4.4.0)
##  tidyselect     1.2.1    2024-03-11 [1] CRAN (R 4.4.0)
##  tidyverse    * 2.0.0    2023-02-22 [1] CRAN (R 4.4.0)
##  timechange     0.3.0    2024-01-18 [1] CRAN (R 4.4.0)
##  tzdb           0.4.0    2023-05-12 [1] CRAN (R 4.4.0)
##  urlchecker     1.0.1    2021-11-30 [1] CRAN (R 4.4.0)
##  usethis        3.0.0    2024-07-29 [1] CRAN (R 4.4.1)
##  utf8           1.2.4    2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs          0.6.5    2023-12-01 [1] CRAN (R 4.4.0)
##  withr          3.0.1    2024-07-31 [1] CRAN (R 4.4.1)
##  xfun           0.47     2024-08-17 [1] CRAN (R 4.4.1)
##  xtable         1.8-4    2019-04-21 [1] CRAN (R 4.4.0)
##  yaml           2.3.10   2024-07-26 [1] CRAN (R 4.4.1)
##  zoo            1.8-12   2023-04-13 [1] CRAN (R 4.4.0)
## 
##  [1] C:/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.1/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Import Data

# Define file path for processed data
fp_data <- here("manuscript_synthesis/data/processed")

# Import weekly average water quality data
df_wq <- readRDS(file.path(fp_data, "wq_week_avg_2013-2019.rds"))

# Import weekly average flow data
df_flow <- readRDS(file.path(fp_data, "flow_week_avg_2013-2019.rds"))

Prepare Data

# Create a vector for the factor order of StationCode
sta_order <- c("RD22", "STTD", "LIB", "RVB")

# We will use LIS flow data as a proxy for STTD
df_flow_c <- df_flow %>% mutate(StationCode = if_else(StationCode == "LIS", "STTD", StationCode))

# Prepare chlorophyll and flow data for exploration and analysis
df_chla_c1 <- df_wq %>% 
  select(StationCode, Year, Week, Chla) %>% 
  drop_na(Chla) %>% 
  # Filter to only include representative stations for 4 habitat types - RD22, STTD, LIB, RVB
  filter(StationCode %in% sta_order) %>% 
  # Join flow data to chlorophyll data
  left_join(df_flow_c, by = join_by(StationCode, Year, Week)) %>% 
  # Remove all NA flow values
  drop_na(Flow) %>%
  mutate(
    # Scale and log transform chlorophyll values
    Chla_log = log(Chla * 1000),
    # Apply factor order to StationCode
    StationCode = factor(StationCode, levels = sta_order),
    # Add a column for Year as a factor for the model
    Year_fct = factor(Year)
  ) %>% 
  arrange(StationCode, Year, Week)

Explore sample counts by Station

df_chla_c1 %>% 
  summarize(
    min_week = min(Week),
    max_week = max(Week),
    num_samples = n(),
    .by = c(StationCode, Year)
  ) %>% 
  arrange(StationCode, Year) %>% 
  kable()
StationCode Year min_week max_week num_samples
RD22 2014 39 45 7
RD22 2015 30 45 16
RD22 2016 25 38 14
RD22 2017 28 44 17
RD22 2018 28 45 18
RD22 2019 28 45 18
STTD 2013 33 44 12
STTD 2014 30 45 14
STTD 2015 30 46 17
STTD 2016 25 38 14
STTD 2017 28 39 10
STTD 2018 29 42 14
STTD 2019 30 45 16
LIB 2014 40 45 6
LIB 2015 27 46 20
LIB 2016 22 38 17
LIB 2017 28 44 17
LIB 2018 33 44 10
LIB 2019 28 43 8
RVB 2013 27 46 20
RVB 2014 30 45 16
RVB 2015 27 46 20
RVB 2016 22 38 15
RVB 2017 28 44 17
RVB 2018 28 45 18
RVB 2019 28 45 18

Looking at the sample counts and date ranges, we’ll only include years 2015-2019 for the analysis.

df_chla_c2 <- df_chla_c1 %>% 
  filter(Year %in% 2015:2019) %>% 
  mutate(Year_fct = fct_drop(Year_fct))

We’ll create another dataframe that has up to 2 lag variables for chlorophyll to be used in the models to help with serial autocorrelation.

df_chla_c2_lag <- df_chla_c2 %>% 
  # Fill in missing weeks for each StationCode-Year combination
  group_by(StationCode, Year) %>% 
  # Create lag variables of scaled log transformed chlorophyll values
  mutate(
    lag1 = lag(Chla_log),
    lag2 = lag(Chla_log, 2)
  ) %>% 
  ungroup()

Plots

Let’s explore the data with some plots. First, lets plot the data in scatter plots of chlorophyll and flow faceted by Station and grouping all years together.

df_chla_c2 %>% 
  ggplot(aes(x = Flow, y = Chla)) +
  geom_point() +
  geom_smooth(formula = "y ~ x") +
  facet_wrap(vars(StationCode), scales = "free") +
  theme_bw()

At first glance, I’m not sure how well flow is going to be able to predict chlorophyll concentrations. At the furthest upstream station - RD22 - chlorophyll appears to be highest at the lowest flows, but the variation is at its maximum at the lowest flows. There may be some dilution effect going on here at the higher flows. At STTD, there does seem to be a modest increase in chlorophyll concentrations at the mid-range flows. This pattern is even more obvious at LIB. There appears to be no effect of flow on chlorophyll at RVB, but the range of chlorophyll concentrations is narrow at this station (between 0 and 5).

Let’s break these scatterplots apart by year to see how these patterns vary annually.

df_chla_c2 %>% 
  ggplot(aes(x = Flow, y = Chla)) +
  geom_point() +
  geom_smooth(formula = "y ~ x") +
  facet_wrap(
    vars(StationCode, Year), 
    ncol = 5, 
    scales = "free", 
    labeller = labeller(.multi_line = FALSE)
  ) +
  theme_bw()

The patterns appear to vary annually at each station, which may justify using a 3-way interaction.

Model 1: GAM Model with Flow and 3-way interactions

First, we will attempt to fit a generalized additive model (GAM) to the data set to help account for seasonality in the data. We’ll try running a GAM using a three-way interaction between Year, Weekly Average Flow, and Station, and a cyclic penalized cubic regression spline smooth term for week number to account for seasonality (restricting the k-value to 5 to reduce overfitting). Initially, we’ll run the GAM without accounting for serial autocorrelation.

Initial Model

m_gam_flow3 <- gam(
  Chla_log ~ Year_fct * Flow * StationCode + s(Week,  bs = "cc", k = 5), 
  data = df_chla_c2,
  method = "REML",
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_flow3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", 
##     k = 5)
## 
## Parametric coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.776e+00  1.065e-01  91.828  < 2e-16 ***
## Year_fct2016                      -8.399e-01  1.543e-01  -5.444 1.16e-07 ***
## Year_fct2017                      -9.692e-01  2.377e-01  -4.078 5.97e-05 ***
## Year_fct2018                      -4.060e-01  1.421e-01  -2.858 0.004600 ** 
## Year_fct2019                      -1.337e-01  1.396e-01  -0.958 0.338960    
## Flow                              -1.964e-03  4.744e-04  -4.140 4.63e-05 ***
## StationCodeSTTD                   -1.147e+00  1.339e-01  -8.568 8.07e-16 ***
## StationCodeLIB                    -2.358e+00  4.002e-01  -5.893 1.12e-08 ***
## StationCodeRVB                    -2.720e+00  5.987e-01  -4.543 8.37e-06 ***
## Year_fct2016:Flow                  2.327e-03  7.570e-04   3.074 0.002324 ** 
## Year_fct2017:Flow                  3.152e-02  1.225e-02   2.574 0.010594 *  
## Year_fct2018:Flow                  5.652e-04  6.522e-04   0.867 0.386873    
## Year_fct2019:Flow                 -5.601e-05  5.740e-04  -0.098 0.922341    
## Year_fct2016:StationCodeSTTD       1.225e+00  1.969e-01   6.220 1.87e-09 ***
## Year_fct2017:StationCodeSTTD       1.205e+00  3.243e-01   3.716 0.000246 ***
## Year_fct2018:StationCodeSTTD      -2.392e-01  1.879e-01  -1.273 0.204262    
## Year_fct2019:StationCodeSTTD      -9.220e-01  1.848e-01  -4.990 1.08e-06 ***
## Year_fct2016:StationCodeLIB        2.302e+00  8.067e-01   2.853 0.004658 ** 
## Year_fct2017:StationCodeLIB        1.216e+00  5.196e-01   2.339 0.020037 *  
## Year_fct2018:StationCodeLIB       -2.069e+00  5.209e-01  -3.972 9.12e-05 ***
## Year_fct2019:StationCodeLIB       -1.666e-01  5.159e-01  -0.323 0.746950    
## Year_fct2016:StationCodeRVB        2.489e+00  8.278e-01   3.006 0.002895 ** 
## Year_fct2017:StationCodeRVB        1.520e+00  7.592e-01   2.002 0.046309 *  
## Year_fct2018:StationCodeRVB        5.291e-01  6.760e-01   0.783 0.434481    
## Year_fct2019:StationCodeRVB       -6.025e-01  6.932e-01  -0.869 0.385527    
## Flow:StationCodeSTTD               5.176e-03  7.181e-04   7.207 5.66e-12 ***
## Flow:StationCodeLIB                1.783e-03  5.383e-04   3.312 0.001051 ** 
## Flow:StationCodeRVB                2.103e-03  4.964e-04   4.237 3.10e-05 ***
## Year_fct2016:Flow:StationCodeSTTD -4.440e-03  1.097e-03  -4.048 6.73e-05 ***
## Year_fct2017:Flow:StationCodeSTTD -2.326e-02  1.366e-02  -1.703 0.089738 .  
## Year_fct2018:Flow:StationCodeSTTD -9.688e-04  9.659e-04  -1.003 0.316769    
## Year_fct2019:Flow:StationCodeSTTD -1.569e-03  8.547e-04  -1.836 0.067435 .  
## Year_fct2016:Flow:StationCodeLIB  -1.936e-03  9.423e-04  -2.055 0.040881 *  
## Year_fct2017:Flow:StationCodeLIB  -3.143e-02  1.225e-02  -2.566 0.010814 *  
## Year_fct2018:Flow:StationCodeLIB  -1.440e-03  8.404e-04  -1.713 0.087812 .  
## Year_fct2019:Flow:StationCodeLIB   2.452e-04  7.403e-04   0.331 0.740800    
## Year_fct2016:Flow:StationCodeRVB  -2.582e-03  7.770e-04  -3.322 0.001015 ** 
## Year_fct2017:Flow:StationCodeRVB  -3.168e-02  1.225e-02  -2.586 0.010237 *  
## Year_fct2018:Flow:StationCodeRVB  -6.853e-04  6.678e-04  -1.026 0.305729    
## Year_fct2019:Flow:StationCodeRVB  -1.390e-05  5.930e-04  -0.023 0.981311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Week) 2.722      3 18.56  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.898   Deviance explained = 91.1%
## -REML =    255  Scale est. = 0.10912   n = 314
appraise(m_gam_flow3)

shapiro.test(residuals(m_gam_flow3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow3)
## W = 0.97687, p-value = 5.975e-05
k.check(m_gam_flow3)
##         k'     edf   k-index p-value
## s(Week)  3 2.72227 0.9693372   0.265
draw(m_gam_flow3, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow3, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_flow3))

Box.test(residuals(m_gam_flow3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow3)
## X-squared = 105.31, df = 20, p-value = 1.398e-13

Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look pretty good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_gam_flow3_lag1 <- gam(
  Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", k = 5) + lag1, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow3_lag1))

Box.test(residuals(m_gam_flow3_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow3_lag1)
## X-squared = 28.061, df = 20, p-value = 0.108

Lag 2

m_gam_flow3_lag2 <- gam(
  Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", k = 5) + lag1 + lag2, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow3_lag2))

Box.test(residuals(m_gam_flow3_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow3_lag2)
## X-squared = 15.686, df = 20, p-value = 0.7359

The model with 1 lag term already seems to address the serial autocorrelation. Let’s use AIC to see how they compare.

Compare Models

AIC(m_gam_flow3, m_gam_flow3_lag1, m_gam_flow3_lag2)
##                        df      AIC
## m_gam_flow3      43.94624 237.4562
## m_gam_flow3_lag1 44.81188 158.7898
## m_gam_flow3_lag2 45.81545 154.3799

It looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_gam_flow3_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", 
##     k = 5) + lag1 + lag2
## 
## Parametric coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        5.942e+00  5.626e-01  10.563  < 2e-16 ***
## Year_fct2016                      -5.724e-01  1.590e-01  -3.599 0.000391 ***
## Year_fct2017                      -6.505e-01  2.400e-01  -2.711 0.007223 ** 
## Year_fct2018                      -2.955e-01  1.412e-01  -2.093 0.037481 *  
## Year_fct2019                      -7.407e-02  1.378e-01  -0.538 0.591405    
## Flow                              -1.364e-03  4.548e-04  -2.999 0.003008 ** 
## StationCodeSTTD                   -6.939e-01  1.437e-01  -4.830 2.49e-06 ***
## StationCodeLIB                    -1.555e+00  3.969e-01  -3.918 0.000118 ***
## StationCodeRVB                    -1.970e+00  6.566e-01  -3.001 0.002988 ** 
## lag1                               5.316e-01  6.573e-02   8.088 3.49e-14 ***
## lag2                              -1.386e-01  6.071e-02  -2.282 0.023376 *  
## Year_fct2016:Flow                  1.446e-03  7.135e-04   2.027 0.043828 *  
## Year_fct2017:Flow                  2.220e-02  1.352e-02   1.642 0.101936    
## Year_fct2018:Flow                  5.982e-04  6.055e-04   0.988 0.324231    
## Year_fct2019:Flow                  1.557e-04  5.359e-04   0.291 0.771614    
## Year_fct2016:StationCodeSTTD       8.113e-01  2.031e-01   3.995 8.73e-05 ***
## Year_fct2017:StationCodeSTTD       8.102e-01  3.275e-01   2.474 0.014073 *  
## Year_fct2018:StationCodeSTTD      -6.404e-02  1.869e-01  -0.343 0.732109    
## Year_fct2019:StationCodeSTTD      -5.797e-01  1.899e-01  -3.052 0.002538 ** 
## Year_fct2016:StationCodeLIB        2.058e+00  8.967e-01   2.295 0.022639 *  
## Year_fct2017:StationCodeLIB        8.948e-01  4.964e-01   1.803 0.072770 .  
## Year_fct2018:StationCodeLIB       -8.208e-01  5.527e-01  -1.485 0.138923    
## Year_fct2019:StationCodeLIB       -4.930e-02  4.829e-01  -0.102 0.918769    
## Year_fct2016:StationCodeRVB        1.952e+00  9.201e-01   2.121 0.034971 *  
## Year_fct2017:StationCodeRVB        1.055e+00  7.978e-01   1.323 0.187191    
## Year_fct2018:StationCodeRVB        6.910e-01  7.092e-01   0.974 0.330883    
## Year_fct2019:StationCodeRVB       -1.381e-01  7.264e-01  -0.190 0.849409    
## Flow:StationCodeSTTD               3.192e-03  7.093e-04   4.499 1.08e-05 ***
## Flow:StationCodeLIB                1.191e-03  5.202e-04   2.289 0.022987 *  
## Flow:StationCodeRVB                1.517e-03  4.813e-04   3.151 0.001842 ** 
## Year_fct2016:Flow:StationCodeSTTD -2.740e-03  1.034e-03  -2.649 0.008627 ** 
## Year_fct2017:Flow:StationCodeSTTD -1.654e-02  1.472e-02  -1.124 0.262253    
## Year_fct2018:Flow:StationCodeSTTD -8.351e-04  9.053e-04  -0.923 0.357214    
## Year_fct2019:Flow:StationCodeSTTD -9.759e-04  7.996e-04  -1.221 0.223506    
## Year_fct2016:Flow:StationCodeLIB  -7.654e-04  9.668e-04  -0.792 0.429389    
## Year_fct2017:Flow:StationCodeLIB  -2.206e-02  1.352e-02  -1.632 0.104110    
## Year_fct2018:Flow:StationCodeLIB  -4.712e-04  8.824e-04  -0.534 0.593834    
## Year_fct2019:Flow:StationCodeLIB  -5.711e-05  6.916e-04  -0.083 0.934265    
## Year_fct2016:Flow:StationCodeRVB  -1.681e-03  7.394e-04  -2.274 0.023898 *  
## Year_fct2017:Flow:StationCodeRVB  -2.235e-02  1.352e-02  -1.653 0.099754 .  
## Year_fct2018:Flow:StationCodeRVB  -7.416e-04  6.256e-04  -1.186 0.237042    
## Year_fct2019:Flow:StationCodeRVB  -2.587e-04  5.598e-04  -0.462 0.644462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value    
## s(Week) 2.461      3 7.763 1.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.918   Deviance explained = 93.1%
## -REML = 215.73  Scale est. = 0.087877  n = 274
appraise(m_gam_flow3_lag2)

shapiro.test(residuals(m_gam_flow3_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow3_lag2)
## W = 0.96317, p-value = 1.857e-06
k.check(m_gam_flow3_lag2)
##         k'      edf   k-index p-value
## s(Week)  3 2.460695 0.9467043  0.1725
draw(m_gam_flow3_lag2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow3_lag2, pages = 1, all.terms = TRUE)

anova(m_gam_flow3_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = "cc", 
##     k = 5) + lag1 + lag2
## 
## Parametric Terms:
##                           df      F  p-value
## Year_fct                   4  4.954 0.000755
## Flow                       1  8.993 0.003008
## StationCode                3 11.462 4.96e-07
## lag1                       1 65.416 3.49e-14
## lag2                       1  5.210 0.023376
## Year_fct:Flow              4  1.873 0.116090
## Year_fct:StationCode      12  4.300 3.77e-06
## Flow:StationCode           3  7.237 0.000116
## Year_fct:Flow:StationCode 12  1.241 0.256118
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value
## s(Week) 2.461  3.000 7.763 1.33e-05

The model diagnostics look pretty good. Note that the 3-way interaction between Year, Station, and Flow isn’t significant. We’ll use m_gam_flow3_lag2 in the model selection process.

rm(m_gam_flow3, m_gam_flow3_lag1)

Model 2: GAM Model with Flow and 2-way interactions

Now we’ll try running a GAM using all two-way interactions between Year, Flow, and Station.

Initial Model

m_gam_flow2 <- gam(
  Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", k = 5), 
  data = df_chla_c2,
  method = "REML", 
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_flow2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", 
##     k = 5)
## 
## Parametric coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.711e+00  9.377e-02 103.564  < 2e-16 ***
## Year_fct2016                 -5.547e-01  1.298e-01  -4.273 2.64e-05 ***
## Year_fct2017                 -3.981e-01  1.251e-01  -3.183 0.001620 ** 
## Year_fct2018                 -3.184e-01  1.205e-01  -2.641 0.008722 ** 
## Year_fct2019                 -1.354e-01  1.206e-01  -1.123 0.262566    
## Flow                         -1.452e-03  2.513e-04  -5.780 1.97e-08 ***
## StationCodeSTTD              -1.041e+00  1.256e-01  -8.287 4.70e-15 ***
## StationCodeLIB               -2.074e+00  3.037e-01  -6.831 5.13e-11 ***
## StationCodeRVB               -2.597e+00  5.281e-01  -4.918 1.49e-06 ***
## Year_fct2016:Flow            -2.286e-04  1.412e-04  -1.619 0.106640    
## Year_fct2017:Flow            -1.430e-04  1.304e-04  -1.096 0.274109    
## Year_fct2018:Flow            -1.149e-04  1.307e-04  -0.879 0.380090    
## Year_fct2019:Flow            -7.474e-05  1.310e-04  -0.571 0.568768    
## Year_fct2016:StationCodeSTTD  8.459e-01  1.788e-01   4.732 3.52e-06 ***
## Year_fct2017:StationCodeSTTD  3.354e-01  1.870e-01   1.794 0.073912 .  
## Year_fct2018:StationCodeSTTD -3.324e-01  1.738e-01  -1.913 0.056752 .  
## Year_fct2019:StationCodeSTTD -1.007e+00  1.709e-01  -5.895 1.07e-08 ***
## Year_fct2016:StationCodeLIB   1.112e+00  2.863e-01   3.884 0.000128 ***
## Year_fct2017:StationCodeLIB   3.338e-01  2.693e-01   1.240 0.216084    
## Year_fct2018:StationCodeLIB  -1.770e+00  2.879e-01  -6.149 2.64e-09 ***
## Year_fct2019:StationCodeLIB  -4.850e-01  2.927e-01  -1.657 0.098653 .  
## Year_fct2016:StationCodeRVB   2.036e+00  7.646e-01   2.663 0.008182 ** 
## Year_fct2017:StationCodeRVB   8.638e-01  6.689e-01   1.291 0.197600    
## Year_fct2018:StationCodeRVB   4.469e-01  6.078e-01   0.735 0.462836    
## Year_fct2019:StationCodeRVB  -5.091e-01  6.217e-01  -0.819 0.413560    
## Flow:StationCodeSTTD          3.620e-03  3.234e-04  11.195  < 2e-16 ***
## Flow:StationCodeLIB           1.409e-03  2.834e-04   4.970 1.16e-06 ***
## Flow:StationCodeRVB           1.578e-03  2.342e-04   6.739 8.92e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Week) 2.687      3 22.34  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.888   Deviance explained = 89.8%
## -REML = 200.62  Scale est. = 0.12004   n = 314
appraise(m_gam_flow2)

shapiro.test(residuals(m_gam_flow2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow2)
## W = 0.98376, p-value = 0.001287
k.check(m_gam_flow2)
##         k'      edf   k-index p-value
## s(Week)  3 2.686871 0.9690676  0.2475
draw(m_gam_flow2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow2, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_flow2))

Box.test(residuals(m_gam_flow2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow2)
## X-squared = 144.04, df = 20, p-value < 2.2e-16

Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look really good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_gam_flow2_lag1 <- gam(
  Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", k = 5) + lag1, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow2_lag1))

Box.test(residuals(m_gam_flow2_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow2_lag1)
## X-squared = 30.611, df = 20, p-value = 0.06054

Lag 2

m_gam_flow2_lag2 <- gam(
  Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", k = 5) + lag1 + lag2, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow2_lag2))

Box.test(residuals(m_gam_flow2_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow2_lag2)
## X-squared = 18.837, df = 20, p-value = 0.5324

The model with 1 lag term already seems to address the serial autocorrelation, but the lag2 model is even better. Let’s use AIC to see how they compare.

Compare Models

AIC(m_gam_flow2, m_gam_flow2_lag1, m_gam_flow2_lag2)
##                        df      AIC
## m_gam_flow2      31.93061 256.9933
## m_gam_flow2_lag1 32.78679 153.1460
## m_gam_flow2_lag2 33.79579 148.0033

Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_gam_flow2_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", 
##     k = 5) + lag1 + lag2
## 
## Parametric coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.338e+00  5.236e-01  10.194  < 2e-16 ***
## Year_fct2016                 -3.493e-01  1.243e-01  -2.811 0.005349 ** 
## Year_fct2017                 -2.403e-01  1.175e-01  -2.045 0.041952 *  
## Year_fct2018                 -1.804e-01  1.128e-01  -1.600 0.110883    
## Year_fct2019                 -3.798e-02  1.118e-01  -0.340 0.734386    
## Flow                         -8.481e-04  2.374e-04  -3.573 0.000426 ***
## StationCodeSTTD              -5.400e-01  1.261e-01  -4.281 2.69e-05 ***
## StationCodeLIB               -1.093e+00  3.071e-01  -3.559 0.000448 ***
## StationCodeRVB               -1.434e+00  5.446e-01  -2.634 0.008992 ** 
## lag1                          5.876e-01  6.315e-02   9.305  < 2e-16 ***
## lag2                         -1.389e-01  5.918e-02  -2.346 0.019763 *  
## Year_fct2016:Flow            -1.337e-04  1.446e-04  -0.924 0.356166    
## Year_fct2017:Flow            -6.283e-05  1.314e-04  -0.478 0.633076    
## Year_fct2018:Flow            -6.137e-05  1.311e-04  -0.468 0.640055    
## Year_fct2019:Flow            -2.909e-05  1.311e-04  -0.222 0.824550    
## Year_fct2016:StationCodeSTTD  4.972e-01  1.701e-01   2.923 0.003798 ** 
## Year_fct2017:StationCodeSTTD  1.932e-01  1.764e-01   1.095 0.274447    
## Year_fct2018:StationCodeSTTD -1.479e-01  1.610e-01  -0.919 0.359077    
## Year_fct2019:StationCodeSTTD -6.087e-01  1.647e-01  -3.697 0.000270 ***
## Year_fct2016:StationCodeLIB   6.306e-01  2.856e-01   2.208 0.028198 *  
## Year_fct2017:StationCodeLIB   1.881e-01  2.575e-01   0.730 0.465824    
## Year_fct2018:StationCodeLIB  -1.098e+00  2.886e-01  -3.804 0.000181 ***
## Year_fct2019:StationCodeLIB  -3.000e-01  2.815e-01  -1.066 0.287667    
## Year_fct2016:StationCodeRVB   1.217e+00  8.017e-01   1.519 0.130194    
## Year_fct2017:StationCodeRVB   3.038e-01  6.574e-01   0.462 0.644412    
## Year_fct2018:StationCodeRVB   2.598e-01  5.860e-01   0.443 0.657883    
## Year_fct2019:StationCodeRVB  -3.972e-01  5.960e-01  -0.666 0.505767    
## Flow:StationCodeSTTD          2.007e-03  3.206e-04   6.262 1.72e-09 ***
## Flow:StationCodeLIB           8.623e-04  2.659e-04   3.242 0.001352 ** 
## Flow:StationCodeRVB           9.164e-04  2.177e-04   4.210 3.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value    
## s(Week) 2.435      3 8.747 3.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.917   Deviance explained = 92.7%
## -REML =  149.4  Scale est. = 0.089063  n = 274
appraise(m_gam_flow2_lag2)

shapiro.test(residuals(m_gam_flow2_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow2_lag2)
## W = 0.96772, p-value = 7.839e-06
k.check(m_gam_flow2_lag2)
##         k'      edf   k-index p-value
## s(Week)  3 2.435016 0.9459394  0.1775
draw(m_gam_flow2_lag2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow2_lag2, pages = 1, all.terms = TRUE)

anova(m_gam_flow2_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = "cc", 
##     k = 5) + lag1 + lag2
## 
## Parametric Terms:
##                      df      F  p-value
## Year_fct              4  2.789 0.027126
## Flow                  1 12.765 0.000426
## StationCode           3 10.800 1.10e-06
## lag1                  1 86.583  < 2e-16
## lag2                  1  5.506 0.019763
## Year_fct:Flow         4  0.551 0.698106
## Year_fct:StationCode 12  5.788 8.65e-09
## Flow:StationCode      3 13.123 5.74e-08
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value
## s(Week) 2.435  3.000 8.747 3.04e-06

The model diagnostics look pretty good. Note that the 2-way interaction between Year and Flow isn’t significant. We’ll use m_gam_flow2_lag2 in the model selection process.

rm(m_gam_flow2, m_gam_flow2_lag1)

Model 3: GAM Model with 2-way interaction between Station and Year but without Flow

Next we’ll try running a GAM using a two-way interaction between Year and Station but not including flow as a predictor.

Initial Model

m_gam_cat2 <- gam(
  Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5), 
  data = df_chla_c2,
  method = "REML", 
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_cat2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5)
## 
## Parametric coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.50498    0.10404  91.360  < 2e-16 ***
## Year_fct2016                 -0.54029    0.15349  -3.520 0.000501 ***
## Year_fct2017                 -0.21984    0.14466  -1.520 0.129675    
## Year_fct2018                 -0.29821    0.14264  -2.091 0.037427 *  
## Year_fct2019                 -0.14474    0.14264  -1.015 0.311065    
## StationCodeSTTD              -0.76699    0.14452  -5.307 2.21e-07 ***
## StationCodeLIB               -1.79937    0.13941 -12.907  < 2e-16 ***
## StationCodeRVB               -1.85932    0.13941 -13.337  < 2e-16 ***
## Year_fct2016:StationCodeSTTD  0.85363    0.21322   4.003 7.93e-05 ***
## Year_fct2017:StationCodeSTTD  0.03494    0.22023   0.159 0.874055    
## Year_fct2018:StationCodeSTTD -0.32131    0.20713  -1.551 0.121925    
## Year_fct2019:StationCodeSTTD -0.84627    0.20306  -4.168 4.06e-05 ***
## Year_fct2016:StationCodeLIB   1.41654    0.20470   6.920 2.87e-11 ***
## Year_fct2017:StationCodeLIB   0.26245    0.19919   1.318 0.188682    
## Year_fct2018:StationCodeLIB  -1.75609    0.21578  -8.138 1.17e-14 ***
## Year_fct2019:StationCodeLIB  -0.46106    0.22475  -2.051 0.041121 *  
## Year_fct2016:StationCodeRVB   0.63920    0.20807   3.072 0.002327 ** 
## Year_fct2017:StationCodeRVB  -0.02780    0.19919  -0.140 0.889118    
## Year_fct2018:StationCodeRVB  -0.01988    0.19635  -0.101 0.919435    
## Year_fct2019:StationCodeRVB  -0.60985    0.19635  -3.106 0.002083 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Week) 2.568      3 15.76  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.839   Deviance explained =   85%
## -REML =  189.6  Scale est. = 0.17204   n = 314
appraise(m_gam_cat2)

shapiro.test(residuals(m_gam_cat2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_cat2)
## W = 0.98769, p-value = 0.009076
k.check(m_gam_cat2)
##         k'     edf   k-index p-value
## s(Week)  3 2.56794 0.9387348    0.12
draw(m_gam_cat2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_cat2, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_cat2))

Box.test(residuals(m_gam_cat2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_cat2)
## X-squared = 278.51, df = 20, p-value < 2.2e-16

Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look really good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_gam_cat2_lag1 <- gam(
  Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_cat2_lag1))

Box.test(residuals(m_gam_cat2_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_cat2_lag1)
## X-squared = 26.832, df = 20, p-value = 0.1401

Lag 2

m_gam_cat2_lag2 <- gam(
  Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + lag2, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_cat2_lag2))

Box.test(residuals(m_gam_cat2_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_cat2_lag2)
## X-squared = 15.258, df = 20, p-value = 0.7615

The model with 1 lag term already seems to address the serial autocorrelation, but the lag2 model is even better. Let’s use AIC to see how they compare.

Compare Models

AIC(m_gam_cat2, m_gam_cat2_lag1, m_gam_cat2_lag2)
##                       df      AIC
## m_gam_cat2      23.86710 362.7642
## m_gam_cat2_lag1 24.59559 189.2854
## m_gam_cat2_lag2 25.62646 176.4000

Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_gam_cat2_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) + 
##     lag1 + lag2
## 
## Parametric coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.968075   0.509024   7.795 1.73e-13 ***
## Year_fct2016                 -0.264678   0.129335  -2.046 0.041759 *  
## Year_fct2017                 -0.094817   0.118855  -0.798 0.425768    
## Year_fct2018                 -0.120328   0.117920  -1.020 0.308514    
## Year_fct2019                 -0.020108   0.117202  -0.172 0.863919    
## StationCodeSTTD              -0.256423   0.124046  -2.067 0.039750 *  
## StationCodeLIB               -0.744025   0.146809  -5.068 7.83e-07 ***
## StationCodeRVB               -0.768135   0.150118  -5.117 6.20e-07 ***
## lag1                          0.757746   0.061040  12.414  < 2e-16 ***
## lag2                         -0.178844   0.062076  -2.881 0.004308 ** 
## Year_fct2016:StationCodeSTTD  0.383395   0.179813   2.132 0.033967 *  
## Year_fct2017:StationCodeSTTD -0.023914   0.183626  -0.130 0.896489    
## Year_fct2018:StationCodeSTTD -0.111566   0.170698  -0.654 0.513979    
## Year_fct2019:StationCodeSTTD -0.412364   0.171467  -2.405 0.016904 *  
## Year_fct2016:StationCodeLIB   0.616167   0.183385   3.360 0.000902 ***
## Year_fct2017:StationCodeLIB   0.103366   0.163030   0.634 0.526644    
## Year_fct2018:StationCodeLIB  -0.840384   0.200085  -4.200 3.71e-05 ***
## Year_fct2019:StationCodeLIB  -0.212731   0.191415  -1.111 0.267482    
## Year_fct2016:StationCodeRVB   0.272513   0.173941   1.567 0.118452    
## Year_fct2017:StationCodeRVB  -0.037916   0.162719  -0.233 0.815938    
## Year_fct2018:StationCodeRVB   0.003196   0.160067   0.020 0.984088    
## Year_fct2019:StationCodeRVB  -0.286425   0.162510  -1.763 0.079207 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value    
## s(Week) 2.214      3 5.061 0.000402 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.906   Deviance explained = 91.4%
## -REML = 101.27  Scale est. = 0.10141   n = 274
appraise(m_gam_cat2_lag2)

shapiro.test(residuals(m_gam_cat2_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_cat2_lag2)
## W = 0.9529, p-value = 9.872e-08
k.check(m_gam_cat2_lag2)
##         k'      edf   k-index p-value
## s(Week)  3 2.214206 0.9270206   0.135
draw(m_gam_cat2_lag2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_cat2_lag2, pages = 1, all.terms = TRUE)

anova(m_gam_cat2_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * StationCode + s(Week, bs = "cc", k = 5) + 
##     lag1 + lag2
## 
## Parametric Terms:
##                      df       F  p-value
## Year_fct              4   1.324  0.26146
## StationCode           3  10.923 9.14e-07
## lag1                  1 154.104  < 2e-16
## lag2                  1   8.300  0.00431
## Year_fct:StationCode 12   3.571 6.34e-05
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value
## s(Week) 2.214  3.000 5.061 0.000402

The model diagnostics look pretty good but not quite as good as with the initial model. We’ll use m_gam_cat2_lag2 in the model selection process.

rm(m_gam_cat2, m_gam_cat2_lag1)

Model 4: Linear Model with Flow and 3-way interactions

Let’s try the weekly average model as a linear model with a three-way interaction between Year, Weekly Average Flow, and Station but without the smooth term for week number. Initially, we’ll run the model without accounting for serial autocorrelation.

Initial Model

m_lm_flow3 <- lm(Chla_log ~ Year_fct * Flow * StationCode, data = df_chla_c2)

Lets look at the model summary and diagnostics:

summary(m_lm_flow3)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow * StationCode, data = df_chla_c2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98399 -0.20320 -0.01209  0.17753  1.31988 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.762e+00  1.160e-01  84.185  < 2e-16 ***
## Year_fct2016                      -7.803e-01  1.678e-01  -4.651 5.13e-06 ***
## Year_fct2017                      -1.275e+00  2.557e-01  -4.987 1.09e-06 ***
## Year_fct2018                      -3.717e-01  1.557e-01  -2.387 0.017642 *  
## Year_fct2019                      -1.064e-01  1.529e-01  -0.696 0.486980    
## Flow                              -2.168e-03  5.114e-04  -4.239 3.07e-05 ***
## StationCodeSTTD                   -1.148e+00  1.467e-01  -7.830 1.07e-13 ***
## StationCodeLIB                    -2.487e+00  4.255e-01  -5.846 1.43e-08 ***
## StationCodeRVB                    -3.270e+00  6.506e-01  -5.027 9.03e-07 ***
## Year_fct2016:Flow                  2.807e-03  8.177e-04   3.432 0.000691 ***
## Year_fct2017:Flow                  4.980e-02  1.296e-02   3.842 0.000152 ***
## Year_fct2018:Flow                  4.445e-04  7.150e-04   0.622 0.534642    
## Year_fct2019:Flow                 -7.480e-05  6.283e-04  -0.119 0.905318    
## Year_fct2016:StationCodeSTTD       1.250e+00  2.156e-01   5.798 1.84e-08 ***
## Year_fct2017:StationCodeSTTD       1.716e+00  3.439e-01   4.989 1.08e-06 ***
## Year_fct2018:StationCodeSTTD      -2.869e-01  2.057e-01  -1.395 0.164109    
## Year_fct2019:StationCodeSTTD      -9.572e-01  2.025e-01  -4.726 3.67e-06 ***
## Year_fct2016:StationCodeLIB        2.235e+00  8.545e-01   2.615 0.009414 ** 
## Year_fct2017:StationCodeLIB        1.584e+00  5.556e-01   2.850 0.004698 ** 
## Year_fct2018:StationCodeLIB       -2.048e+00  5.678e-01  -3.606 0.000369 ***
## Year_fct2019:StationCodeLIB       -2.341e-01  5.550e-01  -0.422 0.673534    
## Year_fct2016:StationCodeRVB        3.488e+00  8.735e-01   3.993 8.39e-05 ***
## Year_fct2017:StationCodeRVB        2.697e+00  8.098e-01   3.330 0.000986 ***
## Year_fct2018:StationCodeRVB        9.703e-01  7.291e-01   1.331 0.184338    
## Year_fct2019:StationCodeRVB        2.551e-02  7.444e-01   0.034 0.972687    
## Flow:StationCodeSTTD               4.930e-03  7.866e-04   6.267 1.42e-09 ***
## Flow:StationCodeLIB                1.893e-03  5.719e-04   3.309 0.001061 ** 
## Flow:StationCodeRVB                2.442e-03  5.331e-04   4.580 7.07e-06 ***
## Year_fct2016:Flow:StationCodeSTTD -4.311e-03  1.203e-03  -3.585 0.000399 ***
## Year_fct2017:Flow:StationCodeSTTD -3.529e-02  1.471e-02  -2.399 0.017117 *  
## Year_fct2018:Flow:StationCodeSTTD -7.185e-04  1.057e-03  -0.680 0.497159    
## Year_fct2019:Flow:StationCodeSTTD -1.288e-03  9.364e-04  -1.375 0.170214    
## Year_fct2016:Flow:StationCodeLIB  -2.501e-03  9.998e-04  -2.502 0.012938 *  
## Year_fct2017:Flow:StationCodeLIB  -4.968e-02  1.297e-02  -3.830 0.000159 ***
## Year_fct2018:Flow:StationCodeLIB  -1.229e-03  9.207e-04  -1.335 0.182876    
## Year_fct2019:Flow:StationCodeLIB   8.259e-05  8.062e-04   0.102 0.918482    
## Year_fct2016:Flow:StationCodeRVB  -3.242e-03  8.342e-04  -3.887 0.000127 ***
## Year_fct2017:Flow:StationCodeRVB  -5.013e-02  1.296e-02  -3.867 0.000138 ***
## Year_fct2018:Flow:StationCodeRVB  -6.903e-04  7.317e-04  -0.943 0.346284    
## Year_fct2019:Flow:StationCodeRVB  -1.432e-04  6.474e-04  -0.221 0.825074    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3624 on 274 degrees of freedom
## Multiple R-squared:  0.8923, Adjusted R-squared:  0.877 
## F-statistic: 58.23 on 39 and 274 DF,  p-value: < 2.2e-16
df_chla_c2 %>% plot_lm_diag(Chla_log, m_lm_flow3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm_flow3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm_flow3)
## W = 0.97352, p-value = 1.555e-05
acf(residuals(m_lm_flow3))

Box.test(residuals(m_lm_flow3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_flow3)
## X-squared = 137.15, df = 20, p-value < 2.2e-16

The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_lm_flow3_lag1 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1) %>% 
  lm(Chla_log ~ Year_fct * Flow * StationCode + lag1, data = .)

acf(residuals(m_lm_flow3_lag1))

Box.test(residuals(m_lm_flow3_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_flow3_lag1)
## X-squared = 26.352, df = 20, p-value = 0.1545

Lag 2

m_lm_flow3_lag2 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2, data = .)

acf(residuals(m_lm_flow3_lag2))

Box.test(residuals(m_lm_flow3_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_flow3_lag2)
## X-squared = 13.106, df = 20, p-value = 0.8728

The model with 1 lag term already has better ACF and Box-Ljung test results than the initial model. Let’s use AIC to see how they compare.

Compare Models

AIC(m_lm_flow3, m_lm_flow3_lag1, m_lm_flow3_lag2)
##                 df      AIC
## m_lm_flow3      41 292.7994
## m_lm_flow3_lag1 42 183.8907
## m_lm_flow3_lag2 43 177.8827

Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_lm_flow3_lag2)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow * StationCode + lag1 + 
##     lag2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10604 -0.15916 -0.00246  0.11706  0.97717 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        5.151e+00  5.577e-01   9.236  < 2e-16 ***
## Year_fct2016                      -4.942e-01  1.616e-01  -3.057 0.002495 ** 
## Year_fct2017                      -7.132e-01  2.504e-01  -2.848 0.004796 ** 
## Year_fct2018                      -2.494e-01  1.473e-01  -1.693 0.091852 .  
## Year_fct2019                      -5.465e-02  1.440e-01  -0.380 0.704640    
## Flow                              -1.413e-03  4.645e-04  -3.043 0.002615 ** 
## StationCodeSTTD                   -6.074e-01  1.488e-01  -4.081 6.16e-05 ***
## StationCodeLIB                    -1.326e+00  4.060e-01  -3.266 0.001256 ** 
## StationCodeRVB                    -2.238e+00  6.850e-01  -3.268 0.001249 ** 
## lag1                               6.009e-01  6.712e-02   8.954  < 2e-16 ***
## lag2                              -1.262e-01  6.347e-02  -1.989 0.047871 *  
## Year_fct2016:Flow                  1.648e-03  7.277e-04   2.265 0.024448 *  
## Year_fct2017:Flow                  2.732e-02  1.386e-02   1.971 0.049931 *  
## Year_fct2018:Flow                  5.512e-04  6.344e-04   0.869 0.385818    
## Year_fct2019:Flow                  2.174e-04  5.600e-04   0.388 0.698236    
## Year_fct2016:StationCodeSTTD       7.355e-01  2.116e-01   3.476 0.000607 ***
## Year_fct2017:StationCodeSTTD       9.581e-01  3.364e-01   2.848 0.004790 ** 
## Year_fct2018:StationCodeSTTD      -9.220e-02  1.956e-01  -0.471 0.637855    
## Year_fct2019:StationCodeSTTD      -5.315e-01  1.989e-01  -2.672 0.008076 ** 
## Year_fct2016:StationCodeLIB        1.796e+00  9.278e-01   1.935 0.054173 .  
## Year_fct2017:StationCodeLIB        8.549e-01  5.126e-01   1.668 0.096688 .  
## Year_fct2018:StationCodeLIB       -5.611e-01  5.719e-01  -0.981 0.327564    
## Year_fct2019:StationCodeLIB       -1.799e-01  4.976e-01  -0.362 0.718000    
## Year_fct2016:StationCodeRVB        2.675e+00  9.350e-01   2.861 0.004608 ** 
## Year_fct2017:StationCodeRVB        1.758e+00  8.198e-01   2.144 0.033071 *  
## Year_fct2018:StationCodeRVB        1.155e+00  7.307e-01   1.581 0.115196    
## Year_fct2019:StationCodeRVB        4.856e-01  7.436e-01   0.653 0.514402    
## Flow:StationCodeSTTD               2.734e-03  7.374e-04   3.707 0.000262 ***
## Flow:StationCodeLIB                1.279e-03  5.221e-04   2.450 0.015009 *  
## Flow:StationCodeRVB                1.673e-03  4.912e-04   3.406 0.000778 ***
## Year_fct2016:Flow:StationCodeSTTD -2.337e-03  1.081e-03  -2.161 0.031691 *  
## Year_fct2017:Flow:StationCodeSTTD -1.781e-02  1.525e-02  -1.168 0.244031    
## Year_fct2018:Flow:StationCodeSTTD -5.762e-04  9.461e-04  -0.609 0.543101    
## Year_fct2019:Flow:StationCodeSTTD -7.217e-04  8.369e-04  -0.862 0.389363    
## Year_fct2016:Flow:StationCodeLIB  -1.086e-03  9.786e-04  -1.110 0.268090    
## Year_fct2017:Flow:StationCodeLIB  -2.723e-02  1.387e-02  -1.964 0.050720 .  
## Year_fct2018:Flow:StationCodeLIB  -2.727e-05  9.181e-04  -0.030 0.976333    
## Year_fct2019:Flow:StationCodeLIB  -2.769e-04  7.181e-04  -0.386 0.700161    
## Year_fct2016:Flow:StationCodeRVB  -2.024e-03  7.498e-04  -2.700 0.007442 ** 
## Year_fct2017:Flow:StationCodeRVB  -2.759e-02  1.386e-02  -1.990 0.047745 *  
## Year_fct2018:Flow:StationCodeRVB  -8.088e-04  6.547e-04  -1.235 0.217949    
## Year_fct2019:Flow:StationCodeRVB  -4.458e-04  5.826e-04  -0.765 0.444982    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.311 on 232 degrees of freedom
## Multiple R-squared:  0.9237, Adjusted R-squared:  0.9102 
## F-statistic: 68.51 on 41 and 232 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  plot_lm_diag(Chla_log, m_lm_flow3_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm_flow3_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm_flow3_lag2)
## W = 0.95757, p-value = 3.559e-07
Anova(m_lm_flow3_lag2, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
## 
## Response: Chla_log
##                            Sum Sq  Df F value    Pr(>F)    
## (Intercept)                8.2491   1 85.3064 < 2.2e-16 ***
## Year_fct                   1.5818   4  4.0896 0.0031915 ** 
## Flow                       0.8952   1  9.2578 0.0026149 ** 
## StationCode                2.6225   3  9.0401 1.100e-05 ***
## lag1                       7.7523   1 80.1687 < 2.2e-16 ***
## lag2                       0.3826   1  3.9563 0.0478710 *  
## Year_fct:Flow              0.8906   4  2.3024 0.0593683 .  
## Year_fct:StationCode       4.2199  12  3.6366 5.227e-05 ***
## Flow:StationCode           1.6269   3  5.6082 0.0009948 ***
## Year_fct:Flow:StationCode  1.6343  12  1.4084 0.1627989    
## Residuals                 22.4343 232                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model diagnostics look okay, but not as good as with the GAM models. Note that the 3-way interaction between Year, Station, and Flow isn’t significant in the ANOVA table. We’ll use m_lm_flow3_lag2 in the model selection process.

rm(m_lm_flow3, m_lm_flow3_lag1)

Model 5: Linear Model with Flow and 2-way interactions

Let’s try a linear model using all two-way interactions between Year, Weekly Average Flow, and Station. Initially, we’ll run the model without accounting for serial autocorrelation.

Initial Model

m_lm_flow2 <- lm(Chla_log ~ (Year_fct + Flow + StationCode)^2, data = df_chla_c2)

Lets look at the model summary and diagnostics:

summary(m_lm_flow2)
## 
## Call:
## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2, data = df_chla_c2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02115 -0.21555 -0.02562  0.20756  1.30970 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.6720421  0.1037997  93.180  < 2e-16 ***
## Year_fct2016                 -0.4032148  0.1425002  -2.830  0.00499 ** 
## Year_fct2017                 -0.3833291  0.1386668  -2.764  0.00607 ** 
## Year_fct2018                 -0.2825822  0.1338134  -2.112  0.03557 *  
## Year_fct2019                 -0.0926685  0.1338116  -0.693  0.48917    
## Flow                         -0.0015353  0.0002749  -5.585 5.43e-08 ***
## StationCodeSTTD              -1.0366460  0.1394711  -7.433 1.24e-12 ***
## StationCodeLIB               -2.1586815  0.3294116  -6.553 2.62e-10 ***
## StationCodeRVB               -2.9189896  0.5835889  -5.002 9.93e-07 ***
## Year_fct2016:Flow            -0.0003420  0.0001541  -2.220  0.02721 *  
## Year_fct2017:Flow            -0.0002571  0.0001435  -1.791  0.07431 .  
## Year_fct2018:Flow            -0.0001841  0.0001438  -1.280  0.20143    
## Year_fct2019:Flow            -0.0001688  0.0001439  -1.173  0.24161    
## Year_fct2016:StationCodeSTTD  0.8302909  0.1985714   4.181 3.86e-05 ***
## Year_fct2017:StationCodeSTTD  0.4036756  0.2070786   1.949  0.05223 .  
## Year_fct2018:StationCodeSTTD -0.3665744  0.1925111  -1.904  0.05789 .  
## Year_fct2019:StationCodeSTTD -1.0430308  0.1897975  -5.495 8.63e-08 ***
## Year_fct2016:StationCodeLIB   0.9218534  0.3150367   2.926  0.00371 ** 
## Year_fct2017:StationCodeLIB   0.2307078  0.2962035   0.779  0.43669    
## Year_fct2018:StationCodeLIB  -1.8797945  0.3149611  -5.968 7.08e-09 ***
## Year_fct2019:StationCodeLIB  -0.5044002  0.3185041  -1.584  0.11438    
## Year_fct2016:StationCodeRVB   2.5842980  0.8235230   3.138  0.00188 ** 
## Year_fct2017:StationCodeRVB   1.4826319  0.7325138   2.024  0.04390 *  
## Year_fct2018:StationCodeRVB   0.6227655  0.6661707   0.935  0.35066    
## Year_fct2019:StationCodeRVB  -0.1434418  0.6796939  -0.211  0.83301    
## Flow:StationCodeSTTD          0.0035789  0.0003591   9.966  < 2e-16 ***
## Flow:StationCodeLIB           0.0014129  0.0003025   4.670 4.64e-06 ***
## Flow:StationCodeRVB           0.0017470  0.0002547   6.860 4.26e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3849 on 286 degrees of freedom
## Multiple R-squared:  0.8732, Adjusted R-squared:  0.8612 
## F-statistic: 72.93 on 27 and 286 DF,  p-value: < 2.2e-16
df_chla_c2 %>% plot_lm_diag(Chla_log, m_lm_flow2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm_flow2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm_flow2)
## W = 0.97497, p-value = 2.752e-05
acf(residuals(m_lm_flow2))

Box.test(residuals(m_lm_flow2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_flow2)
## X-squared = 214.85, df = 20, p-value < 2.2e-16

The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_lm_flow2_lag1 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1) %>% 
  lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1, data = .)

acf(residuals(m_lm_flow2_lag1))

Box.test(residuals(m_lm_flow2_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_flow2_lag1)
## X-squared = 33.835, df = 20, p-value = 0.02727

Lag 2

m_lm_flow2_lag2 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2, data = .)

acf(residuals(m_lm_flow2_lag2))

Box.test(residuals(m_lm_flow2_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_flow2_lag2)
## X-squared = 16.487, df = 20, p-value = 0.686

The model with 2 lag terms seems to be okay in terms of serial autocorrelation. Let’s use AIC to see how they compare.

Compare Models

AIC(m_lm_flow2, m_lm_flow2_lag1, m_lm_flow2_lag2)
##                 df      AIC
## m_lm_flow2      29 320.2073
## m_lm_flow2_lag1 30 179.7601
## m_lm_flow2_lag2 31 173.1492

Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_lm_flow2_lag2)
## 
## Call:
## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + 
##     lag2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24138 -0.14868 -0.01613  0.13053  0.87049 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.5102528  0.5172489   8.720 4.39e-16 ***
## Year_fct2016                 -0.2225024  0.1268556  -1.754 0.080689 .  
## Year_fct2017                 -0.2049353  0.1228178  -1.669 0.096477 .  
## Year_fct2018                 -0.1265295  0.1180393  -1.072 0.284812    
## Year_fct2019                  0.0042951  0.1172864   0.037 0.970817    
## Flow                         -0.0007791  0.0002437  -3.197 0.001573 ** 
## StationCodeSTTD              -0.4510406  0.1311688  -3.439 0.000687 ***
## StationCodeLIB               -0.8412407  0.3153205  -2.668 0.008144 ** 
## StationCodeRVB               -1.4624572  0.5715023  -2.559 0.011103 *  
## lag1                          0.6683254  0.0642124  10.408  < 2e-16 ***
## lag2                         -0.1363594  0.0621237  -2.195 0.029109 *  
## Year_fct2016:Flow            -0.0002104  0.0001502  -1.400 0.162647    
## Year_fct2017:Flow            -0.0001343  0.0001373  -0.978 0.328924    
## Year_fct2018:Flow            -0.0001202  0.0001369  -0.878 0.380766    
## Year_fct2019:Flow            -0.0001003  0.0001367  -0.733 0.464045    
## Year_fct2016:StationCodeSTTD  0.4173963  0.1782275   2.342 0.019989 *  
## Year_fct2017:StationCodeSTTD  0.1784208  0.1852472   0.963 0.336426    
## Year_fct2018:StationCodeSTTD -0.1664026  0.1689473  -0.985 0.325630    
## Year_fct2019:StationCodeSTTD -0.5590542  0.1729367  -3.233 0.001395 ** 
## Year_fct2016:StationCodeLIB   0.3796717  0.2958825   1.283 0.200645    
## Year_fct2017:StationCodeLIB   0.0333266  0.2679057   0.124 0.901104    
## Year_fct2018:StationCodeLIB  -1.1124628  0.2993880  -3.716 0.000251 ***
## Year_fct2019:StationCodeLIB  -0.3875763  0.2907185  -1.333 0.183720    
## Year_fct2016:StationCodeRVB   1.5642012  0.8259346   1.894 0.059427 .  
## Year_fct2017:StationCodeRVB   0.6738241  0.6836514   0.986 0.325293    
## Year_fct2018:StationCodeRVB   0.4731218  0.6102908   0.775 0.438949    
## Year_fct2019:StationCodeRVB  -0.0331205  0.6185916  -0.054 0.957344    
## Flow:StationCodeSTTD          0.0017758  0.0003334   5.327 2.27e-07 ***
## Flow:StationCodeLIB           0.0008313  0.0002653   3.133 0.001941 ** 
## Flow:StationCodeRVB           0.0009021  0.0002239   4.030 7.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3141 on 244 degrees of freedom
## Multiple R-squared:  0.9182, Adjusted R-squared:  0.9084 
## F-statistic: 94.38 on 29 and 244 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  plot_lm_diag(Chla_log, m_lm_flow2_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm_flow2_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm_flow2_lag2)
## W = 0.95878, p-value = 5.036e-07
Anova(m_lm_flow2_lag2, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
## 
## Response: Chla_log
##                       Sum Sq  Df  F value    Pr(>F)    
## (Intercept)           7.5000   1  76.0331 4.387e-16 ***
## Year_fct              0.6129   4   1.5533  0.187469    
## Flow                  1.0081   1  10.2202  0.001573 ** 
## StationCode           2.2025   3   7.4427 8.661e-05 ***
## lag1                 10.6856   1 108.3274 < 2.2e-16 ***
## lag2                  0.4752   1   4.8179  0.029109 *  
## Year_fct:Flow         0.2845   4   0.7209  0.578345    
## Year_fct:StationCode  5.2495  12   4.4349 2.004e-06 ***
## Flow:StationCode      2.8070   3   9.4856 5.971e-06 ***
## Residuals            24.0685 244                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model diagnostics look somewhat worse than those for the 3-way interaction model. Note that the 2-way interaction between Year and Flow isn’t significant. We’ll use m_lm_flow2_lag2 in the model selection process.

rm(m_lm_flow2, m_lm_flow2_lag1)

Model 6: Linear Model with 2-way interaction between Station and Year but without Flow

We’ll try running a linear model using a two-way interaction between Year and Station but not including flow as a predictor. Initially, we’ll run the model without accounting for serial autocorrelation.

Initial Model

m_lm_cat2 <- lm(Chla_log ~ Year_fct * StationCode, data = df_chla_c2)

Lets look at the model summary and diagnostics:

summary(m_lm_cat2)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * StationCode, data = df_chla_c2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35952 -0.25359 -0.05221  0.26722  1.31786 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.45473    0.11172  84.627  < 2e-16 ***
## Year_fct2016                 -0.40045    0.16354  -2.449 0.014927 *  
## Year_fct2017                 -0.19510    0.15566  -1.253 0.211051    
## Year_fct2018                 -0.26862    0.15355  -1.749 0.081263 .  
## Year_fct2019                 -0.11515    0.15355  -0.750 0.453881    
## StationCodeSTTD              -0.75644    0.15566  -4.860 1.92e-06 ***
## StationCodeLIB               -1.74964    0.14989 -11.673  < 2e-16 ***
## StationCodeRVB               -1.80959    0.14989 -12.073  < 2e-16 ***
## Year_fct2016:StationCodeSTTD  0.84307    0.22969   3.670 0.000287 ***
## Year_fct2017:StationCodeSTTD  0.10665    0.23653   0.451 0.652400    
## Year_fct2018:StationCodeSTTD -0.35014    0.22269  -1.572 0.116942    
## Year_fct2019:StationCodeSTTD -0.88642    0.21865  -4.054 6.45e-05 ***
## Year_fct2016:StationCodeLIB   1.38154    0.22018   6.275 1.25e-09 ***
## Year_fct2017:StationCodeLIB   0.21272    0.21439   0.992 0.321898    
## Year_fct2018:StationCodeLIB  -1.88368    0.23137  -8.141 1.12e-14 ***
## Year_fct2019:StationCodeLIB  -0.46150    0.24192  -1.908 0.057411 .  
## Year_fct2016:StationCodeRVB   0.60290    0.22371   2.695 0.007444 ** 
## Year_fct2017:StationCodeRVB  -0.07752    0.21439  -0.362 0.717928    
## Year_fct2018:StationCodeRVB  -0.06960    0.21132  -0.329 0.742124    
## Year_fct2019:StationCodeRVB  -0.65957    0.21132  -3.121 0.001980 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4469 on 294 degrees of freedom
## Multiple R-squared:  0.8243, Adjusted R-squared:  0.8129 
## F-statistic: 72.59 on 19 and 294 DF,  p-value: < 2.2e-16
df_chla_c2 %>% plot_lm_diag(Chla_log, m_lm_cat2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm_cat2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm_cat2)
## W = 0.98856, p-value = 0.01425
acf(residuals(m_lm_cat2))

Box.test(residuals(m_lm_cat2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_cat2)
## X-squared = 331.73, df = 20, p-value < 2.2e-16

Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look pretty good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_lm_cat2_lag1 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1) %>% 
  lm(Chla_log ~ Year_fct * StationCode + lag1, data = .)

acf(residuals(m_lm_cat2_lag1))

Box.test(residuals(m_lm_cat2_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_cat2_lag1)
## X-squared = 29.858, df = 20, p-value = 0.07219

Lag 2

m_lm_cat2_lag2 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  lm(Chla_log ~ Year_fct * StationCode + lag1 + lag2, data = .)

acf(residuals(m_lm_cat2_lag2))

Box.test(residuals(m_lm_cat2_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm_cat2_lag2)
## X-squared = 16.063, df = 20, p-value = 0.7127

The model with 2 lag terms seems to be okay in terms of serial autocorrelation. Let’s use AIC to see how they compare.

Compare Models

AIC(m_lm_cat2, m_lm_cat2_lag1, m_lm_cat2_lag2)
##                df      AIC
## m_lm_cat2      21 406.6058
## m_lm_cat2_lag1 22 202.1676
## m_lm_cat2_lag2 23 189.5969

Again, it looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_lm_cat2_lag2)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * StationCode + lag1 + lag2, 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32999 -0.16419 -0.02045  0.15089  0.85556 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.459253   0.493928   7.004 2.26e-11 ***
## Year_fct2016                 -0.182532   0.129908  -1.405 0.161226    
## Year_fct2017                 -0.078006   0.122065  -0.639 0.523366    
## Year_fct2018                 -0.090019   0.120957  -0.744 0.457437    
## Year_fct2019                  0.002086   0.120402   0.017 0.986191    
## StationCodeSTTD              -0.211284   0.127028  -1.663 0.097499 .  
## StationCodeLIB               -0.625336   0.146282  -4.275 2.72e-05 ***
## StationCodeRVB               -0.644967   0.149378  -4.318 2.27e-05 ***
## lag1                          0.806179   0.061397  13.131  < 2e-16 ***
## lag2                         -0.175878   0.063486  -2.770 0.006017 ** 
## Year_fct2016:StationCodeSTTD  0.336363   0.184635   1.822 0.069675 .  
## Year_fct2017:StationCodeSTTD -0.015832   0.188606  -0.084 0.933169    
## Year_fct2018:StationCodeSTTD -0.128766   0.175323  -0.734 0.463356    
## Year_fct2019:StationCodeSTTD -0.393012   0.176281  -2.229 0.026665 *  
## Year_fct2016:StationCodeLIB   0.535094   0.185883   2.879 0.004337 ** 
## Year_fct2017:StationCodeLIB   0.064763   0.167490   0.387 0.699326    
## Year_fct2018:StationCodeLIB  -0.811907   0.205745  -3.946 0.000103 ***
## Year_fct2019:StationCodeLIB  -0.214839   0.196894  -1.091 0.276254    
## Year_fct2016:StationCodeRVB   0.233510   0.177836   1.313 0.190356    
## Year_fct2017:StationCodeRVB  -0.061986   0.167397  -0.370 0.711475    
## Year_fct2018:StationCodeRVB  -0.025504   0.164606  -0.155 0.876994    
## Year_fct2019:StationCodeRVB  -0.285560   0.167246  -1.707 0.088974 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3279 on 252 degrees of freedom
## Multiple R-squared:  0.9079, Adjusted R-squared:  0.9002 
## F-statistic: 118.2 on 21 and 252 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  plot_lm_diag(Chla_log, m_lm_cat2_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm_cat2_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm_cat2_lag2)
## W = 0.94723, p-value = 2.28e-08
Anova(m_lm_cat2_lag2, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
## 
## Response: Chla_log
##                       Sum Sq  Df  F value    Pr(>F)    
## (Intercept)           5.2737   1  49.0497 2.262e-11 ***
## Year_fct              0.3050   4   0.7092 0.5862914    
## StationCode           2.5688   3   7.9641 4.295e-05 ***
## lag1                 18.5375   1 172.4135 < 2.2e-16 ***
## lag2                  0.8252   1   7.6747 0.0060171 ** 
## Year_fct:StationCode  3.8050  12   2.9492 0.0007352 ***
## Residuals            27.0944 252                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model diagnostics don’t look that great. However, we’ll use m_lm_cat2_lag2 in the model selection process.

rm(m_lm_cat2, m_lm_cat2_lag1)

Model 7: GAM model using smooths for Flow

Finally, we’ll try running a GAM model using smooths for weekly average flow by Station and Year and a smooth term for week number to account for seasonality. Initially, we’ll run the model without accounting for serial autocorrelation.

Initial Model

m_gam_sflow <- gam(
  Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = "cc", k = 5),
  data = df_chla_c2,
  method = "REML",
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_sflow)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + 
##     Year_fct * StationCode + s(Week, bs = "cc", k = 5)
## 
## Parametric coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.9415     0.4617  15.035  < 2e-16 ***
## Year_fct2016                  -0.8407     0.3009  -2.794 0.005564 ** 
## Year_fct2017                  -0.5569     0.2631  -2.117 0.035152 *  
## Year_fct2018                  -0.4401     0.2562  -1.718 0.086918 .  
## Year_fct2019                  -0.2726     0.2859  -0.953 0.341160    
## StationCodeSTTD               -8.1285    37.3586  -0.218 0.827913    
## StationCodeLIB                 0.4849     0.6923   0.700 0.484250    
## StationCodeRVB                 0.5447     0.6180   0.881 0.378860    
## Year_fct2016:StationCodeSTTD   0.8281     0.1736   4.771 2.95e-06 ***
## Year_fct2017:StationCodeSTTD   0.3340     0.1816   1.839 0.066926 .  
## Year_fct2018:StationCodeSTTD  -0.2712     0.1689  -1.606 0.109483    
## Year_fct2019:StationCodeSTTD  -0.9023     0.1695  -5.324 2.08e-07 ***
## Year_fct2016:StationCodeLIB    1.3519     0.3581   3.776 0.000195 ***
## Year_fct2017:StationCodeLIB    0.4174     0.2613   1.598 0.111254    
## Year_fct2018:StationCodeLIB   -1.6855     0.2793  -6.034 5.05e-09 ***
## Year_fct2019:StationCodeLIB   -0.4878     0.3338  -1.461 0.145045    
## Year_fct2016:StationCodeRVB    1.4699     0.9618   1.528 0.127582    
## Year_fct2017:StationCodeRVB    0.5908     0.6516   0.907 0.365409    
## Year_fct2018:StationCodeRVB    0.2254     0.5915   0.381 0.703400    
## Year_fct2019:StationCodeRVB   -0.2519     0.8749  -0.288 0.773628    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf    Ref.df      F  p-value    
## s(Flow):StationCodeRD22 1.137e-05 2.237e-05  0.071    0.999    
## s(Flow):StationCodeSTTD 2.120e+00 2.315e+00 72.452  < 2e-16 ***
## s(Flow):StationCodeLIB  1.000e+00 1.000e+00 23.524 2.46e-06 ***
## s(Flow):StationCodeRVB  1.000e+00 1.000e+00 38.494  < 2e-16 ***
## s(Flow):Year_fct2015    1.000e+00 1.000e+00 31.991  < 2e-16 ***
## s(Flow):Year_fct2016    1.309e+00 1.567e+00 29.104 3.49e-07 ***
## s(Flow):Year_fct2017    1.000e+00 1.000e+00 38.052  < 2e-16 ***
## s(Flow):Year_fct2018    1.000e+00 1.000e+00 37.232  < 2e-16 ***
## s(Flow):Year_fct2019    2.159e+00 2.768e+00 15.315  < 2e-16 ***
## s(Week)                 2.719e+00 3.000e+00 25.866  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 103/104
## R-sq.(adj) =  0.895   Deviance explained = 90.6%
## -REML = 127.47  Scale est. = 0.11201   n = 314
appraise(m_gam_sflow)

shapiro.test(residuals(m_gam_sflow))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_sflow)
## W = 0.98185, p-value = 0.0005262
k.check(m_gam_sflow)
##                         k'          edf   k-index p-value
## s(Flow):StationCodeRD22  9 1.137053e-05 1.0644839  0.8375
## s(Flow):StationCodeSTTD  9 2.120242e+00 1.0644839  0.8475
## s(Flow):StationCodeLIB   9 1.000027e+00 1.0644839  0.8500
## s(Flow):StationCodeRVB   9 1.000030e+00 1.0644839  0.8650
## s(Flow):Year_fct2015     9 1.000027e+00 1.0644839  0.8475
## s(Flow):Year_fct2016     9 1.309135e+00 1.0644839  0.8550
## s(Flow):Year_fct2017     9 1.000037e+00 1.0644839  0.8500
## s(Flow):Year_fct2018     9 1.000033e+00 1.0644839  0.8675
## s(Flow):Year_fct2019     9 2.159088e+00 1.0644839  0.8625
## s(Week)                  3 2.719405e+00 0.9426987  0.1400
concurvity(m_gam_sflow, full = FALSE)$worst
##                                 para s(Flow):StationCodeRD22
## para                    1.000000e+00            2.643318e-01
## s(Flow):StationCodeRD22 2.643318e-01            1.000000e+00
## s(Flow):StationCodeSTTD 2.261153e-01            1.761594e-06
## s(Flow):StationCodeLIB  2.292994e-01            7.714390e-07
## s(Flow):StationCodeRVB  2.802548e-01            1.376958e-06
## s(Flow):Year_fct2015    2.324841e-01            3.251023e-01
## s(Flow):Year_fct2016    1.907664e-01            2.856469e-01
## s(Flow):Year_fct2017    1.940114e-01            2.319233e-01
## s(Flow):Year_fct2018    1.910646e-01            2.226980e-01
## s(Flow):Year_fct2019    1.880875e-01            3.424988e-01
## s(Week)                 1.733484e-31            1.574509e-01
##                         s(Flow):StationCodeSTTD s(Flow):StationCodeLIB
## para                               2.261152e-01           2.292994e-01
## s(Flow):StationCodeRD22            1.418504e-06           3.146502e-07
## s(Flow):StationCodeSTTD            1.000000e+00           1.787493e-07
## s(Flow):StationCodeLIB             1.140567e-07           1.000000e+00
## s(Flow):StationCodeRVB             1.777024e-07           1.962974e-11
## s(Flow):Year_fct2015               2.212699e-01           9.174395e-01
## s(Flow):Year_fct2016               1.576930e-01           5.098821e-01
## s(Flow):Year_fct2017               1.064600e-01           3.817659e-01
## s(Flow):Year_fct2018               1.427320e-01           3.281525e-01
## s(Flow):Year_fct2019               3.311807e-01           2.162509e-01
## s(Week)                            5.820526e-02           6.532376e-02
##                         s(Flow):StationCodeRVB s(Flow):Year_fct2015
## para                              2.802548e-01         2.324841e-01
## s(Flow):StationCodeRD22           1.232514e-06         3.250229e-01
## s(Flow):StationCodeSTTD           2.346023e-07         2.212756e-01
## s(Flow):StationCodeLIB            1.618413e-11         9.174394e-01
## s(Flow):StationCodeRVB            1.000000e+00         9.295169e-01
## s(Flow):Year_fct2015              9.295169e-01         1.000000e+00
## s(Flow):Year_fct2016              4.867459e-01         3.209744e-20
## s(Flow):Year_fct2017              9.999383e-01         1.418664e-21
## s(Flow):Year_fct2018              6.431556e-01         7.900257e-21
## s(Flow):Year_fct2019              4.786662e-01         1.803157e-20
## s(Week)                           1.477669e-01         1.014688e-01
##                         s(Flow):Year_fct2016 s(Flow):Year_fct2017
## para                            1.907664e-01         1.940114e-01
## s(Flow):StationCodeRD22         2.856517e-01         2.318611e-01
## s(Flow):StationCodeSTTD         1.576967e-01         1.064645e-01
## s(Flow):StationCodeLIB          5.098822e-01         3.817659e-01
## s(Flow):StationCodeRVB          4.867459e-01         9.999383e-01
## s(Flow):Year_fct2015            3.417427e-20         1.074026e-21
## s(Flow):Year_fct2016            1.000000e+00         8.226970e-26
## s(Flow):Year_fct2017            1.099840e-25         1.000000e+00
## s(Flow):Year_fct2018            1.086091e-25         1.775469e-25
## s(Flow):Year_fct2019            1.823744e-25         6.782206e-27
## s(Week)                         1.081710e-01         6.038884e-02
##                         s(Flow):Year_fct2018 s(Flow):Year_fct2019      s(Week)
## para                            1.910646e-01         1.880875e-01 2.183715e-31
## s(Flow):StationCodeRD22         2.227087e-01         3.424802e-01 1.574083e-01
## s(Flow):StationCodeSTTD         1.427425e-01         3.311824e-01 5.820393e-02
## s(Flow):StationCodeLIB          3.281525e-01         2.162509e-01 6.532370e-02
## s(Flow):StationCodeRVB          6.431556e-01         4.786662e-01 1.477669e-01
## s(Flow):Year_fct2015            8.133545e-21         1.870799e-20 1.014688e-01
## s(Flow):Year_fct2016            1.363686e-25         6.362499e-26 1.081710e-01
## s(Flow):Year_fct2017            1.645895e-25         7.481050e-27 6.038884e-02
## s(Flow):Year_fct2018            1.000000e+00         1.794354e-25 7.451261e-02
## s(Flow):Year_fct2019            1.956784e-25         1.000000e+00 7.697321e-02
## s(Week)                         7.451261e-02         7.697321e-02 1.000000e+00
draw(m_gam_sflow, select = 10, residuals = TRUE, rug = FALSE)

plot(m_gam_sflow, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_sflow))

Box.test(residuals(m_gam_sflow), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_sflow)
## X-squared = 139.11, df = 20, p-value < 2.2e-16

The diagnostic plots look really good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation. We’ll run a series of models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_gam_sflow_lag1 <- gam(
  Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_sflow_lag1))

Box.test(residuals(m_gam_sflow_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_sflow_lag1)
## X-squared = 31.058, df = 20, p-value = 0.05443

Lag 2

m_gam_sflow_lag2 <- gam(
  Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + lag2, 
  data = df_chla_c2_lag,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_sflow_lag2))

Box.test(residuals(m_gam_sflow_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_sflow_lag2)
## X-squared = 20.838, df = 20, p-value = 0.4067

The model with 2 lag terms seems to be okay in terms of serial autocorrelation. Let’s use AIC to see how they compare.

Compare Models

AIC(m_gam_sflow, m_gam_sflow_lag1, m_gam_sflow_lag2)
##                        df      AIC
## m_gam_sflow      35.59213 239.6761
## m_gam_sflow_lag1 35.54780 146.3048
## m_gam_sflow_lag2 35.57415 141.1547

It looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_gam_sflow_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + 
##     Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + 
##     lag2
## 
## Parametric coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.78540    0.62649   6.042 5.75e-09 ***
## Year_fct2016                 -0.59159    0.31742  -1.864  0.06358 .  
## Year_fct2017                 -0.33673    0.29720  -1.133  0.25834    
## Year_fct2018                 -0.27953    0.29071  -0.962  0.33725    
## Year_fct2019                 -0.05803    0.29006  -0.200  0.84160    
## StationCodeSTTD              -9.99171   22.15069  -0.451  0.65234    
## StationCodeLIB                0.49217    0.67725   0.727  0.46811    
## StationCodeRVB                0.55317    0.64109   0.863  0.38907    
## lag1                          0.56833    0.06260   9.078  < 2e-16 ***
## lag2                         -0.13552    0.05828  -2.325  0.02089 *  
## Year_fct2016:StationCodeSTTD  0.52390    0.16786   3.121  0.00202 ** 
## Year_fct2017:StationCodeSTTD  0.23333    0.17447   1.337  0.18237    
## Year_fct2018:StationCodeSTTD -0.10863    0.15940  -0.681  0.49622    
## Year_fct2019:StationCodeSTTD -0.52711    0.16512  -3.192  0.00160 ** 
## Year_fct2016:StationCodeLIB   0.83072    0.32873   2.527  0.01214 *  
## Year_fct2017:StationCodeLIB   0.35730    0.29392   1.216  0.22532    
## Year_fct2018:StationCodeLIB  -0.96976    0.31585  -3.070  0.00238 ** 
## Year_fct2019:StationCodeLIB  -0.14044    0.31112  -0.451  0.65210    
## Year_fct2016:StationCodeRVB   0.99033    0.82897   1.195  0.23340    
## Year_fct2017:StationCodeRVB   0.08019    0.68487   0.117  0.90689    
## Year_fct2018:StationCodeRVB   0.04818    0.61823   0.078  0.93795    
## Year_fct2019:StationCodeRVB  -0.67188    0.62938  -1.068  0.28680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf    Ref.df      F  p-value    
## s(Flow):StationCodeRD22 1.602e-05 3.192e-05  0.001 0.999860    
## s(Flow):StationCodeSTTD 1.955e+00 2.149e+00 23.362 2.24e-07 ***
## s(Flow):StationCodeLIB  1.000e+00 1.000e+00 10.957 0.001076 ** 
## s(Flow):StationCodeRVB  1.000e+00 1.000e+00 17.521 4.04e-05 ***
## s(Flow):Year_fct2015    1.323e+00 1.564e+00  8.370 0.000908 ***
## s(Flow):Year_fct2016    1.010e+00 1.019e+00 18.117 3.07e-05 ***
## s(Flow):Year_fct2017    1.000e+00 1.000e+00 16.652 6.13e-05 ***
## s(Flow):Year_fct2018    1.000e+00 1.000e+00 16.804 5.69e-05 ***
## s(Flow):Year_fct2019    1.000e+00 1.000e+00 15.502 0.000108 ***
## s(Week)                 2.510e+00 3.000e+00 10.050 1.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 105/106
## R-sq.(adj) =   0.92   Deviance explained =   93%
## -REML = 80.961  Scale est. = 0.086231  n = 274
appraise(m_gam_sflow_lag2)

shapiro.test(residuals(m_gam_sflow_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_sflow_lag2)
## W = 0.96744, p-value = 7.155e-06
k.check(m_gam_sflow_lag2)
##                         k'          edf   k-index p-value
## s(Flow):StationCodeRD22  9 1.601793e-05 1.1504105  1.0000
## s(Flow):StationCodeSTTD  9 1.954583e+00 1.1504105  0.9950
## s(Flow):StationCodeLIB   9 1.000016e+00 1.1504105  0.9975
## s(Flow):StationCodeRVB   9 1.000014e+00 1.1504105  0.9925
## s(Flow):Year_fct2015     9 1.323433e+00 1.1504105  0.9900
## s(Flow):Year_fct2016     9 1.009550e+00 1.1504105  0.9925
## s(Flow):Year_fct2017     9 1.000007e+00 1.1504105  0.9950
## s(Flow):Year_fct2018     9 1.000019e+00 1.1504105  0.9950
## s(Flow):Year_fct2019     9 1.000143e+00 1.1504105  0.9900
## s(Week)                  3 2.510248e+00 0.9407824  0.1625
concurvity(m_gam_sflow_lag2, full = FALSE)$worst
##                                 para s(Flow):StationCodeRD22
## para                    1.000000e+00            2.664237e-01
## s(Flow):StationCodeRD22 2.664234e-01            1.000000e+00
## s(Flow):StationCodeSTTD 2.226278e-01            1.638181e-06
## s(Flow):StationCodeLIB  2.262774e-01            4.733228e-07
## s(Flow):StationCodeRVB  2.846715e-01            1.811609e-06
## s(Flow):Year_fct2015    2.372263e-01            3.306389e-01
## s(Flow):Year_fct2016    1.896821e-01            3.062741e-01
## s(Flow):Year_fct2017    1.932414e-01            2.559599e-01
## s(Flow):Year_fct2018    1.897747e-01            2.486411e-01
## s(Flow):Year_fct2019    1.885812e-01            3.588243e-01
## s(Week)                 1.158819e-31            1.678606e-01
##                         s(Flow):StationCodeSTTD s(Flow):StationCodeLIB
## para                               2.226279e-01           2.262774e-01
## s(Flow):StationCodeRD22            3.662718e-06           5.732367e-08
## s(Flow):StationCodeSTTD            1.000000e+00           2.603309e-08
## s(Flow):StationCodeLIB             2.505693e-07           1.000000e+00
## s(Flow):StationCodeRVB             5.268892e-08           1.954079e-11
## s(Flow):Year_fct2015               2.169073e-01           9.582905e-01
## s(Flow):Year_fct2016               1.409557e-01           5.240970e-01
## s(Flow):Year_fct2017               9.562028e-02           4.111986e-01
## s(Flow):Year_fct2018               1.610252e-01           4.775918e-01
## s(Flow):Year_fct2019               3.478030e-01           1.863800e-01
## s(Week)                            8.184826e-02           8.951237e-02
##                         s(Flow):StationCodeRVB s(Flow):Year_fct2015
## para                              2.846715e-01         2.372263e-01
## s(Flow):StationCodeRD22           1.299236e-06         3.306299e-01
## s(Flow):StationCodeSTTD           2.670124e-08         2.169129e-01
## s(Flow):StationCodeLIB            1.082367e-11         9.582905e-01
## s(Flow):StationCodeRVB            1.000000e+00         9.204540e-01
## s(Flow):Year_fct2015              9.204540e-01         1.000000e+00
## s(Flow):Year_fct2016              4.552008e-01         1.152022e-21
## s(Flow):Year_fct2017              9.999652e-01         2.445211e-21
## s(Flow):Year_fct2018              6.738619e-01         1.662580e-21
## s(Flow):Year_fct2019              5.049372e-01         3.243414e-21
## s(Week)                           1.440000e-01         1.209400e-01
##                         s(Flow):Year_fct2016 s(Flow):Year_fct2017
## para                            1.896821e-01         1.932414e-01
## s(Flow):StationCodeRD22         3.062808e-01         2.560147e-01
## s(Flow):StationCodeSTTD         1.409590e-01         9.560616e-02
## s(Flow):StationCodeLIB          5.240970e-01         4.111984e-01
## s(Flow):StationCodeRVB          4.552008e-01         9.999652e-01
## s(Flow):Year_fct2015            2.908201e-21         1.553210e-21
## s(Flow):Year_fct2016            1.000000e+00         9.868853e-26
## s(Flow):Year_fct2017            2.030612e-25         1.000000e+00
## s(Flow):Year_fct2018            4.726143e-25         1.671147e-25
## s(Flow):Year_fct2019            7.395786e-26         2.911772e-26
## s(Week)                         1.207275e-01         6.868218e-02
##                         s(Flow):Year_fct2018 s(Flow):Year_fct2019      s(Week)
## para                            1.897747e-01         1.885812e-01 3.616743e-32
## s(Flow):StationCodeRD22         2.486213e-01         3.587966e-01 1.678500e-01
## s(Flow):StationCodeSTTD         1.610299e-01         3.478107e-01 8.184815e-02
## s(Flow):StationCodeLIB          4.775917e-01         1.863799e-01 8.951238e-02
## s(Flow):StationCodeRVB          6.738619e-01         5.049372e-01 1.440000e-01
## s(Flow):Year_fct2015            1.393761e-21         2.220073e-21 1.209400e-01
## s(Flow):Year_fct2016            1.159538e-25         9.089073e-26 1.207275e-01
## s(Flow):Year_fct2017            2.404126e-25         2.579184e-26 6.868218e-02
## s(Flow):Year_fct2018            1.000000e+00         9.247148e-26 8.522678e-02
## s(Flow):Year_fct2019            9.564460e-26         1.000000e+00 9.372902e-02
## s(Week)                         8.522678e-02         9.372902e-02 1.000000e+00
draw(m_gam_sflow_lag2, select = 1:4, residuals = TRUE, rug = FALSE)

draw(m_gam_sflow_lag2, select = 5:9, residuals = TRUE, rug = FALSE)

draw(m_gam_sflow_lag2, select = 10, residuals = TRUE, rug = FALSE)

anova(m_gam_sflow_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + 
##     Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + 
##     lag2
## 
## Parametric Terms:
##                      df      F  p-value
## Year_fct              4  2.543   0.0403
## StationCode           3  0.406   0.7490
## lag1                  1 82.411  < 2e-16
## lag2                  1  5.407   0.0209
## Year_fct:StationCode 12  5.961 4.38e-09
## 
## Approximate significance of smooth terms:
##                               edf    Ref.df      F  p-value
## s(Flow):StationCodeRD22 1.602e-05 3.192e-05  0.001 0.999860
## s(Flow):StationCodeSTTD 1.955e+00 2.149e+00 23.362 2.24e-07
## s(Flow):StationCodeLIB  1.000e+00 1.000e+00 10.957 0.001076
## s(Flow):StationCodeRVB  1.000e+00 1.000e+00 17.521 4.04e-05
## s(Flow):Year_fct2015    1.323e+00 1.564e+00  8.370 0.000908
## s(Flow):Year_fct2016    1.010e+00 1.019e+00 18.117 3.07e-05
## s(Flow):Year_fct2017    1.000e+00 1.000e+00 16.652 6.13e-05
## s(Flow):Year_fct2018    1.000e+00 1.000e+00 16.804 5.69e-05
## s(Flow):Year_fct2019    1.000e+00 1.000e+00 15.502 0.000108
## s(Week)                 2.510e+00 3.000e+00 10.050 1.08e-06

The model diagnostics look a little worse than with the initial model but they still look pretty good. Note that the approximate significance of all smooth terms are greater than 0.05 except for the s(Week) term. We’ll use m_gam_sflow_lag2 in the model selection process.

rm(m_gam_sflow, m_gam_sflow_lag1)

Model selection

Now we’ll compare the seven candidate models with AIC to select the one with the best fit. As a summary, here are the 7 models we are comparing:

  • Model 1 - m_gam_flow3_lag2 - GAM 3-way interactions with s(Week)
    Formula: Chla_log ~ Year_fct * Flow * StationCode + s(Week, bs = “cc”, k = 5) + lag1 + lag2

  • Model 2 - m_gam_flow2_lag2 - GAM 2-way interactions with s(Week)
    Formula: Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(Week, bs = “cc”, k = 5) + lag1 + lag2

  • Model 3 - m_gam_cat2_lag2 - GAM 2-way interaction between Station and Year with s(Week) but without Flow
    Formula: Chla_log ~ Year_fct * StationCode + s(Week, bs = “cc”, k = 5) + lag1 + lag2

  • Model 4 - m_lm_flow3_lag2 - LM 3-way interactions without seasonal term
    Formula: Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2

  • Model 5 - m_lm_flow2_lag2 - LM 2-way interactions without seasonal term
    Formula: Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2

  • Model 6 - m_lm_cat2_lag2 - LM 2-way interaction between Station and Year but without Flow and seasonal term
    Formula: Chla_log ~ Year_fct * StationCode + lag1 + lag2

  • Model 7 - m_gam_sflow_lag2 - GAM using smooths for Flow with s(Week)
    Formula: Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + Year_fct * StationCode + s(Week, bs = “cc”, k = 5) + lag1 + lag2
# AIC values
df_m_aic <- 
  AIC(
    m_gam_flow3_lag2,
    m_gam_flow2_lag2,
    m_gam_cat2_lag2,
    m_lm_flow3_lag2,
    m_lm_flow2_lag2,
    m_lm_cat2_lag2,
    m_gam_sflow_lag2
  ) %>% 
  as_tibble(rownames = "Model") %>% 
  mutate(Model_Number = 1:7, .before = Model)

# BIC values
df_m_bic <- 
  BIC(
    m_gam_flow3_lag2,
    m_gam_flow2_lag2,
    m_gam_cat2_lag2,
    m_lm_flow3_lag2,
    m_lm_flow2_lag2,
    m_lm_cat2_lag2,
    m_gam_sflow_lag2
  ) %>% 
  as_tibble(rownames = "Model")

# Combine AIC and BIC and calculate differences from lowest value
df_m_aic_bic <- 
  left_join(df_m_aic, df_m_bic, by = join_by(Model, df)) %>% 
  mutate(across(c(AIC, BIC), ~ .x - min(.x), .names = "{.col}_delta")) %>% 
  select(starts_with("Model"), df, starts_with("AIC"), starts_with("BIC"))

# Sort by AIC
df_m_aic_bic %>% arrange(AIC)
## # A tibble: 7 × 7
##   Model_Number Model               df   AIC AIC_delta   BIC BIC_delta
##          <int> <chr>            <dbl> <dbl>     <dbl> <dbl>     <dbl>
## 1            7 m_gam_sflow_lag2  35.6  141.      0     270.     0.697
## 2            2 m_gam_flow2_lag2  33.8  148.      6.85  270.     1.12 
## 3            1 m_gam_flow3_lag2  45.8  154.     13.2   320.    50.9  
## 4            5 m_lm_flow2_lag2   31    173.     32.0   285.    16.2  
## 5            3 m_gam_cat2_lag2   25.6  176.     35.2   269.     0    
## 6            4 m_lm_flow3_lag2   43    178.     36.7   333.    64.3  
## 7            6 m_lm_cat2_lag2    23    190.     48.4   273.     3.71

Export the AIC/BIC table.

df_m_aic_bic %>% 
  arrange(AIC) %>% 
  mutate(across(where(is.numeric), ~ paste0(formatC(.x, digits = 1, format = "f"), "##"))) %>% 
  write_csv(here("manuscript_synthesis/results/tables/chl_aic_weekly_models.csv"))

According to AIC, model 7 (GAM using smooths for Flow with s(Week)) was the model with the best fit. BIC preferred model 3 (GAM 2-way interaction between Station and Year with s(Week) but without Flow) with model 7 coming in close second place. Before we proceed with model 7, let’s revisit the model diagnostics and take a closer look at how the back-transformed fitted values from the model match the observed values.

Model 7 diagnostics

appraise(m_gam_sflow_lag2)

shapiro.test(residuals(m_gam_sflow_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_sflow_lag2)
## W = 0.96744, p-value = 7.155e-06
k.check(m_gam_sflow_lag2)
##                         k'          edf   k-index p-value
## s(Flow):StationCodeRD22  9 1.601793e-05 1.1504105  0.9875
## s(Flow):StationCodeSTTD  9 1.954583e+00 1.1504105  0.9900
## s(Flow):StationCodeLIB   9 1.000016e+00 1.1504105  0.9925
## s(Flow):StationCodeRVB   9 1.000014e+00 1.1504105  0.9950
## s(Flow):Year_fct2015     9 1.323433e+00 1.1504105  0.9875
## s(Flow):Year_fct2016     9 1.009550e+00 1.1504105  0.9900
## s(Flow):Year_fct2017     9 1.000007e+00 1.1504105  0.9950
## s(Flow):Year_fct2018     9 1.000019e+00 1.1504105  0.9950
## s(Flow):Year_fct2019     9 1.000143e+00 1.1504105  0.9975
## s(Week)                  3 2.510248e+00 0.9407824  0.1600
draw(m_gam_sflow_lag2, select = 1:4, residuals = TRUE, rug = FALSE)

draw(m_gam_sflow_lag2, select = 5:9, residuals = TRUE, rug = FALSE)

draw(m_gam_sflow_lag2, select = 10, residuals = TRUE, rug = FALSE)

anova(m_gam_sflow_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ s(Flow, by = StationCode) + s(Flow, by = Year_fct) + 
##     Year_fct * StationCode + s(Week, bs = "cc", k = 5) + lag1 + 
##     lag2
## 
## Parametric Terms:
##                      df      F  p-value
## Year_fct              4  2.543   0.0403
## StationCode           3  0.406   0.7490
## lag1                  1 82.411  < 2e-16
## lag2                  1  5.407   0.0209
## Year_fct:StationCode 12  5.961 4.38e-09
## 
## Approximate significance of smooth terms:
##                               edf    Ref.df      F  p-value
## s(Flow):StationCodeRD22 1.602e-05 3.192e-05  0.001 0.999860
## s(Flow):StationCodeSTTD 1.955e+00 2.149e+00 23.362 2.24e-07
## s(Flow):StationCodeLIB  1.000e+00 1.000e+00 10.957 0.001076
## s(Flow):StationCodeRVB  1.000e+00 1.000e+00 17.521 4.04e-05
## s(Flow):Year_fct2015    1.323e+00 1.564e+00  8.370 0.000908
## s(Flow):Year_fct2016    1.010e+00 1.019e+00 18.117 3.07e-05
## s(Flow):Year_fct2017    1.000e+00 1.000e+00 16.652 6.13e-05
## s(Flow):Year_fct2018    1.000e+00 1.000e+00 16.804 5.69e-05
## s(Flow):Year_fct2019    1.000e+00 1.000e+00 15.502 0.000108
## s(Week)                 2.510e+00 3.000e+00 10.050 1.08e-06

Model 7 observed vs fitted values

df_chla_c2_lag2 <- df_chla_c2_lag %>% drop_na(lag1, lag2)

df_m_gam_sflow_lag2_fit <- df_chla_c2_lag2 %>% 
  fitted_values(m_gam_sflow_lag2, data = .) %>% 
  mutate(fitted_bt = exp(.fitted) / 1000)

plt_m_gam_sflow_lag2_fit <- df_m_gam_sflow_lag2_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_m_gam_sflow_lag2_fit

Let’s group by station.

plt_m_gam_sflow_lag2_fit + facet_wrap(vars(StationCode), scales = "free")

Now, group by year.

plt_m_gam_sflow_lag2_fit + facet_wrap(vars(Year_fct), scales = "free")

Everything looks pretty decent with this model. Not perfect, but pretty good given the number of data points. Note that variability does increase as the chlorophyll values increase. Before proceeding with model 7, let’s look more closely at its results.

Model 7 Results

Effect of Flow by Station

# Calculate min and max flows for each station to narrow down x-axis in the plot
df_chla_flow_sta_summ <- df_chla_c2_lag2 %>% 
  summarize(
    Flow_min = min(Flow),
    Flow_max = max(Flow),
    .by = c(StationCode)
  ) %>% 
  mutate(
    Flow_buffer = (Flow_max - Flow_min) * 0.05,
    Flow_min = Flow_min - Flow_buffer,
    Flow_max = Flow_max + Flow_buffer
  )

# Calculate effects of flow on chlorophyll for each station holding the
  # non-focal variables constant - marginal effects/adjusted predictions
df_gam_flow_sta_eff <- 
  as.data.frame(
    predict_response(
      m_gam_sflow_lag2, 
      terms = c("Flow", "StationCode"),
      margin = "marginalmeans"
    ),
    terms_to_colnames = TRUE
  ) %>% 
  as_tibble() %>%
  # Narrow down range of flow values for each station
  left_join(df_chla_flow_sta_summ, by = join_by(StationCode)) %>% 
  filter(Flow >= Flow_min & Flow <= Flow_max) %>% 
  transmute(
    StationCode,
    Flow,
    # Back calculate model fits and confidence levels
    across(c(predicted, conf.low, conf.high), ~ exp(.x) / 1000)
  )

# Create effects plot
plt_gam_flow_sta_eff <- df_gam_flow_sta_eff %>% 
  ggplot(aes(x = Flow, y = predicted)) +
  geom_point(
    data = df_chla_c2_lag2,
    aes(y = Chla, color = Year_fct),
    alpha = 0.6
  ) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25) +
  facet_wrap(vars(StationCode), scales = "free") +
  theme_bw() +
  labs(
    x = "Flow (cfs)",
    y = expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))
  ) +
  scale_x_continuous(breaks = breaks_extended(6)) +
  scale_color_viridis_d(name = "Year", option = "C")

plt_gam_flow_sta_eff

These results look reasonable.

Effect of Flow by Year

# Calculate min and max flows for each station to narrow down x-axis in the plot
df_chla_flow_yr_summ <- df_chla_c2_lag2 %>% 
  summarize(
    Flow_min = min(Flow),
    Flow_max = max(Flow),
    .by = c(Year_fct)
  ) %>% 
  mutate(
    Flow_buffer = (Flow_max - Flow_min) * 0.05,
    Flow_min = Flow_min - Flow_buffer,
    Flow_max = Flow_max + Flow_buffer
  )

# Calculate effects of flow on chlorophyll for each year holding the
  # non-focal variables constant - marginal effects/adjusted predictions
df_gam_flow_yr_eff <- 
  as.data.frame(
    predict_response(
      m_gam_sflow_lag2, 
      terms = c("Flow", "Year_fct"),
      margin = "marginalmeans"
    ),
    terms_to_colnames = TRUE
  ) %>% 
  as_tibble() %>% 
  # Narrow down range of flow values for each station
  left_join(df_chla_flow_yr_summ, by = join_by(Year_fct)) %>% 
  filter(Flow >= Flow_min & Flow <= Flow_max) %>% 
  transmute(
    Year_fct,
    Flow,
    # Back calculate model fits and confidence levels
    across(c(predicted, conf.low, conf.high), ~ exp(.x) / 1000)
  )

# Create effects plot
plt_gam_flow_yr_eff <- df_gam_flow_yr_eff %>% 
  ggplot(aes(x = Flow, y = predicted)) +
  geom_point(
    data = df_chla_c2_lag2,
    aes(y = Chla, color = StationCode),
    alpha = 0.6
  ) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25) +
  facet_wrap(vars(Year_fct), scales = "free") +
  theme_bw() +
  labs(
    x = "Flow (cfs)",
    y = expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))
  ) +
  scale_x_continuous(breaks = breaks_extended(6)) +
  scale_color_viridis_d(name = "Station", option = "C") +
  theme(
    legend.margin = margin(0, 0, 0, 0),
    legend.position = "inside",
    legend.position.inside = c(0.8, 0.3)
  )

plt_gam_flow_yr_eff

There is a lot of uncertainty in the model results at the highest flows. This seems problematic.

Station by Year Contrasts

# Estimated marginal means for station by year
em_gam_sta_yr <- emmeans(m_gam_sflow_lag2, ~ StationCode | Year_fct)

# Tukey post-hoc contrasts
pairs(em_gam_sta_yr)
## Year_fct = 2015:
##  contrast    estimate    SE  df t.ratio p.value
##  RD22 - STTD    3.922 4.145 240   0.946  0.7799
##  RD22 - LIB    -0.492 0.676 240  -0.729  0.8856
##  RD22 - RVB    -0.553 0.639 240  -0.865  0.8228
##  STTD - LIB    -4.414 4.145 240  -1.065  0.7113
##  STTD - RVB    -4.475 4.193 240  -1.067  0.7098
##  LIB - RVB     -0.061 0.798 240  -0.076  0.9998
## 
## Year_fct = 2016:
##  contrast    estimate    SE  df t.ratio p.value
##  RD22 - STTD    3.398 4.140 240   0.821  0.8446
##  RD22 - LIB    -1.323 0.679 240  -1.949  0.2104
##  RD22 - RVB    -1.544 0.721 240  -2.139  0.1437
##  STTD - LIB    -4.721 4.160 240  -1.135  0.6684
##  STTD - RVB    -4.941 4.163 240  -1.187  0.6356
##  LIB - RVB     -0.221 0.857 240  -0.257  0.9940
## 
## Year_fct = 2017:
##  contrast    estimate    SE  df t.ratio p.value
##  RD22 - STTD    3.689 4.136 240   0.892  0.8092
##  RD22 - LIB    -0.849 0.635 240  -1.337  0.5404
##  RD22 - RVB    -0.633 0.587 240  -1.080  0.7023
##  STTD - LIB    -4.538 4.147 240  -1.094  0.6933
##  STTD - RVB    -4.322 4.128 240  -1.047  0.7220
##  LIB - RVB      0.216 0.600 240   0.360  0.9840
## 
## Year_fct = 2018:
##  contrast    estimate    SE  df t.ratio p.value
##  RD22 - STTD    4.030 4.141 240   0.973  0.7648
##  RD22 - LIB     0.478 0.598 240   0.798  0.8553
##  RD22 - RVB    -0.601 0.480 240  -1.252  0.5942
##  STTD - LIB    -3.553 4.147 240  -0.857  0.8270
##  STTD - RVB    -4.632 4.127 240  -1.122  0.6761
##  LIB - RVB     -1.079 0.504 240  -2.143  0.1427
## 
## Year_fct = 2019:
##  contrast    estimate    SE  df t.ratio p.value
##  RD22 - STTD    4.449 4.118 240   1.080  0.7018
##  RD22 - LIB    -0.352 0.594 240  -0.592  0.9345
##  RD22 - RVB     0.119 0.488 240   0.243  0.9949
##  STTD - LIB    -4.801 4.127 240  -1.163  0.6505
##  STTD - RVB    -4.330 4.090 240  -1.059  0.7150
##  LIB - RVB      0.470 0.521 240   0.904  0.8029
## 
## P value adjustment: tukey method for comparing a family of 4 estimates
# Create table of contrasts and convert it to a tibble for plot
df_gam_sta_yr <- em_gam_sta_yr %>% 
  cld(sort = FALSE, Letters = letters) %>% 
  as_tibble() %>% 
  mutate(
    group = str_remove_all(.group, fixed(" ")),
    # back transform log-transformed results
    across(c(emmean, lower.CL, upper.CL), ~ exp(.x) / 1000)
  ) %>% 
  # Add min and max values of observed data to the Tukey post-hoc results and
    # calculate vertical positioning of letters
  left_join(
    df_chla_c2_lag2 %>% 
      summarize(
        max_val = max(Chla),
        min_val = min(Chla),
        .by = Year_fct
    ), 
    by = join_by(Year_fct)
  ) %>% 
  mutate(max_val = if_else(upper.CL > max_val, upper.CL, max_val)) %>% 
  group_by(Year_fct) %>% 
  mutate(max_val = max(max_val)) %>% 
  ungroup() %>% 
  mutate(y_pos = max_val + (max_val - min_val) / 10) %>% 
  select(
    StationCode,
    Year_fct,
    emmean,
    lower.CL,
    upper.CL,
    group,
    y_pos
  )

# Create boxplot showing Tukey post-hoc results
plt_gam_sta_yr <- df_gam_sta_yr %>% 
  ggplot(
    aes(
      x = StationCode,
      y = emmean,
      ymin = lower.CL,
      ymax = upper.CL
    )
  ) +
  geom_boxplot(
    data = df_chla_c2_lag2,
    aes(x = StationCode, y = Chla),
    inherit.aes = FALSE
  ) +
  geom_crossbar(color = "grey82", fill = "grey", alpha = 0.7, linewidth = 0.1) +
  geom_point(color = "red") +
  geom_text(aes(y = y_pos, label = group), size = 3.5) +
  facet_wrap(vars(Year_fct), scales = "free_y") +
  xlab("Station") +
  ylab(expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))) +
  theme_bw()

plt_gam_sta_yr

None of the contrasts are significant. Also, the model under predicts RD22, and STTD has a lot of uncertainty.

Year by Station Contrasts

# Estimated marginal means for year by station
em_gam_yr_sta <- emmeans(m_gam_sflow_lag2, ~ Year_fct | StationCode)

# Tukey post-hoc contrasts
pairs(em_gam_yr_sta)
## StationCode = RD22:
##  contrast                    estimate    SE  df t.ratio p.value
##  Year_fct2015 - Year_fct2016   0.4746 0.333 240   1.427  0.6108
##  Year_fct2015 - Year_fct2017   0.2222 0.312 240   0.713  0.9533
##  Year_fct2015 - Year_fct2018   0.1650 0.305 240   0.541  0.9829
##  Year_fct2015 - Year_fct2019  -0.0565 0.303 240  -0.186  0.9997
##  Year_fct2016 - Year_fct2017  -0.2524 0.193 240  -1.306  0.6875
##  Year_fct2016 - Year_fct2018  -0.3096 0.186 240  -1.661  0.4604
##  Year_fct2016 - Year_fct2019  -0.5311 0.188 240  -2.820  0.0412
##  Year_fct2017 - Year_fct2018  -0.0572 0.142 240  -0.404  0.9944
##  Year_fct2017 - Year_fct2019  -0.2787 0.144 240  -1.935  0.3017
##  Year_fct2018 - Year_fct2019  -0.2215 0.134 240  -1.648  0.4685
## 
## StationCode = STTD:
##  contrast                    estimate    SE  df t.ratio p.value
##  Year_fct2015 - Year_fct2016  -0.0493 0.345 240  -0.143  0.9999
##  Year_fct2015 - Year_fct2017  -0.0111 0.332 240  -0.034  1.0000
##  Year_fct2015 - Year_fct2018   0.2736 0.326 240   0.839  0.9181
##  Year_fct2015 - Year_fct2019   0.4706 0.331 240   1.424  0.6129
##  Year_fct2016 - Year_fct2017   0.0382 0.207 240   0.184  0.9997
##  Year_fct2016 - Year_fct2018   0.3229 0.198 240   1.632  0.4786
##  Year_fct2016 - Year_fct2019   0.5199 0.205 240   2.536  0.0861
##  Year_fct2017 - Year_fct2018   0.2848 0.168 240   1.692  0.4411
##  Year_fct2017 - Year_fct2019   0.4817 0.176 240   2.743  0.0507
##  Year_fct2018 - Year_fct2019   0.1970 0.150 240   1.316  0.6813
## 
## StationCode = LIB:
##  contrast                    estimate    SE  df t.ratio p.value
##  Year_fct2015 - Year_fct2016  -0.3561 0.585 240  -0.609  0.9737
##  Year_fct2015 - Year_fct2017  -0.1351 0.529 240  -0.256  0.9991
##  Year_fct2015 - Year_fct2018   1.1348 0.534 240   2.126  0.2124
##  Year_fct2015 - Year_fct2019   0.0840 0.530 240   0.158  0.9999
##  Year_fct2016 - Year_fct2017   0.2210 0.293 240   0.755  0.9429
##  Year_fct2016 - Year_fct2018   1.4909 0.333 240   4.481  0.0001
##  Year_fct2016 - Year_fct2019   0.4401 0.314 240   1.400  0.6284
##  Year_fct2017 - Year_fct2018   1.2699 0.215 240   5.914  <.0001
##  Year_fct2017 - Year_fct2019   0.2190 0.202 240   1.084  0.8148
##  Year_fct2018 - Year_fct2019  -1.0508 0.213 240  -4.923  <.0001
## 
## StationCode = RVB:
##  contrast                    estimate    SE  df t.ratio p.value
##  Year_fct2015 - Year_fct2016  -0.5157 0.559 240  -0.922  0.8882
##  Year_fct2015 - Year_fct2017   0.1420 0.437 240   0.325  0.9976
##  Year_fct2015 - Year_fct2018   0.1168 0.359 240   0.325  0.9976
##  Year_fct2015 - Year_fct2019   0.6154 0.376 240   1.635  0.4763
##  Year_fct2016 - Year_fct2017   0.6577 0.563 240   1.168  0.7694
##  Year_fct2016 - Year_fct2018   0.6325 0.506 240   1.250  0.7220
##  Year_fct2016 - Year_fct2019   1.1311 0.518 240   2.185  0.1890
##  Year_fct2017 - Year_fct2018  -0.0252 0.372 240  -0.068  1.0000
##  Year_fct2017 - Year_fct2019   0.4734 0.387 240   1.225  0.7370
##  Year_fct2018 - Year_fct2019   0.4986 0.293 240   1.704  0.4332
## 
## P value adjustment: tukey method for comparing a family of 5 estimates
# Create table of contrasts and convert it to a tibble for plot
df_gam_yr_sta <- em_gam_yr_sta %>% 
  cld(sort = FALSE, Letters = letters) %>% 
  as_tibble() %>% 
  mutate(
    group = str_remove_all(.group, fixed(" ")),
    # back transform log-transformed results
    across(c(emmean, lower.CL, upper.CL), ~ exp(.x) / 1000)
  ) %>% 
  # Add min and max values of observed data to the Tukey post-hoc results and
    # calculate vertical positioning of letters
  left_join(
    df_chla_c2_lag2 %>% 
      summarize(
        max_val = max(Chla),
        min_val = min(Chla),
        .by = StationCode
    ), 
    by = join_by(StationCode)
  ) %>% 
  mutate(max_val = if_else(upper.CL > max_val, upper.CL, max_val)) %>% 
  group_by(StationCode) %>% 
  mutate(max_val = max(max_val)) %>% 
  ungroup() %>% 
  mutate(y_pos = max_val + (max_val - min_val) / 10) %>% 
  select(
    StationCode,
    Year_fct,
    emmean,
    lower.CL,
    upper.CL,
    group,
    y_pos
  )

# Create boxplot showing Tukey post-hoc results
plt_gam_yr_sta <- df_gam_yr_sta %>% 
  ggplot(
    aes(
      x = Year_fct,
      y = emmean,
      ymin = lower.CL,
      ymax = upper.CL
    )
  ) +
  geom_boxplot(
    data = df_chla_c2_lag2,
    aes(x = Year_fct, y = Chla),
    inherit.aes = FALSE
  ) +
  geom_crossbar(color = "grey82", fill = "grey", alpha = 0.7, linewidth = 0.1) +
  geom_point(color = "red") +
  geom_text(aes(y = y_pos, label = group), size = 3.5) +
  facet_wrap(vars(StationCode), scales = "free_y") +
  xlab("Year") +
  ylab(expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))) +
  theme_bw()

plt_gam_yr_sta

Again, the model under predicts RD22, and both STTD and RVB have a lot of uncertainty.

After looking at the results more closely from Model 7, they appear questionable. I think we should consider using another possibly less complicated model. Model 2 (GAM 2-way interactions with s(Week)) was in close second place behind Model 7 for AIC and in close third place for BIC. Let’s revisit this model, but we’ll break it into 4 separate models for each station to make it less complicated.

Model 2 broken up by station

ls_chla_c2_lag <- split(df_chla_c2_lag, ~ StationCode)

Model 2 - RD22

Initial Model

m_gam_flow_yr_rd22 <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5), 
  data = ls_chla_c2_lag$RD22,
  method = "REML", 
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_flow_yr_rd22)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.6958227  0.1132442  85.619  < 2e-16 ***
## Year_fct2016      -0.7142917  0.1660087  -4.303 5.32e-05 ***
## Year_fct2017      -0.8940651  0.2624785  -3.406  0.00109 ** 
## Year_fct2018      -0.4058687  0.1484528  -2.734  0.00790 ** 
## Year_fct2019      -0.1231646  0.1458255  -0.845  0.40118    
## Flow              -0.0013920  0.0005276  -2.639  0.01024 *  
## Year_fct2016:Flow  0.0013735  0.0008616   1.594  0.11536    
## Year_fct2017:Flow  0.0307527  0.0137880   2.230  0.02890 *  
## Year_fct2018:Flow  0.0005201  0.0006813   0.763  0.44779    
## Year_fct2019:Flow -0.0002592  0.0006023  -0.430  0.66821    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value    
## s(Week) 2.333      3 8.92 8.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.595   Deviance explained = 65.1%
## -REML = 65.495  Scale est. = 0.11878   n = 83
appraise(m_gam_flow_yr_rd22)

shapiro.test(residuals(m_gam_flow_yr_rd22))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow_yr_rd22)
## W = 0.98163, p-value = 0.2826
k.check(m_gam_flow_yr_rd22)
##         k'      edf   k-index p-value
## s(Week)  3 2.333212 0.9492174    0.28
draw(m_gam_flow_yr_rd22, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow_yr_rd22, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_flow_yr_rd22))

Box.test(residuals(m_gam_flow_yr_rd22), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_rd22)
## X-squared = 26.689, df = 20, p-value = 0.1442
anova(m_gam_flow_yr_rd22)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric Terms:
##               df     F  p-value
## Year_fct       4 7.366 5.12e-05
## Flow           1 6.962   0.0102
## Year_fct:Flow  4 2.383   0.0595
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value
## s(Week) 2.333  3.000 8.92 8.3e-06

Shapiro-Wilk normality test shows that the residuals are normal, the diagnostic plots look really good. The ACF plot and the Box-Ljung test indicate little to no serial autocorrelation. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

Observed vs Fitted Values

df_m_gam_flow_yr_rd22_fit <- ls_chla_c2_lag$RD22 %>% 
  fitted_values(m_gam_flow_yr_rd22, data = .) %>% 
  mutate(fitted_bt = exp(.fitted) / 1000)

plt_m_gam_flow_yr_rd22_fit <- df_m_gam_flow_yr_rd22_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_m_gam_flow_yr_rd22_fit

plt_m_gam_flow_yr_rd22_fit + facet_wrap(vars(Year_fct), scales = "free")

This model looks pretty good. Note that variability does increase as the chlorophyll values increase. Before proceeding with this model for RD22, let’s look more closely at its results.

Effect of Flow by Year

# Calculate min and max flows for each station to narrow down x-axis in the plot
df_chla_flow_yr_rd22_summ <- ls_chla_c2_lag$RD22 %>% 
  summarize(
    Flow_min = min(Flow),
    Flow_max = max(Flow),
    .by = c(Year_fct)
  ) %>% 
  mutate(
    Flow_buffer = (Flow_max - Flow_min) * 0.05,
    Flow_min = Flow_min - Flow_buffer,
    Flow_max = Flow_max + Flow_buffer
  )

# Calculate effects of flow on chlorophyll for each year holding the
  # non-focal variables constant - marginal effects/adjusted predictions
df_gam_flow_yr_rd22_eff <- 
  as.data.frame(
    predict_response(
      m_gam_flow_yr_rd22, 
      terms = c("Flow", "Year_fct"),
      margin = "marginalmeans"
    ),
    terms_to_colnames = TRUE
  ) %>% 
  as_tibble() %>% 
  # Narrow down range of flow values for each year
  left_join(df_chla_flow_yr_rd22_summ, by = join_by(Year_fct)) %>% 
  filter(Flow >= Flow_min & Flow <= Flow_max) %>% 
  transmute(
    Year_fct,
    Flow,
    # Back calculate model fits and confidence levels
    across(c(predicted, conf.low, conf.high), ~ exp(.x) / 1000)
  )
  
# Create effects plot
plt_gam_flow_yr_rd22_eff <- df_gam_flow_yr_rd22_eff %>% 
  ggplot(aes(x = Flow, y = predicted)) +
  geom_point(
    data = ls_chla_c2_lag$RD22,
    aes(y = Chla),
    alpha = 0.6
  ) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25) +
  facet_wrap(vars(Year_fct), scales = "free") +
  theme_bw() +
  labs(
    x = "Flow (cfs)",
    y = expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))
  ) +
  scale_x_continuous(breaks = breaks_extended(6))

plt_gam_flow_yr_rd22_eff

Year Contrasts

# Estimated marginal means for year
em_gam_yr_rd22 <- emmeans(m_gam_flow_yr_rd22, ~ Year_fct)

# Tukey post-hoc contrasts
pairs(em_gam_yr_rd22)
##  contrast                    estimate    SE   df t.ratio p.value
##  Year_fct2015 - Year_fct2016    0.569 0.131 70.7   4.349  0.0004
##  Year_fct2015 - Year_fct2017   -2.367 1.246 70.7  -1.901  0.3266
##  Year_fct2015 - Year_fct2018    0.351 0.120 70.7   2.922  0.0364
##  Year_fct2015 - Year_fct2019    0.151 0.121 70.7   1.250  0.7225
##  Year_fct2016 - Year_fct2017   -2.936 1.234 70.7  -2.379  0.1332
##  Year_fct2016 - Year_fct2018   -0.218 0.125 70.7  -1.740  0.4166
##  Year_fct2016 - Year_fct2019   -0.418 0.126 70.7  -3.325  0.0119
##  Year_fct2017 - Year_fct2018    2.718 1.242 70.7   2.188  0.1961
##  Year_fct2017 - Year_fct2019    2.518 1.242 70.7   2.028  0.2637
##  Year_fct2018 - Year_fct2019   -0.200 0.116 70.7  -1.729  0.4229
## 
## P value adjustment: tukey method for comparing a family of 5 estimates
# Create table of contrasts and convert it to a tibble for plot
df_gam_yr_rd22 <- em_gam_yr_rd22 %>% 
  cld(sort = FALSE, Letters = letters) %>% 
  as_tibble() %>% 
  mutate(
    group = str_remove_all(.group, fixed(" ")),
    # back transform log-transformed results
    across(c(emmean, lower.CL, upper.CL), ~ exp(.x) / 1000)
  ) %>% 
  # Add min and max values of observed data to the Tukey post-hoc results and
    # calculate vertical positioning of letters
  left_join(
    ls_chla_c2_lag$RD22 %>% 
      summarize(
        max_val = max(Chla),
        min_val = min(Chla),
        .by = Year_fct
    ), 
    by = join_by(Year_fct)
  ) %>% 
  mutate(
    max_val = if_else(upper.CL > max_val, upper.CL, max_val),
    y_pos = max_val + (max_val - min_val) / 10,
    # Make all post-hoc contrast letters at same height equal to max of all
    y_pos = max(y_pos)
  ) %>% 
  select(
    Year_fct,
    emmean,
    lower.CL,
    upper.CL,
    group,
    y_pos
  )

# Create boxplot showing Tukey post-hoc results
plt_gam_yr_rd22 <- df_gam_yr_rd22 %>% 
  ggplot(
    aes(
      x = Year_fct,
      y = emmean,
      ymin = lower.CL,
      ymax = upper.CL
    )
  ) +
  geom_boxplot(
    data = ls_chla_c2_lag$RD22,
    aes(x = Year_fct, y = Chla),
    inherit.aes = FALSE
  ) +
  geom_crossbar(color = "grey82", fill = "grey", alpha = 0.7, linewidth = 0.1) +
  geom_point(color = "red") +
  geom_text(aes(y = y_pos, label = group), size = 3.5) +
  xlab("Year") +
  ylab(expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))) +
  theme_bw()

plt_gam_yr_rd22

The model predictions and confidence intervals look really off for 2017. Let’s look at what value emmeans is using for Flow and Week number when calculating these model predictions and how they compare to the range for each year.

ref_grid(m_gam_flow_yr_rd22)
## 'emmGrid' object with variables:
##     Year_fct = 2015, 2016, 2017, 2018, 2019
##     Flow = 106.05
##     Week = 35.747
ls_chla_c2_lag$RD22 %>% 
  summarize(
    across(c(Flow, Week), list(min = min, max = max)),
    .by = Year_fct
  )
## # A tibble: 5 × 5
##   Year_fct Flow_min Flow_max Week_min Week_max
##   <fct>       <dbl>    <dbl>    <dbl>    <dbl>
## 1 2015         4.88    509.        30       45
## 2 2016        31.6     559.        25       38
## 3 2017         5.05     26.9       28       44
## 4 2018         5.56    522.        28       45
## 5 2019         9.52    690.        28       45

emmeans is using 106 for Flow and 35.7 for Week when calculating predictions and confidence intervals for this model. Week is within the range of values for each year, but Flow is outside of the range of values for 2017. This may explain why the results look really off for 2017.

Model 2 - STTD

Initial Model

m_gam_flow_yr_sttd <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5), 
  data = ls_chla_c2_lag$STTD,
  method = "REML", 
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_flow_yr_sttd)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.6170057  0.0666832 129.223  < 2e-16 ***
## Year_fct2016       0.4636900  0.1006718   4.606 2.16e-05 ***
## Year_fct2017       0.4266768  0.1721946   2.478  0.01602 *  
## Year_fct2018      -0.6626033  0.0996955  -6.646 9.57e-09 ***
## Year_fct2019      -1.0635357  0.0982515 -10.825 8.48e-16 ***
## Flow               0.0027340  0.0004493   6.085 8.57e-08 ***
## Year_fct2016:Flow -0.0014831  0.0006608  -2.245  0.02846 *  
## Year_fct2017:Flow  0.0143166  0.0051790   2.764  0.00754 ** 
## Year_fct2018:Flow -0.0002519  0.0005772  -0.436  0.66414    
## Year_fct2019:Flow -0.0013482  0.0005153  -2.616  0.01120 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value
## s(Week) 0.3421      3 0.137   0.303
## 
## R-sq.(adj) =  0.847   Deviance explained = 86.7%
## -REML = 43.221  Scale est. = 0.071816  n = 71
appraise(m_gam_flow_yr_sttd)

shapiro.test(residuals(m_gam_flow_yr_sttd))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow_yr_sttd)
## W = 0.97466, p-value = 0.1605
k.check(m_gam_flow_yr_sttd)
##         k'       edf   k-index p-value
## s(Week)  3 0.3421119 0.8134544    0.03
draw(m_gam_flow_yr_sttd, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow_yr_sttd, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_flow_yr_sttd))

Box.test(residuals(m_gam_flow_yr_sttd), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_sttd)
## X-squared = 16.412, df = 20, p-value = 0.6907
anova(m_gam_flow_yr_sttd)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric Terms:
##               df      F  p-value
## Year_fct       4 69.283  < 2e-16
## Flow           1 37.022 8.57e-08
## Year_fct:Flow  4  5.092  0.00133
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value
## s(Week) 0.3421 3.0000 0.137   0.303

Shapiro-Wilk normality test shows that the residuals are normal, the diagnostic plots look really good. The ACF plot and the Box-Ljung test indicate no serial autocorrelation. Note that the smooth term is not significant. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

Observed vs Fitted Values

df_m_gam_flow_yr_sttd_fit <- ls_chla_c2_lag$STTD %>% 
  fitted_values(m_gam_flow_yr_sttd, data = .) %>% 
  mutate(fitted_bt = exp(.fitted) / 1000)

plt_m_gam_flow_yr_sttd_fit <- df_m_gam_flow_yr_sttd_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_m_gam_flow_yr_sttd_fit

plt_m_gam_flow_yr_sttd_fit + facet_wrap(vars(Year_fct), scales = "free")

This model looks pretty good. Note that variability does increase as the chlorophyll values increase. Before proceeding with this model for STTD, let’s look more closely at its results.

Effect of Flow by Year

# Calculate min and max flows for each station to narrow down x-axis in the plot
df_chla_flow_yr_sttd_summ <- ls_chla_c2_lag$STTD %>% 
  summarize(
    Flow_min = min(Flow),
    Flow_max = max(Flow),
    .by = c(Year_fct)
  ) %>% 
  mutate(
    Flow_buffer = (Flow_max - Flow_min) * 0.05,
    Flow_min = Flow_min - Flow_buffer,
    Flow_max = Flow_max + Flow_buffer
  )

# Calculate effects of flow on chlorophyll for each year holding the
  # non-focal variables constant - marginal effects/adjusted predictions
df_gam_flow_yr_sttd_eff <- 
  as.data.frame(
    predict_response(
      m_gam_flow_yr_sttd, 
      terms = c("Flow", "Year_fct"),
      margin = "marginalmeans"
    ),
    terms_to_colnames = TRUE
  ) %>% 
  as_tibble() %>% 
  # Narrow down range of flow values for each year
  left_join(df_chla_flow_yr_sttd_summ, by = join_by(Year_fct)) %>% 
  filter(Flow >= Flow_min & Flow <= Flow_max) %>% 
  transmute(
    Year_fct,
    Flow,
    # Back calculate model fits and confidence levels
    across(c(predicted, conf.low, conf.high), ~ exp(.x) / 1000)
  )
  
# Create effects plot
plt_gam_flow_yr_sttd_eff <- df_gam_flow_yr_sttd_eff %>% 
  ggplot(aes(x = Flow, y = predicted)) +
  geom_point(
    data = ls_chla_c2_lag$STTD,
    aes(y = Chla),
    alpha = 0.6
  ) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25) +
  facet_wrap(vars(Year_fct), scales = "free") +
  theme_bw() +
  labs(
    x = "Flow (cfs)",
    y = expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))
  ) +
  scale_x_continuous(breaks = breaks_extended(6))

plt_gam_flow_yr_sttd_eff

Year Contrasts

# Estimated marginal means for year
em_gam_yr_sttd <- emmeans(m_gam_flow_yr_sttd, ~ Year_fct)

# Tukey post-hoc contrasts
pairs(em_gam_yr_sttd)
##  contrast                    estimate     SE   df t.ratio p.value
##  Year_fct2015 - Year_fct2016   -0.395 0.0977 60.7  -4.041  0.0014
##  Year_fct2015 - Year_fct2017   -1.091 0.3884 60.7  -2.809  0.0503
##  Year_fct2015 - Year_fct2018    0.674 0.0971 60.7   6.941  <.0001
##  Year_fct2015 - Year_fct2019    1.126 0.0949 60.7  11.870  <.0001
##  Year_fct2016 - Year_fct2017   -0.696 0.3881 60.7  -1.794  0.3863
##  Year_fct2016 - Year_fct2018    1.069 0.1018 60.7  10.498  <.0001
##  Year_fct2016 - Year_fct2019    1.521 0.1000 60.7  15.217  <.0001
##  Year_fct2017 - Year_fct2018    1.765 0.3891 60.7   4.537  0.0003
##  Year_fct2017 - Year_fct2019    2.217 0.3889 60.7   5.701  <.0001
##  Year_fct2018 - Year_fct2019    0.452 0.0995 60.7   4.541  0.0003
## 
## P value adjustment: tukey method for comparing a family of 5 estimates
# Create table of contrasts and convert it to a tibble for plot
df_gam_yr_sttd <- em_gam_yr_sttd %>% 
  cld(sort = FALSE, Letters = letters) %>% 
  as_tibble() %>% 
  mutate(
    group = str_remove_all(.group, fixed(" ")),
    # back transform log-transformed results
    across(c(emmean, lower.CL, upper.CL), ~ exp(.x) / 1000)
  ) %>% 
  # Add min and max values of observed data to the Tukey post-hoc results and
    # calculate vertical positioning of letters
  left_join(
    ls_chla_c2_lag$STTD %>% 
      summarize(
        max_val = max(Chla),
        min_val = min(Chla),
        .by = Year_fct
    ), 
    by = join_by(Year_fct)
  ) %>% 
  mutate(
    max_val = if_else(upper.CL > max_val, upper.CL, max_val),
    y_pos = max_val + (max_val - min_val) / 10,
    # Make all post-hoc contrast letters at same height equal to max of all
    y_pos = max(y_pos)
  ) %>% 
  select(
    Year_fct,
    emmean,
    lower.CL,
    upper.CL,
    group,
    y_pos
  )

# Create boxplot showing Tukey post-hoc results
plt_gam_yr_sttd <- df_gam_yr_sttd %>% 
  ggplot(
    aes(
      x = Year_fct,
      y = emmean,
      ymin = lower.CL,
      ymax = upper.CL
    )
  ) +
  geom_boxplot(
    data = ls_chla_c2_lag$STTD,
    aes(x = Year_fct, y = Chla),
    inherit.aes = FALSE
  ) +
  geom_crossbar(color = "grey82", fill = "grey", alpha = 0.7, linewidth = 0.1) +
  geom_point(color = "red") +
  geom_text(aes(y = y_pos, label = group), size = 3.5) +
  xlab("Year") +
  ylab(expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))) +
  theme_bw()

plt_gam_yr_sttd

The model predictions and confidence intervals look better than those for RD22, but 2017 still seems somewhat off. Let’s look at what value emmeans is using for Flow and Week number when calculating these model predictions and how they compare to the range for each year.

ref_grid(m_gam_flow_yr_sttd)
## 'emmGrid' object with variables:
##     Year_fct = 2015, 2016, 2017, 2018, 2019
##     Flow = 46.414
##     Week = 35.394
ls_chla_c2_lag$STTD %>% 
  summarize(
    across(c(Flow, Week), list(min = min, max = max)),
    .by = Year_fct
  )
## # A tibble: 5 × 5
##   Year_fct Flow_min Flow_max Week_min Week_max
##   <fct>       <dbl>    <dbl>    <dbl>    <dbl>
## 1 2015       -103.    358.         30       46
## 2 2016        -82.0   476.         25       38
## 3 2017        -61.9    -1.39       28       39
## 4 2018       -101.    445.         29       42
## 5 2019        -64.8   658.         30       45

emmeans is using 46.4 for Flow and 35.4 for Week when calculating predictions and confidence intervals for this model. Week is within the range of values for each year, but Flow is outside of the range of values for 2017. This may explain why the results look somewhat off for 2017.

Model 2 - LIB

Initial Model

m_gam_flow_yr_lib <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5), 
  data = ls_chla_c2_lag$LIB,
  method = "REML", 
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_flow_yr_lib)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.794e+00  5.533e-01  14.087  < 2e-16 ***
## Year_fct2016       7.190e-01  1.141e+00   0.630 0.530883    
## Year_fct2017      -1.842e-02  6.142e-01  -0.030 0.976175    
## Year_fct2018      -2.685e+00  6.676e-01  -4.022 0.000163 ***
## Year_fct2019      -6.665e-01  6.794e-01  -0.981 0.330470    
## Flow               4.752e-05  3.456e-04   0.137 0.891109    
## Year_fct2016:Flow -1.057e-04  7.775e-04  -0.136 0.892309    
## Year_fct2017:Flow -1.905e-05  4.570e-04  -0.042 0.966895    
## Year_fct2018:Flow -9.012e-04  6.981e-04  -1.291 0.201667    
## Year_fct2019:Flow  6.819e-06  6.155e-04   0.011 0.991198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value   
## s(Week) 1.84      3 3.443 0.00353 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.83   Deviance explained = 85.6%
## -REML =  78.76  Scale est. = 0.18821   n = 72
appraise(m_gam_flow_yr_lib)

shapiro.test(residuals(m_gam_flow_yr_lib))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow_yr_lib)
## W = 0.96604, p-value = 0.04888
k.check(m_gam_flow_yr_lib)
##         k'      edf  k-index p-value
## s(Week)  3 1.839936 1.010445    0.51
draw(m_gam_flow_yr_lib, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow_yr_lib, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_flow_yr_lib))

Box.test(residuals(m_gam_flow_yr_lib), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_lib)
## X-squared = 55.221, df = 20, p-value = 3.805e-05
anova(m_gam_flow_yr_lib)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric Terms:
##               df     F  p-value
## Year_fct       4 7.442 6.14e-05
## Flow           1 0.019    0.891
## Year_fct:Flow  4 0.462    0.763
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value
## s(Week) 1.84   3.00 3.443 0.00353

Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look pretty good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_gam_flow_yr_lib_lag1 <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1, 
  data = ls_chla_c2_lag$LIB,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow_yr_lib_lag1))

Box.test(residuals(m_gam_flow_yr_lib_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_lib_lag1)
## X-squared = 24.824, df = 20, p-value = 0.2083

Lag 2

m_gam_flow_yr_lib_lag2 <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1 + lag2, 
  data = ls_chla_c2_lag$LIB,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow_yr_lib_lag2))

Box.test(residuals(m_gam_flow_yr_lib_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_lib_lag2)
## X-squared = 18.158, df = 20, p-value = 0.577

The model with 1 lag term already seems to address the serial autocorrelation, but the lag2 model is even better. Let’s use AIC to see how they compare.

Compare Models

AIC(m_gam_flow_yr_lib, m_gam_flow_yr_lib_lag1, m_gam_flow_yr_lib_lag2)
##                              df      AIC
## m_gam_flow_yr_lib      13.26481 97.66666
## m_gam_flow_yr_lib_lag1 13.46163 60.08330
## m_gam_flow_yr_lib_lag2 13.25738 57.16020

It looks like the lag2 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_gam_flow_yr_lib_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1 + 
##     lag2
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.591e+00  9.906e-01   2.616   0.0118 *  
## Year_fct2016       1.335e+00  1.037e+00   1.287   0.2040    
## Year_fct2017       1.598e-01  4.981e-01   0.321   0.7496    
## Year_fct2018      -3.512e-01  6.883e-01  -0.510   0.6121    
## Year_fct2019      -1.534e-01  5.348e-01  -0.287   0.7755    
## Flow              -1.107e-04  2.686e-04  -0.412   0.6819    
## lag1               8.803e-01  1.357e-01   6.485 3.88e-08 ***
## lag2              -2.401e-01  1.529e-01  -1.570   0.1227    
## Year_fct2016:Flow  7.132e-04  7.314e-04   0.975   0.3342    
## Year_fct2017:Flow  1.135e-04  3.791e-04   0.299   0.7659    
## Year_fct2018:Flow  7.046e-04  7.707e-04   0.914   0.3650    
## Year_fct2019:Flow -5.475e-05  5.027e-04  -0.109   0.9137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value
## s(Week) 0.1328      3 0.047   0.351
## 
## R-sq.(adj) =  0.897   Deviance explained = 91.6%
## -REML = 59.479  Scale est. = 0.11933   n = 62
appraise(m_gam_flow_yr_lib_lag2)

shapiro.test(residuals(m_gam_flow_yr_lib_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow_yr_lib_lag2)
## W = 0.9032, p-value = 0.0001349
k.check(m_gam_flow_yr_lib_lag2)
##         k'       edf   k-index p-value
## s(Week)  3 0.1328294 0.9798788  0.4175
draw(m_gam_flow_yr_lib_lag2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow_yr_lib_lag2, pages = 1, all.terms = TRUE)

anova(m_gam_flow_yr_lib_lag2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1 + 
##     lag2
## 
## Parametric Terms:
##               df      F  p-value
## Year_fct       4  0.648    0.631
## Flow           1  0.170    0.682
## lag1           1 42.054 3.88e-08
## lag2           1  2.465    0.123
## Year_fct:Flow  4  0.454    0.769
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value
## s(Week) 0.1328 3.0000 0.047   0.351

The model diagnostics look okay - serial autocorrelation has been accounted for, but the Shapiro-Wilk normality test and diagnostic plots show that the residuals aren’t normal. Also, note that all terms besides lag1 are not significant. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

Observed vs Fitted Values

df_chla_lib_lag2 <- ls_chla_c2_lag$LIB %>% drop_na(lag1, lag2)

df_m_gam_flow_yr_lib_lag2_fit <- df_chla_lib_lag2 %>% 
  fitted_values(m_gam_flow_yr_lib_lag2, data = .) %>% 
  mutate(fitted_bt = exp(.fitted) / 1000)

plt_m_gam_flow_yr_lib_lag2_fit <- df_m_gam_flow_yr_lib_lag2_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_m_gam_flow_yr_lib_lag2_fit

plt_m_gam_flow_yr_lib_lag2_fit + facet_wrap(vars(Year_fct), scales = "free")

This model looks pretty good. Note that variability does increase as the chlorophyll values increase. Before proceeding with this model for LIB, let’s look more closely at its results.

Effect of Flow by Year

# Calculate min and max flows for each station to narrow down x-axis in the plot
df_chla_flow_yr_lib_summ <- df_chla_lib_lag2 %>% 
  summarize(
    Flow_min = min(Flow),
    Flow_max = max(Flow),
    .by = c(Year_fct)
  ) %>% 
  mutate(
    Flow_buffer = (Flow_max - Flow_min) * 0.05,
    Flow_min = Flow_min - Flow_buffer,
    Flow_max = Flow_max + Flow_buffer
  )

# Calculate effects of flow on chlorophyll for each year holding the
  # non-focal variables constant - marginal effects/adjusted predictions
df_gam_flow_yr_lib_eff <- 
  as.data.frame(
    predict_response(
      m_gam_flow_yr_lib_lag2, 
      terms = c("Flow", "Year_fct"),
      margin = "marginalmeans"
    ),
    terms_to_colnames = TRUE
  ) %>% 
  as_tibble() %>% 
  # Narrow down range of flow values for each year
  left_join(df_chla_flow_yr_lib_summ, by = join_by(Year_fct)) %>% 
  filter(Flow >= Flow_min & Flow <= Flow_max) %>% 
  transmute(
    Year_fct,
    Flow,
    # Back calculate model fits and confidence levels
    across(c(predicted, conf.low, conf.high), ~ exp(.x) / 1000)
  )
  
# Create effects plot
plt_gam_flow_yr_lib_eff <- df_gam_flow_yr_lib_eff %>% 
  ggplot(aes(x = Flow, y = predicted)) +
  geom_point(
    data = df_chla_lib_lag2,
    aes(y = Chla),
    alpha = 0.6
  ) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25) +
  facet_wrap(vars(Year_fct), scales = "free") +
  theme_bw() +
  labs(
    x = "Flow (cfs)",
    y = expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))
  ) +
  scale_x_continuous(breaks = breaks_extended(6))

plt_gam_flow_yr_lib_eff

Year Contrasts

# Estimated marginal means for year
em_gam_yr_lib <- emmeans(m_gam_flow_yr_lib_lag2, ~ Year_fct)

# Tukey post-hoc contrasts
pairs(em_gam_yr_lib)
##  contrast                    estimate    SE   df t.ratio p.value
##  Year_fct2015 - Year_fct2016  -0.5125 0.255 49.9  -2.007  0.2775
##  Year_fct2015 - Year_fct2017  -0.0289 0.166 49.9  -0.174  0.9998
##  Year_fct2015 - Year_fct2018   1.1640 0.444 49.9   2.619  0.0821
##  Year_fct2015 - Year_fct2019   0.0902 0.275 49.9   0.328  0.9974
##  Year_fct2016 - Year_fct2017   0.4836 0.230 49.9   2.098  0.2370
##  Year_fct2016 - Year_fct2018   1.6765 0.505 49.9   3.321  0.0139
##  Year_fct2016 - Year_fct2019   0.6027 0.341 49.9   1.766  0.4046
##  Year_fct2017 - Year_fct2018   1.1929 0.442 49.9   2.699  0.0683
##  Year_fct2017 - Year_fct2019   0.1191 0.266 49.9   0.448  0.9914
##  Year_fct2018 - Year_fct2019  -1.0738 0.480 49.9  -2.238  0.1828
## 
## P value adjustment: tukey method for comparing a family of 5 estimates
# Create table of contrasts and convert it to a tibble for plot
df_gam_yr_lib <- em_gam_yr_lib %>% 
  cld(sort = FALSE, Letters = letters) %>% 
  as_tibble() %>% 
  mutate(
    group = str_remove_all(.group, fixed(" ")),
    # back transform log-transformed results
    across(c(emmean, lower.CL, upper.CL), ~ exp(.x) / 1000)
  ) %>% 
  # Add min and max values of observed data to the Tukey post-hoc results and
    # calculate vertical positioning of letters
  left_join(
    df_chla_lib_lag2 %>% 
      summarize(
        max_val = max(Chla),
        min_val = min(Chla),
        .by = Year_fct
    ), 
    by = join_by(Year_fct)
  ) %>% 
  mutate(
    max_val = if_else(upper.CL > max_val, upper.CL, max_val),
    y_pos = max_val + (max_val - min_val) / 10,
    # Make all post-hoc contrast letters at same height equal to max of all
    y_pos = max(y_pos)
  ) %>% 
  select(
    Year_fct,
    emmean,
    lower.CL,
    upper.CL,
    group,
    y_pos
  )

# Create boxplot showing Tukey post-hoc results
plt_gam_yr_lib <- df_gam_yr_lib %>% 
  ggplot(
    aes(
      x = Year_fct,
      y = emmean,
      ymin = lower.CL,
      ymax = upper.CL
    )
  ) +
  geom_boxplot(
    data = df_chla_lib_lag2,
    aes(x = Year_fct, y = Chla),
    inherit.aes = FALSE
  ) +
  geom_crossbar(color = "grey82", fill = "grey", alpha = 0.7, linewidth = 0.1) +
  geom_point(color = "red") +
  geom_text(aes(y = y_pos, label = group), size = 3.5) +
  xlab("Year") +
  ylab(expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))) +
  theme_bw()

plt_gam_yr_lib

The model predictions and confidence intervals look decent, but 2016 seems somewhat off. Let’s look at what value emmeans is using for Flow and Week number when calculating these model predictions and how they compare to the range for each year.

ref_grid(m_gam_flow_yr_lib_lag2)
## 'emmGrid' object with variables:
##     Year_fct = 2015, 2016, 2017, 2018, 2019
##     Flow = -1153.5
##     lag1 = 7.6183
##     lag2 = 7.6844
##     Week = 35.968
df_chla_lib_lag2 %>% 
  summarize(
    across(c(Flow, Week), list(min = min, max = max)),
    .by = Year_fct
  )
## # A tibble: 5 × 5
##   Year_fct Flow_min Flow_max Week_min Week_max
##   <fct>       <dbl>    <dbl>    <dbl>    <dbl>
## 1 2015       -1946.   -802.        29       46
## 2 2016       -1646.  -1171.        24       38
## 3 2017       -1488.   -221.        30       44
## 4 2018        -767.   -162         35       44
## 5 2019       -1102.    -52.5       30       43

emmeans is using -1154 for Flow and 36 for Week when calculating predictions and confidence intervals for this model. Week is within the range of values for each year, but Flow is outside of the range of values for 2016, 2018, and 2019. This may explain why the results look somewhat off for 2016 and the generally wide confidence intervals for all years.

Model 2 - RVB

Initial Model

m_gam_flow_yr_rvb <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5), 
  data = ls_chla_c2_lag$RVB,
  method = "REML", 
  knots = list(week = c(0, 52))
)

Lets look at the model summary and diagnostics:

summary(m_gam_flow_yr_rvb)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.058e+00  3.989e-01  17.694   <2e-16 ***
## Year_fct2016       1.917e+00  5.989e-01   3.202   0.0020 ** 
## Year_fct2017       6.736e-01  5.037e-01   1.337   0.1851    
## Year_fct2018       3.069e-01  4.629e-01   0.663   0.5094    
## Year_fct2019      -5.541e-01  4.808e-01  -1.152   0.2528    
## Flow               1.411e-04  9.403e-05   1.501   0.1376    
## Year_fct2016:Flow -2.880e-04  1.095e-04  -2.632   0.0103 *  
## Year_fct2017:Flow -1.724e-04  9.976e-05  -1.728   0.0881 .  
## Year_fct2018:Flow -1.455e-04  9.932e-05  -1.465   0.1472    
## Year_fct2019:Flow -9.370e-05  1.005e-04  -0.932   0.3542    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Week) 2.682      3 16.69  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.76   Deviance explained = 79.2%
## -REML = 47.018  Scale est. = 0.047662  n = 88
appraise(m_gam_flow_yr_rvb)

shapiro.test(residuals(m_gam_flow_yr_rvb))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow_yr_rvb)
## W = 0.96779, p-value = 0.02739
k.check(m_gam_flow_yr_rvb)
##         k'      edf  k-index p-value
## s(Week)  3 2.681987 1.156251  0.9325
draw(m_gam_flow_yr_rvb, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow_yr_rvb, pages = 1, all.terms = TRUE)

acf(residuals(m_gam_flow_yr_rvb))

Box.test(residuals(m_gam_flow_yr_rvb), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_rvb)
## X-squared = 31.666, df = 20, p-value = 0.04699
anova(m_gam_flow_yr_rvb)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5)
## 
## Parametric Terms:
##               df     F  p-value
## Year_fct       4 9.736 2.11e-06
## Flow           1 2.252    0.138
## Year_fct:Flow  4 4.675    0.002
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value
## s(Week) 2.682  3.000 16.69  <2e-16

Besides the Shapiro-Wilk normality test showing that the residuals aren’t normal, the diagnostic plots look pretty good. However, the residuals are autocorrelated.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1 and 2 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_gam_flow_yr_rvb_lag1 <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1, 
  data = ls_chla_c2_lag$RVB,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow_yr_rvb_lag1))

Box.test(residuals(m_gam_flow_yr_rvb_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_rvb_lag1)
## X-squared = 23.848, df = 20, p-value = 0.2491

Lag 2

m_gam_flow_yr_rvb_lag2 <- gam(
  Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1 + lag2, 
  data = ls_chla_c2_lag$RVB,
  method = "REML", 
  knots = list(week = c(0, 52))
)

acf(residuals(m_gam_flow_yr_rvb_lag2))

Box.test(residuals(m_gam_flow_yr_rvb_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam_flow_yr_rvb_lag2)
## X-squared = 23.027, df = 20, p-value = 0.2875

The model with 1 lag term already seems to address the serial autocorrelation. Let’s use AIC to see how they compare.

Compare Models

AIC(m_gam_flow_yr_rvb, m_gam_flow_yr_rvb_lag1, m_gam_flow_yr_rvb_lag2)
##                              df        AIC
## m_gam_flow_yr_rvb      13.93575  -3.928139
## m_gam_flow_yr_rvb_lag1 14.69086 -15.227178
## m_gam_flow_yr_rvb_lag2 15.69375  -8.788796

It looks like the lag1 model has the best fit according to the AIC values. Let’s take a closer look at that one.

Lag 1 model summary

summary(m_gam_flow_yr_rvb_lag1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.808e+00  8.641e-01   4.407 3.71e-05 ***
## Year_fct2016       1.626e+00  6.187e-01   2.628   0.0105 *  
## Year_fct2017       7.505e-01  5.026e-01   1.493   0.1399    
## Year_fct2018       6.441e-01  4.720e-01   1.365   0.1768    
## Year_fct2019       4.841e-02  4.970e-01   0.097   0.9227    
## Flow               1.726e-04  9.998e-05   1.727   0.0887 .  
## lag1               4.064e-01  9.548e-02   4.256 6.36e-05 ***
## Year_fct2016:Flow -2.752e-04  1.138e-04  -2.418   0.0182 *  
## Year_fct2017:Flow -1.925e-04  1.040e-04  -1.851   0.0684 .  
## Year_fct2018:Flow -1.865e-04  1.040e-04  -1.793   0.0774 .  
## Year_fct2019:Flow -1.450e-04  1.049e-04  -1.383   0.1712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value   
## s(Week) 2.244      3 3.694 0.00466 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.783   Deviance explained = 81.5%
## -REML = 40.814  Scale est. = 0.040701  n = 83
appraise(m_gam_flow_yr_rvb_lag1)

shapiro.test(residuals(m_gam_flow_yr_rvb_lag1))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam_flow_yr_rvb_lag1)
## W = 0.91616, p-value = 4.636e-05
k.check(m_gam_flow_yr_rvb_lag1)
##         k'      edf  k-index p-value
## s(Week)  3 2.243627 1.169088   0.935
draw(m_gam_flow_yr_rvb_lag1, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam_flow_yr_rvb_lag1, pages = 1, all.terms = TRUE)

anova(m_gam_flow_yr_rvb_lag1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow + s(Week, bs = "cc", k = 5) + lag1
## 
## Parametric Terms:
##               df      F  p-value
## Year_fct       4  3.522   0.0112
## Flow           1  2.981   0.0887
## lag1           1 18.118 6.36e-05
## Year_fct:Flow  4  2.442   0.0547
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value
## s(Week) 2.244  3.000 3.694 0.00466

The model diagnostics look okay - serial autocorrelation has been accounted for, but the Shapiro-Wilk normality test and diagnostic plots show that the residuals aren’t normal. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

Observed vs Fitted Values

df_chla_rvb_lag1 <- ls_chla_c2_lag$RVB %>% drop_na(lag1)

df_m_gam_flow_yr_rvb_lag1_fit <- df_chla_rvb_lag1 %>% 
  fitted_values(m_gam_flow_yr_rvb_lag1, data = .) %>% 
  mutate(fitted_bt = exp(.fitted) / 1000)

plt_m_gam_flow_yr_rvb_lag1_fit <- df_m_gam_flow_yr_rvb_lag1_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_m_gam_flow_yr_rvb_lag1_fit

plt_m_gam_flow_yr_rvb_lag1_fit + facet_wrap(vars(Year_fct), scales = "free")

This model looks pretty good. Before proceeding with this model for RVB, let’s look more closely at its results.

Effect of Flow by Year

# Calculate min and max flows for each station to narrow down x-axis in the plot
df_chla_flow_yr_rvb_summ <- df_chla_rvb_lag1 %>% 
  summarize(
    Flow_min = min(Flow),
    Flow_max = max(Flow),
    .by = c(Year_fct)
  ) %>% 
  mutate(
    Flow_buffer = (Flow_max - Flow_min) * 0.05,
    Flow_min = Flow_min - Flow_buffer,
    Flow_max = Flow_max + Flow_buffer
  )

# Calculate effects of flow on chlorophyll for each year holding the
  # non-focal variables constant - marginal effects/adjusted predictions
df_gam_flow_yr_rvb_eff <- 
  as.data.frame(
    predict_response(
      m_gam_flow_yr_rvb_lag1, 
      terms = c("Flow", "Year_fct"),
      margin = "marginalmeans"
    ),
    terms_to_colnames = TRUE
  ) %>% 
  as_tibble() %>% 
  # Narrow down range of flow values for each year
  left_join(df_chla_flow_yr_rvb_summ, by = join_by(Year_fct)) %>% 
  filter(Flow >= Flow_min & Flow <= Flow_max) %>% 
  transmute(
    Year_fct,
    Flow,
    # Back calculate model fits and confidence levels
    across(c(predicted, conf.low, conf.high), ~ exp(.x) / 1000)
  )
  
# Create effects plot
plt_gam_flow_yr_rvb_eff <- df_gam_flow_yr_rvb_eff %>% 
  ggplot(aes(x = Flow, y = predicted)) +
  geom_point(
    data = df_chla_rvb_lag1,
    aes(y = Chla),
    alpha = 0.6
  ) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.25) +
  facet_wrap(vars(Year_fct), scales = "free") +
  theme_bw() +
  labs(
    x = "Flow (cfs)",
    y = expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))
  ) +
  scale_x_continuous(breaks = breaks_extended(6))

plt_gam_flow_yr_rvb_eff

Year Contrasts

# Estimated marginal means for year
em_gam_yr_rvb <- emmeans(m_gam_flow_yr_rvb_lag1, ~ Year_fct)

# Tukey post-hoc contrasts
pairs(em_gam_yr_rvb)
##  contrast                    estimate     SE   df t.ratio p.value
##  Year_fct2015 - Year_fct2016   0.4983 0.3637 69.8   1.370  0.6486
##  Year_fct2015 - Year_fct2017   0.7354 0.3670 69.8   2.004  0.2750
##  Year_fct2015 - Year_fct2018   0.7951 0.3634 69.8   2.188  0.1964
##  Year_fct2015 - Year_fct2019   1.0711 0.3662 69.8   2.925  0.0363
##  Year_fct2016 - Year_fct2017   0.2372 0.1126 69.8   2.105  0.2295
##  Year_fct2016 - Year_fct2018   0.2969 0.1011 69.8   2.936  0.0353
##  Year_fct2016 - Year_fct2019   0.5729 0.1269 69.8   4.514  0.0002
##  Year_fct2017 - Year_fct2018   0.0597 0.1006 69.8   0.594  0.9756
##  Year_fct2017 - Year_fct2019   0.3357 0.1134 69.8   2.961  0.0330
##  Year_fct2018 - Year_fct2019   0.2760 0.0825 69.8   3.347  0.0112
## 
## P value adjustment: tukey method for comparing a family of 5 estimates
# Create table of contrasts and convert it to a tibble for plot
df_gam_yr_rvb <- em_gam_yr_rvb %>% 
  cld(sort = FALSE, Letters = letters) %>% 
  as_tibble() %>% 
  mutate(
    group = str_remove_all(.group, fixed(" ")),
    # back transform log-transformed results
    across(c(emmean, lower.CL, upper.CL), ~ exp(.x) / 1000)
  ) %>% 
  # Add min and max values of observed data to the Tukey post-hoc results and
    # calculate vertical positioning of letters
  left_join(
    df_chla_rvb_lag1 %>% 
      summarize(
        max_val = max(Chla),
        min_val = min(Chla),
        .by = Year_fct
    ), 
    by = join_by(Year_fct)
  ) %>% 
  mutate(
    max_val = if_else(upper.CL > max_val, upper.CL, max_val),
    y_pos = max_val + (max_val - min_val) / 10,
    # Make all post-hoc contrast letters at same height equal to max of all
    y_pos = max(y_pos)
  ) %>% 
  select(
    Year_fct,
    emmean,
    lower.CL,
    upper.CL,
    group,
    y_pos
  )

# Create boxplot showing Tukey post-hoc results
plt_gam_yr_rvb <- df_gam_yr_rvb %>% 
  ggplot(
    aes(
      x = Year_fct,
      y = emmean,
      ymin = lower.CL,
      ymax = upper.CL
    )
  ) +
  geom_boxplot(
    data = df_chla_rvb_lag1,
    aes(x = Year_fct, y = Chla),
    inherit.aes = FALSE
  ) +
  geom_crossbar(color = "grey82", fill = "grey", alpha = 0.7, linewidth = 0.1) +
  geom_point(color = "red") +
  geom_text(aes(y = y_pos, label = group), size = 3.5) +
  xlab("Year") +
  ylab(expression(Chlorophyll~Fluoresence~(mu*g~L^{-1}))) +
  theme_bw()

plt_gam_yr_rvb

The model predictions and confidence intervals look decent, but 2015 seems somewhat off. Let’s look at what value emmeans is using for Flow and Week number when calculating these model predictions and how they compare to the range for each year.

ref_grid(m_gam_flow_yr_rvb_lag1)
## 'emmGrid' object with variables:
##     Year_fct = 2015, 2016, 2017, 2018, 2019
##     Flow = 7719.3
##     lag1 = 7.4032
##     Week = 35.771
df_chla_rvb_lag1 %>% 
  summarize(
    across(c(Flow, Week), list(min = min, max = max)),
    .by = Year_fct
  )
## # A tibble: 5 × 5
##   Year_fct Flow_min Flow_max Week_min Week_max
##   <fct>       <dbl>    <dbl>    <dbl>    <dbl>
## 1 2015        3307.    5171.       28       46
## 2 2016        6827.   10622.       23       38
## 3 2017        7095.   14402.       29       44
## 4 2018        2975.    9887.       29       45
## 5 2019        4431.   10978.       29       45

emmeans is using 7719 for Flow and 35.8 for Week when calculating predictions and confidence intervals for this model. Week is within the range of values for each year, but Flow is outside of the range of values for 2015. This may explain why the results look somewhat off for 2015.

Conclusions

Model 2 broken up by station seems to be a good way to look at how flow affects chlorophyll for each year; however, the models for LIB and RVB had trouble with normality of their residuals and needed adjustment for serial autocorrelation. Chlorophyll values at STTD in particular show a strong positive response to flow for all years.

In contrast, model 2 broken up by station does not appear to a sufficient way to look at year contrasts. In most cases, the model predictions did not match the observed data well and there were wide confidence intervals. This may be because emmeans used values for Flow that were outside of the range of values for some of the years when calculating predictions and confidence intervals.