Purpose

Explore and analyze the continuous chlorophyll data to be included in the NDFS synthesis manuscript. We will attempt to fit various models to the data set using daily average flow as a continuous predictor which replaces the categorical predictor - flow action period - in the original analysis. These models will only include representative stations for 4 habitat types - upstream (RD22), lower Yolo Bypass (STTD), Cache Slough complex (LIB), and downstream (RVB).

Global code and functions

# Load packages
library(tidyverse)
library(scales)
library(knitr)
library(mgcv)
library(car)
library(gratia)
library(here)
library(conflicted)

# Source functions
source(here("manuscript_synthesis/src/global_functions.R"))

# Declare package conflict preferences 
conflicts_prefer(dplyr::filter(), dplyr::lag())

Display current versions of R and packages used for this analysis:

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.3 (2023-03-15 ucrt)
##  os       Windows 10 x64 (build 19045)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.utf8
##  ctype    English_United States.utf8
##  tz       America/Los_Angeles
##  date     2024-02-29
##  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  abind         1.4-5   2016-07-21 [1] CRAN (R 4.2.0)
##  bslib         0.4.2   2022-12-16 [1] CRAN (R 4.2.2)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.2.3)
##  callr         3.7.3   2022-11-02 [1] CRAN (R 4.2.2)
##  car         * 3.1-2   2023-03-30 [1] CRAN (R 4.2.3)
##  carData     * 3.0-5   2022-01-06 [1] CRAN (R 4.2.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.2.3)
##  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.2.2)
##  conflicted  * 1.2.0   2023-02-01 [1] CRAN (R 4.2.2)
##  crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.2.1)
##  digest        0.6.33  2023-07-07 [1] CRAN (R 4.2.3)
##  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.2.3)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.1)
##  evaluate      0.21    2023-05-05 [1] CRAN (R 4.2.3)
##  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.2.3)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.2)
##  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.2)
##  fs          * 1.6.3   2023-07-20 [1] CRAN (R 4.2.3)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
##  ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.2.3)
##  glue          1.7.0   2024-01-09 [1] CRAN (R 4.2.3)
##  gratia      * 0.8.1   2023-02-02 [1] CRAN (R 4.2.2)
##  gtable        0.3.4   2023-08-21 [1] CRAN (R 4.2.3)
##  here        * 1.0.1   2020-12-13 [1] CRAN (R 4.2.1)
##  hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.3)
##  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.2.3)
##  htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.2.3)
##  httpuv        1.6.9   2023-02-14 [1] CRAN (R 4.2.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.2.1)
##  jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.2.3)
##  knitr       * 1.42    2023-01-25 [1] CRAN (R 4.2.2)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.2.1)
##  lattice       0.21-8  2023-04-05 [1] CRAN (R 4.2.3)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.2.3)
##  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.2.3)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
##  Matrix        1.5-4   2023-04-04 [1] CRAN (R 4.2.3)
##  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.2.1)
##  mgcv        * 1.8-42  2023-03-02 [1] CRAN (R 4.2.2)
##  mime          0.12    2021-09-28 [1] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.1)
##  mvnfast       0.2.8   2023-02-23 [1] CRAN (R 4.2.2)
##  nlme        * 3.1-162 2023-01-31 [2] CRAN (R 4.2.3)
##  patchwork   * 1.1.2   2022-08-19 [1] CRAN (R 4.2.1)
##  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.3)
##  pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.2.3)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
##  pkgload       1.3.2.1 2023-07-08 [1] CRAN (R 4.2.3)
##  prettyunits   1.2.0   2023-09-24 [1] CRAN (R 4.2.3)
##  processx      3.8.2   2023-06-30 [1] CRAN (R 4.2.3)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.2.1)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
##  ps            1.7.5   2023-04-18 [1] CRAN (R 4.2.3)
##  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.2.3)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
##  Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.2.3)
##  readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.2.3)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.2.1)
##  rlang       * 1.1.3   2024-01-10 [1] CRAN (R 4.2.3)
##  rmarkdown     2.21    2023-03-26 [1] CRAN (R 4.2.3)
##  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.1)
##  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
##  sass          0.4.6   2023-05-03 [1] CRAN (R 4.2.3)
##  scales      * 1.2.1   2022-08-20 [1] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
##  shiny         1.7.4   2022-12-15 [1] CRAN (R 4.2.2)
##  stringi       1.8.3   2023-12-11 [1] CRAN (R 4.2.3)
##  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.2.3)
##  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.3)
##  tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.2.3)
##  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
##  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.2)
##  timechange    0.3.0   2024-01-18 [1] CRAN (R 4.2.3)
##  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.2.3)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.2.1)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.2.1)
##  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.2)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.2.3)
##  withr         3.0.0   2024-01-16 [1] CRAN (R 4.2.3)
##  xfun          0.39    2023-04-20 [1] CRAN (R 4.2.3)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.2.1)
##  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.2)
## 
##  [1] C:/R/win-library/4.2
##  [2] C:/Program Files/R/R-4.2.3/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Import Data

# Define file path for processed data
fp_data <- here("manuscript_synthesis/data/processed")

# Import daily average water quality data
df_wq <- readRDS(file.path(fp_data, "wq_daily_avg_2013-2019.rds"))

# Import daily average flow data
df_flow <- readRDS(file.path(fp_data, "flow_daily_avg_2013-2019.rds"))

Prepare Data

# Create a vector for the factor order of StationCode
sta_order <- c("RD22", "STTD", "LIB", "RVB")

# We will use LIS flow data as a proxy for STTD
df_flow_c <- df_flow %>% mutate(StationCode = if_else(StationCode == "LIS", "STTD", StationCode))

# Prepare chlorophyll and flow data for exploration and analysis
df_chla_c1 <- df_wq %>% 
  select(StationCode, Date, Chla) %>% 
  drop_na(Chla) %>% 
  # Filter to only include representative stations for 4 habitat types - RD22, STTD, LIB, RVB
  filter(StationCode %in% sta_order) %>% 
  # Join flow data to chlorophyll data
  left_join(df_flow_c, by = join_by(StationCode, Date)) %>% 
  # Remove all NA flow values
  drop_na(Flow) %>% 
  mutate(
    # Scale and log transform chlorophyll values
    Chla_log = log(Chla * 1000),
    # Add Year and DOY variables
    Year = year(Date),
    DOY = yday(Date),
    # Apply factor order to StationCode
    StationCode = factor(StationCode, levels = sta_order),
    # Add a column for Year as a factor for the model
    Year_fct = factor(Year)
  ) %>% 
  arrange(StationCode, Date)

Explore sample counts by Station

df_chla_c1 %>% 
  summarize(
    min_date = min(Date),
    max_date = max(Date),
    num_samples = n(),
    .by = c(StationCode, Year)
  ) %>% 
  arrange(StationCode, Year) %>% 
  kable()
StationCode Year min_date max_date num_samples
RD22 2014 2014-10-01 2014-11-08 39
RD22 2015 2015-07-24 2015-11-10 110
RD22 2016 2016-06-23 2016-09-16 86
RD22 2017 2017-07-14 2017-11-03 101
RD22 2018 2018-07-13 2018-11-11 122
RD22 2019 2019-07-11 2019-11-06 118
STTD 2013 2013-08-15 2013-11-04 78
STTD 2014 2014-07-25 2014-10-19 87
STTD 2015 2015-07-27 2015-11-16 108
STTD 2016 2016-06-23 2016-09-16 83
STTD 2017 2017-07-14 2017-09-25 56
STTD 2018 2018-07-20 2018-10-11 84
STTD 2019 2019-07-26 2019-11-06 100
LIB 2014 2014-10-01 2014-11-08 35
LIB 2015 2015-07-06 2015-11-16 123
LIB 2016 2016-05-29 2016-09-16 99
LIB 2017 2017-07-14 2017-11-01 89
LIB 2018 2018-08-22 2018-10-22 26
LIB 2019 2019-07-11 2019-09-21 21
RVB 2013 2013-07-07 2013-11-17 134
RVB 2014 2014-07-25 2014-11-08 105
RVB 2015 2015-07-06 2015-11-16 134
RVB 2016 2016-05-29 2016-09-16 80
RVB 2017 2017-07-14 2017-11-03 113
RVB 2018 2018-07-13 2018-11-11 122
RVB 2019 2019-07-11 2019-11-06 99

Looking at the sample counts and date ranges, we’ll only include years 2015-2019 for the analysis.

df_chla_c2 <- df_chla_c1 %>% 
  filter(Year %in% 2015:2019) %>% 
  mutate(Year_fct = fct_drop(Year_fct))

# Create another dataframe that has up to 3 lag variables for chlorophyll to be
  # used in the linear models
df_chla_c2_lag <- df_chla_c2 %>% 
  # Fill in missing days for each StationCode-Year combination
  group_by(StationCode, Year) %>% 
  complete(Date = seq.Date(min(Date), max(Date), by = "1 day")) %>% 
  # Create lag variables of scaled log transformed chlorophyll values
  mutate(
    lag1 = lag(Chla_log),
    lag2 = lag(Chla_log, 2),
    lag3 = lag(Chla_log, 3)
  ) %>% 
  ungroup()

Plots

Let’s explore the data with some plots. First, lets plot the data in scatter plots of chlorophyll and flow faceted by Station and grouping all years together.

df_chla_c2 %>% 
  ggplot(aes(x = Flow, y = Chla)) +
  geom_point() +
  geom_smooth(formula = "y ~ x") +
  facet_wrap(vars(StationCode), scales = "free") +
  theme_bw()

At first glance, I’m not sure how well flow is going to be able to predict chlorophyll concentrations. At the furthest upstream station - RD22 - chlorophyll appears to be highest at the lowest flows, but the variation is at its maximum at the lowest flows. There may be some dilution effect going on here at the higher flows. At STTD, there does seem to be a modest increase in chlorophyll concentrations at the mid-range flows. This pattern is even more obvious at LIB. There appears to be no effect of flow on chlorophyll at RVB, but the range of chlorophyll concentrations is narrow at this station (between 0 and 5).

Let’s break these scatterplots apart by year to see how these patterns vary annually.

df_chla_c2 %>% 
  ggplot(aes(x = Flow, y = Chla)) +
  geom_point() +
  geom_smooth(formula = "y ~ x") +
  facet_wrap(
    vars(StationCode, Year), 
    ncol = 5, 
    scales = "free", 
    labeller = labeller(.multi_line = FALSE)
  ) +
  theme_bw()

The patterns appear to vary annually at each station, which may justify using a 3-way interaction.

GAM Model with 3-way interactions

First, we will attempt to fit a generalized additive model (GAM) to the data set to help account for seasonality in the data. We’ll try running a GAM using a three-way interaction between Year, Daily Average Flow, and Station, and a smooth term for day of year to account for seasonality. Initially, we’ll run the GAM without accounting for serial autocorrelation.

Initial Model

m_gam3 <- gam(
  Chla_log ~ Year_fct * Flow * StationCode + s(DOY), 
  data = df_chla_c2,
  method = "REML"
)

Lets look at the model summary and diagnostics:

summary(m_gam3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(DOY)
## 
## Parametric coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.734e+00  4.242e-02 229.451  < 2e-16 ***
## Year_fct2016                      -8.100e-01  6.510e-02 -12.442  < 2e-16 ***
## Year_fct2017                      -9.446e-01  9.333e-02 -10.120  < 2e-16 ***
## Year_fct2018                      -3.798e-01  5.586e-02  -6.800 1.42e-11 ***
## Year_fct2019                      -1.569e-01  5.549e-02  -2.827 0.004748 ** 
## Flow                              -1.795e-03  1.810e-04  -9.916  < 2e-16 ***
## StationCodeSTTD                   -1.068e+00  5.338e-02 -20.010  < 2e-16 ***
## StationCodeLIB                    -2.164e+00  1.427e-01 -15.166  < 2e-16 ***
## StationCodeRVB                    -2.181e+00  9.945e-02 -21.927  < 2e-16 ***
## Year_fct2016:Flow                  2.101e-03  2.965e-04   7.087 1.95e-12 ***
## Year_fct2017:Flow                  2.980e-02  4.868e-03   6.122 1.13e-09 ***
## Year_fct2018:Flow                  4.690e-04  2.456e-04   1.909 0.056383 .  
## Year_fct2019:Flow                 -8.769e-05  2.155e-04  -0.407 0.684079    
## Year_fct2016:StationCodeSTTD       1.073e+00  8.125e-02  13.208  < 2e-16 ***
## Year_fct2017:StationCodeSTTD       1.077e+00  1.234e-01   8.724  < 2e-16 ***
## Year_fct2018:StationCodeSTTD      -2.829e-01  7.662e-02  -3.692 0.000229 ***
## Year_fct2019:StationCodeSTTD      -8.837e-01  7.505e-02 -11.774  < 2e-16 ***
## Year_fct2016:StationCodeLIB        2.103e+00  2.559e-01   8.217 3.92e-16 ***
## Year_fct2017:StationCodeLIB        9.232e-01  2.051e-01   4.501 7.20e-06 ***
## Year_fct2018:StationCodeLIB       -8.344e-01  2.888e-01  -2.889 0.003910 ** 
## Year_fct2019:StationCodeLIB       -2.348e-01  2.275e-01  -1.032 0.302235    
## Year_fct2016:StationCodeRVB        1.200e+00  2.292e-01   5.235 1.84e-07 ***
## Year_fct2017:StationCodeRVB        9.410e-01  1.984e-01   4.743 2.27e-06 ***
## Year_fct2018:StationCodeRVB        8.428e-02  1.543e-01   0.546 0.584870    
## Year_fct2019:StationCodeRVB       -8.860e-01  1.847e-01  -4.797 1.74e-06 ***
## Flow:StationCodeSTTD               4.763e-03  2.905e-04  16.396  < 2e-16 ***
## Flow:StationCodeLIB                1.721e-03  2.024e-04   8.501  < 2e-16 ***
## Flow:StationCodeRVB                1.817e-03  1.823e-04   9.964  < 2e-16 ***
## Year_fct2016:Flow:StationCodeSTTD -3.755e-03  4.317e-04  -8.699  < 2e-16 ***
## Year_fct2017:Flow:StationCodeSTTD -2.776e-02  5.251e-03  -5.285 1.40e-07 ***
## Year_fct2018:Flow:StationCodeSTTD -8.361e-04  3.842e-04  -2.176 0.029658 *  
## Year_fct2019:Flow:StationCodeSTTD -1.479e-03  3.405e-04  -4.345 1.47e-05 ***
## Year_fct2016:Flow:StationCodeLIB  -1.778e-03  3.405e-04  -5.222 1.97e-07 ***
## Year_fct2017:Flow:StationCodeLIB  -2.991e-02  4.870e-03  -6.143 9.93e-10 ***
## Year_fct2018:Flow:StationCodeLIB   2.326e-04  4.499e-04   0.517 0.605220    
## Year_fct2019:Flow:StationCodeLIB   3.155e-04  3.227e-04   0.978 0.328296    
## Year_fct2016:Flow:StationCodeRVB  -2.159e-03  2.977e-04  -7.252 6.04e-13 ***
## Year_fct2017:Flow:StationCodeRVB  -2.983e-02  4.868e-03  -6.129 1.08e-09 ***
## Year_fct2018:Flow:StationCodeRVB  -4.803e-04  2.469e-04  -1.945 0.051881 .  
## Year_fct2019:Flow:StationCodeRVB   1.161e-04  2.175e-04   0.534 0.593549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(DOY) 7.386  8.342 40.88  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.872   Deviance explained = 87.5%
## -REML = 860.35  Scale est. = 0.11819   n = 1874
appraise(m_gam3)

shapiro.test(residuals(m_gam3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam3)
## W = 0.96852, p-value < 2.2e-16
k.check(m_gam3)
##        k'      edf  k-index p-value
## s(DOY)  9 7.385659 0.952863  0.0175
draw(m_gam3, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam3, pages = 1, all.terms = TRUE)

acf(residuals(m_gam3))

Box.test(residuals(m_gam3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam3)
## X-squared = 3426.2, df = 20, p-value < 2.2e-16

Model with k = 5

The model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test. Even though the p-value for the k-check is less than 0.05, the smooth term for day of year appears to be overfitted. Let’s try a smaller k-value for the smooth first, then lets try to address the residual autocorrelation.

m_gam3_k5 <- gam(
  Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5), 
  data = df_chla_c2,
  method = "REML"
)

summary(m_gam3_k5)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)
## 
## Parametric coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.7278768  0.0421697 230.684  < 2e-16 ***
## Year_fct2016                      -0.7776765  0.0639758 -12.156  < 2e-16 ***
## Year_fct2017                      -0.9217677  0.0936371  -9.844  < 2e-16 ***
## Year_fct2018                      -0.3780679  0.0561613  -6.732 2.23e-11 ***
## Year_fct2019                      -0.1554310  0.0557794  -2.787 0.005383 ** 
## Flow                              -0.0017682  0.0001785  -9.904  < 2e-16 ***
## StationCodeSTTD                   -1.0661259  0.0535105 -19.924  < 2e-16 ***
## StationCodeLIB                    -2.1691650  0.1428317 -15.187  < 2e-16 ***
## StationCodeRVB                    -2.1846232  0.0998032 -21.889  < 2e-16 ***
## Year_fct2016:Flow                  0.0019713  0.0002923   6.743 2.07e-11 ***
## Year_fct2017:Flow                  0.0286811  0.0048738   5.885 4.73e-09 ***
## Year_fct2018:Flow                  0.0004480  0.0002470   1.814 0.069834 .  
## Year_fct2019:Flow                 -0.0000991  0.0002165  -0.458 0.647254    
## Year_fct2016:StationCodeSTTD       1.0643578  0.0815638  13.049  < 2e-16 ***
## Year_fct2017:StationCodeSTTD       1.0619864  0.1239959   8.565  < 2e-16 ***
## Year_fct2018:StationCodeSTTD      -0.2896681  0.0770322  -3.760 0.000175 ***
## Year_fct2019:StationCodeSTTD      -0.8811363  0.0753268 -11.698  < 2e-16 ***
## Year_fct2016:StationCodeLIB        2.1057016  0.2567401   8.202 4.41e-16 ***
## Year_fct2017:StationCodeLIB        0.9481491  0.2051177   4.622 4.06e-06 ***
## Year_fct2018:StationCodeLIB       -0.7855803  0.2899198  -2.710 0.006798 ** 
## Year_fct2019:StationCodeLIB       -0.2662514  0.2284866  -1.165 0.244056    
## Year_fct2016:StationCodeRVB        1.1701753  0.2287826   5.115 3.47e-07 ***
## Year_fct2017:StationCodeRVB        0.9672348  0.1981715   4.881 1.15e-06 ***
## Year_fct2018:StationCodeRVB        0.0946581  0.1548241   0.611 0.541017    
## Year_fct2019:StationCodeRVB       -0.8809824  0.1850921  -4.760 2.09e-06 ***
## Flow:StationCodeSTTD               0.0047745  0.0002921  16.346  < 2e-16 ***
## Flow:StationCodeLIB                0.0016888  0.0002007   8.416  < 2e-16 ***
## Flow:StationCodeRVB                0.0017912  0.0001799   9.959  < 2e-16 ***
## Year_fct2016:Flow:StationCodeSTTD -0.0037733  0.0004341  -8.692  < 2e-16 ***
## Year_fct2017:Flow:StationCodeSTTD -0.0262643  0.0052481  -5.005 6.14e-07 ***
## Year_fct2018:Flow:StationCodeSTTD -0.0008384  0.0003865  -2.169 0.030185 *  
## Year_fct2019:Flow:StationCodeSTTD -0.0015024  0.0003424  -4.388 1.21e-05 ***
## Year_fct2016:Flow:StationCodeLIB  -0.0016335  0.0003378  -4.836 1.44e-06 ***
## Year_fct2017:Flow:StationCodeLIB  -0.0287590  0.0048744  -5.900 4.32e-09 ***
## Year_fct2018:Flow:StationCodeLIB   0.0002943  0.0004521   0.651 0.515145    
## Year_fct2019:Flow:StationCodeLIB   0.0002993  0.0003244   0.923 0.356341    
## Year_fct2016:Flow:StationCodeRVB  -0.0020283  0.0002934  -6.914 6.48e-12 ***
## Year_fct2017:Flow:StationCodeRVB  -0.0287229  0.0048739  -5.893 4.50e-09 ***
## Year_fct2018:Flow:StationCodeRVB  -0.0004615  0.0002483  -1.859 0.063211 .  
## Year_fct2019:Flow:StationCodeRVB   0.0001259  0.0002185   0.576 0.564661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(DOY) 3.892  3.993 81.76  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.87   Deviance explained = 87.3%
## -REML = 867.13  Scale est. = 0.11965   n = 1874
appraise(m_gam3_k5)

shapiro.test(residuals(m_gam3_k5))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam3_k5)
## W = 0.96615, p-value < 2.2e-16
k.check(m_gam3_k5)
##        k'      edf   k-index p-value
## s(DOY)  4 3.892262 0.9395093  0.0025
draw(m_gam3_k5, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam3_k5, pages = 1, all.terms = TRUE)

acf(residuals(m_gam3_k5))

Box.test(residuals(m_gam3_k5), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam3_k5)
## X-squared = 3449.2, df = 20, p-value < 2.2e-16

Changing the k-value to 5 seems to help reduce the “wiggliness” of the smooth term for DOY, but the p-value for the k-check is still less than 0.05. Despite this, lets proceed with a k-value of 5.

Model with AR() correlation structure

Now, we’ll try to deal with the residual autocorrelation. We’ll run AR(1), AR(2), AR(3), and AR(4) models and compare them using AIC.

# Define model formula as an object
f_gam3 <- as.formula("Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)")

# Fit original model with k = 5 and three successive AR(p) models
m_gamm3 <- gamm(
  f_gam3, 
  data = df_chla_c2,
  method = "REML"
)

m_gamm3_ar1 <- gamm(
  f_gam3, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 1), # grouped by Year_fct and StationCode
  method = "REML"
)

m_gamm3_ar2 <- gamm(
  f_gam3, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 2),
  method = "REML"
)

m_gamm3_ar3 <- gamm(
  f_gam3, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 3),
  method = "REML"
)

m_gamm3_ar4 <- gamm(
  f_gam3, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 4),
  method = "REML"
)

# Compare models
anova(
  m_gamm3$lme, 
  m_gamm3_ar1$lme, 
  m_gamm3_ar2$lme, 
  m_gamm3_ar3$lme,
  m_gamm3_ar4$lme
)
##                 Model df       AIC       BIC    logLik   Test   L.Ratio p-value
## m_gamm3$lme         1 43 1820.2656 2057.3551 -867.1328                         
## m_gamm3_ar1$lme     2 44 -725.2482 -482.6450  406.6241 1 vs 2 2547.5138  <.0001
## m_gamm3_ar2$lme     3 45 -723.7345 -475.6176  406.8672 2 vs 3    0.4862  0.4856
## m_gamm3_ar3$lme     4 46 -724.9241 -471.2935  408.4621 3 vs 4    3.1896  0.0741
## m_gamm3_ar4$lme     5 47 -723.7127 -464.5684  408.8564 4 vs 5    0.7886  0.3745

It looks like both AIC and BIC prefer the AR(1) model. Let’s take a closer look at the AR(1) model.

AR(1) Model

# Pull out the residuals and the GAM model
resid_gamm3_ar1 <- residuals(m_gamm3_ar1$lme, type = "normalized")
m_gamm3_ar1_gam <- m_gamm3_ar1$gam

summary(m_gamm3_ar1_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ Year_fct * Flow * StationCode + s(DOY, k = 5)
## 
## Parametric coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.525e+00  3.065e-01  31.077  < 2e-16 ***
## Year_fct2016                      -6.067e-01  4.585e-01  -1.323 0.185863    
## Year_fct2017                      -4.935e-01  4.661e-01  -1.059 0.289818    
## Year_fct2018                      -2.740e-01  4.248e-01  -0.645 0.519064    
## Year_fct2019                       1.705e-02  4.273e-01   0.040 0.968183    
## Flow                              -6.516e-04  2.937e-04  -2.219 0.026619 *  
## StationCodeSTTD                   -7.842e-01  4.322e-01  -1.815 0.069757 .  
## StationCodeLIB                    -1.671e+00  4.339e-01  -3.852 0.000121 ***
## StationCodeRVB                    -1.779e+00  4.201e-01  -4.235  2.4e-05 ***
## Year_fct2016:Flow                  1.011e-03  5.967e-04   1.694 0.090507 .  
## Year_fct2017:Flow                  1.339e-02  8.870e-03   1.510 0.131306    
## Year_fct2018:Flow                  5.601e-05  5.438e-04   0.103 0.917981    
## Year_fct2019:Flow                 -3.069e-04  4.586e-04  -0.669 0.503503    
## Year_fct2016:StationCodeSTTD       9.683e-01  6.423e-01   1.508 0.131843    
## Year_fct2017:StationCodeSTTD       2.722e-01  6.782e-01   0.401 0.688202    
## Year_fct2018:StationCodeSTTD      -5.477e-01  6.227e-01  -0.880 0.379244    
## Year_fct2019:StationCodeSTTD      -1.146e+00  6.127e-01  -1.870 0.061617 .  
## Year_fct2016:StationCodeLIB        1.710e+00  6.473e-01   2.641 0.008333 ** 
## Year_fct2017:StationCodeLIB        5.028e-01  6.538e-01   0.769 0.441911    
## Year_fct2018:StationCodeLIB       -1.599e+00  7.123e-01  -2.244 0.024924 *  
## Year_fct2019:StationCodeLIB       -7.410e-01  7.159e-01  -1.035 0.300801    
## Year_fct2016:StationCodeRVB        1.035e+00  6.537e-01   1.584 0.113479    
## Year_fct2017:StationCodeRVB        2.833e-01  6.466e-01   0.438 0.661318    
## Year_fct2018:StationCodeRVB       -3.375e-02  6.055e-01  -0.056 0.955553    
## Year_fct2019:StationCodeRVB       -6.441e-01  6.186e-01  -1.041 0.297865    
## Flow:StationCodeSTTD               1.370e-03  5.784e-04   2.369 0.017938 *  
## Flow:StationCodeLIB                7.042e-04  3.021e-04   2.331 0.019857 *  
## Flow:StationCodeRVB                6.539e-04  2.939e-04   2.225 0.026221 *  
## Year_fct2016:Flow:StationCodeSTTD -1.540e-03  9.500e-04  -1.621 0.105090    
## Year_fct2017:Flow:StationCodeSTTD -7.559e-03  9.022e-03  -0.838 0.402250    
## Year_fct2018:Flow:StationCodeSTTD  4.260e-04  9.270e-04   0.460 0.645894    
## Year_fct2019:Flow:StationCodeSTTD  8.869e-04  7.767e-04   1.142 0.253651    
## Year_fct2016:Flow:StationCodeLIB  -8.492e-04  6.085e-04  -1.396 0.162993    
## Year_fct2017:Flow:StationCodeLIB  -1.333e-02  8.871e-03  -1.502 0.133191    
## Year_fct2018:Flow:StationCodeLIB   5.668e-04  5.819e-04   0.974 0.330165    
## Year_fct2019:Flow:StationCodeLIB   4.780e-04  4.908e-04   0.974 0.330269    
## Year_fct2016:Flow:StationCodeRVB  -1.030e-03  5.971e-04  -1.724 0.084817 .  
## Year_fct2017:Flow:StationCodeRVB  -1.340e-02  8.870e-03  -1.511 0.131071    
## Year_fct2018:Flow:StationCodeRVB  -6.689e-05  5.443e-04  -0.123 0.902196    
## Year_fct2019:Flow:StationCodeRVB   2.798e-04  4.591e-04   0.609 0.542337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(DOY) 3.806  3.806 14.49  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.748   
##   Scale est. = 0.3387    n = 1874
appraise(m_gamm3_ar1_gam)

shapiro.test(resid_gamm3_ar1)
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_gamm3_ar1
## W = 0.81636, p-value < 2.2e-16
k.check(m_gamm3_ar1_gam)
##        k'      edf   k-index p-value
## s(DOY)  4 3.806167 0.6413984       0
draw(m_gamm3_ar1_gam, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gamm3_ar1_gam, pages = 1, all.terms = TRUE)

acf(resid_gamm3_ar1)

Box.test(resid_gamm3_ar1, lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  resid_gamm3_ar1
## X-squared = 52.893, df = 20, p-value = 8.426e-05

The AR(1) model has much less residual autocorrelation, and the diagnostics plots look okay. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

df_gamm3_ar1_fit <- df_chla_c2 %>% 
  fitted_values(m_gamm3_ar1_gam, data = .) %>% 
  mutate(fitted_bt = exp(fitted) / 1000)

plt_gamm3_ar1_obs_fit <- df_gamm3_ar1_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_gamm3_ar1_obs_fit

Let’s look at each Station-Year combination separately.

plt_gamm3_ar1_obs_fit +
  facet_wrap(
    vars(StationCode, Year_fct),
    ncol = 5,
    scales = "free",
    labeller = labeller(.multi_line = FALSE)
  )

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. However, the fitted values from the model don’t appear to match the observed values that well, and there are some unusual patterns when we look at each Station-Year combination separately. I don’t think this GAM model is a good candidate to use for our analysis.

Linear Model with 3-way interactions

Since the GAM model didn’t seem to fit the observed data that well, let’s try a linear model using a three-way interaction between Year, Daily Average Flow, and Station. Initially, we’ll run the model without accounting for serial autocorrelation.

Initial Model

m_lm3 <- df_chla_c2_lag %>% 
  drop_na(Chla_log) %>% 
  lm(Chla_log ~ Year_fct * Flow * StationCode, data = .)

Lets look at the model summary and diagnostics:

summary(m_lm3)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow * StationCode, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63942 -0.22564 -0.03365  0.19208  1.75991 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.6844243  0.0448372 215.991  < 2e-16 ***
## Year_fct2016                      -0.6624673  0.0673312  -9.839  < 2e-16 ***
## Year_fct2017                      -1.1888491  0.0996337 -11.932  < 2e-16 ***
## Year_fct2018                      -0.3418296  0.0607458  -5.627 2.11e-08 ***
## Year_fct2019                      -0.1257666  0.0602074  -2.089 0.036855 *  
## Flow                              -0.0018894  0.0001884 -10.027  < 2e-16 ***
## StationCodeSTTD                   -1.0664657  0.0578059 -18.449  < 2e-16 ***
## StationCodeLIB                    -2.2872235  0.1491890 -15.331  < 2e-16 ***
## StationCodeRVB                    -2.2211596  0.1077729 -20.610  < 2e-16 ***
## Year_fct2016:Flow                  0.0023191  0.0003084   7.520 8.51e-14 ***
## Year_fct2017:Flow                  0.0462944  0.0050980   9.081  < 2e-16 ***
## Year_fct2018:Flow                  0.0003299  0.0002673   1.234 0.217261    
## Year_fct2019:Flow                 -0.0001397  0.0002337  -0.598 0.549941    
## Year_fct2016:StationCodeSTTD       1.0883529  0.0880500  12.361  < 2e-16 ***
## Year_fct2017:StationCodeSTTD       1.4991116  0.1299212  11.539  < 2e-16 ***
## Year_fct2018:StationCodeSTTD      -0.3045441  0.0828996  -3.674 0.000246 ***
## Year_fct2019:StationCodeSTTD      -0.9349727  0.0814723 -11.476  < 2e-16 ***
## Year_fct2016:StationCodeLIB        1.9616137  0.2700048   7.265 5.49e-13 ***
## Year_fct2017:StationCodeLIB        1.3396481  0.2174563   6.161 8.89e-10 ***
## Year_fct2018:StationCodeLIB       -0.7670441  0.3102772  -2.472 0.013521 *  
## Year_fct2019:StationCodeLIB       -0.2261882  0.2427049  -0.932 0.351487    
## Year_fct2016:StationCodeRVB        1.9708585  0.2207415   8.928  < 2e-16 ***
## Year_fct2017:StationCodeRVB        1.4768042  0.2100894   7.029 2.91e-12 ***
## Year_fct2018:StationCodeRVB       -0.0019108  0.1628429  -0.012 0.990639    
## Year_fct2019:StationCodeRVB       -0.7680363  0.1962837  -3.913 9.45e-05 ***
## Flow:StationCodeSTTD               0.0046042  0.0003152  14.606  < 2e-16 ***
## Flow:StationCodeLIB                0.0017082  0.0002082   8.206 4.25e-16 ***
## Flow:StationCodeRVB                0.0019282  0.0001897  10.163  < 2e-16 ***
## Year_fct2016:Flow:StationCodeSTTD -0.0037128  0.0004693  -7.911 4.39e-15 ***
## Year_fct2017:Flow:StationCodeSTTD -0.0411458  0.0055472  -7.417 1.82e-13 ***
## Year_fct2018:Flow:StationCodeSTTD -0.0007371  0.0004159  -1.772 0.076508 .  
## Year_fct2019:Flow:StationCodeSTTD -0.0012635  0.0003698  -3.417 0.000648 ***
## Year_fct2016:Flow:StationCodeLIB  -0.0021272  0.0003563  -5.970 2.83e-09 ***
## Year_fct2017:Flow:StationCodeLIB  -0.0463035  0.0051002  -9.079  < 2e-16 ***
## Year_fct2018:Flow:StationCodeLIB   0.0005066  0.0004884   1.037 0.299825    
## Year_fct2019:Flow:StationCodeLIB   0.0001596  0.0003482   0.458 0.646694    
## Year_fct2016:Flow:StationCodeRVB  -0.0024657  0.0003099  -7.957 3.07e-15 ***
## Year_fct2017:Flow:StationCodeRVB  -0.0463714  0.0050981  -9.096  < 2e-16 ***
## Year_fct2018:Flow:StationCodeRVB  -0.0003456  0.0002686  -1.287 0.198288    
## Year_fct2019:Flow:StationCodeRVB   0.0001395  0.0002354   0.593 0.553553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3745 on 1834 degrees of freedom
## Multiple R-squared:  0.8512, Adjusted R-squared:  0.848 
## F-statistic: 268.9 on 39 and 1834 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log) %>% 
  plot_lm_diag(Chla_log, m_lm3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm3)
## W = 0.96914, p-value < 2.2e-16
acf(residuals(m_lm3))

Box.test(residuals(m_lm3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm3)
## X-squared = 3897.6, df = 20, p-value < 2.2e-16

The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1, 2, and 3 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_lm3_lag1 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1) %>% 
  lm(Chla_log ~ Year_fct * Flow * StationCode + lag1, data = .)

acf(residuals(m_lm3_lag1))

Box.test(residuals(m_lm3_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm3_lag1)
## X-squared = 59.819, df = 20, p-value = 7.595e-06

Lag 2

m_lm3_lag2 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2, data = .)

acf(residuals(m_lm3_lag2))

Box.test(residuals(m_lm3_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm3_lag2)
## X-squared = 46.699, df = 20, p-value = 0.0006458

Lag 3

m_lm3_lag3 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2, lag3) %>% 
  lm(Chla_log ~ Year_fct * Flow * StationCode + lag1 + lag2 + lag3, data = .)

acf(residuals(m_lm3_lag3))

Box.test(residuals(m_lm3_lag3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm3_lag3)
## X-squared = 26.866, df = 20, p-value = 0.1391

The model with 3 lag terms looks like it has better ACF and Box-Ljung test results than the other models. Let’s compare the 3 models using AIC and BIC to see which model those prefer.

Compare models

AIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
##            df       AIC
## m_lm3_lag1 42 -1343.149
## m_lm3_lag2 43 -1328.212
## m_lm3_lag3 44 -1352.918
BIC(m_lm3_lag1, m_lm3_lag2, m_lm3_lag3)
##            df       BIC
## m_lm3_lag1 42 -1111.573
## m_lm3_lag2 43 -1092.047
## m_lm3_lag3 44 -1112.153

Both AIC and BIC prefer the model with 3 lag terms. Let’s take a closer look at that one.

Lag 3 model summary

summary(m_lm3_lag3)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow * StationCode + lag1 + 
##     lag2 + lag3, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89652 -0.05739 -0.00662  0.05039  1.11198 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.191e+00  1.103e-01  10.799  < 2e-16 ***
## Year_fct2016                      -9.858e-02  3.059e-02  -3.223 0.001294 ** 
## Year_fct2017                      -1.589e-01  4.601e-02  -3.454 0.000566 ***
## Year_fct2018                      -5.669e-02  2.710e-02  -2.092 0.036584 *  
## Year_fct2019                      -4.812e-03  2.687e-02  -0.179 0.857864    
## Flow                              -3.297e-04  8.442e-05  -3.905 9.78e-05 ***
## StationCodeSTTD                   -1.253e-01  2.819e-02  -4.443 9.44e-06 ***
## StationCodeLIB                    -2.304e-01  7.029e-02  -3.277 0.001069 ** 
## StationCodeRVB                    -2.481e-01  5.311e-02  -4.672 3.21e-06 ***
## lag1                               9.650e-01  2.412e-02  40.016  < 2e-16 ***
## lag2                              -3.050e-02  3.321e-02  -0.919 0.358431    
## lag3                              -5.694e-02  2.389e-02  -2.383 0.017260 *  
## Year_fct2016:Flow                  3.732e-04  1.365e-04   2.735 0.006310 ** 
## Year_fct2017:Flow                  6.200e-03  2.353e-03   2.635 0.008500 ** 
## Year_fct2018:Flow                  1.793e-04  1.167e-04   1.536 0.124654    
## Year_fct2019:Flow                  7.924e-05  1.024e-04   0.774 0.439015    
## Year_fct2016:StationCodeSTTD       1.428e-01  4.096e-02   3.486 0.000503 ***
## Year_fct2017:StationCodeSTTD       2.660e-01  6.001e-02   4.433 9.90e-06 ***
## Year_fct2018:StationCodeSTTD      -3.452e-02  3.698e-02  -0.933 0.350725    
## Year_fct2019:StationCodeSTTD      -1.405e-01  3.789e-02  -3.708 0.000215 ***
## Year_fct2016:StationCodeLIB        2.629e-01  1.208e-01   2.175 0.029743 *  
## Year_fct2017:StationCodeLIB        1.362e-01  9.987e-02   1.364 0.172741    
## Year_fct2018:StationCodeLIB        1.543e-02  1.486e-01   0.104 0.917311    
## Year_fct2019:StationCodeLIB       -6.836e-02  2.610e-01  -0.262 0.793414    
## Year_fct2016:StationCodeRVB        2.661e-01  1.077e-01   2.472 0.013541 *  
## Year_fct2017:StationCodeRVB        1.889e-01  9.380e-02   2.014 0.044143 *  
## Year_fct2018:StationCodeRVB        1.962e-02  7.109e-02   0.276 0.782619    
## Year_fct2019:StationCodeRVB       -1.364e-01  9.245e-02  -1.476 0.140206    
## Flow:StationCodeSTTD               6.714e-04  1.514e-04   4.434 9.84e-06 ***
## Flow:StationCodeLIB                3.450e-04  9.266e-05   3.724 0.000203 ***
## Flow:StationCodeRVB                3.269e-04  8.507e-05   3.843 0.000126 ***
## Year_fct2016:Flow:StationCodeSTTD -5.817e-04  2.122e-04  -2.741 0.006181 ** 
## Year_fct2017:Flow:StationCodeSTTD -3.665e-03  2.588e-03  -1.416 0.156920    
## Year_fct2018:Flow:StationCodeSTTD -2.008e-04  1.860e-04  -1.079 0.280640    
## Year_fct2019:Flow:StationCodeSTTD -2.288e-04  1.667e-04  -1.373 0.170039    
## Year_fct2016:Flow:StationCodeLIB  -3.224e-04  1.577e-04  -2.044 0.041063 *  
## Year_fct2017:Flow:StationCodeLIB  -6.223e-03  2.355e-03  -2.643 0.008297 ** 
## Year_fct2018:Flow:StationCodeLIB   2.158e-05  2.375e-04   0.091 0.927610    
## Year_fct2019:Flow:StationCodeLIB  -1.214e-04  3.530e-04  -0.344 0.730859    
## Year_fct2016:Flow:StationCodeRVB  -3.897e-04  1.375e-04  -2.835 0.004629 ** 
## Year_fct2017:Flow:StationCodeRVB  -6.205e-03  2.354e-03  -2.636 0.008455 ** 
## Year_fct2018:Flow:StationCodeRVB  -1.783e-04  1.173e-04  -1.520 0.128795    
## Year_fct2019:Flow:StationCodeRVB  -7.338e-05  1.032e-04  -0.711 0.477174    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1626 on 1715 degrees of freedom
## Multiple R-squared:  0.9709, Adjusted R-squared:  0.9702 
## F-statistic:  1361 on 42 and 1715 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2, lag3) %>% 
  plot_lm_diag(Chla_log, m_lm3_lag3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm3_lag3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm3_lag3)
## W = 0.85329, p-value < 2.2e-16
Anova(m_lm3_lag3, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
## 
## Response: Chla_log
##                           Sum Sq   Df   F value    Pr(>F)    
## (Intercept)                3.084    1  116.6079 < 2.2e-16 ***
## Year_fct                   0.580    4    5.4833  0.000219 ***
## Flow                       0.403    1   15.2507 9.779e-05 ***
## StationCode                0.812    3   10.2328 1.117e-06 ***
## lag1                      42.343    1 1601.2836 < 2.2e-16 ***
## lag2                       0.022    1    0.8438  0.358431    
## lag3                       0.150    1    5.6808  0.017260 *  
## Year_fct:Flow              0.384    4    3.6262  0.005983 ** 
## Year_fct:StationCode       1.631   12    5.1388 1.641e-08 ***
## Flow:StationCode           0.587    3    7.4025 6.303e-05 ***
## Year_fct:Flow:StationCode  0.672   12    2.1187  0.013414 *  
## Residuals                 45.350 1715                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residuals deviate from a normal distribution particularly at the both tails of the distribution. I’m not so sure about using this model. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

df_lm3_lag3_fit <- df_chla_c2_lag %>% 
  drop_na(Chla, lag1, lag2, lag3) %>% 
  mutate(Fitted = as_tibble(predict(m_lm3_lag3, interval = "confidence"))) %>% 
  unnest_wider(Fitted) %>% 
  mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))

plt_lm3_lag3_obs_fit <- df_lm3_lag3_fit %>% 
  ggplot(aes(x = fit, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_lm3_lag3_obs_fit

Let’s look at each Station-Year combination separately.

plt_lm3_lag3_obs_fit +
  facet_wrap(
    vars(StationCode, Year_fct),
    ncol = 5,
    scales = "free",
    labeller = labeller(.multi_line = FALSE)
  )

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. The fitted and observed values appear to match pretty well at the lower end of the range of values, but this deteriorates at the mid and higher range values. This pattern holds for some of the separate Station-Year combinations. I’m not sure this linear model is a good candidate to use for our analysis.

GAM Model with 2-way interactions

Because the models using 3-way interactions don’t seem to fit the data well, we will attempt to fit a generalized additive model (GAM) to the data set using all two-way interactions between Year, Daily Average Flow, and Station, with a smooth term for day of year to account for seasonality. Initially, we’ll run the GAM without accounting for serial autocorrelation.

Initial Model

m_gam2 <- gam(
  Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5), 
  data = df_chla_c2,
  method = "REML"
)

Lets look at the model summary and diagnostics:

summary(m_gam2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)
## 
## Parametric coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.687e+00  3.720e-02 260.381  < 2e-16 ***
## Year_fct2016                 -5.422e-01  5.369e-02 -10.098  < 2e-16 ***
## Year_fct2017                 -4.293e-01  5.124e-02  -8.377  < 2e-16 ***
## Year_fct2018                 -3.173e-01  4.772e-02  -6.650 3.86e-11 ***
## Year_fct2019                 -1.779e-01  4.815e-02  -3.694 0.000227 ***
## Flow                         -1.438e-03  9.087e-05 -15.824  < 2e-16 ***
## StationCodeSTTD              -9.908e-01  5.067e-02 -19.553  < 2e-16 ***
## StationCodeLIB               -1.959e+00  1.108e-01 -17.684  < 2e-16 ***
## StationCodeRVB               -2.136e+00  9.771e-02 -21.860  < 2e-16 ***
## Year_fct2016:Flow            -4.743e-05  3.105e-05  -1.528 0.126764    
## Year_fct2017:Flow            -4.247e-05  2.509e-05  -1.693 0.090640 .  
## Year_fct2018:Flow            -1.286e-05  2.536e-05  -0.507 0.612267    
## Year_fct2019:Flow             1.000e-05  2.711e-05   0.369 0.712110    
## Year_fct2016:StationCodeSTTD  7.459e-01  7.431e-02  10.038  < 2e-16 ***
## Year_fct2017:StationCodeSTTD  4.306e-01  7.903e-02   5.449 5.76e-08 ***
## Year_fct2018:StationCodeSTTD -3.358e-01  7.154e-02  -4.694 2.88e-06 ***
## Year_fct2019:StationCodeSTTD -9.471e-01  7.007e-02 -13.516  < 2e-16 ***
## Year_fct2016:StationCodeLIB   1.303e+00  8.601e-02  15.152  < 2e-16 ***
## Year_fct2017:StationCodeLIB   4.316e-01  8.611e-02   5.012 5.92e-07 ***
## Year_fct2018:StationCodeLIB  -1.431e+00  1.135e-01 -12.609  < 2e-16 ***
## Year_fct2019:StationCodeLIB  -4.886e-01  1.182e-01  -4.135 3.71e-05 ***
## Year_fct2016:StationCodeRVB   8.502e-01  2.321e-01   3.663 0.000257 ***
## Year_fct2017:StationCodeRVB   4.864e-01  1.847e-01   2.633 0.008531 ** 
## Year_fct2018:StationCodeRVB   3.244e-02  1.527e-01   0.212 0.831772    
## Year_fct2019:StationCodeRVB  -7.110e-01  1.831e-01  -3.884 0.000106 ***
## Flow:StationCodeSTTD          3.329e-03  1.253e-04  26.579  < 2e-16 ***
## Flow:StationCodeLIB           1.465e-03  1.114e-04  13.151  < 2e-16 ***
## Flow:StationCodeRVB           1.460e-03  8.962e-05  16.292  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df     F p-value    
## s(DOY) 3.88  3.992 93.39  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.858   Deviance explained = 86.1%
## -REML = 869.89  Scale est. = 0.13071   n = 1874
appraise(m_gam2)

shapiro.test(residuals(m_gam2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_gam2)
## W = 0.96845, p-value < 2.2e-16
k.check(m_gam2)
##        k'      edf   k-index p-value
## s(DOY)  4 3.880408 0.9386901       0
draw(m_gam2, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gam2, pages = 1, all.terms = TRUE)

acf(residuals(m_gam2))

Box.test(residuals(m_gam2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_gam2)
## X-squared = 3957.6, df = 20, p-value < 2.2e-16

Model with AR() correlation structure

Now, we’ll try to deal with the residual autocorrelation. We’ll run AR(1), AR(2), AR(3), and AR(4) models and compare them using AIC.

# Define model formula as an object
f_gam2 <- as.formula("Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)")

# Fit original model with k = 5 and three successive AR(p) models
m_gamm2 <- gamm(
  f_gam2, 
  data = df_chla_c2,
  method = "REML"
)

m_gamm2_ar1 <- gamm(
  f_gam2, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 1), # grouped by Year_fct and StationCode
  method = "REML"
)

m_gamm2_ar2 <- gamm(
  f_gam2, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 2),
  method = "REML"
)

m_gamm2_ar3 <- gamm(
  f_gam2, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 3),
  method = "REML"
)

m_gamm2_ar4 <- gamm(
  f_gam2, 
  data = df_chla_c2, 
  correlation = corARMA(form = ~ 1|Year_fct/StationCode, p = 4),
  method = "REML"
)

# Compare models
anova(
  m_gamm2$lme, 
  m_gamm2_ar1$lme, 
  m_gamm2_ar2$lme, 
  m_gamm2_ar3$lme,
  m_gamm2_ar4$lme
)
##                 Model df       AIC       BIC    logLik   Test   L.Ratio p-value
## m_gamm2$lme         1 31 1801.7891 1972.9163 -869.8945                         
## m_gamm2_ar1$lme     2 32 -880.0401 -703.3926  472.0201 1 vs 2 2683.8292  <.0001
## m_gamm2_ar2$lme     3 33 -878.0514 -695.8836  472.0257 2 vs 3    0.0112  0.9156
## m_gamm2_ar3$lme     4 34 -878.0388 -690.3508  473.0194 3 vs 4    1.9874  0.1586
## m_gamm2_ar4$lme     5 35 -877.7177 -684.5095  473.8589 4 vs 5    1.6789  0.1951

It looks like the AR(1) model has the best fit according to the AIC and BIC values. Let’s take a closer look at the AR(1) model.

AR(1) Model

# Pull out the residuals and the GAM model
resid_gamm2_ar1 <- residuals(m_gamm2_ar1$lme, type = "normalized")
m_gamm2_ar1_gam <- m_gamm2_ar1$gam

summary(m_gamm2_ar1_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Chla_log ~ (Year_fct + Flow + StationCode)^2 + s(DOY, k = 5)
## 
## Parametric coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.524e+00  4.705e-01  20.245  < 2e-16 ***
## Year_fct2016                 -4.745e-01  6.930e-01  -0.685 0.493589    
## Year_fct2017                 -1.946e-01  6.718e-01  -0.290 0.772141    
## Year_fct2018                 -2.427e-01  6.521e-01  -0.372 0.709814    
## Year_fct2019                  7.054e-02  6.557e-01   0.108 0.914341    
## Flow                         -5.558e-04  1.911e-04  -2.908 0.003687 ** 
## StationCodeSTTD              -7.488e-01  6.649e-01  -1.126 0.260262    
## StationCodeLIB               -1.444e+00  6.549e-01  -2.204 0.027614 *  
## StationCodeRVB               -1.664e+00  6.463e-01  -2.575 0.010103 *  
## Year_fct2016:Flow            -1.617e-05  1.991e-05  -0.812 0.416926    
## Year_fct2017:Flow            -4.077e-06  2.028e-05  -0.201 0.840700    
## Year_fct2018:Flow            -2.836e-06  2.191e-05  -0.129 0.897009    
## Year_fct2019:Flow            -2.267e-05  2.043e-05  -1.110 0.267297    
## Year_fct2016:StationCodeSTTD  8.515e-01  9.770e-01   0.872 0.383585    
## Year_fct2017:StationCodeSTTD -2.845e-01  9.961e-01  -0.286 0.775213    
## Year_fct2018:StationCodeSTTD -6.349e-01  9.530e-01  -0.666 0.505336    
## Year_fct2019:StationCodeSTTD -1.204e+00  9.406e-01  -1.280 0.200857    
## Year_fct2016:StationCodeLIB   1.355e+00  9.566e-01   1.417 0.156753    
## Year_fct2017:StationCodeLIB   6.824e-02  9.518e-01   0.072 0.942850    
## Year_fct2018:StationCodeLIB  -2.291e+00  1.025e+00  -2.234 0.025580 *  
## Year_fct2019:StationCodeLIB  -1.047e+00  1.039e+00  -1.008 0.313616    
## Year_fct2016:StationCodeRVB   9.658e-01  9.787e-01   0.987 0.323854    
## Year_fct2017:StationCodeRVB  -5.525e-02  9.433e-01  -0.059 0.953298    
## Year_fct2018:StationCodeRVB  -1.545e-01  9.189e-01  -0.168 0.866506    
## Year_fct2019:StationCodeRVB  -7.854e-01  9.363e-01  -0.839 0.401700    
## Flow:StationCodeSTTD          1.554e-03  3.081e-04   5.044 5.01e-07 ***
## Flow:StationCodeLIB           7.168e-04  1.967e-04   3.645 0.000275 ***
## Flow:StationCodeRVB           5.545e-04  1.911e-04   2.902 0.003758 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(DOY) 3.864  3.864 27.52  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.643   
##   Scale est. = 0.56753   n = 1874
appraise(m_gamm2_ar1_gam)

shapiro.test(resid_gamm2_ar1)
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_gamm2_ar1
## W = 0.80125, p-value < 2.2e-16
k.check(m_gamm2_ar1_gam)
##        k'      edf   k-index p-value
## s(DOY)  4 3.864181 0.4865552       0
draw(m_gamm2_ar1_gam, select = 1, residuals = TRUE, rug = FALSE)

plot(m_gamm2_ar1_gam, pages = 1, all.terms = TRUE)

acf(resid_gamm2_ar1)

Box.test(resid_gamm2_ar1, lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  resid_gamm2_ar1
## X-squared = 47.503, df = 20, p-value = 0.0004993

The AR(1) model has much less residual autocorrelation, and the diagnostics plots look okay. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

df_gamm2_ar1_fit <- df_chla_c2 %>% 
  fitted_values(m_gamm2_ar1_gam, data = .) %>% 
  mutate(fitted_bt = exp(fitted) / 1000)

plt_gamm2_ar1_obs_fit <- df_gamm2_ar1_fit %>% 
  ggplot(aes(x = fitted_bt, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_gamm2_ar1_obs_fit

Let’s look at each Station and Year separately.

plt_gamm2_ar1_obs_fit + facet_wrap(vars(StationCode), scales = "free")

plt_gamm2_ar1_obs_fit + facet_wrap(vars(Year_fct), scales = "free")

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. However, the fitted values from the model don’t appear to match the observed values that well. Again, I don’t think this GAM model is a good candidate to use for our analysis.

Linear Model with 2-way interactions

Since the models using 3-way interactions don’t seem to fit the observed data that well, let’s try a linear model using all two-way interactions between Year, Daily Average Flow, and Station. Initially, we’ll run the model without accounting for serial autocorrelation.

Initial Model

m_lm2 <- df_chla_c2_lag %>% 
  drop_na(Chla_log) %>% 
  lm(Chla_log ~ (Year_fct + Flow + StationCode)^2, data = .)

Lets look at the model summary and diagnostics:

summary(m_lm2)
## 
## Call:
## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65329 -0.23920 -0.03958  0.20123  1.96767 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.632e+00  4.015e-02 239.900  < 2e-16 ***
## Year_fct2016                 -3.555e-01  5.707e-02  -6.228 5.83e-10 ***
## Year_fct2017                 -4.068e-01  5.586e-02  -7.284 4.79e-13 ***
## Year_fct2018                 -2.925e-01  5.213e-02  -5.611 2.31e-08 ***
## Year_fct2019                 -1.467e-01  5.255e-02  -2.791 0.005313 ** 
## Flow                         -1.525e-03  9.636e-05 -15.827  < 2e-16 ***
## StationCodeSTTD              -9.929e-01  5.532e-02 -17.948  < 2e-16 ***
## StationCodeLIB               -2.094e+00  1.172e-01 -17.863  < 2e-16 ***
## StationCodeRVB               -2.154e+00  1.067e-01 -20.194  < 2e-16 ***
## Year_fct2016:Flow            -1.258e-04  3.148e-05  -3.996 6.70e-05 ***
## Year_fct2017:Flow            -7.326e-05  2.717e-05  -2.696 0.007076 ** 
## Year_fct2018:Flow            -8.557e-06  2.722e-05  -0.314 0.753278    
## Year_fct2019:Flow            -1.301e-05  2.923e-05  -0.445 0.656205    
## Year_fct2016:StationCodeSTTD  7.413e-01  8.117e-02   9.133  < 2e-16 ***
## Year_fct2017:StationCodeSTTD  5.377e-01  8.588e-02   6.261 4.75e-10 ***
## Year_fct2018:StationCodeSTTD -3.377e-01  7.771e-02  -4.345 1.47e-05 ***
## Year_fct2019:StationCodeSTTD -9.913e-01  7.655e-02 -12.950  < 2e-16 ***
## Year_fct2016:StationCodeLIB   1.196e+00  9.326e-02  12.829  < 2e-16 ***
## Year_fct2017:StationCodeLIB   4.439e-01  9.307e-02   4.769 1.99e-06 ***
## Year_fct2018:StationCodeLIB  -1.446e+00  1.205e-01 -12.006  < 2e-16 ***
## Year_fct2019:StationCodeLIB  -3.061e-01  1.257e-01  -2.436 0.014943 *  
## Year_fct2016:StationCodeRVB   1.500e+00  2.213e-01   6.779 1.62e-11 ***
## Year_fct2017:StationCodeRVB   6.775e-01  1.990e-01   3.404 0.000679 ***
## Year_fct2018:StationCodeRVB  -9.430e-02  1.626e-01  -0.580 0.561995    
## Year_fct2019:StationCodeRVB  -6.203e-01  1.969e-01  -3.151 0.001653 ** 
## Flow:StationCodeSTTD          3.304e-03  1.368e-04  24.145  < 2e-16 ***
## Flow:StationCodeLIB           1.434e-03  1.151e-04  12.458  < 2e-16 ***
## Flow:StationCodeRVB           1.560e-03  9.489e-05  16.443  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3952 on 1846 degrees of freedom
## Multiple R-squared:  0.8332, Adjusted R-squared:  0.8308 
## F-statistic: 341.5 on 27 and 1846 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log) %>% 
  plot_lm_diag(Chla_log, m_lm2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm2)
## W = 0.96829, p-value < 2.2e-16
acf(residuals(m_lm2))

Box.test(residuals(m_lm2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2)
## X-squared = 4598.1, df = 20, p-value < 2.2e-16

The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1, 2, and 3 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_lm2_lag1 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1) %>% 
  lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1, data = .)

acf(residuals(m_lm2_lag1))

Box.test(residuals(m_lm2_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_lag1)
## X-squared = 59.714, df = 20, p-value = 7.884e-06

Lag 2

m_lm2_lag2 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2, data = .)

acf(residuals(m_lm2_lag2))

Box.test(residuals(m_lm2_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_lag2)
## X-squared = 45.379, df = 20, p-value = 0.00098

Lag 3

m_lm2_lag3 <- df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2, lag3) %>% 
  lm(Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + lag2 + lag3, data = .)

acf(residuals(m_lm2_lag3))

Box.test(residuals(m_lm2_lag3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_lag3)
## X-squared = 25.678, df = 20, p-value = 0.1767

The model with 3 lag terms looks like it has better ACF and Box-Ljung test results than the other models. Let’s compare the 3 models using AIC and BIC to see which model those prefer.

Compare models

AIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
##            df       AIC
## m_lm2_lag1 30 -1341.836
## m_lm2_lag2 31 -1328.270
## m_lm2_lag3 32 -1351.047
BIC(m_lm2_lag1, m_lm2_lag2, m_lm2_lag3)
##            df       BIC
## m_lm2_lag1 30 -1176.425
## m_lm2_lag2 31 -1158.012
## m_lm2_lag3 32 -1175.945

AIC prefers the model with 3 lag terms while BIC slightly prefers the model with 1 lag term. Let’s take a closer look at the model with 3 lag terms since it had better ACF and Box-Ljung test results.

Lag 3 model summary

summary(m_lm2_lag3)
## 
## Call:
## lm(formula = Chla_log ~ (Year_fct + Flow + StationCode)^2 + lag1 + 
##     lag2 + lag3, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88352 -0.05588 -0.00615  0.05095  1.13186 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.021e+00  1.030e-01   9.917  < 2e-16 ***
## Year_fct2016                 -4.245e-02  2.424e-02  -1.751 0.080137 .  
## Year_fct2017                 -3.954e-02  2.399e-02  -1.648 0.099540 .  
## Year_fct2018                 -2.729e-02  2.205e-02  -1.237 0.216141    
## Year_fct2019                  8.278e-03  2.223e-02   0.372 0.709665    
## Flow                         -1.826e-04  4.293e-05  -4.254 2.21e-05 ***
## StationCodeSTTD              -8.930e-02  2.560e-02  -3.489 0.000497 ***
## StationCodeLIB               -1.633e-01  5.564e-02  -2.936 0.003371 ** 
## StationCodeRVB               -1.892e-01  4.989e-02  -3.793 0.000154 ***
## lag1                          9.766e-01  2.404e-02  40.624  < 2e-16 ***
## lag2                         -3.105e-02  3.327e-02  -0.933 0.350706    
## lag3                         -5.219e-02  2.385e-02  -2.189 0.028766 *  
## Year_fct2016:Flow            -1.045e-05  1.389e-05  -0.752 0.452144    
## Year_fct2017:Flow            -2.358e-06  1.139e-05  -0.207 0.836010    
## Year_fct2018:Flow             3.587e-06  1.136e-05   0.316 0.752176    
## Year_fct2019:Flow             5.525e-06  1.265e-05   0.437 0.662376    
## Year_fct2016:StationCodeSTTD  7.295e-02  3.526e-02   2.069 0.038693 *  
## Year_fct2017:StationCodeSTTD  7.972e-02  3.727e-02   2.139 0.032553 *  
## Year_fct2018:StationCodeSTTD -5.016e-02  3.289e-02  -1.525 0.127423    
## Year_fct2019:StationCodeSTTD -1.443e-01  3.397e-02  -4.249 2.26e-05 ***
## Year_fct2016:StationCodeLIB   1.040e-01  4.189e-02   2.482 0.013175 *  
## Year_fct2017:StationCodeLIB   3.142e-02  3.948e-02   0.796 0.426219    
## Year_fct2018:StationCodeLIB  -1.231e-01  5.723e-02  -2.152 0.031553 *  
## Year_fct2019:StationCodeLIB  -4.936e-02  6.070e-02  -0.813 0.416244    
## Year_fct2016:StationCodeRVB   1.633e-01  1.019e-01   1.602 0.109254    
## Year_fct2017:StationCodeRVB   6.267e-02  8.373e-02   0.748 0.454265    
## Year_fct2018:StationCodeRVB  -1.640e-02  6.745e-02  -0.243 0.807882    
## Year_fct2019:StationCodeRVB  -1.252e-01  8.814e-02  -1.420 0.155702    
## Flow:StationCodeSTTD          3.908e-04  6.614e-05   5.909 4.14e-09 ***
## Flow:StationCodeLIB           2.095e-04  5.088e-05   4.118 4.00e-05 ***
## Flow:StationCodeRVB           1.777e-04  4.250e-05   4.181 3.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1632 on 1727 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:  0.9699 
## F-statistic:  1890 on 30 and 1727 DF,  p-value: < 2.2e-16
df_chla_c2_lag %>% 
  drop_na(Chla_log, lag1, lag2, lag3) %>% 
  plot_lm_diag(Chla_log, m_lm2_lag3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm2_lag3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm2_lag3)
## W = 0.85565, p-value < 2.2e-16
Anova(m_lm2_lag3, type = 3, contrasts = list(topic = contr.sum, sys = contr.sum))
## Anova Table (Type III tests)
## 
## Response: Chla_log
##                      Sum Sq   Df   F value    Pr(>F)    
## (Intercept)           2.621    1   98.3416 < 2.2e-16 ***
## Year_fct              0.202    4    1.8943   0.10888    
## Flow                  0.482    1   18.0954 2.214e-05 ***
## StationCode           0.586    3    7.3292 6.992e-05 ***
## lag1                 43.978    1 1650.2819 < 2.2e-16 ***
## lag2                  0.023    1    0.8714   0.35071    
## lag3                  0.128    1    4.7895   0.02877 *  
## Year_fct:Flow         0.052    4    0.4873   0.74508    
## Year_fct:StationCode  1.309   12    4.0938 2.477e-06 ***
## Flow:StationCode      0.975    3   12.1909 6.790e-08 ***
## Residuals            46.023 1727                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As with the model with 3-way interactions, the residuals deviate from a normal distribution particularly at the both tails of the distribution. I’m not so sure about using this model. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

df_lm2_lag3_fit <- df_chla_c2_lag %>% 
  drop_na(Chla, lag1, lag2, lag3) %>% 
  mutate(Fitted = as_tibble(predict(m_lm2_lag3, interval = "confidence"))) %>% 
  unnest_wider(Fitted) %>% 
  mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))

plt_lm2_lag3_obs_fit <- df_lm2_lag3_fit %>% 
  ggplot(aes(x = fit, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_lm2_lag3_obs_fit

Let’s look at each Station and Year separately.

plt_lm2_lag3_obs_fit + facet_wrap(vars(StationCode), scales = "free")

plt_lm2_lag3_obs_fit + facet_wrap(vars(Year_fct), scales = "free")

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. The fitted and observed values appear to match pretty well at the lower end of the range of values, but this deteriorates at the mid and higher range values. This pattern holds for some of the individual Stations and Years. I’m not sure this linear model is a good candidate to use for our analysis.

Linear Model with single station - STTD

Since none of the models we tried so far seemed to fit the observed data very well, it may be interesting to run separate models for each of the four stations - RD22, STTD, LIB, RVB - to see if that helps improve model fit. We’ll try STTD as a test case.

Initial Model

Let’s try a linear model using a two-way interaction between Year and Daily Average Flow. Initially, we’ll run the model without accounting for serial autocorrelation.

df_sttd <- df_chla_c2_lag %>% filter(StationCode == "STTD")

m_lm2_sttd <- df_sttd %>% 
  drop_na(Chla_log) %>% 
  lm(Chla_log ~ Year_fct * Flow, data = .)

Lets look at the model summary and diagnostics:

summary(m_lm2_sttd)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7854 -0.1724 -0.0296  0.1377  1.5439 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.6179586  0.0295402 291.736  < 2e-16 ***
## Year_fct2016       0.4258856  0.0459390   9.271  < 2e-16 ***
## Year_fct2017       0.3102626  0.0675111   4.596 5.71e-06 ***
## Year_fct2018      -0.6463737  0.0456741 -14.152  < 2e-16 ***
## Year_fct2019      -1.0607392  0.0444413 -23.868  < 2e-16 ***
## Flow               0.0027149  0.0002046  13.269  < 2e-16 ***
## Year_fct2016:Flow -0.0013937  0.0002865  -4.865 1.62e-06 ***
## Year_fct2017:Flow  0.0051487  0.0017705   2.908  0.00383 ** 
## Year_fct2018:Flow -0.0004072  0.0002580  -1.578  0.11523    
## Year_fct2019:Flow -0.0014032  0.0002321  -6.047 3.26e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3032 on 421 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.805 
## F-statistic: 198.2 on 9 and 421 DF,  p-value: < 2.2e-16
df_sttd %>% 
  drop_na(Chla_log) %>% 
  plot_lm_diag(Chla_log, m_lm2_sttd)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm2_sttd))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm2_sttd)
## W = 0.94651, p-value = 2.328e-11
acf(residuals(m_lm2_sttd))

Box.test(residuals(m_lm2_sttd), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_sttd)
## X-squared = 463.9, df = 20, p-value < 2.2e-16

The residuals deviate from a normal distribution according to visual inspection and the Shapiro-Wilk normality test. Also, model definitely has residual autocorrelation as indicated by the ACF plot and the Box-Ljung test.

Model with lag terms

Now, we’ll try to deal with the residual autocorrelation and the non-normal residuals. We’ll run a series of linear models adding 1, 2, and 3 lag terms and compare how well they correct for autocorrelation.

Lag 1

m_lm2_sttd_lag1 <- df_sttd %>% 
  drop_na(Chla_log, lag1) %>% 
  lm(Chla_log ~ Year_fct * Flow + lag1, data = .)

acf(residuals(m_lm2_sttd_lag1))

Box.test(residuals(m_lm2_sttd_lag1), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_sttd_lag1)
## X-squared = 37.18, df = 20, p-value = 0.01113

Lag 2

m_lm2_sttd_lag2 <- df_sttd %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  lm(Chla_log ~ Year_fct * Flow + lag1 + lag2, data = .)

acf(residuals(m_lm2_sttd_lag2))

Box.test(residuals(m_lm2_sttd_lag2), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_sttd_lag2)
## X-squared = 18.734, df = 20, p-value = 0.5392

Lag 3

m_lm2_sttd_lag3 <- df_sttd %>% 
  drop_na(Chla_log, lag1, lag2, lag3) %>% 
  lm(Chla_log ~ Year_fct * Flow + lag1 + lag2 + lag3, data = .)

acf(residuals(m_lm2_sttd_lag3))

Box.test(residuals(m_lm2_sttd_lag3), lag = 20, type = 'Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  residuals(m_lm2_sttd_lag3)
## X-squared = 10.813, df = 20, p-value = 0.9509

The models with 2 and 3 lag terms seem to take care of the autocorrelation. Let’s compare the 3 models using AIC and BIC to see which model those prefer.

Compare models

AIC(m_lm2_sttd_lag1, m_lm2_sttd_lag2, m_lm2_sttd_lag3)
##                 df       AIC
## m_lm2_sttd_lag1 12 -328.2932
## m_lm2_sttd_lag2 13 -350.6079
## m_lm2_sttd_lag3 14 -340.5105
BIC(m_lm2_sttd_lag1, m_lm2_sttd_lag2, m_lm2_sttd_lag3)
##                 df       BIC
## m_lm2_sttd_lag1 12 -279.7532
## m_lm2_sttd_lag2 13 -298.3031
## m_lm2_sttd_lag3 14 -284.4561

Both AIC and BIC prefer the model with 2 lag terms. Let’s take a closer look at that one.

Lag 2 model summary

summary(m_lm2_sttd_lag2)
## 
## Call:
## lm(formula = Chla_log ~ Year_fct * Flow + lag1 + lag2, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88426 -0.06597 -0.00901  0.05116  0.81176 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.612e+00  2.284e-01   7.058 7.50e-12 ***
## Year_fct2016       6.863e-02  2.667e-02   2.573 0.010442 *  
## Year_fct2017       1.177e-01  3.644e-02   3.230 0.001339 ** 
## Year_fct2018      -1.292e-01  2.907e-02  -4.445 1.14e-05 ***
## Year_fct2019      -2.085e-01  3.616e-02  -5.766 1.62e-08 ***
## Flow               5.131e-04  1.324e-04   3.875 0.000124 ***
## lag1               1.109e+00  4.925e-02  22.527  < 2e-16 ***
## lag2              -2.956e-01  4.819e-02  -6.133 2.07e-09 ***
## Year_fct2016:Flow -2.822e-04  1.555e-04  -1.816 0.070186 .  
## Year_fct2017:Flow  2.695e-03  1.001e-03   2.693 0.007380 ** 
## Year_fct2018:Flow -5.917e-05  1.375e-04  -0.430 0.667197    
## Year_fct2019:Flow -2.500e-04  1.302e-04  -1.920 0.055609 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1557 on 401 degrees of freedom
## Multiple R-squared:  0.9497, Adjusted R-squared:  0.9483 
## F-statistic: 687.9 on 11 and 401 DF,  p-value: < 2.2e-16
df_sttd %>% 
  drop_na(Chla_log, lag1, lag2) %>% 
  plot_lm_diag(Chla_log, m_lm2_sttd_lag2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

shapiro.test(residuals(m_lm2_sttd_lag2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m_lm2_sttd_lag2)
## W = 0.86609, p-value < 2.2e-16

As with all the other linear models, the residuals deviate from a normal distribution particularly at the both tails of the distribution. I’m not so sure about using this model. Let’s take a closer look at how the back-transformed fitted values from the model match the observed values.

df_lm2_sttd_lag2_fit <- df_sttd %>% 
  drop_na(Chla, lag1, lag2) %>% 
  mutate(Fitted = as_tibble(predict(m_lm2_sttd_lag2, interval = "confidence"))) %>% 
  unnest_wider(Fitted) %>% 
  mutate(across(c("fit", "lwr", "upr"), ~ exp(.x) / 1000))

plt_lm2_sttd_lag2_obs_fit <- df_lm2_sttd_lag2_fit %>% 
  ggplot(aes(x = fit, y = Chla)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  theme_bw() +
  labs(
    x = "Back-transformed Fitted Values",
    y = "Observed Values"
  )

plt_lm2_sttd_lag2_obs_fit

Let’s look at each Year separately.

plt_lm2_sttd_lag2_obs_fit + facet_wrap(vars(Year_fct), scales = "free")

The red lines are the 1:1 ratio between fitted and observed values, and we would expect for most of the points to be near this line if the model has a good fit to the observed data. The fitted and observed values appear to match pretty well at the lower end of the range of values, but this deteriorates at the higher values. This pattern holds for some of the individual Years. I’m not sure this linear model is a good candidate to use for our analysis.

Conclusion

None of the models we tried in this analysis seemed to fit the observed data very well. It looks like daily average flow has an effect on chlorophyll concentrations at some of the stations, but flow doesn’t seem to be a good predictor overall. Unfortunately, I don’t think we should use any of these models in our analysis.