Sometimes you have data that is categorized as “low”, “medium”, and “high”. You can turn these into factors for use in a model, but the model will treat them as unrelated, whereas there is probably an underlying distribution behind those values. The probability of “high” is probably closer to the probability of “medium” than to “low”. Or something like that. In order to include this information in the model, you need an ordinal regression.
This figure, and much of the material in this tutorial comes from:
Bürkner, P.-C., and M. Vuorre. 2019. Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science 2(1):77-101. doi:https://doi.org/10.1177/2515245918823199
This is another good tutorial: https://www.r-bloggers.com/2019/06/how-to-perform-ordinal-logistic-regression-in-r/
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(devtools)
## Loading required package: usethis
#install_github("sbashevkin/discretewq")
library(discretewq)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.16.3). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
## The following object is masked from 'package:stats':
##
## ar
library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ordinal)
## Warning: package 'ordinal' was built under R version 4.1.3
##
## Attaching package: 'ordinal'
## The following objects are masked from 'package:brms':
##
## ranef, VarCorr
## The following object is masked from 'package:dplyr':
##
## slice
library(emmeans)
##
## Attaching package: 'emmeans'
## The following object is masked from 'package:devtools':
##
## test
Oridinal regressions are basically extentions of logistic regressions. In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. Ordinal regression uses cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy.
There are a number of different types of distributions that can be used with ordinal data.
For this example, we are toing to be using visual Microcystis scores collected from various IEP surveys.
Flynn, T., P. Lehman, S. Lesmeister, and S. Waller. 2022. A Visual Scale for Microcystis Bloom Severity. figshare Figure. doi:https://doi.org/10.6084/m9.figshare.19239882.v1
In this case, the cummulative distribution is probably what we want, since “low”, “medium”, and “high” are mutually exclusive.
Now let’s get the data. We can use Sam B’s fantastic integrated water quality dataset to automatically download all the survyes the collect Microcystis observations. We’ll just do the past five years of data, and limit it to observations in summer and fall, when microcystis is most common.
WQ = wq(Sources = c("FMWT", "STN", "EMP"), Start_year = 2015, End_year = 2021) %>%
dplyr::select(Source, Station, Latitude, Longitude, Date, Depth, Microcystis,
Secchi, Temperature, Salinity, Conductivity, Month, Season) %>%
filter(!is.na(Microcystis), Season %in% c("Summer", "Fall"))
View(WQ)
ggplot(WQ, aes(x = as.factor(Microcystis), y = Temperature)) + geom_boxplot()
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
Cool! So it looks as though there is probably a trend between Microcystis observations and water temperature. However, there is some ammount of variation in how people judge microcystis levels. Let’s convert from 5 categories to 3 categories, to try and reduce variability.
WQ = mutate(WQ, Mic = case_when(Microcystis == 1 ~ "Absent",
Microcystis %in% c(2,3) ~ "Low",
Microcystis %in% c(4,5)~ "High"),
Mic = factor(Mic, levels = c("Absent", "Low", "High"), ordered = T))
ggplot(WQ, aes(x = Mic, y = Temperature)) + geom_boxplot()
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
OK, now we can set up our model of temperature versus Microcystis. If we have a simple model, we can use the ‘polr’ function in the MASS package, but the package ‘ordinal’ allows for mixed models and more complicated controls.
mod1 = clm(Mic~Temperature, data = WQ)
summary(mod1)
## formula: Mic ~ Temperature
## data: WQ
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 4812 -3065.90 6137.79 6(0) 3.00e-11 7.8e+04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Temperature 0.34248 0.01437 23.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Absent|Low 7.9649 0.3054 26.08
## Low|High 10.4682 0.3242 32.29
## (9 observations deleted due to missingness)
The table displays the value of coefficients and intercepts, and corresponding standard errors z values, and p-values. Holding everything else constant, an increase in Temperature by one degree increases the log odds of Microcystis by 0.3.
THere are multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. So the Absent|Low intercept takes value of 7.9, indicating that the expected odds of absence when temperature is zero is 7.9 times greater than for Low.
We can plot the effect of temperature:
#Plotting the effects
Effect(focal.predictors = "Temperature",mod1)
##
## Temperature effect (probability) for Absent
## Temperature
## 11 16 20 24 28
## 0.9851913 0.9230985 0.7531129 0.4366813 0.1645759
##
## Temperature effect (probability) for Low
## Temperature
## 11 16 20 24 28
## 0.01358052 0.07013209 0.22076788 0.46785638 0.54199068
##
## Temperature effect (probability) for High
## Temperature
## 11 16 20 24 28
## 0.001228230 0.006769439 0.026119191 0.095462303 0.293433450
plot(Effect(focal.predictors ="Temperature",mod1))
This really lets us see how as temperature increases, probability of High microcystis increases, probability of Low microcystis increases, then levels off, and probability of Absence decreases.
However, there area probably some issues with this model. Temperature is highly correlated with day of the year, and there is probably some spatial autocorrelation going on too. A better model would include multiple predictors and random effects.
#first let's calculate day of year and year to include in the model as a random effect
WQ = mutate(WQ, DOY = yday(Date), Year = as.factor(year(Date)))
mod2 = clmm(Mic ~ Temperature + Secchi + Year + (1|DOY), data = WQ)
## Warning in update.uC(rho): step factor reduced below minimum when updating the random effects
## at iteration 52
summary(mod2)
## Cumulative Link Mixed Model fitted with the Laplace approximation
##
## formula: Mic ~ Temperature + Secchi + Year + (1 | DOY)
## data: WQ
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 4718 -2425.35 4872.71 665(3495) 1.30e-02 2.5e+06
##
## Random effects:
## Groups Name Variance Std.Dev.
## DOY (Intercept) 2.042 1.429
## Number of groups: DOY 168
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Temperature 0.5616482 0.0286895 19.577 < 2e-16 ***
## Secchi 0.0097468 0.0007285 13.378 < 2e-16 ***
## Year2016 0.1793758 0.1489293 1.204 0.2284
## Year2017 -2.7175744 0.2269768 -11.973 < 2e-16 ***
## Year2018 0.1836779 0.1742576 1.054 0.2919
## Year2019 -0.7997375 0.1855894 -4.309 1.64e-05 ***
## Year2020 0.3454841 0.1458239 2.369 0.0178 *
## Year2021 1.5046456 0.2830319 5.316 1.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Absent|Low 13.5555 0.6443 21.04
## Low|High 16.9222 0.6841 24.74
## (103 observations deleted due to missingness)
Now let’s plot the effects, and do pairwise comparisons between years
allEffects(mod2)
## model: Mic ~ Temperature + Secchi + Year
##
## Temperature effect (probability) for Absent
## Temperature
## 11 16 20 24 28
## 0.99892823 0.98252120 0.85601027 0.38602496 0.06234814
##
## Temperature effect (probability) for Low
## Temperature
## 11 16 20 24 28
## 0.001034751 0.016865393 0.138219589 0.561953798 0.596032742
##
## Temperature effect (probability) for High
## Temperature
## 11 16 20 24 28
## 3.701664e-05 6.134089e-04 5.770137e-03 5.202125e-02 3.416191e-01
##
## Secchi effect (probability) for Absent
## Secchi
## 8 100 200 400 500
## 0.9310822 0.8464117 0.6752557 0.2284119 0.1004733
##
## Secchi effect (probability) for Low
## Secchi
## 8 100 200 400 500
## 0.06637049 0.14736654 0.30842229 0.66720378 0.66353015
##
## Secchi effect (probability) for High
## Secchi
## 8 100 200 400 500
## 0.002547314 0.006221753 0.016321987 0.104384335 0.235996597
##
## Year effect (probability) for Absent
## Year
## 2015 2016 2017 2018 2019 2020 2021
## 0.7998936 0.7696354 0.9837488 0.7688718 0.8989277 0.7388781 0.4702810
##
## Year effect (probability) for Low
## Year
## 2015 2016 2017 2018 2019 2020 2021
## 0.19154898 0.22014307 0.01568156 0.22086308 0.09720795 0.24907563 0.49230991
##
## Year effect (probability) for High
## Year
## 2015 2016 2017 2018 2019 2020
## 0.0085574258 0.0102215208 0.0005696401 0.0102651367 0.0038643149 0.0120463090
## 2021
## 0.0374090453
plot(allEffects(mod2))
emmeans(mod2, pairwise ~ Year)
## $emmeans
## Year emmean SE df asymp.LCL asymp.UCL
## 2015 -3.07 0.183 Inf -3.43 -2.71
## 2016 -2.89 0.175 Inf -3.23 -2.55
## 2017 -5.79 0.247 Inf -6.27 -5.30
## 2018 -2.89 0.171 Inf -3.22 -2.55
## 2019 -3.87 0.192 Inf -4.24 -3.49
## 2020 -2.72 0.172 Inf -3.06 -2.39
## 2021 -1.56 0.257 Inf -2.07 -1.06
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df z.ratio p.value
## 2015 - 2016 -0.1794 0.149 Inf -1.204 0.8927
## 2015 - 2017 2.7176 0.227 Inf 11.973 <.0001
## 2015 - 2018 -0.1837 0.174 Inf -1.054 0.9412
## 2015 - 2019 0.7997 0.186 Inf 4.309 0.0003
## 2015 - 2020 -0.3455 0.146 Inf -2.369 0.2117
## 2015 - 2021 -1.5046 0.283 Inf -5.316 <.0001
## 2016 - 2017 2.8969 0.225 Inf 12.868 <.0001
## 2016 - 2018 -0.0043 0.162 Inf -0.027 1.0000
## 2016 - 2019 0.9791 0.184 Inf 5.327 <.0001
## 2016 - 2020 -0.1661 0.151 Inf -1.103 0.9275
## 2016 - 2021 -1.3253 0.280 Inf -4.727 <.0001
## 2017 - 2018 -2.9013 0.215 Inf -13.469 <.0001
## 2017 - 2019 -1.9178 0.225 Inf -8.542 <.0001
## 2017 - 2020 -3.0631 0.220 Inf -13.931 <.0001
## 2017 - 2021 -4.2222 0.322 Inf -13.110 <.0001
## 2018 - 2019 0.9834 0.163 Inf 6.034 <.0001
## 2018 - 2020 -0.1618 0.161 Inf -1.006 0.9529
## 2018 - 2021 -1.3210 0.267 Inf -4.954 <.0001
## 2019 - 2020 -1.1452 0.176 Inf -6.490 <.0001
## 2019 - 2021 -2.3044 0.272 Inf -8.470 <.0001
## 2020 - 2021 -1.1592 0.275 Inf -4.213 0.0005
##
## P value adjustment: tukey method for comparing a family of 7 estimates
plot(emmeans(mod2, pairwise ~ Year))
We can also use a Bayesian approach.
We will be using the ‘brms’ package, which uses stan on the back end.
There are a lot of great vinettes to help you set it up. The “troubleshooting” one is my favorite.
#Running these models takes FOREVER, so we're going to limit it to just 2020, to make life easier
WQ2020 = filter(WQ, Year == "2020")
M5.1 = brm(Mic ~ Temperature + Secchi + (1|DOY), data = WQ2020, family = cumulative,
iter = 1000, backend = "cmdstanr",
control = list(max_treedepth = 15),
chains = 2)
## Warning: Rows containing NAs were excluded from the model.
## Start sampling
## Running MCMC with 2 sequential chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 282.7 seconds.
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 349.3 seconds.
##
## Both chains finished successfully.
## Mean chain execution time: 316.0 seconds.
## Total execution time: 634.3 seconds.
save(M5.1, file = "M5.1.RData")
So, it worked, but it takes a while. The more complicated the model, the longer it takes.
summary(M5.1)
## Family: cumulative
## Links: mu = logit; disc = identity
## Formula: Mic ~ Temperature + Secchi + (1 | DOY)
## Data: WQ2020 (Number of observations: 742)
## Draws: 2 chains, each with iter = 500; warmup = 0; thin = 1;
## total post-warmup draws = 1000
##
## Group-Level Effects:
## ~DOY (Number of levels: 68)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 2.11 0.27 1.62 2.69 1.01 242 460
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 19.01 1.82 15.64 22.72 1.00 379 639
## Intercept[2] 23.31 1.96 19.71 27.35 1.00 387 570
## Temperature 0.86 0.09 0.70 1.03 1.00 380 654
## Secchi 0.01 0.00 0.00 0.01 1.00 898 943
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
cex = conditional_effects(M5.1, categorical = TRUE)
conditional_effects(M5.1, categorical = TRUE)