An Introduction to Orginal Regression

Sometimes you have data that is categorized as “low”, “medium”, and “high”. You can turn these into factors for use in a model, but the model will treat them as unrelated, whereas there is probably an underlying distribution behind those values. The probability of “high” is probably closer to the probability of “medium” than to “low”. Or something like that. In order to include this information in the model, you need an ordinal regression.

This figure, and much of the material in this tutorial comes from:

Bürkner, P.-C., and M. Vuorre. 2019. Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science 2(1):77-101. doi:https://doi.org/10.1177/2515245918823199

This is another good tutorial: https://www.r-bloggers.com/2019/06/how-to-perform-ordinal-logistic-regression-in-r/

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(devtools)
## Loading required package: usethis
#install_github("sbashevkin/discretewq")
library(discretewq)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.16.3). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
## 
## Attaching package: 'brms'
## The following object is masked from 'package:stats':
## 
##     ar
library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ordinal)
## Warning: package 'ordinal' was built under R version 4.1.3
## 
## Attaching package: 'ordinal'
## The following objects are masked from 'package:brms':
## 
##     ranef, VarCorr
## The following object is masked from 'package:dplyr':
## 
##     slice
library(emmeans)
## 
## Attaching package: 'emmeans'
## The following object is masked from 'package:devtools':
## 
##     test

Oridinal regressions are basically extentions of logistic regressions. In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. Ordinal regression uses cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy.

There are a number of different types of distributions that can be used with ordinal data.

For this example, we are toing to be using visual Microcystis scores collected from various IEP surveys.

Flynn, T., P. Lehman, S. Lesmeister, and S. Waller. 2022. A Visual Scale for Microcystis Bloom Severity. figshare Figure. doi:https://doi.org/10.6084/m9.figshare.19239882.v1

In this case, the cummulative distribution is probably what we want, since “low”, “medium”, and “high” are mutually exclusive.

Now let’s get the data. We can use Sam B’s fantastic integrated water quality dataset to automatically download all the survyes the collect Microcystis observations. We’ll just do the past five years of data, and limit it to observations in summer and fall, when microcystis is most common.

WQ = wq(Sources = c("FMWT", "STN", "EMP"), Start_year = 2015, End_year = 2021) %>%
  dplyr::select(Source, Station, Latitude, Longitude, Date, Depth, Microcystis, 
                Secchi, Temperature, Salinity, Conductivity, Month, Season) %>%
  filter(!is.na(Microcystis), Season %in% c("Summer", "Fall"))
View(WQ)

ggplot(WQ, aes(x = as.factor(Microcystis), y = Temperature)) + geom_boxplot()
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).

Cool! So it looks as though there is probably a trend between Microcystis observations and water temperature. However, there is some ammount of variation in how people judge microcystis levels. Let’s convert from 5 categories to 3 categories, to try and reduce variability.

WQ = mutate(WQ, Mic = case_when(Microcystis == 1 ~ "Absent",
                                Microcystis %in% c(2,3) ~ "Low",
                                Microcystis %in% c(4,5)~ "High"),
            Mic = factor(Mic, levels = c("Absent", "Low", "High"), ordered = T))

ggplot(WQ, aes(x = Mic, y = Temperature)) + geom_boxplot()
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).

OK, now we can set up our model of temperature versus Microcystis. If we have a simple model, we can use the ‘polr’ function in the MASS package, but the package ‘ordinal’ allows for mixed models and more complicated controls.

mod1 = clm(Mic~Temperature, data = WQ)
summary(mod1)
## formula: Mic ~ Temperature
## data:    WQ
## 
##  link  threshold nobs logLik   AIC     niter max.grad cond.H 
##  logit flexible  4812 -3065.90 6137.79 6(0)  3.00e-11 7.8e+04
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## Temperature  0.34248    0.01437   23.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##            Estimate Std. Error z value
## Absent|Low   7.9649     0.3054   26.08
## Low|High    10.4682     0.3242   32.29
## (9 observations deleted due to missingness)

The table displays the value of coefficients and intercepts, and corresponding standard errors z values, and p-values. Holding everything else constant, an increase in Temperature by one degree increases the log odds of Microcystis by 0.3.

THere are multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. So the Absent|Low intercept takes value of 7.9, indicating that the expected odds of absence when temperature is zero is 7.9 times greater than for Low.

We can plot the effect of temperature:

#Plotting the effects 

Effect(focal.predictors = "Temperature",mod1)
## 
## Temperature effect (probability) for Absent
## Temperature
##        11        16        20        24        28 
## 0.9851913 0.9230985 0.7531129 0.4366813 0.1645759 
## 
## Temperature effect (probability) for Low
## Temperature
##         11         16         20         24         28 
## 0.01358052 0.07013209 0.22076788 0.46785638 0.54199068 
## 
## Temperature effect (probability) for High
## Temperature
##          11          16          20          24          28 
## 0.001228230 0.006769439 0.026119191 0.095462303 0.293433450
plot(Effect(focal.predictors ="Temperature",mod1))

This really lets us see how as temperature increases, probability of High microcystis increases, probability of Low microcystis increases, then levels off, and probability of Absence decreases.

However, there area probably some issues with this model. Temperature is highly correlated with day of the year, and there is probably some spatial autocorrelation going on too. A better model would include multiple predictors and random effects.

#first let's calculate day of year and year to include in the model as a random effect

WQ = mutate(WQ, DOY = yday(Date), Year = as.factor(year(Date)))

mod2 = clmm(Mic ~ Temperature + Secchi + Year + (1|DOY), data = WQ)
## Warning in update.uC(rho): step factor reduced below minimum when updating the random effects
##   at iteration 52
summary(mod2)
## Cumulative Link Mixed Model fitted with the Laplace approximation
## 
## formula: Mic ~ Temperature + Secchi + Year + (1 | DOY)
## data:    WQ
## 
##  link  threshold nobs logLik   AIC     niter     max.grad cond.H 
##  logit flexible  4718 -2425.35 4872.71 665(3495) 1.30e-02 2.5e+06
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  DOY    (Intercept) 2.042    1.429   
## Number of groups:  DOY 168 
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## Temperature  0.5616482  0.0286895  19.577  < 2e-16 ***
## Secchi       0.0097468  0.0007285  13.378  < 2e-16 ***
## Year2016     0.1793758  0.1489293   1.204   0.2284    
## Year2017    -2.7175744  0.2269768 -11.973  < 2e-16 ***
## Year2018     0.1836779  0.1742576   1.054   0.2919    
## Year2019    -0.7997375  0.1855894  -4.309 1.64e-05 ***
## Year2020     0.3454841  0.1458239   2.369   0.0178 *  
## Year2021     1.5046456  0.2830319   5.316 1.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##            Estimate Std. Error z value
## Absent|Low  13.5555     0.6443   21.04
## Low|High    16.9222     0.6841   24.74
## (103 observations deleted due to missingness)

Now let’s plot the effects, and do pairwise comparisons between years

allEffects(mod2)
##  model: Mic ~ Temperature + Secchi + Year
## 
## Temperature effect (probability) for Absent
## Temperature
##         11         16         20         24         28 
## 0.99892823 0.98252120 0.85601027 0.38602496 0.06234814 
## 
## Temperature effect (probability) for Low
## Temperature
##          11          16          20          24          28 
## 0.001034751 0.016865393 0.138219589 0.561953798 0.596032742 
## 
## Temperature effect (probability) for High
## Temperature
##           11           16           20           24           28 
## 3.701664e-05 6.134089e-04 5.770137e-03 5.202125e-02 3.416191e-01 
## 
## Secchi effect (probability) for Absent
## Secchi
##         8       100       200       400       500 
## 0.9310822 0.8464117 0.6752557 0.2284119 0.1004733 
## 
## Secchi effect (probability) for Low
## Secchi
##          8        100        200        400        500 
## 0.06637049 0.14736654 0.30842229 0.66720378 0.66353015 
## 
## Secchi effect (probability) for High
## Secchi
##           8         100         200         400         500 
## 0.002547314 0.006221753 0.016321987 0.104384335 0.235996597 
## 
## Year effect (probability) for Absent
## Year
##      2015      2016      2017      2018      2019      2020      2021 
## 0.7998936 0.7696354 0.9837488 0.7688718 0.8989277 0.7388781 0.4702810 
## 
## Year effect (probability) for Low
## Year
##       2015       2016       2017       2018       2019       2020       2021 
## 0.19154898 0.22014307 0.01568156 0.22086308 0.09720795 0.24907563 0.49230991 
## 
## Year effect (probability) for High
## Year
##         2015         2016         2017         2018         2019         2020 
## 0.0085574258 0.0102215208 0.0005696401 0.0102651367 0.0038643149 0.0120463090 
##         2021 
## 0.0374090453
plot(allEffects(mod2))

emmeans(mod2, pairwise ~ Year)
## $emmeans
##  Year emmean    SE  df asymp.LCL asymp.UCL
##  2015  -3.07 0.183 Inf     -3.43     -2.71
##  2016  -2.89 0.175 Inf     -3.23     -2.55
##  2017  -5.79 0.247 Inf     -6.27     -5.30
##  2018  -2.89 0.171 Inf     -3.22     -2.55
##  2019  -3.87 0.192 Inf     -4.24     -3.49
##  2020  -2.72 0.172 Inf     -3.06     -2.39
##  2021  -1.56 0.257 Inf     -2.07     -1.06
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast    estimate    SE  df z.ratio p.value
##  2015 - 2016  -0.1794 0.149 Inf  -1.204  0.8927
##  2015 - 2017   2.7176 0.227 Inf  11.973  <.0001
##  2015 - 2018  -0.1837 0.174 Inf  -1.054  0.9412
##  2015 - 2019   0.7997 0.186 Inf   4.309  0.0003
##  2015 - 2020  -0.3455 0.146 Inf  -2.369  0.2117
##  2015 - 2021  -1.5046 0.283 Inf  -5.316  <.0001
##  2016 - 2017   2.8969 0.225 Inf  12.868  <.0001
##  2016 - 2018  -0.0043 0.162 Inf  -0.027  1.0000
##  2016 - 2019   0.9791 0.184 Inf   5.327  <.0001
##  2016 - 2020  -0.1661 0.151 Inf  -1.103  0.9275
##  2016 - 2021  -1.3253 0.280 Inf  -4.727  <.0001
##  2017 - 2018  -2.9013 0.215 Inf -13.469  <.0001
##  2017 - 2019  -1.9178 0.225 Inf  -8.542  <.0001
##  2017 - 2020  -3.0631 0.220 Inf -13.931  <.0001
##  2017 - 2021  -4.2222 0.322 Inf -13.110  <.0001
##  2018 - 2019   0.9834 0.163 Inf   6.034  <.0001
##  2018 - 2020  -0.1618 0.161 Inf  -1.006  0.9529
##  2018 - 2021  -1.3210 0.267 Inf  -4.954  <.0001
##  2019 - 2020  -1.1452 0.176 Inf  -6.490  <.0001
##  2019 - 2021  -2.3044 0.272 Inf  -8.470  <.0001
##  2020 - 2021  -1.1592 0.275 Inf  -4.213  0.0005
## 
## P value adjustment: tukey method for comparing a family of 7 estimates
plot(emmeans(mod2, pairwise ~ Year))

Bayesian Approach

We can also use a Bayesian approach.

We will be using the ‘brms’ package, which uses stan on the back end.

There are a lot of great vinettes to help you set it up. The “troubleshooting” one is my favorite.

#Running these models takes FOREVER, so we're going to limit it to just 2020, to make life easier

WQ2020 = filter(WQ, Year == "2020")

M5.1 = brm(Mic ~ Temperature + Secchi +  (1|DOY), data = WQ2020, family = cumulative,
           iter = 1000, backend = "cmdstanr",  
           control = list(max_treedepth = 15),
           chains = 2)
## Warning: Rows containing NAs were excluded from the model.
## Start sampling
## Running MCMC with 2 sequential chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 282.7 seconds.
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 349.3 seconds.
## 
## Both chains finished successfully.
## Mean chain execution time: 316.0 seconds.
## Total execution time: 634.3 seconds.
save(M5.1, file = "M5.1.RData")

So, it worked, but it takes a while. The more complicated the model, the longer it takes.

summary(M5.1)
##  Family: cumulative 
##   Links: mu = logit; disc = identity 
## Formula: Mic ~ Temperature + Secchi + (1 | DOY) 
##    Data: WQ2020 (Number of observations: 742) 
##   Draws: 2 chains, each with iter = 500; warmup = 0; thin = 1;
##          total post-warmup draws = 1000
## 
## Group-Level Effects: 
## ~DOY (Number of levels: 68) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     2.11      0.27     1.62     2.69 1.01      242      460
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1]    19.01      1.82    15.64    22.72 1.00      379      639
## Intercept[2]    23.31      1.96    19.71    27.35 1.00      387      570
## Temperature      0.86      0.09     0.70     1.03 1.00      380      654
## Secchi           0.01      0.00     0.00     0.01 1.00      898      943
## 
## Family Specific Parameters: 
##      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc     1.00      0.00     1.00     1.00   NA       NA       NA
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
cex = conditional_effects(M5.1, categorical = TRUE)
 conditional_effects(M5.1, categorical = TRUE)