A few basic statistical tests

Now that you have a dataset and have some comfort in manipulating it, you probably want to run some statistics. In this tutorial I will cover a little bit about linear models and ANOVAs. In order to actually analyze complicated ecological data, you will probably need more advanced topics, but this is a start. This is not going to cover statistics in detail, instead it assumes some knowledge of statistics and I will cover how to run a few basic tests in R and evaluate the results.

Instead of using the FMWT catch matrix we worked with before, we are just going to use the annual abundance indecis for Longfin Smelt.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidylog)
## 
## Attaching package: 'tidylog'
## 
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup
## 
## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount
## 
## The following object is masked from 'package:stats':
## 
##     filter
library(lubridate)

FMWTindex = read_csv("https://filelib.wildlife.ca.gov/Public/TownetFallMidwaterTrawl/FMWT%20Data/FMWTindices.csv")
## Rows: 56 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): Year, Threadfin Shad, American Shad, Delta Smelt, Longfin Smelt, St...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
LFS = select(FMWTindex, Year, `Longfin Smelt`)
## select: dropped 5 variables (Threadfin Shad, American Shad, Delta Smelt, Striped Bass Age0, Splittail)

Before we do anything with this data, we should make a few basic plots so we know what it looks like. Let’s make a histogram so we know what the distribution of the data looks like, and let’s plot the index over time.

Histograms are very easy.

hist(LFS$`Longfin Smelt`)

So we can see that we have a lot of small values and very few high values. This is not normally distributed. That could be a problem for later statistical tests that have assumptions of normality.

What does it look like if we log-transform it?

hist(log(LFS$`Longfin Smelt`))

That looks a lot more like a bell curve! So we will probably have to log-transform our data to use regular linear models or ANOVAs.

For our next step, let’s plot longfin catch over time.

plot(x = LFS$Year, y = LFS$`Longfin Smelt`)

Some very high catches at the beginning, then not a lot. Does log-transforming it help see patterns?

plot(x = LFS$Year, y = log(LFS$`Longfin Smelt`))

The downward trend over time is easier to see now. Let’s see if that trend is statistically significant.

For this, we will use a basic linear model.

#log-transform the index to make it easier to write in the model
LFS = mutate(LFS, logIndex = log(`Longfin Smelt`))
## mutate: new variable 'logIndex' (double) with 54 unique values and 4% NA
mod1 = lm(logIndex ~ Year, data = LFS)
summary(mod1)
## 
## Call:
## lm(formula = logIndex ~ Year, data = LFS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1122 -0.7238 -0.2002  1.2051  3.1151 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 216.03333   27.09004   7.975 1.40e-10 ***
## Year         -0.10499    0.01358  -7.733 3.37e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.603 on 52 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5349, Adjusted R-squared:  0.5259 
## F-statistic:  59.8 on 1 and 52 DF,  p-value: 3.37e-10

Discuss output.

But how do we know that this model is appropriate? The plot.lm function can give you diagnostic plots to test the assumptions. Remember that a linear model assumes:

For more information on interpreting these plots, see: https://data.library.virginia.edu/diagnostic-plots/

plot(mod1)

Residuals verus fitted looks good. We exepect no pattern in the residuals. The red like is close to zero, no clearn patterns.

QQ Plot looks good. Most of the points are close to the 1:1 line.

The scale-location plot also looks good. This is how we test homogeneity of variance. If the variance increases with increasing fitted values, we have a problem.

Residuals versus leverage plot is how we tell if outliers are skewing our resluts. If any points are outside the ‘Cook’s Distance’ line, we have a problem. In this case, Cook’s distance isn’t even on the plot.

If any of these plots indicated issues with the assumptions, we would have to think about choosing another distribution, transform our data, or choose a non-parametric test.

As a counter-example, let’s see what the plots would look like if we hadn’t log-transformed the data.

mod2 = lm(`Longfin Smelt`~ Year, data = LFS)
summary(mod2)
## 
## Call:
## lm(formula = `Longfin Smelt` ~ Year, data = LFS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16123  -8004  -1586   1900  62579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 914216.2   244277.5   3.743 0.000456 ***
## Year          -455.0      122.4  -3.717 0.000495 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14460 on 52 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2099, Adjusted R-squared:  0.1947 
## F-statistic: 13.81 on 1 and 52 DF,  p-value: 0.0004947
plot(mod2)

Now we see a linear trend in the residuals versus fitted plot, a lot of points are way off the 1:1 line for the normal QQ plot, variance increases with increasing values in the scale-location plot, and we have one point outside the ‘Cook’s Distance’.

Those plots were diagnostic plots. What if we want to plot the effect of the terms in our model? We can use the effects package for that.

library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
plot(allEffects(mod1, residuals = T))

That’s a pretty simple example. What if we want to model multiple terms? For example, water year index might be important.

We can join water year type to the indecies, similar to the earlier demo.

library(waterYearType)
WYs = water_year_indices %>%
  filter(location == "Sacramento Valley") %>%
  select(WY, Index, Yr_type)
## filter: removed 116 rows (50%), 116 rows remaining
## select: dropped 4 variables (Oct_Mar, Apr_Jul, WYsum, location)
LFS2 = left_join(LFS, WYs, by = c("Year" = "WY"))
## left_join: added 2 columns (Index, Yr_type)
##            > rows only in x    1
##            > rows only in y  (61)
##            > matched rows     55
##            >                 ====
##            > rows total       56

Not let’s model the impact of both Year and Water Year Index.

mod3 = lm(logIndex~ Year + Index, data = LFS2)
summary(mod3)
## 
## Call:
## lm(formula = logIndex ~ Year + Index, data = LFS2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1936 -0.8512 -0.0397  0.6006  3.3999 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 198.040839  19.886456   9.959 1.85e-13 ***
## Year         -0.097539   0.009931  -9.822 2.93e-13 ***
## Index         0.384834   0.053718   7.164 3.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 50 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7797, Adjusted R-squared:  0.7709 
## F-statistic: 88.48 on 2 and 50 DF,  p-value: < 2.2e-16
plot(mod3)

Now the ‘effects’ plots will show us the effect of year with the effect of ‘Index’ removed, and vice versa.

plot(allEffects(mod3))

What if we think there is an interaction between year and Index?

mod4 = lm(logIndex ~ Year*Index, data = LFS2)
summary(mod4)
## 
## Call:
## lm(formula = logIndex ~ Year * Index, data = LFS2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33431 -0.55318 -0.09661  0.62493  2.68610 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 82.320389  58.903956   1.398   0.1685  
## Year        -0.039588   0.029489  -1.342   0.1856  
## Index       14.712447   6.892481   2.135   0.0378 *
## Year:Index  -0.007178   0.003453  -2.079   0.0429 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.089 on 49 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7976, Adjusted R-squared:  0.7852 
## F-statistic: 64.35 on 3 and 49 DF,  p-value: < 2.2e-16
plot(allEffects(mod4))

From this we see that the decrease in idex is more dramatic for wetter years than drier years.

ANOVAs

I also said we would look at ANOVA’s. MOst people teach ANOVA’s first, then linear models. I started with linear models because ANOVAs are actually just a special case of linear models with categorical predictors.

We can use the aov function as a shortcut, or we can run a linear model and follow up with anova. Let’s look and see if there is a difference in longfin smelt by water year type.

a1 = aov(logIndex ~ Yr_type, data = LFS2)
summary(a1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Yr_type      4  119.0  29.750   8.495 2.89e-05 ***
## Residuals   48  168.1   3.502                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3 observations deleted due to missingness

So, yes! Looks good. How about following up with a tkey post-hoc?

TukeyHSD(a1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = logIndex ~ Yr_type, data = LFS2)
## 
## $Yr_type
##                                  diff        lwr      upr     p adj
## Dry-Critical               0.58303202 -1.7344003 2.900464 0.9525132
## Below Normal-Critical      0.06670281 -2.4976906 2.631096 0.9999930
## Above Normal-Critical      2.71969679  0.1553034 5.284090 0.0326370
## Wet-Critical               3.35242289  1.3225906 5.382255 0.0002218
## Below Normal-Dry          -0.51632921 -3.1301102 2.097452 0.9801613
## Above Normal-Dry           2.13666477 -0.4771162 4.750446 0.1573204
## Wet-Dry                    2.76939087  0.6775121 4.861270 0.0041275
## Above Normal-Below Normal  2.65299398 -0.1820496 5.488038 0.0767412
## Wet-Below Normal           3.28572008  0.9231838 5.648256 0.0023326
## Wet-Above Normal           0.63272610 -1.7298102 2.995262 0.9409646

If we want to do this as a linear model with an ANOVA follow up, that looks like this:

a2 = lm(logIndex ~ Yr_type, data = LFS2)
anova(a2)
## Analysis of Variance Table
## 
## Response: logIndex
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Yr_type    4  119.0 29.7496  8.4946 2.886e-05 ***
## Residuals 48  168.1  3.5022                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, the results are pretty much the same!

We can also do more complicated models with both continuous and categorical variables. However, when we include interactions we should be doing a type-III ANOVA. For that we need Anova (capital-A) from the car package.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
a3 = lm(logIndex ~ Yr_type*Year, data = LFS2)
Anova(a3, type = "III")
## Anova Table (Type III tests)
## 
## Response: logIndex
##              Sum Sq Df F value  Pr(>F)  
## (Intercept)   9.256  1  6.6812 0.01322 *
## Yr_type       4.246  4  0.7663 0.55307  
## Year          8.540  1  6.1646 0.01701 *
## Yr_type:Year  4.047  4  0.7302 0.57629  
## Residuals    59.571 43                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(allEffects(a3))

The best way to follow up on these more complicated models is with the emmeans package. You can use it for Tukey post-hoc or other types of post-hoc tests.

library(emmeans)
a3e = emmeans(a3, pairwise ~ Yr_type|Year, at = list(Year = c(1980, 2000, 2020)))

a3e
## $emmeans
## Year = 1980:
##  Yr_type      emmean    SE df lower.CL upper.CL
##  Critical       5.96 0.543 43     4.87     7.06
##  Dry            7.65 0.685 43     6.26     9.03
##  Below Normal   6.69 0.638 43     5.40     7.98
##  Above Normal   8.83 0.583 43     7.66    10.01
##  Wet            9.36 0.329 43     8.69    10.02
## 
## Year = 2000:
##  Yr_type      emmean    SE df lower.CL upper.CL
##  Critical       4.76 0.363 43     4.03     5.49
##  Dry            5.46 0.373 43     4.71     6.21
##  Below Normal   5.01 0.445 43     4.12     5.91
##  Above Normal   6.56 0.569 43     5.41     7.71
##  Wet            7.30 0.324 43     6.64     7.95
## 
## Year = 2020:
##  Yr_type      emmean    SE df lower.CL upper.CL
##  Critical       3.55 0.663 43     2.22     4.89
##  Dry            3.28 0.715 43     1.84     4.72
##  Below Normal   3.33 0.638 43     2.05     4.62
##  Above Normal   4.29 1.174 43     1.92     6.66
##  Wet            5.24 0.584 43     4.06     6.42
## 
## Confidence level used: 0.95 
## 
## $contrasts
## Year = 1980:
##  contrast                    estimate    SE df t.ratio p.value
##  Critical - Dry               -1.6814 0.874 43  -1.924  0.3205
##  Critical - Below Normal      -0.7253 0.838 43  -0.866  0.9077
##  Critical - Above Normal      -2.8698 0.796 43  -3.605  0.0069
##  Critical - Wet               -3.3914 0.635 43  -5.343  <.0001
##  Dry - Below Normal            0.9561 0.936 43   1.021  0.8442
##  Dry - Above Normal           -1.1884 0.899 43  -1.322  0.6795
##  Dry - Wet                    -1.7100 0.760 43  -2.250  0.1814
##  Below Normal - Above Normal  -2.1445 0.864 43  -2.482  0.1138
##  Below Normal - Wet           -2.6661 0.718 43  -3.713  0.0050
##  Above Normal - Wet           -0.5216 0.669 43  -0.779  0.9352
## 
## Year = 2000:
##  contrast                    estimate    SE df t.ratio p.value
##  Critical - Dry               -0.7038 0.520 43  -1.353  0.6599
##  Critical - Below Normal      -0.2530 0.574 43  -0.441  0.9919
##  Critical - Above Normal      -1.8017 0.675 43  -2.669  0.0756
##  Critical - Wet               -2.5378 0.487 43  -5.216  <.0001
##  Dry - Below Normal            0.4508 0.580 43   0.777  0.9359
##  Dry - Above Normal           -1.0979 0.680 43  -1.613  0.4972
##  Dry - Wet                    -1.8339 0.494 43  -3.712  0.0050
##  Below Normal - Above Normal  -1.5487 0.723 43  -2.143  0.2210
##  Below Normal - Wet           -2.2848 0.551 43  -4.150  0.0014
##  Above Normal - Wet           -0.7361 0.655 43  -1.123  0.7934
## 
## Year = 2020:
##  contrast                    estimate    SE df t.ratio p.value
##  Critical - Dry                0.2737 0.976 43   0.281  0.9986
##  Critical - Below Normal       0.2193 0.921 43   0.238  0.9993
##  Critical - Above Normal      -0.7336 1.349 43  -0.544  0.9821
##  Critical - Wet               -1.6841 0.884 43  -1.906  0.3297
##  Dry - Below Normal           -0.0544 0.958 43  -0.057  1.0000
##  Dry - Above Normal           -1.0073 1.375 43  -0.733  0.9477
##  Dry - Wet                    -1.9578 0.923 43  -2.121  0.2301
##  Below Normal - Above Normal  -0.9529 1.337 43  -0.713  0.9524
##  Below Normal - Wet           -1.9034 0.865 43  -2.201  0.1988
##  Above Normal - Wet           -0.9505 1.311 43  -0.725  0.9496
## 
## P value adjustment: tukey method for comparing a family of 5 estimates
plot(a3e, comparisons = T)
## I bet you wanted to call this with just object[[1]] - use '[[]]' or which' if I'm wrong.
## See '? emm_list' for more information

That’s just a quick intro for some of the available statistical tools.