Loops

We’ve now got our zooplankton data set in a format we can work with, but now we probably want to do some calculations on it. What if we want to calculate annual average catch?

We can do it one year at a time:

mean(filter(CBlong, Year == 1984)$CPUE)

mean(filter(CBlong, Year == 1985)$CPUE)

mean(filter(CBlong, Year == 1986)$CPUE)

But we don’t want to keep copying and pasting! Instead, we can use a loop. For each items in a sequence, do something.

for Loop

for (items in sequence) {
  # do something
}

…an example

for (i in 1:5) {
  print(i)
}

to calculate the means of our CB data set by year:

#first we set up an empty data frame to accept our output
output = data.frame(year = min(CBlong$Year):max(CBlong$Year), Mean = NA)


for (i in 1:nrow(output)) {   # sequence
  
  output$Mean[i] = mean(filter(CBlong, 
                               Year == output$year[i])$CPUE, na.rm = T)  #body
}

That took a few seconds. You may have heard “don’t use for loops in R - they are slow!” That is true when you are dealing with very large data sets, but remember: If it works, it’s good code! However, you might want to try some different approaches to speed up your data processing, or make your code more readable.

Apply

The apply family of functions includes a number of “wrappers” for loops that make them run faster with less code involved. Let’s start with a quick look at the help documentation for apply

# `<_>apply` Family Functions
?apply

There are some shortcut functions for using apply on certain data types. For example, you can use lapply for lists and vapply for vectors. The basic syntax is:

Here in lapply the argument to FUN is that ‘do something’ in the for loop. The function passed to FUN, will act on every item in the list or vector.

lapply(list or vector, FUN = function(x) # do something)

lapply First Example

Here we supply the vector 1–>5. We wish to add 5 to each number in that vector.

lapply(1:5, FUN = `+`, 5)

But with this example, we can do the same thing without using lapply. But because R is vectorized, we can simply write below. The results remain neatly in a vector.

1:5 + 5

Clean Data

Now let’s take a look at some of the zooplankton data! The first thing we want to know about our data is…is it clean? Where are the NA values (in which fields)? What are field data types? Are these the correct data types?

The easiest way to start is with the “summary” function

summary(CBdata)

This is great for all of our numeric data, and lets us know the data types, but there might be some problems in our character variables too.

To check out those, we can use the unique functions. First we will select just the coloms with characters, then apply the "unique’ functino to each column.

lapply(select(CBlong, where(is.character)), unique)

We notice something interesting about fields EZStation and crackers. We have some quoted NAs. That is, entries in the Excel file are NA, not just blank (which would be read by R as NA). This certainly poses a problem for data summary and other analytics.

The function “na_if” automatically replaces a certain value with an NA

CBlong = mutate(CBlong, EZStation == na_if(EZStation, "NA"),
               crackers == na_if(crackers, "NA"))

Split-apply-combine

The package dplyr has some nifty tools to create summaries of your data based on particular groups. In this case, we want to group our data frame based on year and apply the mean fuction to CPUE.

CBmeans = CBlong %>%
  group_by(Year) %>% #group it by Year
  summarize(MeanCPUE = mean(CPUE)) #Calculate Mean CPUE

CBmeans
#much faster than the loop!


#we can get really complicated if we want
CBmeans = CBlong %>%
  group_by(Year) %>% #group it by Year
  summarize(MeanCPUE = mean(CPUE), #Calculate Mean CPUE
                    sdCPUE = sd(CPUE), #calculate standard deviation
                    nobs = length(CPUE), #number of observations
                    seCPUE = sdCPUE/nobs) #calculate standard error
View(CBmeans)

#note that you can use your new variables immediately!

Now it’s your turn

  1. What is the average CPUE of each species?

  2. What is the maximum CPUE of Eurytemora in Suisun Bay by year?

  3. Calculate the relative % composition of each species by year (this is a hard one).