We’ve now got our zooplankton data set in a format we can work with, but now we probably want to do some calculations on it. What if we want to calculate annual average catch?
We can do it one year at a time:
mean(filter(CBlong, Year == 1984)$CPUE)
mean(filter(CBlong, Year == 1985)$CPUE)
mean(filter(CBlong, Year == 1986)$CPUE)
But we don’t want to keep copying and pasting! Instead, we can use a loop. For each items in a sequence, do something.
for Loopfor (items in sequence) {
# do something
}
…an example
for (i in 1:5) {
print(i)
}
to calculate the means of our CB data set by year:
#first we set up an empty data frame to accept our output
output = data.frame(year = min(CBlong$Year):max(CBlong$Year), Mean = NA)
for (i in 1:nrow(output)) { # sequence
output$Mean[i] = mean(filter(CBlong,
Year == output$year[i])$CPUE, na.rm = T) #body
}
That took a few seconds. You may have heard “don’t use for loops in R - they are slow!” That is true when you are dealing with very large data sets, but remember: If it works, it’s good code! However, you might want to try some different approaches to speed up your data processing, or make your code more readable.
The apply family of functions includes a number of “wrappers” for loops that make them run faster with less code involved. Let’s start with a quick look at the help documentation for apply
# `<_>apply` Family Functions
?apply
There are some shortcut functions for using apply on certain data types. For example, you can use lapply for lists and vapply for vectors. The basic syntax is:
Here in lapply the argument to FUN is that ‘do something’ in the for loop. The function passed to FUN, will act on every item in the list or vector.
lapply(list or vector, FUN = function(x) # do something)
lapply First ExampleHere we supply the vector 1–>5. We wish to add 5 to each number in that vector.
lapply(1:5, FUN = `+`, 5)
But with this example, we can do the same thing without using lapply. But because R is vectorized, we can simply write below. The results remain neatly in a vector.
1:5 + 5
Now let’s take a look at some of the zooplankton data! The first thing we want to know about our data is…is it clean? Where are the NA values (in which fields)? What are field data types? Are these the correct data types?
The easiest way to start is with the “summary” function
summary(CBdata)
This is great for all of our numeric data, and lets us know the data types, but there might be some problems in our character variables too.
To check out those, we can use the unique functions. First we will select just the coloms with characters, then apply the "unique’ functino to each column.
lapply(select(CBlong, where(is.character)), unique)
We notice something interesting about fields EZStation and crackers. We have some quoted NAs. That is, entries in the Excel file are NA, not just blank (which would be read by R as NA). This certainly poses a problem for data summary and other analytics.
The function “na_if” automatically replaces a certain value with an NA
CBlong = mutate(CBlong, EZStation == na_if(EZStation, "NA"),
crackers == na_if(crackers, "NA"))
The package dplyr has some nifty tools to create summaries of your data based on particular groups. In this case, we want to group our data frame based on year and apply the mean fuction to CPUE.
CBmeans = CBlong %>%
group_by(Year) %>% #group it by Year
summarize(MeanCPUE = mean(CPUE)) #Calculate Mean CPUE
CBmeans
#much faster than the loop!
#we can get really complicated if we want
CBmeans = CBlong %>%
group_by(Year) %>% #group it by Year
summarize(MeanCPUE = mean(CPUE), #Calculate Mean CPUE
sdCPUE = sd(CPUE), #calculate standard deviation
nobs = length(CPUE), #number of observations
seCPUE = sdCPUE/nobs) #calculate standard error
View(CBmeans)
#note that you can use your new variables immediately!
What is the average CPUE of each species?
What is the maximum CPUE of Eurytemora in Suisun Bay by year?
Calculate the relative % composition of each species by year (this is a hard one).