Rosie
There are multiple different ways of getting your data into R. RStudio has a nice little point=and-click interface that lets you import your dataset, however using the GUI makes it more difficult to replicate your work later. Therefore, it’s better to use the comand-line to import your data.
We are going to use the Zooplankton Survey data for this demonstration. It’s available On the CDFW Website
If you have a .csv file, you can use base R. So I went into the excel file and saved one of the worksheets as a .csv
If you’ve having trouble saving as a .csv, you can download the .csv version from the workshop GitHub Site
?read.csv
CBdata = read.csv("CBdata.csv")
head(CBdata)
Now, I just had to give read.csv the name of the file, and it found the file because it was saved in my project’s working directory. if you have it saved somewhere else, it might be a little harder. If you type read.csv("") and hit the tab key inside the quotes, RStudio will help you navigate (quick demo here).
But what if we want to import the original excel file? For that, we need the “readxl” package
library(readxl)
?read_excel
#CBdata = read_xlsx("1972-2019CBMatrix.xlsx", sheet = "CB CPUE Matrix 1972-2019")
#Yuck. If you have too many blanks in the begining of your sheet, it thinks it is
#a logical vector. Let's fix that.
#You can adjust the `guessmax` argument to skip all the beginning NAs
CBdata = read_excel("1972-2019CBMatrix.xlsx",
sheet = "CB CPUE Matrix 1972-2019", guess_max = 100000,
na = "NA")
Now we have our data imported, but it is usually not in the format we want. Often the column names are long and unwieldy, data got imported in the wrong format, or we want to re-order the data.
A lot of these skills can be done with either Base R syntax, or “tidyverse” syntax. Most people find “tidyverse” a little more intuitive, but you can use whichever you are more comfortable with.
One of the useful thing about the tidyverse functions is the fact that you can use “pipes” %>% to string together several functions instead of nesting the functions. We will introduce you to pipes, but I’ll show you how to nest the function too, so you can use whichever is easier for you to understand.
When the EMP zooplankton folks were getting data ready to share, they wanted to use nice long descriptive column names. But some of them my be long and annoying if we are going to use them for coding. So we want to rename some of them. We have several options.
library(tidyverse)
names(CBdata)
#several options for how to rename your columns
#Use the square brackets to select values from the vector of column names
names(CBdata)[names(CBdata)== "SurveyCode"] = "cheese"
names(CBdata)
#you could also just make a new column and delete the old one
CBdata$crackers = CBdata$cheese
CBdata$cheese = NULL
names(CBdata)
#using "rename" (from the package "dplyr", one of the tidyverse packages)
# is a little more intuitive
CBdata = rename(CBdata, SurveyCode = crackers)
names(CBdata)
#this is what it looks like with a pipe instead of nesting
CBdata = CBdata %>%
rename(crackers = DWRStationNo)
names(CBdata)
The zooplankton data set has one row for every sample, and one column for every species. It also has columns with the sums of groups of species (e.g. “All Cladocera”). We want to get rid of those summed columns because they make the data messy and harder to work with.
#Here is how you drop a column in base R
CBdata2 = subset(CBdata, select = -ALLCYCADULTS)
#but we can do more complicated selections with the "select" function from dplyr
?select
CBdata2 = select(CBdata, -starts_with("ALL"))
But now our “SurveyCode” column is at the end. We want to put it back in the beginning.
#here's how to do it in base R
CBdata2 = CBdata2[,c(68, 1:67)]
names(CBdata2)
#you can also use the "select" function in dplyr to do the same thing, but you
#can list the columns by name. (note that the `:` specifies the list of all values in between two values)
CBdata2v = select(CBdata2, Survey, Year, SurveyCode, SurveyRep:BARNNAUP)
names(CBdata2v)
#I just learned this cool thing when you have a lot of columns!
CBdatatest = select(CBdata2, DAPHNIA, everything())
names(CBdatatest)
We frequently need to transform data (especially log transformations). We might also need to calculate CPUE, or do other stuff, and add our calculated variables to our data set.
#In base R, we can just define a new variable and assign its value with the old variable
CBdata$ln_cals = log(CBdata$ALLCALADULTS + 1)
#or we can use the "mutate" function from dplyr. This is especially
#helpful if you want to make a bunch of variables at once
CBdata = CBdata %>%
mutate(ln_clads = log(ALLCLADOCERA + 1),
ln_cyc = log(ALLCYCADULTS +1),
ln_all = ln_cals + ln_cyc + ln_clads)
View(CBdata)
Even though we read in all our columns based on particular classes, sometimes we want to change the data class. Sometimes that’s super easy, sometimes it’s more difficult. Almost all data classes have an “as” function. as.data.frame, as.character, as.numeric, as.factor, etc.
Other useful functions: length(object) # number of elements or components str(object) # structure of an object class(object) # class or type of an object names(object) # names
Factors are particularly useful for creating models with categorical variables. Factors are actually stored as integers, but have helpful labels so they look like categories.
str(CBdata2)
#changing things from characters to factors and back again is really easy
#the "Region" variable would make a good factor
CBdata2$Region = as.factor(CBdata2$Region)
#or we can do it with "mutate" again
CBdata2 = CBdata2 %>%
mutate(Station = as.factor(Station))
str(CBdata2)
levels(CBdata2$Region)
The levels are automatically in alphabetical order. If we want to put them in order from west to east we can change the levels using the “factor” function. Just put the levels in the order you want. You can also use this function to change the labels of the factors (useful when plotting)
?factor
CBdata2$Region = factor(CBdata2$Region,
levels = c("NapaR", "SPBay",
"CarStrait", "SuiBay", "SuiMar", "EZ",
"WestDel", "SacR","SJR",
"NEDel","NorDel","SoDel","EastDel"),
labels = c("Napa River", "San Pablo Bay",
"Carquinez", "Suisun Bay", "Suisun Marsh",
"EZ", "West Delta", "Sacramento River",
"San Joaquin River", "North East Delta",
"North Delta", "South Delta", "East Delta")
)
levels(CBdata2$Region)
You can also change from factor to character, factor to numeric, or numeric to character, or logical values.
For example, if we wanted to look at crabzoea presence/absence instead of abundance, we can use “as.logical”
CBdata2 = mutate(CBdata2, CRABZOEA_yn = as.logical(CRABZOEA))
View(CBdata2)
CBdata2 = select(CBdata2, -CRABZOEA_yn)
Take 20-30 minutes to work through these examples. Feel free to raise your hand or use the chat to ask for help. Use the “thumbs up” reaction when you are done.
How many samples does the Zooplankton survey have from the North Delta that contain Harpacticoids (HARPACT)?
Make a new variable for the total catch of all zooplankton.
Create a new data frame that only contains samples with positive occurrences of Eurytemora (EURYTEM).