Outline

  1. Importing data
  2. Renaming columns and restructuring data frames
  3. Converting between data classes
  4. Adding new columns

Importing data

Import from your desktop

Rosie

There are multiple different ways of getting your data into R. RStudio has a nice little point=and-click interface that lets you import your dataset, however using the GUI makes it more difficult to replicate your work later. Therefore, it’s better to use the comand-line to import your data.

We are going to use the Zooplankton Survey data for this demonstration. It’s available On the CDFW Website

If you have a .csv file, you can use base R. So I went into the excel file and saved one of the worksheets as a .csv

If you’ve having trouble saving as a .csv, you can download the .csv version from the workshop GitHub Site

?read.csv

CBdata = read.csv("CBdata.csv")
head(CBdata)

Now, I just had to give read.csv the name of the file, and it found the file because it was saved in my project’s working directory. if you have it saved somewhere else, it might be a little harder. If you type read.csv("") and hit the tab key inside the quotes, RStudio will help you navigate (quick demo here).

But what if we want to import the original excel file? For that, we need the “readxl” package

library(readxl)

?read_excel

#CBdata = read_xlsx("1972-2019CBMatrix.xlsx", sheet = "CB CPUE Matrix 1972-2019")
#Yuck. If you have too many blanks in the begining of your sheet, it thinks it is
#a logical vector. Let's fix that.

#You can adjust the `guessmax` argument to skip all the beginning NAs
CBdata = read_excel("1972-2019CBMatrix.xlsx", 
    sheet = "CB CPUE Matrix 1972-2019", guess_max = 100000,
    na = "NA")

Dealing with data frames

Now we have our data imported, but it is usually not in the format we want. Often the column names are long and unwieldy, data got imported in the wrong format, or we want to re-order the data.

A lot of these skills can be done with either Base R syntax, or “tidyverse” syntax. Most people find “tidyverse” a little more intuitive, but you can use whichever you are more comfortable with.

One of the useful thing about the tidyverse functions is the fact that you can use “pipes” %>% to string together several functions instead of nesting the functions. We will introduce you to pipes, but I’ll show you how to nest the function too, so you can use whichever is easier for you to understand.

Renaming columns

When the EMP zooplankton folks were getting data ready to share, they wanted to use nice long descriptive column names. But some of them my be long and annoying if we are going to use them for coding. So we want to rename some of them. We have several options.

library(tidyverse)

names(CBdata)

#several options for how to rename your columns

#Use the square brackets to select values from the vector of column names
names(CBdata)[names(CBdata)== "SurveyCode"] = "cheese"

names(CBdata)

#you could also just make a new column and delete the old one
CBdata$crackers = CBdata$cheese
CBdata$cheese = NULL

names(CBdata)

#using "rename" (from the package "dplyr", one of the tidyverse packages) 
# is a little more intuitive
CBdata = rename(CBdata, SurveyCode = crackers)

names(CBdata)

#this is what it looks like with a pipe instead of nesting

CBdata = CBdata %>%
  rename(crackers = DWRStationNo)

names(CBdata)

Dropping columns

The zooplankton data set has one row for every sample, and one column for every species. It also has columns with the sums of groups of species (e.g. “All Cladocera”). We want to get rid of those summed columns because they make the data messy and harder to work with.

#Here is how you drop a column in base R

CBdata2 = subset(CBdata, select = -ALLCYCADULTS)

#but we can do more complicated selections with the "select" function from dplyr
?select

CBdata2 = select(CBdata, -starts_with("ALL"))

Reordering columns

But now our “SurveyCode” column is at the end. We want to put it back in the beginning.

#here's how to do it in base R
CBdata2 = CBdata2[,c(68, 1:67)]

names(CBdata2)

#you can also use the "select" function in dplyr to do the same thing, but you 
#can list the columns by name. (note that the `:` specifies the list of all values in between two values)
CBdata2v = select(CBdata2, Survey, Year, SurveyCode, SurveyRep:BARNNAUP)

names(CBdata2v)

#I just learned this cool thing when you have a lot of columns!
CBdatatest = select(CBdata2, DAPHNIA, everything())

names(CBdatatest)

Adding new mutated data

We frequently need to transform data (especially log transformations). We might also need to calculate CPUE, or do other stuff, and add our calculated variables to our data set.

#In base R, we can just define a new variable and assign its value with the old variable
CBdata$ln_cals = log(CBdata$ALLCALADULTS + 1)

#or we can use the "mutate" function from dplyr. This is especially
#helpful if you want to make a bunch of variables at once

CBdata = CBdata %>%
  mutate(ln_clads = log(ALLCLADOCERA + 1), 
         ln_cyc = log(ALLCYCADULTS +1), 
         ln_all = ln_cals + ln_cyc + ln_clads)

View(CBdata)

converting among data classes

Even though we read in all our columns based on particular classes, sometimes we want to change the data class. Sometimes that’s super easy, sometimes it’s more difficult. Almost all data classes have an “as” function. as.data.frame, as.character, as.numeric, as.factor, etc.

Other useful functions: length(object) # number of elements or components str(object) # structure of an object class(object) # class or type of an object names(object) # names

Factors are particularly useful for creating models with categorical variables. Factors are actually stored as integers, but have helpful labels so they look like categories.

str(CBdata2)

#changing things from characters to factors and back again is really easy
#the "Region" variable would make a good factor

CBdata2$Region = as.factor(CBdata2$Region)

#or we can do it with "mutate" again
CBdata2 = CBdata2 %>%
  mutate(Station = as.factor(Station)) 

str(CBdata2)

levels(CBdata2$Region)

The levels are automatically in alphabetical order. If we want to put them in order from west to east we can change the levels using the “factor” function. Just put the levels in the order you want. You can also use this function to change the labels of the factors (useful when plotting)

?factor

CBdata2$Region = factor(CBdata2$Region, 
                        levels = c("NapaR", "SPBay", 
                                   "CarStrait",  "SuiBay", "SuiMar", "EZ",
                                   "WestDel", "SacR","SJR",
                                   "NEDel","NorDel","SoDel","EastDel"),
                        labels = c("Napa River", "San Pablo Bay",
                                   "Carquinez", "Suisun Bay", "Suisun Marsh",
                                   "EZ", "West Delta", "Sacramento River",
                                   "San Joaquin River", "North East Delta",
                                   "North Delta", "South Delta", "East Delta")
                        )

levels(CBdata2$Region)

You can also change from factor to character, factor to numeric, or numeric to character, or logical values.

For example, if we wanted to look at crabzoea presence/absence instead of abundance, we can use “as.logical”

CBdata2 = mutate(CBdata2, CRABZOEA_yn = as.logical(CRABZOEA))

View(CBdata2)

CBdata2 = select(CBdata2, -CRABZOEA_yn)

Now its your turn

Take 20-30 minutes to work through these examples. Feel free to raise your hand or use the chat to ask for help. Use the “thumbs up” reaction when you are done.

  1. How many samples does the Zooplankton survey have from the North Delta that contain Harpacticoids (HARPACT)?

  2. Make a new variable for the total catch of all zooplankton.

  3. Create a new data frame that only contains samples with positive occurrences of Eurytemora (EURYTEM).