Data Manipulation for Beginners

A few data manipulation tricks

I have taught a 6-hour class on data manipulation. There is a lot to cover! But for today, I want to focus on a few basic topics:

Filtering
Selecting
Summarizing
Joining

We will be moving fast, but hopefully this gives you a taste of what is possible. Please feel free to reach out for more!

The tidyverse

If you are new to R, you may be confused or nervous about loading ‘a bunch of packages’. However, packages are ways to make your life easier, so just embrace them. You can always download one, try it out, and then delete it if you don’t like it. Often there are several packages that all do the same thing. In particular, a set of packages known as “the tidyverse” is a popular alternative to many of the base-R functions. I like them, and I think they are easier to learn, so while I’ll show you a few Base R tools, mostly I’ll use the tidyverse.

These packages are:

ggplot2
tibble
tidyr
readr
purr
dplyr
stringr
forcats

The tidyverse is a collection of packages that can easily be installed with a single “meta”-package, which is called “tidyverse”. This provides a convenient way of downloading and installing all tidyverse packages with a single R command:

#install.packages("tidyverse")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

We’ll also want a really nifty package that helps us see what our data manipulation commands are doing called tidylog

#install.packages("tidylog")
library(tidylog)

## 
## Attaching package: 'tidylog'

## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup

## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount

## The following object is masked from 'package:stats':
## 
##     filter

For more on how these packages work together see tidyverse::paper

But rigth now we will just use these packages and worry about phylosophy later.

Another of my favorite packages for data manipulation is lubridate. It makes working with dates and times a lot easier.

library(lubridate)

Step one - get the data

We will be using the FMWT catch matrices for this demo. By the end of the day we will be able to calculate the index from teh raw catch data.

First we will go to the CDFW FTP site and download the data

https://filelib.wildlife.ca.gov/Public/TownetFallMidwaterTrawl/FMWT%20Data/

The file we want is “FMWT 1967-2022 Catch Matrix_updated.zip”

https://filelib.wildlife.ca.gov/Public/TownetFallMidwaterTrawl/FMWT%20Data/FMWT%201967-2022%20Catch%20Matrix_updated.zip

Then we will unzip the file and save it in the working directory for this project.

getwd()

## [1] "C:/Users/rhartman/OneDrive - California Department of Water Resources/Office meetins and programs/mentorship"

I hope you are already working in a project. If not, let’s set one up now!

Create a folder called ‘data’ and move your file there.

If you are feeling fancy, you can do this in R

temp <- tempfile()
download.file("https://filelib.wildlife.ca.gov/Public/TownetFallMidwaterTrawl/FMWT%20Data/FMWT%201967-2022%20Catch%20Matrix_updated.zip",temp)
FMWT <- read_csv(unz(temp, "FMWT 1967-2022 Catch Matrix_updated.csv"))

## Rows: 29252 Columns: 142
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (140): Year, SurveyNumber, StationCode, index, StationLat, StationLong,...
## date   (1): SampleDate
## time   (1): SampleTimeStart
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

unlink(temp)

But if you downloaded it and put it in your folder, you can use read_csv to bring it in.

read_csv is very similar to read.csv, but it’s a little better at guessing data types.

FMWT = read_csv("data/FMWT 1967-2022 Catch Matrix_updated.csv")

## Rows: 29252 Columns: 142
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (140): Year, SurveyNumber, StationCode, index, StationLat, StationLong,...
## date   (1): SampleDate
## time   (1): SampleTimeStart
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

OK, now that we have our data read in, let’s try some manipulation.

First, let’s filter out just the last 20 years of data, and just the index stations.

#for this, we can use the 'filter' command
?filter

## starting httpd help server ... done

#filter uses your basic comparison commands. For example, remove all years after 2000

FMWT2 = filter(FMWT, Year >2000)

## filter: removed 18,493 rows (63%), 10,759 rows remaining

Now remove all surveys that weren’t in the ‘normal’ (Sept-Dec) time period

#First create a 'month' variable using the 'mutate' command, and the 'month' command from lubridate.

FMWT3 = mutate(FMWT2, Month = month(SampleDate))

## mutate: new variable 'Month' (double) with 8 unique values and 0% NA

#see my last presention on Lubridate for more on that.

Now we can use %in% to find all rows that match a set of values

FMWT4 = filter(FMWT3, Month %in% c(9:12))

## filter: removed 292 rows (3%), 10,467 rows remaining

Selecting particular columns

The filter command allows you to select rows. The select command allows you to select columns. You can also use select to reorder columns

So, if we wanted just the environmental data and not the fish data:

#(note that the `:` specifies the list of all values in between two values)
FMWTenv = select(FMWT4, Year:WaveCode)

## select: dropped 115 variables (Aequorea spp, American Shad, Arrow Goby, Bat Ray, Bay Goby, …)

If we want to put Delta Smelt first in our dataset

#I just learned this cool thing when you have a lot of columns!
FMWTtest = select(FMWT4, `Delta Smelt`,  everything())

## select: columns reordered (Delta Smelt, Year, SampleDate, SurveyNumber, StationCode, …)

FMWTtest2 = select(FMWT4, -`Delta Smelt`)

## select: dropped one variable (Delta Smelt)

Wide to long

Let’s now use pivot_longer to switch this from ‘wide’ to ‘long’ format.

The PI for FMWT was actually very nice, and the file in the zipped folder you downloaded called FMWT 1967-2022 Catch Matrix_updated_tidy.csv has it already in long format, but i’ll demo this anyway.

pivot_longer comes with another command, pivot_wider that reverses the process.

?pivot_longer
?pivot_wider

#last_col() is another useful way of selecting a column without haveing to writeout it's name.
?tidyselect

FMWTlong = pivot_longer(FMWT4, cols = c(`Aequorea spp`:last_col()),
                        names_to = "Species",
                        values_to = "Catch")

## pivot_longer: reorganized (Aequorea spp, American Shad, Arrow Goby, Bat Ray, Bay Goby, …) into (Species, Catch) [was 10467x143, now 1203705x30]

Let’s do a little more filtering. A few important things to know:

== This is the logical equals sign. It means “Is this equal?” whereas a single = means “This is equal”. | Or & And %in% Contained within ! Not

#let's get a dataset that is all the Delta Smelt
Deltasmelt = filter(FMWTlong, Species == "Delta Smelt")

## filter: removed 1,193,238 rows (99%), 10,467 rows remaining

#how about Delta Smelt and longfin smelt?

smelt = filter(FMWTlong, Species == "Delta Smelt" | Species == "Longfin Smelt")

## filter: removed 1,182,771 rows (98%), 20,934 rows remaining

#you could also write this:

smelt = filter(FMWTlong, Species %in% c("Delta Smelt","Longfin Smelt"))

## filter: removed 1,182,771 rows (98%), 20,934 rows remaining

#everything except Delta Smelt or Longfin Smelt

notsmelt = filter(FMWTlong, !Species %in% c("Delta Smelt","Longfin Smelt"))

## filter: removed 20,934 rows (2%), 1,182,771 rows remaining

Summarizing data

One of the things I do most frequently in R is make summaries of datasets or perform the same operation on different groups of data in my dataset. group_by and summarize are gold.

#calculate annual average catch by species
FMWTann =   summarize(group_by(FMWTlong, Year, Species), Catch = mean(Catch, na.rm =T))

## group_by: 2 grouping variables (Year, Species)

## summarize: now 2,530 rows and 3 columns, one group variable remaining (Year)

#calculate total catch of Delta Smelt 
DSann =   summarize(group_by(Deltasmelt, Year), Catch = sum(Catch, na.rm =T))

## group_by: one grouping variable (Year)

## summarize: now 22 rows and 2 columns, ungrouped

Pipes

One of the nifty things about the tidyverse family of functions is you can do a lot of operations together. Traditionally, if you want to do a bunch of things to a dataset, you end up with a bunch of nested parentheses that are hard to keep track of. Pipes make the steps easier to see.

%>% This is a pipe. IT feeds a dataset into a function. It’s part of all the tidyverse packages. shortcut: ctrl shift m

It was so popular, R now has it’s own ‘native’ pipe: |>

I still use %>% because I’m used to it, but feel free to try out both.

For example, what if we want to filter out the Delta smelt, and calculate monthly total catch we can do it all in one step

If we used parentheses, it would look like this;

DSann = summarize(group_by(mutate(filter(FMWTlong, Species == "Delta Smelt"), Month = month(SampleDate)), Year, Month), Catch = sum(Catch))

## filter: removed 1,193,238 rows (99%), 10,467 rows remaining

## mutate: new variable 'Month' (double) with 4 unique values and 0% NA

## group_by: 2 grouping variables (Year, Month)

## summarize: now 88 rows and 3 columns, one group variable remaining (Year)

With pipes, it’s now more clear which attributes come with which funciotn.

DSann = FMWTlong %>%
  filter(Species == "Delta Smelt") %>%
  mutate(Month = month(SampleDate)) %>%
  group_by(Year, Month) %>%
  summarize(Catch = sum(Catch, na.rm = T))

## filter: removed 1,193,238 rows (99%), 10,467 rows remaining

## mutate: new variable 'Month' (double) with 4 unique values and 0% NA

## group_by: 2 grouping variables (Year, Month)

## summarize: now 88 rows and 3 columns, one group variable remaining (Year)

Joins

For the last data manipulation topic, I’m going to cover joining multiple datasets together in different ways.

This is very useful if you have information in a look up table (like scientific names, or station GPS coordinate, or whatever)

For example, we might be interested in associating the fish catch with water year type. Fortunately, FlowWest (a consulting firm we work with regularly), has a data package with all the water year types cleaned up and ready to go.

#devtools::install_github("FlowWest/waterYearType")
library(waterYearType)
View(water_year_indices)

Let’s filter out the Sacramento valley index and select the columns we want (water year, Index, year type)

WYs = water_year_indices %>%
  filter(location == "Sacramento Valley") %>%
  select(WY, Index, Yr_type)

## filter: removed 116 rows (50%), 116 rows remaining

## select: dropped 4 variables (Oct_Mar, Apr_Jul, WYsum, location)

Now we can use left_join to join these two datasets together. This is one of a set of functions in the dplyr package that are very similar to the joins in sql queries.

It will automatically join on any columns shared between the two dataframes. If they have different names, we just need to tell it what the names are using the ‘by’ argument.

?left_join

FMWTlong2 = left_join(FMWTlong, WYs, by = c("Year" = "WY"))

## left_join: added 2 columns (Index, Yr_type)

##            > rows only in x      56,120

##            > rows only in y  (       95)

##            > matched rows     1,147,585

##            >                 ===========

##            > rows total       1,203,705

This dropped all the years in the WYs data frame that don’t have FMWT data associated with it, but it kept the FMWT data that doesn’t have a water year index (2022). If we wanted to keep all the data, we’d use full_join. If we only want to keep data that matches in both data frames, we’d use inner_join.