I have taught a 6-hour class on data manipulation. There is a lot to cover! But for today, I want to focus on a few basic topics:
We will be moving fast, but hopefully this gives you a taste of what is possible. Please feel free to reach out for more!
If you are new to R, you may be confused or nervous about loading ‘a bunch of packages’. However, packages are ways to make your life easier, so just embrace them. You can always download one, try it out, and then delete it if you don’t like it. Often there are several packages that all do the same thing. In particular, a set of packages known as “the tidyverse” is a popular alternative to many of the base-R functions. I like them, and I think they are easier to learn, so while I’ll show you a few Base R tools, mostly I’ll use the tidyverse.
These packages are:
The tidyverse is a collection of packages that can easily be installed with a single “meta”-package, which is called “tidyverse”. This provides a convenient way of downloading and installing all tidyverse packages with a single R command:
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
We’ll also want a really nifty package that helps us see what our
data manipulation commands are doing called tidylog
#install.packages("tidylog")
library(tidylog)
##
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
##
## add_count, add_tally, anti_join, count, distinct, distinct_all,
## distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
## full_join, group_by, group_by_all, group_by_at, group_by_if,
## inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
## relocate, rename, rename_all, rename_at, rename_if, rename_with,
## right_join, sample_frac, sample_n, select, select_all, select_at,
## select_if, semi_join, slice, slice_head, slice_max, slice_min,
## slice_sample, slice_tail, summarise, summarise_all, summarise_at,
## summarise_if, summarize, summarize_all, summarize_at, summarize_if,
## tally, top_frac, top_n, transmute, transmute_all, transmute_at,
## transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
##
## drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
## spread, uncount
## The following object is masked from 'package:stats':
##
## filter
For more on how these packages work together see
tidyverse::paper
But rigth now we will just use these packages and worry about phylosophy later.
Another of my favorite packages for data manipulation is
lubridate
. It makes working with dates and times a lot
easier.
library(lubridate)
We will be using the FMWT catch matrices for this demo. By the end of the day we will be able to calculate the index from teh raw catch data.
First we will go to the CDFW FTP site and download the data
https://filelib.wildlife.ca.gov/Public/TownetFallMidwaterTrawl/FMWT%20Data/
The file we want is “FMWT 1967-2022 Catch Matrix_updated.zip”
Then we will unzip the file and save it in the working directory for this project.
getwd()
## [1] "C:/Users/rhartman/OneDrive - California Department of Water Resources/Office meetins and programs/mentorship"
I hope you are already working in a project. If not, let’s set one up now!
Create a folder called ‘data’ and move your file there.
If you are feeling fancy, you can do this in R
temp <- tempfile()
download.file("https://filelib.wildlife.ca.gov/Public/TownetFallMidwaterTrawl/FMWT%20Data/FMWT%201967-2022%20Catch%20Matrix_updated.zip",temp)
FMWT <- read_csv(unz(temp, "FMWT 1967-2022 Catch Matrix_updated.csv"))
## Rows: 29252 Columns: 142
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (140): Year, SurveyNumber, StationCode, index, StationLat, StationLong,...
## date (1): SampleDate
## time (1): SampleTimeStart
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unlink(temp)
But if you downloaded it and put it in your folder, you can use
read_csv
to bring it in.
read_csv
is very similar to read.csv
, but
it’s a little better at guessing data types.
FMWT = read_csv("data/FMWT 1967-2022 Catch Matrix_updated.csv")
## Rows: 29252 Columns: 142
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (140): Year, SurveyNumber, StationCode, index, StationLat, StationLong,...
## date (1): SampleDate
## time (1): SampleTimeStart
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
OK, now that we have our data read in, let’s try some manipulation.
First, let’s filter out just the last 20 years of data, and just the index stations.
#for this, we can use the 'filter' command
?filter
## starting httpd help server ... done
#filter uses your basic comparison commands. For example, remove all years after 2000
FMWT2 = filter(FMWT, Year >2000)
## filter: removed 18,493 rows (63%), 10,759 rows remaining
Now remove all surveys that weren’t in the ‘normal’ (Sept-Dec) time period
#First create a 'month' variable using the 'mutate' command, and the 'month' command from lubridate.
FMWT3 = mutate(FMWT2, Month = month(SampleDate))
## mutate: new variable 'Month' (double) with 8 unique values and 0% NA
#see my last presention on Lubridate for more on that.
Now we can use %in%
to find all rows that match a set of
values
FMWT4 = filter(FMWT3, Month %in% c(9:12))
## filter: removed 292 rows (3%), 10,467 rows remaining
The filter
command allows you to select rows. The
select
command allows you to select columns. You can also
use select
to reorder columns
So, if we wanted just the environmental data and not the fish data:
#(note that the `:` specifies the list of all values in between two values)
FMWTenv = select(FMWT4, Year:WaveCode)
## select: dropped 115 variables (Aequorea spp, American Shad, Arrow Goby, Bat Ray, Bay Goby, …)
If we want to put Delta Smelt first in our dataset
#I just learned this cool thing when you have a lot of columns!
FMWTtest = select(FMWT4, `Delta Smelt`, everything())
## select: columns reordered (Delta Smelt, Year, SampleDate, SurveyNumber, StationCode, …)
FMWTtest2 = select(FMWT4, -`Delta Smelt`)
## select: dropped one variable (Delta Smelt)
Let’s now use pivot_longer
to switch this from ‘wide’ to
‘long’ format.
The PI for FMWT was actually very nice, and the file in the zipped
folder you downloaded called
FMWT 1967-2022 Catch Matrix_updated_tidy.csv
has it already
in long format, but i’ll demo this anyway.
pivot_longer
comes with another command,
pivot_wider
that reverses the process.
?pivot_longer
?pivot_wider
#last_col() is another useful way of selecting a column without haveing to writeout it's name.
?tidyselect
FMWTlong = pivot_longer(FMWT4, cols = c(`Aequorea spp`:last_col()),
names_to = "Species",
values_to = "Catch")
## pivot_longer: reorganized (Aequorea spp, American Shad, Arrow Goby, Bat Ray, Bay Goby, …) into (Species, Catch) [was 10467x143, now 1203705x30]
Let’s do a little more filtering. A few important things to know:
==
This is the logical equals sign. It means “Is this
equal?” whereas a single =
means “This is equal”.
|
Or &
And %in%
Contained
within !
Not
#let's get a dataset that is all the Delta Smelt
Deltasmelt = filter(FMWTlong, Species == "Delta Smelt")
## filter: removed 1,193,238 rows (99%), 10,467 rows remaining
#how about Delta Smelt and longfin smelt?
smelt = filter(FMWTlong, Species == "Delta Smelt" | Species == "Longfin Smelt")
## filter: removed 1,182,771 rows (98%), 20,934 rows remaining
#you could also write this:
smelt = filter(FMWTlong, Species %in% c("Delta Smelt","Longfin Smelt"))
## filter: removed 1,182,771 rows (98%), 20,934 rows remaining
#everything except Delta Smelt or Longfin Smelt
notsmelt = filter(FMWTlong, !Species %in% c("Delta Smelt","Longfin Smelt"))
## filter: removed 20,934 rows (2%), 1,182,771 rows remaining
One of the things I do most frequently in R is make summaries of
datasets or perform the same operation on different groups of data in my
dataset. group_by
and summarize
are gold.
#calculate annual average catch by species
FMWTann = summarize(group_by(FMWTlong, Year, Species), Catch = mean(Catch, na.rm =T))
## group_by: 2 grouping variables (Year, Species)
## summarize: now 2,530 rows and 3 columns, one group variable remaining (Year)
#calculate total catch of Delta Smelt
DSann = summarize(group_by(Deltasmelt, Year), Catch = sum(Catch, na.rm =T))
## group_by: one grouping variable (Year)
## summarize: now 22 rows and 2 columns, ungrouped
One of the nifty things about the tidyverse family of functions is you can do a lot of operations together. Traditionally, if you want to do a bunch of things to a dataset, you end up with a bunch of nested parentheses that are hard to keep track of. Pipes make the steps easier to see.
%>%
This is a pipe. IT feeds a dataset into a
function. It’s part of all the tidyverse packages. shortcut: ctrl shift
m
It was so popular, R now has it’s own ‘native’ pipe:
|>
I still use %>% because I’m used to it, but feel free to try out both.
For example, what if we want to filter out the Delta smelt, and calculate monthly total catch we can do it all in one step
If we used parentheses, it would look like this;
DSann = summarize(group_by(mutate(filter(FMWTlong, Species == "Delta Smelt"), Month = month(SampleDate)), Year, Month), Catch = sum(Catch))
## filter: removed 1,193,238 rows (99%), 10,467 rows remaining
## mutate: new variable 'Month' (double) with 4 unique values and 0% NA
## group_by: 2 grouping variables (Year, Month)
## summarize: now 88 rows and 3 columns, one group variable remaining (Year)
With pipes, it’s now more clear which attributes come with which funciotn.
DSann = FMWTlong %>%
filter(Species == "Delta Smelt") %>%
mutate(Month = month(SampleDate)) %>%
group_by(Year, Month) %>%
summarize(Catch = sum(Catch, na.rm = T))
## filter: removed 1,193,238 rows (99%), 10,467 rows remaining
## mutate: new variable 'Month' (double) with 4 unique values and 0% NA
## group_by: 2 grouping variables (Year, Month)
## summarize: now 88 rows and 3 columns, one group variable remaining (Year)
For the last data manipulation topic, I’m going to cover joining multiple datasets together in different ways.
This is very useful if you have information in a look up table (like scientific names, or station GPS coordinate, or whatever)
For example, we might be interested in associating the fish catch with water year type. Fortunately, FlowWest (a consulting firm we work with regularly), has a data package with all the water year types cleaned up and ready to go.
#devtools::install_github("FlowWest/waterYearType")
library(waterYearType)
View(water_year_indices)
Let’s filter out the Sacramento valley index and select the columns we want (water year, Index, year type)
WYs = water_year_indices %>%
filter(location == "Sacramento Valley") %>%
select(WY, Index, Yr_type)
## filter: removed 116 rows (50%), 116 rows remaining
## select: dropped 4 variables (Oct_Mar, Apr_Jul, WYsum, location)
Now we can use left_join
to join these two datasets
together. This is one of a set of functions in the dplyr package that
are very similar to the joins in sql queries.
It will automatically join on any columns shared between the two dataframes. If they have different names, we just need to tell it what the names are using the ‘by’ argument.
?left_join
FMWTlong2 = left_join(FMWTlong, WYs, by = c("Year" = "WY"))
## left_join: added 2 columns (Index, Yr_type)
## > rows only in x 56,120
## > rows only in y ( 95)
## > matched rows 1,147,585
## > ===========
## > rows total 1,203,705
This dropped all the years in the WYs
data frame that
don’t have FMWT data associated with it, but it kept the FMWT data that
doesn’t have a water year index (2022). If we wanted to keep all the
data, we’d use full_join
. If we only want to keep data that
matches in both data frames, we’d use inner_join
.