Dataset Restructuring

Wide to long

The data is never in the format you want it. Usually, R likes things to be in “long” format instead of “wide” format. However, data is often published in “wide” format, and community analyses like things in “wide” format, so you need to know how to switch between the two.

The “Long” format preferred because it usually makes the data “tidy”. Tidy data is a method of data structuring analagous to the ‘normal forms’ of database structure. We don’t have time to go into detail here, but in brief, we want data where:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

You can read more about tidy data in the R for Data Science Web Book

To start tidying our zooplankton data, we need to transpose our data so we have one species per row. You can easily transpose the rows and columns of a matrix using t

#we can set up a quick data frame
X <- data.frame(
  Item1 = 1:5,
  Item2 = 6:10,
  Item3 = 11:20
)

X

#Then Transpose it
Y = t(X)
Y

However, sometimes you want to transpose some of the columns and leave the other columns the way they are. For example, we want to keep the environmental data for each row in the zooplankton data set, but transpose the species names.

The functions pivot_wider and pivot_longer are probably the easiest way to do this. You may have heard of other ways to do it, including the reshape package (melt and cast), and the earlier tidyr functions spread and gather. They all work, it’s just a matter of which works for you.

library(tidyverse)
?pivot_longer

CBlong = CBdata2 %>% #specify the data set you want to pivot
  pivot_longer(cols = ACARTELA:CRABZOEA, #specify the colomns to pivot
                      names_to = "TaxonCode", #give the name of the new column you want to create
                      values_to = "CPUE" #name of the column for the values
                      )

View(CBlong)

If we want to put it back in “wide” format, we use pivot_wider

?pivot_wider

CBwide = CBlong %>% #specify the data set you want to pivot
  pivot_wider(id_cols = c(Survey, Year, Date, Station), #Identifier columns (ones you don't want to pivot)
                      names_from = TaxonCode, #give the name of the new column you want to use for the names of the new columns
                      values_from = CPUE #name of the column for the values
                      )

View(CBwide)

We can also use this to summarize values as we pivot. For example, if we wanted one row for each year instead of for each sample, we can do that.

CBwideY = CBlong %>% #specify the data set you want to pivot
  pivot_wider( id_cols = Year, #specify the column with the unique identifier
                      names_from = TaxonCode, #give the name of the new column you want to use for the names of the new columns
                      values_from = CPUE, #name of the column for the values
                      values_fn = list(CPUE = sum) #function to use to combine values
                      )

View(CBwideY)

Merging data

Frequently, we also need to combine several tables. For example, we might want the GPS coordinates for the stations to be included in our zooplankton table. The excel file had those coordinates, but they were in another spread sheet. To put those tables together, we can use the “merge” function.

#first let's import the station lookup table from the excel file

taxalookup <- read_excel("1972-2018CBMatrix.xlsx", 
    sheet = "CB Taxa Lookup")
View(taxalookup)

#apparently the first row wasn't the column names! Oh no!

#let's rename that
names(taxalookup) =  c("TaxonCode", "TaxonName", "StartYear", "EndYear")
View(taxalookup)

#get rid of the first and last row
taxalookup = taxalookup[-c(1,58),]
View(taxalookup)

Now we can attach the scientific names to the main “CBlong” data frame

?merge

CBlong2 = merge(CBlong, taxalookup, by = "TaxonCode")

#It automatically merges by any column with the same name (you don't really need the "by" argument), but
#sometimes our datasets don't have the same column names, even if they contain the same data.

taxalookup2 = taxalookup %>%
  rename(taxoncode = TaxonCode)

CBlongtest = merge(CBlong[1:200,], taxalookup2, by.x = "TaxonCode", by.y = "taxoncode")

#if we didn't specify, it would be a mess
CBlongtest2 = merge(CBlong[1:200,], taxalookup2)

#it tried to pair up every combination of rows!

This automatically only keeps rows that are present in both data sets (equivalent to an “inner join” in SQL). If we want to do a left join or a right join you can set “all.x = T” or “all.y = T”.

#for example, if we only had scientific names for some of our taxa
taxalookupshort = taxalookup[1:20,]

CBlong3 = merge(CBlong, taxalookupshort, by = "TaxonCode")
nrow(CBlong)
nrow(CBlong3)
#all the rows that weren't in the first 20 taxa have been dropped


CBlong4 = merge(CBlong, taxalookupshort, by = "TaxonCode", all.x = T)
nrow(CBlong4)
View(CBlong4)
#Empty values are now "NA"

That’s how it works in base R. The “tidyverse” has a set of functions for different types of joins that are very similar to the syntax in SQL or an Access database, so some people find them easier to deal with.

?join

#for a basic join, the syntax is similar
CBlong2 = inner_join(CBlong, taxalookup, by = "TaxonCode")

#but when things get more complicated, it's a little easier to see what you are doing
taxalookupshort = taxalookup[1:20,]

CBlong3 = inner_join(CBlong, taxalookupshort, by = "TaxonCode")
CBlong4 = left_join(CBlong, taxalookupshort, by = "TaxonCode")
nrow(CBlong)
nrow(CBlong3)
nrow(CBlong4)
View(CBlong4)
#Empty values are now "NA"

Exporting data

Now that we have data in a format we like, we can export the data to share with our colleagues or use later. We can export it as a .csv file, or as an .Rdata file. The .Rdata files are smaller, so easier to send around, but are less widely used and not as stable for long-term storage.

write.csv(CBlong[1:200,], "CBdata_longformat.csv")

#remember that this will automatically save to our working directory. If you want to save somewhere else 
#you will need to specify the whole path

#a subfolder within your directory
write.csv(CBlong[1:200,], "data/CBlonformat.csv")

#a totally different folder
write.csv(CBlong[1:200,], "C:/Users/rhartman/Desktop/CBlong.csv")

#hint: hit 'tab' inside the quotes to help you with the file path

If you open one of the .csv’s you just made, you will notice that there is a new column of numbers at the start of the file. These were the row names from the original data frame. These get annoying if you need to read the file back into R. Get rid of them with row.names = FALSE

write.csv(CBlong[1:200,], "CBdata_longformat.csv", row.names = FALSE)

If you are saving the data to put back into R later, you can save it as an Rdata file. These files take up less storage space than .csvs, and it saves the row names and data classes as well as the data themselves.

#for an .Rdata file (also .rda)
save(CBlong, file = "cblong.rda")

Now its your turn

Take 20-30 minutes to work through these examples. Feel free to raise your hand or use the chat to ask for help. Use the “thumbs up” reaction when you are done.

Import the “Station_GPS.csv” into your environment and merge the latitude and longitude to the “CBlong” data frame.
Convert your new data frame (with lats and longs) back to “wide” format.
Save you new data frame as a .csv.

Did you remember to comment your code?

Dataset restructuring

Rosemary Hartman

6/15/2021

Dataset Restructuring

Wide to long

Merging data

Exporting data

Now its your turn