Raw data

Fiddler crabs

The raw data that you’ve downloaded to data-raw/ consists of six files. These data have been generated by two teams of marine biologists, one working at Ria Formosa and another at Ria de Alvor.

These two teams defined two quadrats in each of the Rias. They sampled fiddler crabs from the two quadrats in each Ria, in two times of the year: Summer and Winter.

They took note of some demographics variables, e.g. sex and developmental stage, as well as of some morphometrics’ variables. The remaining of the data set should be self-explanatory :)

First acquaintances

Start by loading the following required packages:

library(tidyverse)
library(here)
library(tools)
library(readxl)

Get a variable with the path to the directory containing the raw data files:

data_raw_path <- here("data-raw")
data_raw_path

[1] "/home/rmagno/sci/code/R/web/tdvr.oct.22/data-raw"

Note that the path up to data-raw/ is different from the one shown here. This is the job of the function here(): automatically determines the path of your data-raw directory. Now list the files in data-raw/:

list.files(data_raw_path)

[1] "data-raw.zip"      "quadrats.xlsx"     "rf_s_q1.csv"      
[4] "rf_s_q2.csv"       "rf_w_q1.csv"       "rf_w_q2.csv"      
[7] "Ria de Alvor.xlsx"

There are four CSV files (rf_s_q1.csv, rf_s_q2.csv, rf_w_q1.csv, rf_w_q2.csv) and two Excel files (quadrats.xlsx, Ria de Alvor.xlsx). The CSV files contain data on the fiddler crabs sampled during the four field trips to Ria Formosa. The analogous data for Ria de Alvor is provided in the Excel file Ria de Alvor.xlsx. The file quadrats.xlsx contains the area for each of quadrats defined in both Ria Formosa and Ria de Alvor.

Use Excel to visually inspect the data in Ria de Alvor.xlsx, in one of the CSV files of your choice, and in quadrats.xlsx. Take some quick notes of:

How many data sets are there?
What is the observational unit in each data set?
What are the variables? Are the variable names used consistently across the data sets from Ria Formosa (CSV files) and Ria de Alvor (Ria de Alvor.xlsx)?
By looking at the values of the variables can you tell the type of variable? I.e., is it categorical/nominal, ordinal, binary, or continuous? Are there invalid or unexpected values?

Exercise 1.1

Collect the previous commands into your R script data-tidying.R.

Solution to Exercise 1.1

library(tidyverse)
library(here)
library(tools)
library(readxl)

# Define the path to the raw data
data_raw_path <- here("data-raw")

# List the raw data files
list.files(data_raw_path)

Programmatic acquaintances

We will now read the data into R, and try to answer the same questions but using R code. Here’s how you may read one of the CSV files using the read_csv() function:

rf_s_q1 <- readr::read_csv(file.path(data_raw_path, "rf_s_q1.csv"))

And here is how you read the quadrats.xlsx into R:

quadrats01 <- readxl::read_excel(file.path(data_raw_path, "quadrats.xlsx"))

Exercise 1.2

Following those examples can you read all six files into R? Note that Ria de Alvor.xlsx will be trickier because it contains several sheets.

Solution to Exercise 1.2

# Reading the four CSV files (Ria Formosa)
rf_s_q1 <- readr::read_csv(file.path(data_raw_path, "rf_s_q1.csv"))
rf_s_q2 <- readr::read_csv(file.path(data_raw_path, "rf_s_q2.csv"))
rf_w_q1 <- readr::read_csv(file.path(data_raw_path, "rf_w_q1.csv"))
rf_w_q2 <- readr::read_csv(file.path(data_raw_path, "rf_w_q2.csv"))

# Reading now the four sheets inside of "Ria de Alvor.xlsx"
ra_path <- file.path(data_raw_path, "Ria de Alvor.xlsx")
ra_s_q1 <- readxl::read_excel(ra_path, sheet = "summer-q1")
ra_s_q2 <- readxl::read_excel(ra_path, sheet = "summer-q2")
ra_w_q1 <- readxl::read_excel(ra_path, sheet = "winter-q1")
ra_w_q2 <- readxl::read_excel(ra_path, sheet = "winter-q2")

# Finally, reading the details about the quadrats
quadrats01 <- readxl::read_excel(file.path(data_raw_path, "quadrats.xlsx"))

Exercise 1.3

To inspect the data just loaded into R we may try the following functions on those objects:

View()
dplyr::glimpse()
colnames()
nrow() and ncol()
head() and tail()
summary() and table()
unique()

Here are a few examples:

# Print each column as a row and indicate the column name,
# followed by its type, and the first values:
dplyr::glimpse(rf_s_q1)

# Column names of the data frame `rf_s_q1`:
colnames(rf_s_q1)

# Create a contingency table of the column `stage`:
table(rf_s_q1$stage)

# Show me the unique values present in the `sex` column:
unique(rf_s_q1$sex)

Write down as comments in your script file data-tidying.R the insights you gained about the data:

Solution to Exercise 1.3

# Insights gained: to be discussed with trainers.