Data tidying

Introduction

We will now embark on a series of data transformations to convert our nine data sets into only two tidy data sets:

  • crabs
  • quadrats

So bear these two data sets in mind as these are our end goal for the Data Tidying workflow of today. As we proceed transforming our data frames we will keep variables that hold the intermediate steps. For example, early on in our workflow today, we will have a crabs01 data set. This is a messy data set. Then you will transform it a bit into crabs02, etc.. At the very end we will assign the last stage of our crabs data set to the variable crabs. The same pattern will be followed for quadrats, although this data set involves fewer steps.

From the previous section Raw Data you should have generated these nine data sets that are now our starting point:

  • Ria Formosa
    • rf_s_q1
    • rf_s_q2
    • rf_w_q1
    • rf_w_q2
  • Ria de Alvor
    • ra_s_q1
    • ra_s_q2
    • ra_w_q1
    • ra_w_q2
  • Quadrats
    • quadrats01

Okay, so now we are ready to start fixing the various aspects of messiness associated with our data sets.

Add missing columns

Currently, the fiddler crabs data is missing three variables:

  • region
  • season
  • quadrat

These variables were surreptitiously present in the names of the CSV files and in the name of the XLSX and included sheets.

An an example, the file rf_s_q1.csv, whose data has been read into rf_s_q1, indicates in its name that crab sampling was performed in Ria Formosa ("rf"), in the Summer ("s"), in quadrat 1 ("q1").

Exercise 2.1

Use the function dplyr::mutate() to add the three missing columns, and respective values, to each of the eight data frames:

  • Ria Formosa
    • rf_s_q1
    • rf_s_q2
    • rf_w_q1
    • rf_w_q2
  • Ria de Alvor
    • ra_s_q1
    • ra_s_q2
    • ra_w_q1
    • ra_w_q2

Here is an example for the data set rf_s_q1:

# Add the columns `region`, `season` and `quadrat`.
rf_s_q1_01 <- dplyr::mutate(rf_s_q1, region = "Ria Formosa", season = "summer", quadrat = "q1")

# Check the result.
dplyr::glimpse(rf_s_q1_01)
Rows: 1,243
Columns: 15
$ cl        <dbl> 21, 21, 15, 21, 19, 17, 19, 14, 19, 19, 17, 20, 22, 15, 17, …
$ lcl       <dbl> 16, 13, 9, 16, 8, 7, 14, 13, 10, 12, 8, 16, 13, 9, 8, 19, 13…
$ fw        <dbl> 15, 16, 10, 19, 14, 15, 15, 8, 11, 19, 8, 18, 15, 11, 13, 15…
$ species   <chr> "A. farensis", "A. farensis", "A. farensis", "A. farensis", …
$ longitude <dbl> -8.01873, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ is_gravid <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ rcl       <dbl> 10, 15, 5, 14, 7, 6, 9, 5, 10, 12, 13, 15, 11, 10, 6, 16, 12…
$ stage     <chr> "juvenile", "juvenile", "juvenile", "juvenile", "juvenile", …
$ sex       <chr> "?", "?", "female", "male or female", "female", "female", "f…
$ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ cw        <dbl> 27, 27, 19, 26, 24, 22, 25, 18, 24, 24, 21, 25, 27, 19, 22, …
$ latitude  <dbl> 37.02606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ region    <chr> "Ria Formosa", "Ria Formosa", "Ria Formosa", "Ria Formosa", …
$ season    <chr> "summer", "summer", "summer", "summer", "summer", "summer", …
$ quadrat   <chr> "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", …
Solution to Exercise 2.1
# Ria Formosa data sets
rf_s_q1_01 <- dplyr::mutate(rf_s_q1, region = "Ria Formosa", season = "summer", quadrat = "q1")
rf_s_q2_01 <- dplyr::mutate(rf_s_q2, region = "Ria Formosa", season = "summer", quadrat = "q2")
rf_w_q1_01 <- dplyr::mutate(rf_w_q1, region = "Ria Formosa", season = "winter", quadrat = "q1")
rf_w_q2_01 <- dplyr::mutate(rf_w_q2, region = "Ria Formosa", season = "winter", quadrat = "q2")

# Ria de Alvor data sets
ra_s_q1_01 <- dplyr::mutate(ra_s_q1, region = "Ria de Alvor", season = "summer", quadrat = "q1")
ra_s_q2_01 <- dplyr::mutate(ra_s_q2, region = "Ria de Alvor", season = "summer", quadrat = "q2")
ra_w_q1_01 <- dplyr::mutate(ra_w_q1, region = "Ria de Alvor", season = "winter", quadrat = "q1")
ra_w_q2_01 <- dplyr::mutate(ra_w_q2, region = "Ria de Alvor", season = "winter", quadrat = "q2")

Make variable (column) names consistent across data sets

Okay, now that we have all variables in our data sets, we will need to bind all data sets into one. But before we do that we will need to make sure that the variables (i.e. columns) across the eight data sets are all named in the same way. Otherwise, the step of combining the data sets will be cumbersome. We want it to be smooth. You’ll see in a bit what we mean.

We can tell you that there are discrepancies between the data sets coming from Ria Formosa and Ria de Alvor. This is because the two teams of marine biologists had different ideas on how to name two of the variables.

Also, the two teams had abbreviated the name of the morphological variables, and we find them hard to read, so we will rename them to new names as indicated in this table:

Current name New name Meaning
cw carapace_width Carapace width
cl carapace_length Carapace length
fw front_width Front width (distance between the crab eye stalks)
lcl left_claw_length Left claw length
rcl right_claw_length Right claw length

Exercise 2.2

  • Identify the two columns that have slightly different names between the Ria Formosa and the Ria de Alvor data sets.
  • Rename those two columns in the Ria de Alvor data sets (ra_s_q1_01, ra_s_q2_01, ra_w_q1_01 and ra_w_q2_01) to match the names used in Ria Formosa.
  • Rename the abbreviated column names to new, more explicit, names, as indicated in the table above. Do this for all data sets.
  • Name these new data sets as: rf_s_q1_02, rf_s_q2_02, etc. and ra_s_q1_02, ra_s_q2_02, etc..

Use the function dplyr::rename() for renaming. Here is an example:

rf_s_q1_02 <- dplyr::rename(rf_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)

dplyr::glimpse(rf_s_q1_02)
Rows: 1,243
Columns: 15
$ carapace_length   <dbl> 21, 21, 15, 21, 19, 17, 19, 14, 19, 19, 17, 20, 22, …
$ left_claw_length  <dbl> 16, 13, 9, 16, 8, 7, 14, 13, 10, 12, 8, 16, 13, 9, 8…
$ front_width       <dbl> 15, 16, 10, 19, 14, 15, 15, 8, 11, 19, 8, 18, 15, 11…
$ species           <chr> "A. farensis", "A. farensis", "A. farensis", "A. far…
$ longitude         <dbl> -8.01873, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ is_gravid         <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ right_claw_length <dbl> 10, 15, 5, 14, 7, 6, 9, 5, 10, 12, 13, 15, 11, 10, 6…
$ stage             <chr> "juvenile", "juvenile", "juvenile", "juvenile", "juv…
$ sex               <chr> "?", "?", "female", "male or female", "female", "fem…
$ id                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ carapace_width    <dbl> 27, 27, 19, 26, 24, 22, 25, 18, 24, 24, 21, 25, 27, …
$ latitude          <dbl> 37.02606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ region            <chr> "Ria Formosa", "Ria Formosa", "Ria Formosa", "Ria Fo…
$ season            <chr> "summer", "summer", "summer", "summer", "summer", "s…
$ quadrat           <chr> "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1"…
Solution to Exercise 2.2
# What column names are there in Ria de Alvor and not in Ria Formosa?
setdiff(colnames(ra_s_q1_01), colnames(rf_s_q1_01))

# And now the reverse question.
setdiff(colnames(rf_s_q1_01), colnames(ra_s_q1_01))

# So it seems Ria de Alvor data sets have named the `species` column as
# `species_name`, and `is_gravid?` contains a question mark. 
# Let us fix by mapping:
#  - `species_name` to `species`
#  - `is_gravid?` to `is_gravid`
# And also, do not forget to map the abbreviated column names to their full
# names.


# Note how the mapping of the names is indicated in `rename()`:
# New names go in the left hand side of the equal sign and old names in
# the right hand side.
# 
# Ria Formosa data sets
rf_s_q1_02 <- dplyr::rename(rf_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q2_02 <- dplyr::rename(rf_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q1_02 <- dplyr::rename(rf_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q2_02 <- dplyr::rename(rf_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)

# Ria de Alvor data sets
ra_s_q1_02 <- dplyr::rename(ra_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_s_q2_02 <- dplyr::rename(ra_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q1_02 <- dplyr::rename(ra_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q2_02 <- dplyr::rename(ra_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)

Combining the crabs data sets into one

Now we will combine all the eight data sets into one single data frame. To do that we use the function dplyr::bind_rows(). And although the columns are not by the same order in all data sets, bind_rows() is smart enough to match each column of one data set to that of another. It does this by using the column names. And that is why it was so important harmonise the column names across data sets.

Exercise 2.3

Bind all the eight data sets into one named crabs01. Note the number of rows and columns of this new data set.

Solution to Exercise 2.3
crabs01 <-
  dplyr::bind_rows(
    rf_s_q1_02,
    rf_s_q2_02,
    rf_w_q1_02,
    rf_w_q2_02,
    ra_s_q1_02,
    ra_s_q2_02,
    ra_w_q1_02,
    ra_w_q2_02
  )

Here is how the crabs01 should look like:

crabs01
# A tibble: 7,326 × 15
   carapace_…¹ left_…² front…³ species longi…⁴ is_gr…⁵ right…⁶ stage sex      id
         <dbl>   <dbl>   <dbl> <chr>     <dbl> <lgl>     <dbl> <chr> <chr> <dbl>
 1          21      16      15 A. far…   -8.02 FALSE        10 juve… ?         1
 2          21      13      16 A. far…   NA    FALSE        15 juve… ?         2
 3          15       9      10 A. far…   NA    TRUE          5 juve… fema…     3
 4          21      16      19 A. far…   NA    FALSE        14 juve… male…     4
 5          19       8      14 A. far…   NA    FALSE         7 juve… fema…     5
 6          17       7      15 A. far…   NA    FALSE         6 juve… fema…     6
 7          19      14      15 A. far…   NA    FALSE         9 juve… fema…     7
 8          14      13       8 A. far…   NA    FALSE         5 juve… fema…     8
 9          19      10      11 A. far…   NA    TRUE         10 juve… fema…     9
10          19      12      19 A. far…   NA    FALSE        12 juve… fema…    10
# … with 7,316 more rows, 5 more variables: carapace_width <dbl>,
#   latitude <dbl>, region <chr>, season <chr>, quadrat <chr>, and abbreviated
#   variable names ¹​carapace_length, ²​left_claw_length, ³​front_width,
#   ⁴​longitude, ⁵​is_gravid, ⁶​right_claw_length
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Organising variables (columns)

An important part of the tidying up process is getting the data in a format that is as intuitive as possible for us. With regards to the existing columns in the crabs01 data set, there are two things we can further improve:

  • Remove spurious columns, i.e. columns that do not contribute with useful information for our analysis.
  • Reorder the columns, such that we start with slow-varying columns and end with rapid-varying columns.

Exercise 2.4

  • Remove the column id from the crabs01 data set. This identifier was created by the the two teams to keep track of each of crab in each quadrat. But it adds no extra information for us, so we can safely drop it. You can use the function dplyr::select() to keep, but also, to drop columns.

  • Recognise that there are three groups of variables (columns):

    • Contextual:
      • region
      • season
      • quadrat
      • latitude
      • longitude
    • Demographics:
      • species
      • sex
      • stage
      • is_gravid
    • Morphological:
      • carapace_width
      • carapace_length
      • front_width
      • left_claw_length
      • right_claw_length

    And re-arrange these columns in this order. You can use the function dplyr::relocate(). Name the resulting data set crabs02.

Solution to Exercise 2.4
crabs02 <-
  crabs01 %>%
  dplyr::select(-"id") %>%
  dplyr::relocate(
    region,
    season,
    quadrat,
    latitude,
    longitude,
    species,
    sex,
    stage,
    is_gravid,
    carapace_width,
    carapace_length,
    front_width,
    left_claw_length,
    right_claw_length
  )

Here is how the crabs02 data set should look like:

dplyr::glimpse(crabs02)
Rows: 7,326
Columns: 14
$ region            <chr> "Ria Formosa", "Ria Formosa", "Ria Formosa", "Ria Fo…
$ season            <chr> "summer", "summer", "summer", "summer", "summer", "s…
$ quadrat           <chr> "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1"…
$ latitude          <dbl> 37.02606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ longitude         <dbl> -8.01873, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ species           <chr> "A. farensis", "A. farensis", "A. farensis", "A. far…
$ sex               <chr> "?", "?", "female", "male or female", "female", "fem…
$ stage             <chr> "juvenile", "juvenile", "juvenile", "juvenile", "juv…
$ is_gravid         <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ carapace_width    <dbl> 27, 27, 19, 26, 24, 22, 25, 18, 24, 24, 21, 25, 27, …
$ carapace_length   <dbl> 21, 21, 15, 21, 19, 17, 19, 14, 19, 19, 17, 20, 22, …
$ front_width       <dbl> 15, 16, 10, 19, 14, 15, 15, 8, 11, 19, 8, 18, 15, 11…
$ left_claw_length  <dbl> 16, 13, 9, 16, 8, 7, 14, 13, 10, 12, 8, 16, 13, 9, 8…
$ right_claw_length <dbl> 10, 15, 5, 14, 7, 6, 9, 5, 10, 12, 13, 15, 11, 10, 6…

Filling in fake missing values

Now let us start looking into the values of variables, and try to spot strange patterns or weird values.

Look at the columns latitude and longitude. It seems that there are a lot of missing values (NA) and they all seem to be associated with Ria Formosa.

Let us use the function dplyr::count() to count how many observations are there for a specific combination of variables. Let’s try the columns region, quadrat, and latitude and longitude:

count(crabs02, region, quadrat, latitude, longitude)
# A tibble: 6 × 5
  region       quadrat latitude longitude     n
  <chr>        <chr>      <dbl>     <dbl> <int>
1 Ria de Alvor q1          37.1     -8.60  2077
2 Ria de Alvor q2          37.1     -8.62  1618
3 Ria Formosa  q1          37.0     -8.02     2
4 Ria Formosa  q1          NA       NA     1483
5 Ria Formosa  q2          37.0     -7.99     2
6 Ria Formosa  q2          NA       NA     2144

Let us look again at the first observations of crabs02. We will select a few columns, and slice a few rows for readability:

crabs02 %>%
  dplyr::select(region, quadrat, latitude, longitude) %>%
  dplyr::slice(1:5)
# A tibble: 5 × 4
  region      quadrat latitude longitude
  <chr>       <chr>      <dbl>     <dbl>
1 Ria Formosa q1          37.0     -8.02
2 Ria Formosa q1          NA       NA   
3 Ria Formosa q1          NA       NA   
4 Ria Formosa q1          NA       NA   
5 Ria Formosa q1          NA       NA   

It seems that the team of biologists at Ria Formosa did not record the latitude and longitude for all the observations except the first. Let’s now see if this is happening for each group of observations defined by the combination of region-quadrat:

crabs02 %>%
  dplyr::filter(region == "Ria Formosa") %>%
  dplyr::select(region, quadrat, latitude, longitude) %>%
  dplyr::group_by(region, quadrat) %>%
  dplyr::slice(1:3)
# A tibble: 6 × 4
# Groups:   region, quadrat [2]
  region      quadrat latitude longitude
  <chr>       <chr>      <dbl>     <dbl>
1 Ria Formosa q1          37.0     -8.02
2 Ria Formosa q1          NA       NA   
3 Ria Formosa q1          NA       NA   
4 Ria Formosa q2          37.0     -7.99
5 Ria Formosa q2          NA       NA   
6 Ria Formosa q2          NA       NA   

So, these are not truly missing values. These are just values that are absent because the team decided not to record them as they are all the same until the quadrat changes. To fix this, we will need to fill in the values.

Exercise 2.5

Use the function tidyr::fill() to fill in the values. Look at the documentation of this function or in the {tidyr} cheatsheet for how to use it.

Solution to Exercise 2.5
crabs03 <- tidyr::fill(data = crabs02, latitude, longitude, .direction = "down")

Recode NA aliases

Sometimes the values used to represent missing values may take many different forms. Take a close look at the column sex, and check if there are any other values than "female" or "male".

Exercise 2.6

Detect NA aliases used by the researchers to mean missingness or uncertainty regarding the sex of the crabs, and recode them to R’s appropriate value, i.e. the NA value.

Tip: use the unique(crabs03$sex) to see all unique values in that column. Then use a combination of dplyr::mutate(), dplyr::if_else() and the operator %in% to re-create the sex column with all missing values recoded to NA.

Solution to Exercise 2.6
# These `sex_na_aliases` were inferred after running `unique(crabs03$sex)`
sex_na_aliases <- c("?", "male or female", "-", " ", "N/R")
crabs04 <-
  crabs03 %>%
  mutate(sex = if_else(sex %in% sex_na_aliases, NA_character_, sex))

Compare the values in the sex column before and after the recoding:

# Before the recoding
unique(crabs03$sex)
[1] "?"              "female"         "male or female" NA              
[5] "-"              "N/R"            "male"          
# After the recoding
unique(crabs04$sex)
[1] NA       "female" "male"  

General recoding of values

As it turns out, the differences between the data sets generated at Ria Formosa and at Ria de Alvor are not exhausted yet… It seems that the variables species and stage are also not agreeing between the two Rias. Let us use dplyr::count() to see that:

dplyr::count(crabs04, region, species, stage)
# A tibble: 20 × 4
   region       species           stage           n
   <chr>        <chr>             <chr>       <int>
 1 Ria de Alvor Afruca farensis   a             464
 2 Ria de Alvor Afruca farensis   j             130
 3 Ria de Alvor Afruca farensis   p             130
 4 Ria de Alvor Afruca farensis   s             346
 5 Ria de Alvor Uca olhanensis    a             641
 6 Ria de Alvor Uca olhanensis    j             228
 7 Ria de Alvor Uca olhanensis    p             252
 8 Ria de Alvor Uca olhanensis    s             438
 9 Ria de Alvor Ucides alvorensis a             466
10 Ria de Alvor Ucides alvorensis j             128
11 Ria de Alvor Ucides alvorensis p             128
12 Ria de Alvor Ucides alvorensis s             344
13 Ria Formosa  A. farensis       adult         624
14 Ria Formosa  A. farensis       juvenile      178
15 Ria Formosa  A. farensis       pre_puberty   178
16 Ria Formosa  A. farensis       sub_adult     470
17 Ria Formosa  U. olhanensis     adult         881
18 Ria Formosa  U. olhanensis     juvenile      330
19 Ria Formosa  U. olhanensis     pre_puberty   354
20 Ria Formosa  U. olhanensis     sub_adult     616

The team at Ria de Alvor abbreviated the names of the developmental stages, and the team at Ria Formosa seems to have abbreviated the species names. This problematic, we need to recode them.

Exercise 2.7

Let us convert everything to full names:

Old value New value
j juvenile
p pre_puberty
s sub_adult
a adult
A. farensis Afruca farensis
U. olhanensis Uca olhanensis

A combination of dplyr::mutate() and dplyr::recode() will be helpful.

Solution to Exercise 2.7
crabs05 <-
  crabs04 %>%
  dplyr::mutate(
    species = dplyr::recode(
      species,
      `A. farensis` = "Afruca farensis",
      `U. olhanensis` = "Uca olhanensis"),
    stage = dplyr::recode(
      stage,
      `j` = "juvenile",
      `p` = "pre_puberty",
      `s` = "sub_adult",
      `a` = "adult"
    ),
    quadrat = substr(quadrat, 2, 2)
  )

The new data set should now show:

dplyr::count(crabs05, region, species, stage)
# A tibble: 20 × 4
   region       species           stage           n
   <chr>        <chr>             <chr>       <int>
 1 Ria de Alvor Afruca farensis   adult         464
 2 Ria de Alvor Afruca farensis   juvenile      130
 3 Ria de Alvor Afruca farensis   pre_puberty   130
 4 Ria de Alvor Afruca farensis   sub_adult     346
 5 Ria de Alvor Uca olhanensis    adult         641
 6 Ria de Alvor Uca olhanensis    juvenile      228
 7 Ria de Alvor Uca olhanensis    pre_puberty   252
 8 Ria de Alvor Uca olhanensis    sub_adult     438
 9 Ria de Alvor Ucides alvorensis adult         466
10 Ria de Alvor Ucides alvorensis juvenile      128
11 Ria de Alvor Ucides alvorensis pre_puberty   128
12 Ria de Alvor Ucides alvorensis sub_adult     344
13 Ria Formosa  Afruca farensis   adult         624
14 Ria Formosa  Afruca farensis   juvenile      178
15 Ria Formosa  Afruca farensis   pre_puberty   178
16 Ria Formosa  Afruca farensis   sub_adult     470
17 Ria Formosa  Uca olhanensis    adult         881
18 Ria Formosa  Uca olhanensis    juvenile      330
19 Ria Formosa  Uca olhanensis    pre_puberty   354
20 Ria Formosa  Uca olhanensis    sub_adult     616

Fix scale of quantitative variables

The Ria de Alvor team recorded the morphological measurements in centimetres, whereas the Ria Formosa team recorded them in millimetres. We need to fix this.

Exercise 2.8

Multiply the values of the morphological variables of Ria de Alvor by 10 to transform from centimetres to millimetres, and hence harmonise them with Ria Formosa’s values.

Solution to Exercise 2.8
crabs06 <-
  crabs05 %>%
  dplyr::mutate(
    carapace_width = if_else(region == "Ria de Alvor", carapace_width * 10, carapace_width),
    carapace_length = if_else(region == "Ria de Alvor", carapace_length * 10, carapace_length),
    front_width = if_else(region == "Ria de Alvor", front_width * 10, front_width),
    left_claw_length = if_else(region == "Ria de Alvor", left_claw_length * 10, left_claw_length),
    right_claw_length = if_else(region == "Ria de Alvor", right_claw_length * 10, right_claw_length)
  )

Compare the summary statistics before and after the transformation. Here is an example with carapace_width:

# Before multiplication by 10
summary(crabs05$carapace_width)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    4.20   10.80   22.81   36.00  142.00 
# After multiplication by 10
summary(crabs06$carapace_width)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   26.00   40.00   43.81   56.00  142.00 

Ensure appropriate variable types

Currently our data set crabs06 has a range of variables whose type is character, i.e. the individual values are strings. Examples are: region, species, stage, etc.. These are essentially nominal (or categorical), or ordered categorical (or ordinal) variables. In R these variable types are called factors.

Although it is possible to leave these variables as the character type, converting to factors has a few advantages:

  • Memory efficiency: factors take up less space than character variables, because they are effectively stored as integers.
  • Factors are useful when you know the set of possible values but they’re not all present in a given data set. In contrast to a character vector, when you tabulate a factor you’ll get counts of all categories, even unobserved ones.

Exercise 2.9

Use a combination of dplyr::mutate() and factor() to convert the following variables to factors:

  • region
  • season
  • species
  • sex
  • stage

Also, and because we do not have an integer variable in our crabs06 data set, make the quadrat an integer (for illustrative purposes). Tip: use as.integer() for that.

Solution to Exercise 2.9
crabs07 <-
  crabs06 %>%
  dplyr::mutate(
    region = factor(region),
    season = factor(season),
    quadrat = as.integer(quadrat),
    species = factor(species),
    sex = factor(sex, levels = c("female", "male")),
    stage = factor(stage, levels = c("juvenile", "pre_puberty", "sub_adult", "adult"), ordered = TRUE)
  )

Column names are values, not variable names

Remember the quadrats01 data set about the quadrats? It’s time to update it too. Look at the column names, they’re values! Also, the first column has an automatic name ...1 because originally in the Excel file it was empty.

quadrats01
# A tibble: 1 × 5
  ...1      `Ria Formosa-q1` `Ria Formosa-q2` `Ria de Alvor-q1` Ria de Alvor-q…¹
  <chr>                <dbl>            <dbl>             <dbl>            <dbl>
1 Area (m2)              250              360               260              220
# … with abbreviated variable name ¹​`Ria de Alvor-q2`

To fix this data set we will use two fancy functions that we haven’t used yet: tidyr::pivot_longer() and tidyr::separate().

Exercise 2.10

Convert the quadrats01 data set into a tidy data set by moving the values in the column names to table elements, and create new columns to accomodate them.

Solution to Exercise 2.10
quadrats02 <-
  quadrats01 %>%
  dplyr::select(-1) %>%
  tidyr::pivot_longer(cols = everything(), names_to = "region_quadrat", values_to = "area") %>%
  tidyr::separate(col = "region_quadrat", into = c("region", "quadrat"), sep = "-")

Exercise 2.11

And just like we did before, convert the region variable to factor and quadrat to integer and save the result to quadrats03.

Solution to Exercise 2.11
quadrats03 <-
  quadrats02 %>%
  dplyr::mutate(
    region = factor(region),
    quadrat = as.integer(substr(quadrat, 2, 2))
  )

Multiple observational unit types in one table

Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table. This is closely related to the idea of database normalisation, where each fact is expressed in only one place. It’s important because otherwise inconsistencies can arise. From {tidyr}vignette on multiple types

There are two variables (columns) in our crabs07 data set that characterise the quadrats, not the crabs individually. These are latitude and longitude. Yet, they’ve been recorded in the crabs data sets but belong in the quadrats data set.

Exercise 2.12

To fix this, we will need to get the unique latitude and longitude values for each quadrat. There is a nifty function that allow us to that: dplyr::distinct(). And combined with dplyr::left_join() will get us there…

Solution to Exercise 2.12
quadrats_coord <- dplyr::distinct(crabs07, region, quadrat, latitude, longitude)
quadrats04 <- dplyr::left_join(quadrats03, quadrats_coord)

# And now we can drop latitude and longitude from the `crabs07` data set.
# Note the minus sign behind the vector of column names.
crabs08 <- select(crabs07, -c("latitude", "longitude"))

Here’s how it should like:

quadrats04
# A tibble: 4 × 5
  region       quadrat  area latitude longitude
  <fct>          <int> <dbl>    <dbl>     <dbl>
1 Ria Formosa        1   250     37.0     -8.02
2 Ria Formosa        2   360     37.0     -7.99
3 Ria de Alvor       1   260     37.1     -8.60
4 Ria de Alvor       2   220     37.1     -8.62
crabs08
# A tibble: 7,326 × 12
   region     season quadrat species sex   stage is_gr…¹ carap…² carap…³ front…⁴
   <fct>      <fct>    <int> <fct>   <fct> <ord> <lgl>     <dbl>   <dbl>   <dbl>
 1 Ria Formo… summer       1 Afruca… <NA>  juve… FALSE        27      21      15
 2 Ria Formo… summer       1 Afruca… <NA>  juve… FALSE        27      21      16
 3 Ria Formo… summer       1 Afruca… fema… juve… TRUE         19      15      10
 4 Ria Formo… summer       1 Afruca… <NA>  juve… FALSE        26      21      19
 5 Ria Formo… summer       1 Afruca… fema… juve… FALSE        24      19      14
 6 Ria Formo… summer       1 Afruca… fema… juve… FALSE        22      17      15
 7 Ria Formo… summer       1 Afruca… fema… juve… FALSE        25      19      15
 8 Ria Formo… summer       1 Afruca… fema… juve… FALSE        18      14       8
 9 Ria Formo… summer       1 Afruca… fema… juve… TRUE         24      19      11
10 Ria Formo… summer       1 Afruca… fema… juve… FALSE        24      19      19
# … with 7,316 more rows, 2 more variables: left_claw_length <dbl>,
#   right_claw_length <dbl>, and abbreviated variable names ¹​is_gravid,
#   ²​carapace_width, ³​carapace_length, ⁴​front_width
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Wrapping up

Finally we will give simple names to our two data sets and save them to disk to mark this milestone of Data Tidying. These exported files will be the starting point of tomorrow’s workflow about Data Visualisation with {ggplot2}.

Exercise 2.13

  • Export the quadrats data frame to CSV (data-tidy/quadrats.csv)
  • Export the crabs data frame to CSV (data-tidy/crabs.csv)

Although CSV files are a good way of storing our tidy data sets and of sharing it with collaborators, they do not store information about the variables (columns) types.

So please export also to RDS format which is a special format that only R understands but that keeps the data exactly as we have it now, thus making it easier to read in this files and continue with our workflow about Data Visualisation tomorrow.

  • Export the quadrats data frame to RDS (data-tidy/quadrats.rds)
  • Export the crabs data frame to RDS (data-tidy/crabs.rds)
Solution to Exercise 2.13
# Assign to new variables with simpler names
quadrats <- quadrats04
crabs <- crabs08

# Export to disk
data_tidy_path <- here::here("data-tidy")
readr::write_csv(quadrats, file = file.path(data_tidy_path, "quadrats.csv"))
readr::write_csv(crabs, file = file.path(data_tidy_path, "crabs.csv"))

saveRDS(quadrats, file = file.path(data_tidy_path, "quadrats.rds"))
saveRDS(crabs, file = file.path(data_tidy_path, "crabs.rds"))

Data tidying script

Here is the entire script containing all code from the exercises present in section Raw Data and Data Tidying.

#
# Raw Data
#

library(tidyverse)
library(here)
library(tools)
library(readxl)

# Define the path to the raw data
data_raw_path <- here("data-raw")

# List the raw data files
list.files(data_raw_path)

# Reading the four CSV files (Ria Formosa)
rf_s_q1 <- readr::read_csv(file.path(data_raw_path, "rf_s_q1.csv"))
rf_s_q2 <- readr::read_csv(file.path(data_raw_path, "rf_s_q2.csv"))
rf_w_q1 <- readr::read_csv(file.path(data_raw_path, "rf_w_q1.csv"))
rf_w_q2 <- readr::read_csv(file.path(data_raw_path, "rf_w_q2.csv"))

# Reading now the four sheets inside of "Ria de Alvor.xlsx"
ra_path <- file.path(data_raw_path, "Ria de Alvor.xlsx")
ra_s_q1 <- readxl::read_excel(ra_path, sheet = "summer-q1")
ra_s_q2 <- readxl::read_excel(ra_path, sheet = "summer-q2")
ra_w_q1 <- readxl::read_excel(ra_path, sheet = "winter-q1")
ra_w_q2 <- readxl::read_excel(ra_path, sheet = "winter-q2")

# Finally, reading the details about the quadrats
quadrats01 <- readxl::read_excel(file.path(data_raw_path, "quadrats.xlsx"))

# Insights gained: to be discussed with trainers.



#
# Data Tidying
#

# EXERCISE 2.1
# Ria Formosa data sets
rf_s_q1_01 <- dplyr::mutate(rf_s_q1, region = "Ria Formosa", season = "summer", quadrat = "q1")
rf_s_q2_01 <- dplyr::mutate(rf_s_q2, region = "Ria Formosa", season = "summer", quadrat = "q2")
rf_w_q1_01 <- dplyr::mutate(rf_w_q1, region = "Ria Formosa", season = "winter", quadrat = "q1")
rf_w_q2_01 <- dplyr::mutate(rf_w_q2, region = "Ria Formosa", season = "winter", quadrat = "q2")

# Ria de Alvor data sets
ra_s_q1_01 <- dplyr::mutate(ra_s_q1, region = "Ria de Alvor", season = "summer", quadrat = "q1")
ra_s_q2_01 <- dplyr::mutate(ra_s_q2, region = "Ria de Alvor", season = "summer", quadrat = "q2")
ra_w_q1_01 <- dplyr::mutate(ra_w_q1, region = "Ria de Alvor", season = "winter", quadrat = "q1")
ra_w_q2_01 <- dplyr::mutate(ra_w_q2, region = "Ria de Alvor", season = "winter", quadrat = "q2")

# EXERCISE 2.2
# What column names are there in Ria de Alvor and not in Ria Formosa?
setdiff(colnames(ra_s_q1_01), colnames(rf_s_q1_01))

# And now the reverse question.
setdiff(colnames(rf_s_q1_01), colnames(ra_s_q1_01))

# So it seems Ria de Alvor data sets have named the `species` column as
# `species_name`, and `is_gravid?` contains a question mark. 
# Let us fix by mapping:
#  - `species_name` to `species`
#  - `is_gravid?` to `is_gravid`
# And also, do not forget to map the abbreviated column names to their full
# names.


# Note how the mapping of the names is indicated in `rename()`:
# New names go in the left hand side of the equal sign and old names in
# the right hand side.
# 
# Ria Formosa data sets
rf_s_q1_02 <- dplyr::rename(rf_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q2_02 <- dplyr::rename(rf_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q1_02 <- dplyr::rename(rf_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q2_02 <- dplyr::rename(rf_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)

# Ria de Alvor data sets
ra_s_q1_02 <- dplyr::rename(ra_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_s_q2_02 <- dplyr::rename(ra_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q1_02 <- dplyr::rename(ra_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q2_02 <- dplyr::rename(ra_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)


# EXERCISE 2.3
crabs01 <-
  dplyr::bind_rows(
    rf_s_q1_02,
    rf_s_q2_02,
    rf_w_q1_02,
    rf_w_q2_02,
    ra_s_q1_02,
    ra_s_q2_02,
    ra_w_q1_02,
    ra_w_q2_02
  )


# EXERCISE 2.4
crabs02 <-
  crabs01 %>%
  dplyr::select(-"id") %>%
  dplyr::relocate(
    region,
    season,
    quadrat,
    latitude,
    longitude,
    species,
    sex,
    stage,
    is_gravid,
    carapace_width,
    carapace_length,
    front_width,
    left_claw_length,
    right_claw_length
  )

# EXERCISE 2.5
crabs03 <- tidyr::fill(data = crabs02, latitude, longitude, .direction = "down")

# EXERCISE 2.6
# These `sex_na_aliases` were inferred after running `unique(crabs03$sex)`
sex_na_aliases <- c("?", "male or female", "-", " ", "N/R")
crabs04 <-
  crabs03 %>%
  mutate(sex = if_else(sex %in% sex_na_aliases, NA_character_, sex))

# EXERCISE 2.7
crabs05 <-
  crabs04 %>%
  dplyr::mutate(
    species = dplyr::recode(
      species,
      `A. farensis` = "Afruca farensis",
      `U. olhanensis` = "Uca olhanensis"),
    stage = dplyr::recode(
      stage,
      `j` = "juvenile",
      `p` = "pre_puberty",
      `s` = "sub_adult",
      `a` = "adult"
    ),
    quadrat = substr(quadrat, 2, 2)
  )

# EXERCISE 2.8
crabs06 <-
  crabs05 %>%
  dplyr::mutate(
    carapace_width = if_else(region == "Ria de Alvor", carapace_width * 10, carapace_width),
    carapace_length = if_else(region == "Ria de Alvor", carapace_length * 10, carapace_length),
    front_width = if_else(region == "Ria de Alvor", front_width * 10, front_width),
    left_claw_length = if_else(region == "Ria de Alvor", left_claw_length * 10, left_claw_length),
    right_claw_length = if_else(region == "Ria de Alvor", right_claw_length * 10, right_claw_length)
  )

# EXERCISE 2.9
crabs07 <-
  crabs06 %>%
  dplyr::mutate(
    region = factor(region),
    season = factor(season),
    quadrat = as.integer(quadrat),
    species = factor(species),
    sex = factor(sex, levels = c("female", "male")),
    stage = factor(stage, levels = c("juvenile", "pre_puberty", "sub_adult", "adult"), ordered = TRUE)
  )

# EXERCISE 2.10
quadrats02 <-
  quadrats01 %>%
  dplyr::select(-1) %>%
  tidyr::pivot_longer(cols = everything(), names_to = "region_quadrat", values_to = "area") %>%
  tidyr::separate(col = "region_quadrat", into = c("region", "quadrat"), sep = "-")

# EXERCISE 2.11
quadrats03 <-
  quadrats02 %>%
  dplyr::mutate(
    region = factor(region),
    quadrat = as.integer(substr(quadrat, 2, 2))
  )

# EXERCISE 2.12
quadrats_coord <- dplyr::distinct(crabs07, region, quadrat, latitude, longitude)
quadrats04 <- dplyr::left_join(quadrats03, quadrats_coord)

# And now we can drop latitude and longitude from the `crabs07` data set.
# Note the minus sign behind the vector of column names.
crabs08 <- select(crabs07, -c("latitude", "longitude"))

# EXERCISE 2.13
# Assign to new variables with simpler names
quadrats <- quadrats04
crabs <- crabs08

# Export to disk
data_tidy_path <- here::here("data-tidy")
readr::write_csv(quadrats, file = file.path(data_tidy_path, "quadrats.csv"))
readr::write_csv(crabs, file = file.path(data_tidy_path, "crabs.csv"))

saveRDS(quadrats, file = file.path(data_tidy_path, "quadrats.rds"))
saveRDS(crabs, file = file.path(data_tidy_path, "crabs.rds"))