Data tidying
Introduction
We will now embark on a series of data transformations to convert our nine data sets into only two tidy data sets:
crabs
quadrats
So bear these two data sets in mind as these are our end goal for the Data Tidying workflow of today. As we proceed transforming our data frames we will keep variables that hold the intermediate steps. For example, early on in our workflow today, we will have a crabs01
data set. This is a messy data set. Then you will transform it a bit into crabs02
, etc.. At the very end we will assign the last stage of our crabs data set to the variable crabs
. The same pattern will be followed for quadrats
, although this data set involves fewer steps.
From the previous section Raw Data you should have generated these nine data sets that are now our starting point:
- Ria Formosa
rf_s_q1
rf_s_q2
rf_w_q1
rf_w_q2
- Ria de Alvor
ra_s_q1
ra_s_q2
ra_w_q1
ra_w_q2
- Quadrats
quadrats01
Okay, so now we are ready to start fixing the various aspects of messiness associated with our data sets.
Add missing columns
Currently, the fiddler crabs data is missing three variables:
region
season
quadrat
These variables were surreptitiously present in the names of the CSV files and in the name of the XLSX and included sheets.
An an example, the file
rf_s_q1.csv
, whose data has been read intorf_s_q1
, indicates in its name that crab sampling was performed in Ria Formosa ("rf"
), in the Summer ("s"
), in quadrat 1 ("q1"
).
Exercise 2.1
Use the function dplyr::mutate()
to add the three missing columns, and respective values, to each of the eight data frames:
- Ria Formosa
rf_s_q1
rf_s_q2
rf_w_q1
rf_w_q2
- Ria de Alvor
ra_s_q1
ra_s_q2
ra_w_q1
ra_w_q2
Here is an example for the data set rf_s_q1
:
# Add the columns `region`, `season` and `quadrat`.
<- dplyr::mutate(rf_s_q1, region = "Ria Formosa", season = "summer", quadrat = "q1")
rf_s_q1_01
# Check the result.
::glimpse(rf_s_q1_01) dplyr
Rows: 1,243
Columns: 15
$ cl <dbl> 21, 21, 15, 21, 19, 17, 19, 14, 19, 19, 17, 20, 22, 15, 17, …
$ lcl <dbl> 16, 13, 9, 16, 8, 7, 14, 13, 10, 12, 8, 16, 13, 9, 8, 19, 13…
$ fw <dbl> 15, 16, 10, 19, 14, 15, 15, 8, 11, 19, 8, 18, 15, 11, 13, 15…
$ species <chr> "A. farensis", "A. farensis", "A. farensis", "A. farensis", …
$ longitude <dbl> -8.01873, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ is_gravid <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ rcl <dbl> 10, 15, 5, 14, 7, 6, 9, 5, 10, 12, 13, 15, 11, 10, 6, 16, 12…
$ stage <chr> "juvenile", "juvenile", "juvenile", "juvenile", "juvenile", …
$ sex <chr> "?", "?", "female", "male or female", "female", "female", "f…
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ cw <dbl> 27, 27, 19, 26, 24, 22, 25, 18, 24, 24, 21, 25, 27, 19, 22, …
$ latitude <dbl> 37.02606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ region <chr> "Ria Formosa", "Ria Formosa", "Ria Formosa", "Ria Formosa", …
$ season <chr> "summer", "summer", "summer", "summer", "summer", "summer", …
$ quadrat <chr> "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", …
Solution to Exercise 2.1
# Ria Formosa data sets
<- dplyr::mutate(rf_s_q1, region = "Ria Formosa", season = "summer", quadrat = "q1")
rf_s_q1_01 <- dplyr::mutate(rf_s_q2, region = "Ria Formosa", season = "summer", quadrat = "q2")
rf_s_q2_01 <- dplyr::mutate(rf_w_q1, region = "Ria Formosa", season = "winter", quadrat = "q1")
rf_w_q1_01 <- dplyr::mutate(rf_w_q2, region = "Ria Formosa", season = "winter", quadrat = "q2")
rf_w_q2_01
# Ria de Alvor data sets
<- dplyr::mutate(ra_s_q1, region = "Ria de Alvor", season = "summer", quadrat = "q1")
ra_s_q1_01 <- dplyr::mutate(ra_s_q2, region = "Ria de Alvor", season = "summer", quadrat = "q2")
ra_s_q2_01 <- dplyr::mutate(ra_w_q1, region = "Ria de Alvor", season = "winter", quadrat = "q1")
ra_w_q1_01 <- dplyr::mutate(ra_w_q2, region = "Ria de Alvor", season = "winter", quadrat = "q2") ra_w_q2_01
Make variable (column) names consistent across data sets
Okay, now that we have all variables in our data sets, we will need to bind all data sets into one. But before we do that we will need to make sure that the variables (i.e. columns) across the eight data sets are all named in the same way. Otherwise, the step of combining the data sets will be cumbersome. We want it to be smooth. You’ll see in a bit what we mean.
We can tell you that there are discrepancies between the data sets coming from Ria Formosa and Ria de Alvor. This is because the two teams of marine biologists had different ideas on how to name two of the variables.
Also, the two teams had abbreviated the name of the morphological variables, and we find them hard to read, so we will rename them to new names as indicated in this table:
Current name | New name | Meaning |
---|---|---|
cw |
carapace_width |
Carapace width |
cl |
carapace_length |
Carapace length |
fw |
front_width |
Front width (distance between the crab eye stalks) |
lcl |
left_claw_length |
Left claw length |
rcl |
right_claw_length |
Right claw length |
Exercise 2.2
- Identify the two columns that have slightly different names between the Ria Formosa and the Ria de Alvor data sets.
- Rename those two columns in the Ria de Alvor data sets (
ra_s_q1_01
,ra_s_q2_01
,ra_w_q1_01
andra_w_q2_01
) to match the names used in Ria Formosa. - Rename the abbreviated column names to new, more explicit, names, as indicated in the table above. Do this for all data sets.
- Name these new data sets as:
rf_s_q1_02
,rf_s_q2_02
, etc. andra_s_q1_02
,ra_s_q2_02
, etc..
Use the function dplyr::rename()
for renaming. Here is an example:
<- dplyr::rename(rf_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q1_02
::glimpse(rf_s_q1_02) dplyr
Rows: 1,243
Columns: 15
$ carapace_length <dbl> 21, 21, 15, 21, 19, 17, 19, 14, 19, 19, 17, 20, 22, …
$ left_claw_length <dbl> 16, 13, 9, 16, 8, 7, 14, 13, 10, 12, 8, 16, 13, 9, 8…
$ front_width <dbl> 15, 16, 10, 19, 14, 15, 15, 8, 11, 19, 8, 18, 15, 11…
$ species <chr> "A. farensis", "A. farensis", "A. farensis", "A. far…
$ longitude <dbl> -8.01873, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ is_gravid <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ right_claw_length <dbl> 10, 15, 5, 14, 7, 6, 9, 5, 10, 12, 13, 15, 11, 10, 6…
$ stage <chr> "juvenile", "juvenile", "juvenile", "juvenile", "juv…
$ sex <chr> "?", "?", "female", "male or female", "female", "fem…
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ carapace_width <dbl> 27, 27, 19, 26, 24, 22, 25, 18, 24, 24, 21, 25, 27, …
$ latitude <dbl> 37.02606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ region <chr> "Ria Formosa", "Ria Formosa", "Ria Formosa", "Ria Fo…
$ season <chr> "summer", "summer", "summer", "summer", "summer", "s…
$ quadrat <chr> "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1"…
Solution to Exercise 2.2
# What column names are there in Ria de Alvor and not in Ria Formosa?
setdiff(colnames(ra_s_q1_01), colnames(rf_s_q1_01))
# And now the reverse question.
setdiff(colnames(rf_s_q1_01), colnames(ra_s_q1_01))
# So it seems Ria de Alvor data sets have named the `species` column as
# `species_name`, and `is_gravid?` contains a question mark.
# Let us fix by mapping:
# - `species_name` to `species`
# - `is_gravid?` to `is_gravid`
# And also, do not forget to map the abbreviated column names to their full
# names.
# Note how the mapping of the names is indicated in `rename()`:
# New names go in the left hand side of the equal sign and old names in
# the right hand side.
#
# Ria Formosa data sets
<- dplyr::rename(rf_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q1_02 <- dplyr::rename(rf_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q2_02 <- dplyr::rename(rf_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q1_02 <- dplyr::rename(rf_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q2_02
# Ria de Alvor data sets
<- dplyr::rename(ra_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_s_q1_02 <- dplyr::rename(ra_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_s_q2_02 <- dplyr::rename(ra_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q1_02 <- dplyr::rename(ra_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`) ra_w_q2_02
Combining the crabs data sets into one
Now we will combine all the eight data sets into one single data frame. To do that we use the function dplyr::bind_rows()
. And although the columns are not by the same order in all data sets, bind_rows()
is smart enough to match each column of one data set to that of another. It does this by using the column names. And that is why it was so important harmonise the column names across data sets.
Exercise 2.3
Bind all the eight data sets into one named crabs01
. Note the number of rows and columns of this new data set.
Solution to Exercise 2.3
<-
crabs01 ::bind_rows(
dplyr
rf_s_q1_02,
rf_s_q2_02,
rf_w_q1_02,
rf_w_q2_02,
ra_s_q1_02,
ra_s_q2_02,
ra_w_q1_02,
ra_w_q2_02 )
Here is how the crabs01
should look like:
crabs01
# A tibble: 7,326 × 15
carapace_…¹ left_…² front…³ species longi…⁴ is_gr…⁵ right…⁶ stage sex id
<dbl> <dbl> <dbl> <chr> <dbl> <lgl> <dbl> <chr> <chr> <dbl>
1 21 16 15 A. far… -8.02 FALSE 10 juve… ? 1
2 21 13 16 A. far… NA FALSE 15 juve… ? 2
3 15 9 10 A. far… NA TRUE 5 juve… fema… 3
4 21 16 19 A. far… NA FALSE 14 juve… male… 4
5 19 8 14 A. far… NA FALSE 7 juve… fema… 5
6 17 7 15 A. far… NA FALSE 6 juve… fema… 6
7 19 14 15 A. far… NA FALSE 9 juve… fema… 7
8 14 13 8 A. far… NA FALSE 5 juve… fema… 8
9 19 10 11 A. far… NA TRUE 10 juve… fema… 9
10 19 12 19 A. far… NA FALSE 12 juve… fema… 10
# … with 7,316 more rows, 5 more variables: carapace_width <dbl>,
# latitude <dbl>, region <chr>, season <chr>, quadrat <chr>, and abbreviated
# variable names ¹carapace_length, ²left_claw_length, ³front_width,
# ⁴longitude, ⁵is_gravid, ⁶right_claw_length
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Organising variables (columns)
An important part of the tidying up process is getting the data in a format that is as intuitive as possible for us. With regards to the existing columns in the crabs01
data set, there are two things we can further improve:
- Remove spurious columns, i.e. columns that do not contribute with useful information for our analysis.
- Reorder the columns, such that we start with slow-varying columns and end with rapid-varying columns.
Exercise 2.4
Remove the column
id
from thecrabs01
data set. This identifier was created by the the two teams to keep track of each of crab in each quadrat. But it adds no extra information for us, so we can safely drop it. You can use the functiondplyr::select()
to keep, but also, to drop columns.Recognise that there are three groups of variables (columns):
- Contextual:
region
season
quadrat
latitude
longitude
- Demographics:
species
sex
stage
is_gravid
- Morphological:
carapace_width
carapace_length
front_width
left_claw_length
right_claw_length
And re-arrange these columns in this order. You can use the function
dplyr::relocate()
. Name the resulting data setcrabs02
.- Contextual:
Solution to Exercise 2.4
<-
crabs02 %>%
crabs01 ::select(-"id") %>%
dplyr::relocate(
dplyr
region,
season,
quadrat,
latitude,
longitude,
species,
sex,
stage,
is_gravid,
carapace_width,
carapace_length,
front_width,
left_claw_length,
right_claw_length )
Here is how the crabs02
data set should look like:
::glimpse(crabs02) dplyr
Rows: 7,326
Columns: 14
$ region <chr> "Ria Formosa", "Ria Formosa", "Ria Formosa", "Ria Fo…
$ season <chr> "summer", "summer", "summer", "summer", "summer", "s…
$ quadrat <chr> "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1", "q1"…
$ latitude <dbl> 37.02606, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ longitude <dbl> -8.01873, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ species <chr> "A. farensis", "A. farensis", "A. farensis", "A. far…
$ sex <chr> "?", "?", "female", "male or female", "female", "fem…
$ stage <chr> "juvenile", "juvenile", "juvenile", "juvenile", "juv…
$ is_gravid <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ carapace_width <dbl> 27, 27, 19, 26, 24, 22, 25, 18, 24, 24, 21, 25, 27, …
$ carapace_length <dbl> 21, 21, 15, 21, 19, 17, 19, 14, 19, 19, 17, 20, 22, …
$ front_width <dbl> 15, 16, 10, 19, 14, 15, 15, 8, 11, 19, 8, 18, 15, 11…
$ left_claw_length <dbl> 16, 13, 9, 16, 8, 7, 14, 13, 10, 12, 8, 16, 13, 9, 8…
$ right_claw_length <dbl> 10, 15, 5, 14, 7, 6, 9, 5, 10, 12, 13, 15, 11, 10, 6…
Filling in fake missing values
Now let us start looking into the values of variables, and try to spot strange patterns or weird values.
Look at the columns latitude
and longitude
. It seems that there are a lot of missing values (NA
) and they all seem to be associated with Ria Formosa.
Let us use the function dplyr::count()
to count how many observations are there for a specific combination of variables. Let’s try the columns region
, quadrat
, and latitude
and longitude
:
count(crabs02, region, quadrat, latitude, longitude)
# A tibble: 6 × 5
region quadrat latitude longitude n
<chr> <chr> <dbl> <dbl> <int>
1 Ria de Alvor q1 37.1 -8.60 2077
2 Ria de Alvor q2 37.1 -8.62 1618
3 Ria Formosa q1 37.0 -8.02 2
4 Ria Formosa q1 NA NA 1483
5 Ria Formosa q2 37.0 -7.99 2
6 Ria Formosa q2 NA NA 2144
Let us look again at the first observations of crabs02
. We will select a few columns, and slice a few rows for readability:
%>%
crabs02 ::select(region, quadrat, latitude, longitude) %>%
dplyr::slice(1:5) dplyr
# A tibble: 5 × 4
region quadrat latitude longitude
<chr> <chr> <dbl> <dbl>
1 Ria Formosa q1 37.0 -8.02
2 Ria Formosa q1 NA NA
3 Ria Formosa q1 NA NA
4 Ria Formosa q1 NA NA
5 Ria Formosa q1 NA NA
It seems that the team of biologists at Ria Formosa did not record the latitude and longitude for all the observations except the first. Let’s now see if this is happening for each group of observations defined by the combination of region-quadrat:
%>%
crabs02 ::filter(region == "Ria Formosa") %>%
dplyr::select(region, quadrat, latitude, longitude) %>%
dplyr::group_by(region, quadrat) %>%
dplyr::slice(1:3) dplyr
# A tibble: 6 × 4
# Groups: region, quadrat [2]
region quadrat latitude longitude
<chr> <chr> <dbl> <dbl>
1 Ria Formosa q1 37.0 -8.02
2 Ria Formosa q1 NA NA
3 Ria Formosa q1 NA NA
4 Ria Formosa q2 37.0 -7.99
5 Ria Formosa q2 NA NA
6 Ria Formosa q2 NA NA
So, these are not truly missing values. These are just values that are absent because the team decided not to record them as they are all the same until the quadrat changes. To fix this, we will need to fill in the values.
Exercise 2.5
Use the function tidyr::fill()
to fill in the values. Look at the documentation of this function or in the {tidyr}
cheatsheet for how to use it.
Solution to Exercise 2.5
<- tidyr::fill(data = crabs02, latitude, longitude, .direction = "down") crabs03
Recode NA
aliases
Sometimes the values used to represent missing values may take many different forms. Take a close look at the column sex
, and check if there are any other values than "female"
or "male"
.
Exercise 2.6
Detect NA
aliases used by the researchers to mean missingness or uncertainty regarding the sex of the crabs, and recode them to R’s appropriate value, i.e. the NA
value.
Tip: use the
unique(crabs03$sex)
to see all unique values in that column. Then use a combination ofdplyr::mutate()
,dplyr::if_else()
and the operator%in%
to re-create thesex
column with all missing values recoded toNA
.
Solution to Exercise 2.6
# These `sex_na_aliases` were inferred after running `unique(crabs03$sex)`
<- c("?", "male or female", "-", " ", "N/R")
sex_na_aliases <-
crabs04 %>%
crabs03 mutate(sex = if_else(sex %in% sex_na_aliases, NA_character_, sex))
Compare the values in the sex
column before and after the recoding:
# Before the recoding
unique(crabs03$sex)
[1] "?" "female" "male or female" NA
[5] "-" "N/R" "male"
# After the recoding
unique(crabs04$sex)
[1] NA "female" "male"
General recoding of values
As it turns out, the differences between the data sets generated at Ria Formosa and at Ria de Alvor are not exhausted yet… It seems that the variables species
and stage
are also not agreeing between the two Rias. Let us use dplyr::count()
to see that:
::count(crabs04, region, species, stage) dplyr
# A tibble: 20 × 4
region species stage n
<chr> <chr> <chr> <int>
1 Ria de Alvor Afruca farensis a 464
2 Ria de Alvor Afruca farensis j 130
3 Ria de Alvor Afruca farensis p 130
4 Ria de Alvor Afruca farensis s 346
5 Ria de Alvor Uca olhanensis a 641
6 Ria de Alvor Uca olhanensis j 228
7 Ria de Alvor Uca olhanensis p 252
8 Ria de Alvor Uca olhanensis s 438
9 Ria de Alvor Ucides alvorensis a 466
10 Ria de Alvor Ucides alvorensis j 128
11 Ria de Alvor Ucides alvorensis p 128
12 Ria de Alvor Ucides alvorensis s 344
13 Ria Formosa A. farensis adult 624
14 Ria Formosa A. farensis juvenile 178
15 Ria Formosa A. farensis pre_puberty 178
16 Ria Formosa A. farensis sub_adult 470
17 Ria Formosa U. olhanensis adult 881
18 Ria Formosa U. olhanensis juvenile 330
19 Ria Formosa U. olhanensis pre_puberty 354
20 Ria Formosa U. olhanensis sub_adult 616
The team at Ria de Alvor abbreviated the names of the developmental stages, and the team at Ria Formosa seems to have abbreviated the species names. This problematic, we need to recode them.
Exercise 2.7
Let us convert everything to full names:
Old value | New value |
---|---|
j |
juvenile |
p |
pre_puberty |
s |
sub_adult |
a |
adult |
A. farensis |
Afruca farensis |
U. olhanensis |
Uca olhanensis |
A combination of
dplyr::mutate()
anddplyr::recode()
will be helpful.
Solution to Exercise 2.7
<-
crabs05 %>%
crabs04 ::mutate(
dplyrspecies = dplyr::recode(
species,`A. farensis` = "Afruca farensis",
`U. olhanensis` = "Uca olhanensis"),
stage = dplyr::recode(
stage,`j` = "juvenile",
`p` = "pre_puberty",
`s` = "sub_adult",
`a` = "adult"
),quadrat = substr(quadrat, 2, 2)
)
The new data set should now show:
::count(crabs05, region, species, stage) dplyr
# A tibble: 20 × 4
region species stage n
<chr> <chr> <chr> <int>
1 Ria de Alvor Afruca farensis adult 464
2 Ria de Alvor Afruca farensis juvenile 130
3 Ria de Alvor Afruca farensis pre_puberty 130
4 Ria de Alvor Afruca farensis sub_adult 346
5 Ria de Alvor Uca olhanensis adult 641
6 Ria de Alvor Uca olhanensis juvenile 228
7 Ria de Alvor Uca olhanensis pre_puberty 252
8 Ria de Alvor Uca olhanensis sub_adult 438
9 Ria de Alvor Ucides alvorensis adult 466
10 Ria de Alvor Ucides alvorensis juvenile 128
11 Ria de Alvor Ucides alvorensis pre_puberty 128
12 Ria de Alvor Ucides alvorensis sub_adult 344
13 Ria Formosa Afruca farensis adult 624
14 Ria Formosa Afruca farensis juvenile 178
15 Ria Formosa Afruca farensis pre_puberty 178
16 Ria Formosa Afruca farensis sub_adult 470
17 Ria Formosa Uca olhanensis adult 881
18 Ria Formosa Uca olhanensis juvenile 330
19 Ria Formosa Uca olhanensis pre_puberty 354
20 Ria Formosa Uca olhanensis sub_adult 616
Fix scale of quantitative variables
The Ria de Alvor team recorded the morphological measurements in centimetres, whereas the Ria Formosa team recorded them in millimetres. We need to fix this.
Exercise 2.8
Multiply the values of the morphological variables of Ria de Alvor by 10 to transform from centimetres to millimetres, and hence harmonise them with Ria Formosa’s values.
Solution to Exercise 2.8
<-
crabs06 %>%
crabs05 ::mutate(
dplyrcarapace_width = if_else(region == "Ria de Alvor", carapace_width * 10, carapace_width),
carapace_length = if_else(region == "Ria de Alvor", carapace_length * 10, carapace_length),
front_width = if_else(region == "Ria de Alvor", front_width * 10, front_width),
left_claw_length = if_else(region == "Ria de Alvor", left_claw_length * 10, left_claw_length),
right_claw_length = if_else(region == "Ria de Alvor", right_claw_length * 10, right_claw_length)
)
Compare the summary statistics before and after the transformation. Here is an example with carapace_width
:
# Before multiplication by 10
summary(crabs05$carapace_width)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.20 10.80 22.81 36.00 142.00
# After multiplication by 10
summary(crabs06$carapace_width)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 26.00 40.00 43.81 56.00 142.00
Ensure appropriate variable types
Currently our data set crabs06
has a range of variables whose type is character
, i.e. the individual values are strings. Examples are: region
, species
, stage
, etc.. These are essentially nominal (or categorical), or ordered categorical (or ordinal) variables. In R these variable types are called factor
s.
Although it is possible to leave these variables as the character
type, converting to factors has a few advantages:
- Memory efficiency:
factor
s take up less space thancharacter
variables, because they are effectively stored as integers. - Factors are useful when you know the set of possible values but they’re not all present in a given data set. In contrast to a character vector, when you tabulate a factor you’ll get counts of all categories, even unobserved ones.
Exercise 2.9
Use a combination of dplyr::mutate()
and factor()
to convert the following variables to factors:
region
season
species
sex
stage
Also, and because we do not have an integer variable in our crabs06
data set, make the quadrat an integer (for illustrative purposes). Tip: use as.integer()
for that.
Solution to Exercise 2.9
<-
crabs07 %>%
crabs06 ::mutate(
dplyrregion = factor(region),
season = factor(season),
quadrat = as.integer(quadrat),
species = factor(species),
sex = factor(sex, levels = c("female", "male")),
stage = factor(stage, levels = c("juvenile", "pre_puberty", "sub_adult", "adult"), ordered = TRUE)
)
Column names are values, not variable names
Remember the quadrats01
data set about the quadrats? It’s time to update it too. Look at the column names, they’re values! Also, the first column has an automatic name ...1
because originally in the Excel file it was empty.
quadrats01
# A tibble: 1 × 5
...1 `Ria Formosa-q1` `Ria Formosa-q2` `Ria de Alvor-q1` Ria de Alvor-q…¹
<chr> <dbl> <dbl> <dbl> <dbl>
1 Area (m2) 250 360 260 220
# … with abbreviated variable name ¹`Ria de Alvor-q2`
To fix this data set we will use two fancy functions that we haven’t used yet: tidyr::pivot_longer()
and tidyr::separate()
.
Exercise 2.10
Convert the quadrats01
data set into a tidy data set by moving the values in the column names to table elements, and create new columns to accomodate them.
Solution to Exercise 2.10
<-
quadrats02 %>%
quadrats01 ::select(-1) %>%
dplyr::pivot_longer(cols = everything(), names_to = "region_quadrat", values_to = "area") %>%
tidyr::separate(col = "region_quadrat", into = c("region", "quadrat"), sep = "-") tidyr
Exercise 2.11
And just like we did before, convert the region
variable to factor and quadrat
to integer and save the result to quadrats03
.
Solution to Exercise 2.11
<-
quadrats03 %>%
quadrats02 ::mutate(
dplyrregion = factor(region),
quadrat = as.integer(substr(quadrat, 2, 2))
)
Multiple observational unit types in one table
Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table. This is closely related to the idea of database normalisation, where each fact is expressed in only one place. It’s important because otherwise inconsistencies can arise. From
{tidyr}
vignette on multiple types
There are two variables (columns) in our crabs07
data set that characterise the quadrats, not the crabs individually. These are latitude
and longitude
. Yet, they’ve been recorded in the crabs data sets but belong in the quadrats data set.
Exercise 2.12
To fix this, we will need to get the unique latitude and longitude values for each quadrat. There is a nifty function that allow us to that: dplyr::distinct()
. And combined with dplyr::left_join()
will get us there…
Solution to Exercise 2.12
<- dplyr::distinct(crabs07, region, quadrat, latitude, longitude)
quadrats_coord <- dplyr::left_join(quadrats03, quadrats_coord)
quadrats04
# And now we can drop latitude and longitude from the `crabs07` data set.
# Note the minus sign behind the vector of column names.
<- select(crabs07, -c("latitude", "longitude")) crabs08
Here’s how it should like:
quadrats04
# A tibble: 4 × 5
region quadrat area latitude longitude
<fct> <int> <dbl> <dbl> <dbl>
1 Ria Formosa 1 250 37.0 -8.02
2 Ria Formosa 2 360 37.0 -7.99
3 Ria de Alvor 1 260 37.1 -8.60
4 Ria de Alvor 2 220 37.1 -8.62
crabs08
# A tibble: 7,326 × 12
region season quadrat species sex stage is_gr…¹ carap…² carap…³ front…⁴
<fct> <fct> <int> <fct> <fct> <ord> <lgl> <dbl> <dbl> <dbl>
1 Ria Formo… summer 1 Afruca… <NA> juve… FALSE 27 21 15
2 Ria Formo… summer 1 Afruca… <NA> juve… FALSE 27 21 16
3 Ria Formo… summer 1 Afruca… fema… juve… TRUE 19 15 10
4 Ria Formo… summer 1 Afruca… <NA> juve… FALSE 26 21 19
5 Ria Formo… summer 1 Afruca… fema… juve… FALSE 24 19 14
6 Ria Formo… summer 1 Afruca… fema… juve… FALSE 22 17 15
7 Ria Formo… summer 1 Afruca… fema… juve… FALSE 25 19 15
8 Ria Formo… summer 1 Afruca… fema… juve… FALSE 18 14 8
9 Ria Formo… summer 1 Afruca… fema… juve… TRUE 24 19 11
10 Ria Formo… summer 1 Afruca… fema… juve… FALSE 24 19 19
# … with 7,316 more rows, 2 more variables: left_claw_length <dbl>,
# right_claw_length <dbl>, and abbreviated variable names ¹is_gravid,
# ²carapace_width, ³carapace_length, ⁴front_width
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Wrapping up
Finally we will give simple names to our two data sets and save them to disk to mark this milestone of Data Tidying. These exported files will be the starting point of tomorrow’s workflow about Data Visualisation with {ggplot2}
.
Exercise 2.13
- Export the
quadrats
data frame to CSV (data-tidy/quadrats.csv
) - Export the
crabs
data frame to CSV (data-tidy/crabs.csv
)
Although CSV files are a good way of storing our tidy data sets and of sharing it with collaborators, they do not store information about the variables (columns) types.
So please export also to RDS format which is a special format that only R understands but that keeps the data exactly as we have it now, thus making it easier to read in this files and continue with our workflow about Data Visualisation tomorrow.
- Export the
quadrats
data frame to RDS (data-tidy/quadrats.rds
) - Export the
crabs
data frame to RDS (data-tidy/crabs.rds
)
Solution to Exercise 2.13
# Assign to new variables with simpler names
<- quadrats04
quadrats <- crabs08
crabs
# Export to disk
<- here::here("data-tidy")
data_tidy_path ::write_csv(quadrats, file = file.path(data_tidy_path, "quadrats.csv"))
readr::write_csv(crabs, file = file.path(data_tidy_path, "crabs.csv"))
readr
saveRDS(quadrats, file = file.path(data_tidy_path, "quadrats.rds"))
saveRDS(crabs, file = file.path(data_tidy_path, "crabs.rds"))
Data tidying script
Here is the entire script containing all code from the exercises present in section Raw Data and Data Tidying.
#
# Raw Data
#
library(tidyverse)
library(here)
library(tools)
library(readxl)
# Define the path to the raw data
<- here("data-raw")
data_raw_path
# List the raw data files
list.files(data_raw_path)
# Reading the four CSV files (Ria Formosa)
<- readr::read_csv(file.path(data_raw_path, "rf_s_q1.csv"))
rf_s_q1 <- readr::read_csv(file.path(data_raw_path, "rf_s_q2.csv"))
rf_s_q2 <- readr::read_csv(file.path(data_raw_path, "rf_w_q1.csv"))
rf_w_q1 <- readr::read_csv(file.path(data_raw_path, "rf_w_q2.csv"))
rf_w_q2
# Reading now the four sheets inside of "Ria de Alvor.xlsx"
<- file.path(data_raw_path, "Ria de Alvor.xlsx")
ra_path <- readxl::read_excel(ra_path, sheet = "summer-q1")
ra_s_q1 <- readxl::read_excel(ra_path, sheet = "summer-q2")
ra_s_q2 <- readxl::read_excel(ra_path, sheet = "winter-q1")
ra_w_q1 <- readxl::read_excel(ra_path, sheet = "winter-q2")
ra_w_q2
# Finally, reading the details about the quadrats
<- readxl::read_excel(file.path(data_raw_path, "quadrats.xlsx"))
quadrats01
# Insights gained: to be discussed with trainers.
#
# Data Tidying
#
# EXERCISE 2.1
# Ria Formosa data sets
<- dplyr::mutate(rf_s_q1, region = "Ria Formosa", season = "summer", quadrat = "q1")
rf_s_q1_01 <- dplyr::mutate(rf_s_q2, region = "Ria Formosa", season = "summer", quadrat = "q2")
rf_s_q2_01 <- dplyr::mutate(rf_w_q1, region = "Ria Formosa", season = "winter", quadrat = "q1")
rf_w_q1_01 <- dplyr::mutate(rf_w_q2, region = "Ria Formosa", season = "winter", quadrat = "q2")
rf_w_q2_01
# Ria de Alvor data sets
<- dplyr::mutate(ra_s_q1, region = "Ria de Alvor", season = "summer", quadrat = "q1")
ra_s_q1_01 <- dplyr::mutate(ra_s_q2, region = "Ria de Alvor", season = "summer", quadrat = "q2")
ra_s_q2_01 <- dplyr::mutate(ra_w_q1, region = "Ria de Alvor", season = "winter", quadrat = "q1")
ra_w_q1_01 <- dplyr::mutate(ra_w_q2, region = "Ria de Alvor", season = "winter", quadrat = "q2")
ra_w_q2_01
# EXERCISE 2.2
# What column names are there in Ria de Alvor and not in Ria Formosa?
setdiff(colnames(ra_s_q1_01), colnames(rf_s_q1_01))
# And now the reverse question.
setdiff(colnames(rf_s_q1_01), colnames(ra_s_q1_01))
# So it seems Ria de Alvor data sets have named the `species` column as
# `species_name`, and `is_gravid?` contains a question mark.
# Let us fix by mapping:
# - `species_name` to `species`
# - `is_gravid?` to `is_gravid`
# And also, do not forget to map the abbreviated column names to their full
# names.
# Note how the mapping of the names is indicated in `rename()`:
# New names go in the left hand side of the equal sign and old names in
# the right hand side.
#
# Ria Formosa data sets
<- dplyr::rename(rf_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q1_02 <- dplyr::rename(rf_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_s_q2_02 <- dplyr::rename(rf_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q1_02 <- dplyr::rename(rf_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw)
rf_w_q2_02
# Ria de Alvor data sets
<- dplyr::rename(ra_s_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_s_q1_02 <- dplyr::rename(ra_s_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_s_q2_02 <- dplyr::rename(ra_w_q1_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q1_02 <- dplyr::rename(ra_w_q2_01, carapace_width = cw, carapace_length = cl, left_claw_length = lcl, right_claw_length = rcl, front_width = fw, species = species_name, is_gravid = `is_gravid?`)
ra_w_q2_02
# EXERCISE 2.3
<-
crabs01 ::bind_rows(
dplyr
rf_s_q1_02,
rf_s_q2_02,
rf_w_q1_02,
rf_w_q2_02,
ra_s_q1_02,
ra_s_q2_02,
ra_w_q1_02,
ra_w_q2_02
)
# EXERCISE 2.4
<-
crabs02 %>%
crabs01 ::select(-"id") %>%
dplyr::relocate(
dplyr
region,
season,
quadrat,
latitude,
longitude,
species,
sex,
stage,
is_gravid,
carapace_width,
carapace_length,
front_width,
left_claw_length,
right_claw_length
)
# EXERCISE 2.5
<- tidyr::fill(data = crabs02, latitude, longitude, .direction = "down")
crabs03
# EXERCISE 2.6
# These `sex_na_aliases` were inferred after running `unique(crabs03$sex)`
<- c("?", "male or female", "-", " ", "N/R")
sex_na_aliases <-
crabs04 %>%
crabs03 mutate(sex = if_else(sex %in% sex_na_aliases, NA_character_, sex))
# EXERCISE 2.7
<-
crabs05 %>%
crabs04 ::mutate(
dplyrspecies = dplyr::recode(
species,`A. farensis` = "Afruca farensis",
`U. olhanensis` = "Uca olhanensis"),
stage = dplyr::recode(
stage,`j` = "juvenile",
`p` = "pre_puberty",
`s` = "sub_adult",
`a` = "adult"
),quadrat = substr(quadrat, 2, 2)
)
# EXERCISE 2.8
<-
crabs06 %>%
crabs05 ::mutate(
dplyrcarapace_width = if_else(region == "Ria de Alvor", carapace_width * 10, carapace_width),
carapace_length = if_else(region == "Ria de Alvor", carapace_length * 10, carapace_length),
front_width = if_else(region == "Ria de Alvor", front_width * 10, front_width),
left_claw_length = if_else(region == "Ria de Alvor", left_claw_length * 10, left_claw_length),
right_claw_length = if_else(region == "Ria de Alvor", right_claw_length * 10, right_claw_length)
)
# EXERCISE 2.9
<-
crabs07 %>%
crabs06 ::mutate(
dplyrregion = factor(region),
season = factor(season),
quadrat = as.integer(quadrat),
species = factor(species),
sex = factor(sex, levels = c("female", "male")),
stage = factor(stage, levels = c("juvenile", "pre_puberty", "sub_adult", "adult"), ordered = TRUE)
)
# EXERCISE 2.10
<-
quadrats02 %>%
quadrats01 ::select(-1) %>%
dplyr::pivot_longer(cols = everything(), names_to = "region_quadrat", values_to = "area") %>%
tidyr::separate(col = "region_quadrat", into = c("region", "quadrat"), sep = "-")
tidyr
# EXERCISE 2.11
<-
quadrats03 %>%
quadrats02 ::mutate(
dplyrregion = factor(region),
quadrat = as.integer(substr(quadrat, 2, 2))
)
# EXERCISE 2.12
<- dplyr::distinct(crabs07, region, quadrat, latitude, longitude)
quadrats_coord <- dplyr::left_join(quadrats03, quadrats_coord)
quadrats04
# And now we can drop latitude and longitude from the `crabs07` data set.
# Note the minus sign behind the vector of column names.
<- select(crabs07, -c("latitude", "longitude"))
crabs08
# EXERCISE 2.13
# Assign to new variables with simpler names
<- quadrats04
quadrats <- crabs08
crabs
# Export to disk
<- here::here("data-tidy")
data_tidy_path ::write_csv(quadrats, file = file.path(data_tidy_path, "quadrats.csv"))
readr::write_csv(crabs, file = file.path(data_tidy_path, "crabs.csv"))
readr
saveRDS(quadrats, file = file.path(data_tidy_path, "quadrats.rds"))
saveRDS(crabs, file = file.path(data_tidy_path, "crabs.rds"))