mutate()
function to add variables to a data
framegroup_by()
and summarise()
functions to generate summary statisticscase_when()
function to create conditional
factor labelsThis week we’re going to use two “untidy” data sets
provided by Joseph Casillas,
with the goal of wrangling the data in order to create some appropriate
plots. Let’s first load the data, which will introduce two data tibbles
into the workspace: spanish_vowels
and
vot
:
load("/shared/groups/jrole001/pals0047/data/untidy_data.Rda")
Let’s also load the whole tidyverse, because we’ll be using three
packages from it: dplyr
, stringr
, and
ggplot2
:
library(tidyverse)
The first untidy data set we’ll be working with includes simulated F1 and F2 measurements from male and female Spanish speakers. Let’s say we want to create a formant space, with observations separated by vowel and male vs. female.
Let’s take a look at what we have:
head(spanish_vowels)
## # A tibble: 6 × 4
## label rep f1 f2
## <chr> <int> <dbl> <dbl>
## 1 p01-male-a 1 615. 1231.
## 2 p01-male-a 2 645. 1282.
## 3 p01-male-a 3 608. 1248.
## 4 p01-male-e 1 477. 1612.
## 5 p01-male-e 2 457. 1839.
## 6 p01-male-e 3 445. 1849.
Here is a problem: there are no column variables for vowel and
speaker sex! However… these variables seem to be embedded in the
character string that is stored in the label
column… so how
do we retrieve these variables?
We can take advantage of the fact that the variables of interest are
each separated by the character -
in the string, which
means that we can split the string by this character
and extract the sub-strings that the -
character
delineates. To do this, we can use the str_split()
function
from the stringr
package.
Let’s take a look at an example by creating as vector of strings that
are concatenated with the underscore _
:
mystrings <- paste0(letters,"_",1:26,"_",rev(LETTERS))
So now we have 26 strings, each of which is composed of lowercase letters (in ascending order), then an underscore, then integers (in ascending order), then an underscore, then uppercase letters (in descending order). What if we want to separate the lowercase letters, the numbers, and the uppercase letters into observations of three separate variables?
We can split the strings by the _
character by
using the str_split()
function:
str_split(mystrings, "_")
## [[1]]
## [1] "a" "1" "Z"
##
## [[2]]
## [1] "b" "2" "Y"
##
## [[3]]
## [1] "c" "3" "X"
##
## [[4]]
## [1] "d" "4" "W"
##
## [[5]]
## [1] "e" "5" "V"
##
## [[6]]
## [1] "f" "6" "U"
##
## [[7]]
## [1] "g" "7" "T"
##
## [[8]]
## [1] "h" "8" "S"
##
## [[9]]
## [1] "i" "9" "R"
##
## [[10]]
## [1] "j" "10" "Q"
##
## [[11]]
## [1] "k" "11" "P"
##
## [[12]]
## [1] "l" "12" "O"
##
## [[13]]
## [1] "m" "13" "N"
##
## [[14]]
## [1] "n" "14" "M"
##
## [[15]]
## [1] "o" "15" "L"
##
## [[16]]
## [1] "p" "16" "K"
##
## [[17]]
## [1] "q" "17" "J"
##
## [[18]]
## [1] "r" "18" "I"
##
## [[19]]
## [1] "s" "19" "H"
##
## [[20]]
## [1] "t" "20" "G"
##
## [[21]]
## [1] "u" "21" "F"
##
## [[22]]
## [1] "v" "22" "E"
##
## [[23]]
## [1] "w" "23" "D"
##
## [[24]]
## [1] "x" "24" "C"
##
## [[25]]
## [1] "y" "25" "B"
##
## [[26]]
## [1] "z" "26" "A"
This will create a list by default, but we can override this
behaviour and make a matrix instead by employing the
simplify
argument```:
str_split(mystrings, "_", simplify=T)
## [,1] [,2] [,3]
## [1,] "a" "1" "Z"
## [2,] "b" "2" "Y"
## [3,] "c" "3" "X"
## [4,] "d" "4" "W"
## [5,] "e" "5" "V"
## [6,] "f" "6" "U"
## [7,] "g" "7" "T"
## [8,] "h" "8" "S"
## [9,] "i" "9" "R"
## [10,] "j" "10" "Q"
## [11,] "k" "11" "P"
## [12,] "l" "12" "O"
## [13,] "m" "13" "N"
## [14,] "n" "14" "M"
## [15,] "o" "15" "L"
## [16,] "p" "16" "K"
## [17,] "q" "17" "J"
## [18,] "r" "18" "I"
## [19,] "s" "19" "H"
## [20,] "t" "20" "G"
## [21,] "u" "21" "F"
## [22,] "v" "22" "E"
## [23,] "w" "23" "D"
## [24,] "x" "24" "C"
## [25,] "y" "25" "B"
## [26,] "z" "26" "A"
If we assign the resulting matrix to a variable we can access the columns as variables!
split_strings <- str_split(mystrings, "_", simplify=T)
head(split_strings[,1])
## [1] "a" "b" "c" "d" "e" "f"
head(split_strings[,2])
## [1] "1" "2" "3" "4" "5" "6"
head(split_strings[,3])
## [1] "Z" "Y" "X" "W" "V" "U"
You first goal in this exercise is to apply str_split()
to the label
column to extract three columns. Assign this
to a variable called sp_labels
; the result should look like
this:
head(sp_labels)
## [,1] [,2] [,3]
## [1,] "p01" "male" "a"
## [2,] "p01" "male" "a"
## [3,] "p01" "male" "a"
## [4,] "p01" "male" "e"
## [5,] "p01" "male" "e"
## [6,] "p01" "male" "e"
Once you have this matrix, you can use it to add three new column
variables (speaker
, sex
, vowel
)
to the spanish_vowels
tibble using the
mutate()
function. Here is a hint for how to add the
vowel
column:
spanish_vowels %>%
mutate(vowel = sp_labels[,3])
## # A tibble: 750 × 5
## label rep f1 f2 vowel
## <chr> <int> <dbl> <dbl> <chr>
## 1 p01-male-a 1 615. 1231. a
## 2 p01-male-a 2 645. 1282. a
## 3 p01-male-a 3 608. 1248. a
## 4 p01-male-e 1 477. 1612. e
## 5 p01-male-e 2 457. 1839. e
## 6 p01-male-e 3 445. 1849. e
## 7 p01-male-i 1 309. 2153. i
## 8 p01-male-i 2 259. 2176. i
## 9 p01-male-i 3 337. 2015. i
## 10 p01-male-o 1 478. 865. o
## # … with 740 more rows
nb #1: you only need to use the mutate function once… all three columns can be added in the same instance of the function!
nb #2: remember that the new columns are not automatically saved to the original data… you need to do this yourself!
The final table should look like this:
spanish_vowels
## # A tibble: 750 × 7
## label rep f1 f2 speaker sex vowel
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
## 1 p01-male-a 1 615. 1231. p01 male a
## 2 p01-male-a 2 645. 1282. p01 male a
## 3 p01-male-a 3 608. 1248. p01 male a
## 4 p01-male-e 1 477. 1612. p01 male e
## 5 p01-male-e 2 457. 1839. p01 male e
## 6 p01-male-e 3 445. 1849. p01 male e
## 7 p01-male-i 1 309. 2153. p01 male i
## 8 p01-male-i 2 259. 2176. p01 male i
## 9 p01-male-i 3 337. 2015. p01 male i
## 10 p01-male-o 1 478. 865. p01 male o
## # … with 740 more rows
Your next goal is to use ggplot()
to create some nice
looking plots that show differences in formant values between both vowel
categories and speaker sex. To help separate the different vowel
categories visually, you will first create a table of average formant
values using the group_by()
and summarise()
functions in the same way as shown in this week’s lecture.
Assign the results to a variable called means
, which
should look like the following when you have finished:
means
## # A tibble: 5 × 3
## vowel f1 f2
## <chr> <dbl> <dbl>
## 1 a 689. 1503.
## 2 e 510. 1963.
## 3 i 336. 2284.
## 4 o 510. 1180.
## 5 u 370. 1141.
You can now add the values from this table of means onto a scatter plot, with the vowel category separated by color, like so:
spanish_vowels %>%
ggplot(aes(x=f2,y=f1)) + geom_point(aes(col=vowel)) +
geom_label(data=means, aes(x=f2, y=f1, label=vowel)) +
scale_x_reverse() + scale_y_reverse()
You can use the color aesthetic to separate by speaker sex instead of vowel category. Try to recreate the following figure:
Your final goal for this data set is to display two different factor aesthetics by separating vowel category by color and speaker sex by shape. Try to recreate the following figure:
time to think: what is the likely cause of the formant differences between male and female speakers displayed here?
For the second part of the tutorial, we’re going to switch to the
vot
data set. The goal is to compare the voice onset time
(VOT) values of voiced vs. voiceless consonants, in English
vs. Spanish.
Let’s take a look at what we have to work with:
vot
## # A tibble: 720 × 5
## participant language item repetition vot
## <chr> <chr> <chr> <int> <dbl>
## 1 monoSp00 spanish da 1 -75.8
## 2 monoSp00 spanish da 2 -81.1
## 3 monoSp00 spanish da 3 -56.5
## 4 monoSp00 spanish de 1 -55.6
## 5 monoSp00 spanish de 2 -62.9
## 6 monoSp00 spanish de 3 -43.7
## 7 monoSp00 spanish di 1 -54.0
## 8 monoSp00 spanish di 2 -72.4
## 9 monoSp00 spanish di 3 -58.5
## 10 monoSp00 spanish te 1 18.6
## # … with 710 more rows
Well, we have a column for language, which is good! But what about
consonant voicing? How do we get that information? We can extract it
from the strings in the item
column!
table(vot$item)
##
## da de di dig dog dug tag te ti tog tu tug
## 60 60 60 60 60 60 60 60 60 60 60 60
We can see that all of the items begin with either /d/ (voiced) or
/t/ (voiceless), so we simply need to extract the first character of
each string. You can do this using a different function from
stringr
called str_sub
.
Let’s practice using another base R constant, the English names for the months of the year:
month.name
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
If we want to extract, say, the 2nd through the 5th characters, we can specify the start and end indices of the string character extraction:
str_sub(month.name, 2, 5)
## [1] "anua" "ebru" "arch" "pril" "ay" "une" "uly" "ugus" "epte" "ctob"
## [11] "ovem" "ecem"
We can see that months that don’t have 5 characters stop short and simply give fewer characters (e.g. “ay” for May, “une” for June).
The minus sign can be used to count characters right-to-left instead of left-to-right. So the 2nd character through the 2nd to last character would be:
str_sub(month.name, 2, -2)
## [1] "anuar" "ebruar" "arc" "pri" "a" "un" "ul"
## [8] "ugus" "eptembe" "ctobe" "ovembe" "ecembe"
Using these principles, the last character of each month name can be extracted using:
str_sub(month.name, -1, -1)
## [1] "y" "y" "h" "l" "y" "e" "y" "t" "r" "r" "r" "r"
Likewise, the first character of each month name can be extracted using:
str_sub(month.name, 1, 1)
## [1] "J" "F" "M" "A" "M" "J" "J" "A" "S" "O" "N" "D"
Using this knowledge, you should be able to extract the first
character of each string in the item
column, which will
look like this:
## [1] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [19] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [37] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [55] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [73] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [91] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [109] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [127] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [145] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [163] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [181] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [199] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [217] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [235] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [253] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [271] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [289] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [307] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [325] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [343] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [361] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [379] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [397] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [415] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [433] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [451] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [469] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [487] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [505] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [523] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [541] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [559] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [577] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [595] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [613] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [631] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [649] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [667] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [685] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [703] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
You’ve already learned about conditional cases in base R, but there
is also a way to implement conditional cases using the
case_when()
function from dplyr
, which is a
more readable way of using multiple if else
statements.
These statements are implemented using a two-sided formula:
condition TRUE ~ output value
Let’s practice by using an example from week 5:
mynums <- -2:2
for (x in mynums) {
if (x < 0) {
print("negative")
} else if (x > 0) {
print("positive")
} else {
print("neither")
}
}
## [1] "negative"
## [1] "negative"
## [1] "neither"
## [1] "positive"
## [1] "positive"
This same set of conditional statements can also be written in the
following way, where the last line TRUE ~ "neither"
is the
case when all other conditions are not met:
case_when(
mynums < 0 ~ "negative",
mynums > 0 ~ "positive",
TRUE ~ "neither"
)
## [1] "negative" "negative" "neither" "positive" "positive"
If the variable being queried in the conditional statements is a
column in a data table, then the column name can be referenced in the
formula instead, and the result can be added to the table using the
mutate()
function:
mydat <- data.frame(
val = mynums,
label = letters[1:5]
)
mydat %>%
mutate(
result = case_when(
val < 0 ~ "negative",
val > 0 ~ "positive",
TRUE ~ "neither"
)
)
## val label result
## 1 -2 a negative
## 2 -1 b negative
## 3 0 c neither
## 4 1 d positive
## 5 2 e positive
Using all of this new knowledge add two new column named
onset
and voicing
to the tibble
vot
by filling in the missing information below:
vot <- vot %>%
mutate(
onset = ???,
voicing = case_when(
onset == ??? ~ ???,
onset == ??? ~ ???
)
)
The result should look like the following:
vot
## # A tibble: 720 × 7
## participant language item repetition vot onset voicing
## <chr> <chr> <chr> <int> <dbl> <chr> <chr>
## 1 monoSp00 spanish da 1 -75.8 d voiced
## 2 monoSp00 spanish da 2 -81.1 d voiced
## 3 monoSp00 spanish da 3 -56.5 d voiced
## 4 monoSp00 spanish de 1 -55.6 d voiced
## 5 monoSp00 spanish de 2 -62.9 d voiced
## 6 monoSp00 spanish de 3 -43.7 d voiced
## 7 monoSp00 spanish di 1 -54.0 d voiced
## 8 monoSp00 spanish di 2 -72.4 d voiced
## 9 monoSp00 spanish di 3 -58.5 d voiced
## 10 monoSp00 spanish te 1 18.6 t voiceless
## # … with 710 more rows
Your final goal is to create plots to compare the VOT distributions for the interaction between two variables: voicing (voiced vs. voiceless) and language (English vs. Spanish).
There are three variables that we want to represent in these plots:
vot
, voicing
, and language
. Using
this knowledge, try to recreate the following plot by filling in the
missing information:
vot %>%
ggplot(aes(x=???, y=???, fill=???)) +
geom_boxplot()
A more informative view of the distribution is to use a “violin”
plot, which is a mirrored probability density. You can do this by using
geom_violin()
instead of geom_boxplot()
:
The best of both worlds, however, would be
to display the probability density along with the 25%, 50%, and 75%
quantiles shown in the box plots by using the vector
c(0.25,0.5,0.75)
as the value for the
draw_quantiles
argument:
time to think: how does vot characterise the consonant voicing contrast in these two languages? in what cases might this cause confusion for an english speaker learning spanish, or a spanish speaker learning english?