Objectives

  • Manipulate strings by splitting by characters
  • Manipulate strings by extracting a range of character values
  • Use the mutate() function to add variables to a data frame
  • Use the group_by() and summarise() functions to generate summary statistics
  • Use the case_when() function to create conditional factor labels

Preliminaries

This week we’re going to use two “untidy” data sets provided by Joseph Casillas, with the goal of wrangling the data in order to create some appropriate plots. Let’s first load the data, which will introduce two data tibbles into the workspace: spanish_vowels and vot:

load("/shared/groups/jrole001/pals0047/data/untidy_data.Rda")

Let’s also load the whole tidyverse, because we’ll be using three packages from it: dplyr, stringr, and ggplot2:

library(tidyverse)

Spanish vowels

Exercise #1: splitting strings

The first untidy data set we’ll be working with includes simulated F1 and F2 measurements from male and female Spanish speakers. Let’s say we want to create a formant space, with observations separated by vowel and male vs. female.

Let’s take a look at what we have:

head(spanish_vowels)
## # A tibble: 6 × 4
##   label        rep    f1    f2
##   <chr>      <int> <dbl> <dbl>
## 1 p01-male-a     1  615. 1231.
## 2 p01-male-a     2  645. 1282.
## 3 p01-male-a     3  608. 1248.
## 4 p01-male-e     1  477. 1612.
## 5 p01-male-e     2  457. 1839.
## 6 p01-male-e     3  445. 1849.

Here is a problem: there are no column variables for vowel and speaker sex! However… these variables seem to be embedded in the character string that is stored in the label column… so how do we retrieve these variables?

We can take advantage of the fact that the variables of interest are each separated by the character - in the string, which means that we can split the string by this character and extract the sub-strings that the - character delineates. To do this, we can use the str_split() function from the stringr package.

Let’s take a look at an example by creating as vector of strings that are concatenated with the underscore _:

mystrings <- paste0(letters,"_",1:26,"_",rev(LETTERS))

So now we have 26 strings, each of which is composed of lowercase letters (in ascending order), then an underscore, then integers (in ascending order), then an underscore, then uppercase letters (in descending order). What if we want to separate the lowercase letters, the numbers, and the uppercase letters into observations of three separate variables?

We can split the strings by the _ character by using the str_split() function:

str_split(mystrings, "_")
## [[1]]
## [1] "a" "1" "Z"
## 
## [[2]]
## [1] "b" "2" "Y"
## 
## [[3]]
## [1] "c" "3" "X"
## 
## [[4]]
## [1] "d" "4" "W"
## 
## [[5]]
## [1] "e" "5" "V"
## 
## [[6]]
## [1] "f" "6" "U"
## 
## [[7]]
## [1] "g" "7" "T"
## 
## [[8]]
## [1] "h" "8" "S"
## 
## [[9]]
## [1] "i" "9" "R"
## 
## [[10]]
## [1] "j"  "10" "Q" 
## 
## [[11]]
## [1] "k"  "11" "P" 
## 
## [[12]]
## [1] "l"  "12" "O" 
## 
## [[13]]
## [1] "m"  "13" "N" 
## 
## [[14]]
## [1] "n"  "14" "M" 
## 
## [[15]]
## [1] "o"  "15" "L" 
## 
## [[16]]
## [1] "p"  "16" "K" 
## 
## [[17]]
## [1] "q"  "17" "J" 
## 
## [[18]]
## [1] "r"  "18" "I" 
## 
## [[19]]
## [1] "s"  "19" "H" 
## 
## [[20]]
## [1] "t"  "20" "G" 
## 
## [[21]]
## [1] "u"  "21" "F" 
## 
## [[22]]
## [1] "v"  "22" "E" 
## 
## [[23]]
## [1] "w"  "23" "D" 
## 
## [[24]]
## [1] "x"  "24" "C" 
## 
## [[25]]
## [1] "y"  "25" "B" 
## 
## [[26]]
## [1] "z"  "26" "A"

This will create a list by default, but we can override this behaviour and make a matrix instead by employing the simplify argument```:

str_split(mystrings, "_", simplify=T)
##       [,1] [,2] [,3]
##  [1,] "a"  "1"  "Z" 
##  [2,] "b"  "2"  "Y" 
##  [3,] "c"  "3"  "X" 
##  [4,] "d"  "4"  "W" 
##  [5,] "e"  "5"  "V" 
##  [6,] "f"  "6"  "U" 
##  [7,] "g"  "7"  "T" 
##  [8,] "h"  "8"  "S" 
##  [9,] "i"  "9"  "R" 
## [10,] "j"  "10" "Q" 
## [11,] "k"  "11" "P" 
## [12,] "l"  "12" "O" 
## [13,] "m"  "13" "N" 
## [14,] "n"  "14" "M" 
## [15,] "o"  "15" "L" 
## [16,] "p"  "16" "K" 
## [17,] "q"  "17" "J" 
## [18,] "r"  "18" "I" 
## [19,] "s"  "19" "H" 
## [20,] "t"  "20" "G" 
## [21,] "u"  "21" "F" 
## [22,] "v"  "22" "E" 
## [23,] "w"  "23" "D" 
## [24,] "x"  "24" "C" 
## [25,] "y"  "25" "B" 
## [26,] "z"  "26" "A"

If we assign the resulting matrix to a variable we can access the columns as variables!

split_strings <- str_split(mystrings, "_", simplify=T)

head(split_strings[,1])
## [1] "a" "b" "c" "d" "e" "f"
head(split_strings[,2])
## [1] "1" "2" "3" "4" "5" "6"
head(split_strings[,3])
## [1] "Z" "Y" "X" "W" "V" "U"

You first goal in this exercise is to apply str_split() to the label column to extract three columns. Assign this to a variable called sp_labels; the result should look like this:

head(sp_labels)
##      [,1]  [,2]   [,3]
## [1,] "p01" "male" "a" 
## [2,] "p01" "male" "a" 
## [3,] "p01" "male" "a" 
## [4,] "p01" "male" "e" 
## [5,] "p01" "male" "e" 
## [6,] "p01" "male" "e"

Once you have this matrix, you can use it to add three new column variables (speaker, sex, vowel) to the spanish_vowels tibble using the mutate() function. Here is a hint for how to add the vowel column:

spanish_vowels %>% 
  mutate(vowel = sp_labels[,3])
## # A tibble: 750 × 5
##    label        rep    f1    f2 vowel
##    <chr>      <int> <dbl> <dbl> <chr>
##  1 p01-male-a     1  615. 1231. a    
##  2 p01-male-a     2  645. 1282. a    
##  3 p01-male-a     3  608. 1248. a    
##  4 p01-male-e     1  477. 1612. e    
##  5 p01-male-e     2  457. 1839. e    
##  6 p01-male-e     3  445. 1849. e    
##  7 p01-male-i     1  309. 2153. i    
##  8 p01-male-i     2  259. 2176. i    
##  9 p01-male-i     3  337. 2015. i    
## 10 p01-male-o     1  478.  865. o    
## # … with 740 more rows

nb #1: you only need to use the mutate function once… all three columns can be added in the same instance of the function!

nb #2: remember that the new columns are not automatically saved to the original data… you need to do this yourself!

The final table should look like this:

spanish_vowels
## # A tibble: 750 × 7
##    label        rep    f1    f2 speaker sex   vowel
##    <chr>      <int> <dbl> <dbl> <chr>   <chr> <chr>
##  1 p01-male-a     1  615. 1231. p01     male  a    
##  2 p01-male-a     2  645. 1282. p01     male  a    
##  3 p01-male-a     3  608. 1248. p01     male  a    
##  4 p01-male-e     1  477. 1612. p01     male  e    
##  5 p01-male-e     2  457. 1839. p01     male  e    
##  6 p01-male-e     3  445. 1849. p01     male  e    
##  7 p01-male-i     1  309. 2153. p01     male  i    
##  8 p01-male-i     2  259. 2176. p01     male  i    
##  9 p01-male-i     3  337. 2015. p01     male  i    
## 10 p01-male-o     1  478.  865. p01     male  o    
## # … with 740 more rows

Exercise #2: plotting formants, means, and factor aesthetics

Your next goal is to use ggplot() to create some nice looking plots that show differences in formant values between both vowel categories and speaker sex. To help separate the different vowel categories visually, you will first create a table of average formant values using the group_by() and summarise() functions in the same way as shown in this week’s lecture.

Assign the results to a variable called means, which should look like the following when you have finished:

means
## # A tibble: 5 × 3
##   vowel    f1    f2
##   <chr> <dbl> <dbl>
## 1 a      689. 1503.
## 2 e      510. 1963.
## 3 i      336. 2284.
## 4 o      510. 1180.
## 5 u      370. 1141.

You can now add the values from this table of means onto a scatter plot, with the vowel category separated by color, like so:

spanish_vowels %>%
  ggplot(aes(x=f2,y=f1)) + geom_point(aes(col=vowel)) +
  geom_label(data=means, aes(x=f2, y=f1, label=vowel)) +
  scale_x_reverse() + scale_y_reverse()

You can use the color aesthetic to separate by speaker sex instead of vowel category. Try to recreate the following figure:

Your final goal for this data set is to display two different factor aesthetics by separating vowel category by color and speaker sex by shape. Try to recreate the following figure:

time to think: what is the likely cause of the formant differences between male and female speakers displayed here?

Spanish and English VOT

Exercise #3: extracting strings

For the second part of the tutorial, we’re going to switch to the vot data set. The goal is to compare the voice onset time (VOT) values of voiced vs. voiceless consonants, in English vs. Spanish.

Let’s take a look at what we have to work with:

vot
## # A tibble: 720 × 5
##    participant language item  repetition   vot
##    <chr>       <chr>    <chr>      <int> <dbl>
##  1 monoSp00    spanish  da             1 -75.8
##  2 monoSp00    spanish  da             2 -81.1
##  3 monoSp00    spanish  da             3 -56.5
##  4 monoSp00    spanish  de             1 -55.6
##  5 monoSp00    spanish  de             2 -62.9
##  6 monoSp00    spanish  de             3 -43.7
##  7 monoSp00    spanish  di             1 -54.0
##  8 monoSp00    spanish  di             2 -72.4
##  9 monoSp00    spanish  di             3 -58.5
## 10 monoSp00    spanish  te             1  18.6
## # … with 710 more rows

Well, we have a column for language, which is good! But what about consonant voicing? How do we get that information? We can extract it from the strings in the item column!

table(vot$item)
## 
##  da  de  di dig dog dug tag  te  ti tog  tu tug 
##  60  60  60  60  60  60  60  60  60  60  60  60

We can see that all of the items begin with either /d/ (voiced) or /t/ (voiceless), so we simply need to extract the first character of each string. You can do this using a different function from stringr called str_sub.

Let’s practice using another base R constant, the English names for the months of the year:

month.name
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

If we want to extract, say, the 2nd through the 5th characters, we can specify the start and end indices of the string character extraction:

str_sub(month.name, 2, 5)
##  [1] "anua" "ebru" "arch" "pril" "ay"   "une"  "uly"  "ugus" "epte" "ctob"
## [11] "ovem" "ecem"

We can see that months that don’t have 5 characters stop short and simply give fewer characters (e.g. “ay” for May, “une” for June).

The minus sign can be used to count characters right-to-left instead of left-to-right. So the 2nd character through the 2nd to last character would be:

str_sub(month.name, 2, -2)
##  [1] "anuar"   "ebruar"  "arc"     "pri"     "a"       "un"      "ul"     
##  [8] "ugus"    "eptembe" "ctobe"   "ovembe"  "ecembe"

Using these principles, the last character of each month name can be extracted using:

str_sub(month.name, -1, -1)
##  [1] "y" "y" "h" "l" "y" "e" "y" "t" "r" "r" "r" "r"

Likewise, the first character of each month name can be extracted using:

str_sub(month.name, 1, 1)
##  [1] "J" "F" "M" "A" "M" "J" "J" "A" "S" "O" "N" "D"

Using this knowledge, you should be able to extract the first character of each string in the item column, which will look like this:

##   [1] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
##  [19] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
##  [37] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
##  [55] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
##  [73] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
##  [91] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [109] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [127] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [145] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [163] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [181] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [199] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [217] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [235] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [253] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [271] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [289] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [307] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [325] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [343] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [361] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [379] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [397] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [415] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [433] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [451] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [469] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [487] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [505] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [523] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [541] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [559] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [577] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [595] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [613] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [631] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [649] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [667] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [685] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"
## [703] "d" "d" "d" "d" "d" "d" "d" "d" "d" "t" "t" "t" "t" "t" "t" "t" "t" "t"

Excercise #4: conditional cases

You’ve already learned about conditional cases in base R, but there is also a way to implement conditional cases using the case_when() function from dplyr, which is a more readable way of using multiple if else statements. These statements are implemented using a two-sided formula:

condition TRUE ~ output value

Let’s practice by using an example from week 5:

mynums <- -2:2

for (x in mynums) {
  if (x < 0) {
    print("negative")
  } else if (x > 0) {
    print("positive")
  } else {
    print("neither")
  }
}
## [1] "negative"
## [1] "negative"
## [1] "neither"
## [1] "positive"
## [1] "positive"

This same set of conditional statements can also be written in the following way, where the last line TRUE ~ "neither" is the case when all other conditions are not met:

case_when(
  mynums < 0 ~ "negative",
  mynums > 0 ~ "positive",
  TRUE ~ "neither"
)
## [1] "negative" "negative" "neither"  "positive" "positive"

If the variable being queried in the conditional statements is a column in a data table, then the column name can be referenced in the formula instead, and the result can be added to the table using the mutate() function:

mydat <- data.frame(
  val = mynums,
  label = letters[1:5]
)

mydat %>%
  mutate(
    result = case_when(
      val < 0 ~ "negative",
      val > 0 ~ "positive",
      TRUE ~ "neither"
    )
  )
##   val label   result
## 1  -2     a negative
## 2  -1     b negative
## 3   0     c  neither
## 4   1     d positive
## 5   2     e positive

Using all of this new knowledge add two new column named onset and voicing to the tibble vot by filling in the missing information below:

vot <- vot %>%
  mutate(
    onset = ???,
    voicing = case_when(
      onset == ??? ~ ???,
      onset == ??? ~ ???
    )
  )

The result should look like the following:

vot
## # A tibble: 720 × 7
##    participant language item  repetition   vot onset voicing  
##    <chr>       <chr>    <chr>      <int> <dbl> <chr> <chr>    
##  1 monoSp00    spanish  da             1 -75.8 d     voiced   
##  2 monoSp00    spanish  da             2 -81.1 d     voiced   
##  3 monoSp00    spanish  da             3 -56.5 d     voiced   
##  4 monoSp00    spanish  de             1 -55.6 d     voiced   
##  5 monoSp00    spanish  de             2 -62.9 d     voiced   
##  6 monoSp00    spanish  de             3 -43.7 d     voiced   
##  7 monoSp00    spanish  di             1 -54.0 d     voiced   
##  8 monoSp00    spanish  di             2 -72.4 d     voiced   
##  9 monoSp00    spanish  di             3 -58.5 d     voiced   
## 10 monoSp00    spanish  te             1  18.6 t     voiceless
## # … with 710 more rows

Exercise #5: two variable distributions

Your final goal is to create plots to compare the VOT distributions for the interaction between two variables: voicing (voiced vs. voiceless) and language (English vs. Spanish).

There are three variables that we want to represent in these plots: vot, voicing, and language. Using this knowledge, try to recreate the following plot by filling in the missing information:

vot %>% 
  ggplot(aes(x=???, y=???, fill=???)) + 
  geom_boxplot()

A more informative view of the distribution is to use a “violin” plot, which is a mirrored probability density. You can do this by using geom_violin() instead of geom_boxplot():

The best of both worlds, however, would be to display the probability density along with the 25%, 50%, and 75% quantiles shown in the box plots by using the vector c(0.25,0.5,0.75) as the value for the draw_quantiles argument:

time to think: how does vot characterise the consonant voicing contrast in these two languages? in what cases might this cause confusion for an english speaker learning spanish, or a spanish speaker learning english?