Objectives

  • Learn about within- and between-speaker formant variation
  • Replace missing data by comparing variable values
  • Normalize using z-score transformation
  • Rescale from z-scores to Hertz
  • Debug bad code!

Preliminaries

During the past two weeks, we’ve been working with formant data combined from multiple data sets available in the phonTools package…

load("/shared/groups/jrole001/pals0047/data/vowel_formants.Rda")

…and we saw that the result looked like this:

library(ggplot2)

ggplot(alldat, aes(x=f2, y=f1, col=vowel)) + geom_point() + 
  scale_x_reverse() + scale_y_reverse()

Although there is certainly a pattern here in which we can identify separate vowel categories, there is still quite a lot of variation here, which makes the resulting vowel space a bit messy.

In this final practical session of PALS0047, we will learn where this variation comes from, how to rectify it, and how to generate a new version of the vowel space that looks like this:

alldat$sex[alldat$type=="w"] <- "f"
alldat$sex[alldat$type=="m"] <- "m"

alldat$sex[alldat$gender=="f"] <- "f"
alldat$sex[alldat$gender=="m"] <- "m"

alldat$f1norm[alldat$sex=="f"] <- scale(alldat$f1[alldat$sex=="f"])
alldat$f2norm[alldat$sex=="f"] <- scale(alldat$f2[alldat$sex=="f"])

alldat$f1norm[alldat$sex=="m"] <- scale(alldat$f1[alldat$sex=="m"])
alldat$f2norm[alldat$sex=="m"] <- scale(alldat$f2[alldat$sex=="m"])

alldat$f1norm <- alldat$f1norm * mean(c(
  sd(alldat$f1[alldat$sex=="f"]),
  sd(alldat$f1[alldat$sex=="m"]) )) + 
  mean(alldat$f1)

alldat$f2norm <- alldat$f2norm * mean(c(
  sd(alldat$f2[alldat$sex=="f"]),
  sd(alldat$f2[alldat$sex=="m"]) )) + 
  mean(alldat$f2)

ggplot(alldat, aes(x=f2norm,y=f1norm,col=vowel)) + geom_point() +
  scale_x_reverse() + scale_y_reverse()

Steps to replicate

There are a number of complicated steps that will be used in the code that you will debug for this exercise. Before tackling the debugging itself, I highly recommend first working through the logic of these steps to understand the goals of the code.

Data replacement

load("../data/vowel_formants.Rda")

We saw in the Week 8 lecture that children have higher formant values than adults, because children have shorter vocal tracts (and, hence, the resonant frequencies of their vocal tracts are higher). However, there is also a general pattern of speaker sex that shares the same explanation: women generally have higher formant values than men because their vocal tracts are generally shorter. To account for this difference, we will normalize the formant values for men and women speakers separately.

There is currently a problem in our data set, however. We ideally want to plot all 2418 observations in the data set:

nrow(alldat)
## [1] 2418

But if we look at a breakdown of the observations that have values for the variable sex

table(alldat$sex)
## 
##   f   m 
## 575 707

…we can see that nearly half of the data isn’t included here!

sum(table(alldat$sex))
## [1] 1282

So where is this “missing” data? Let’s remind ourselves which variables we have in this combined data set:

head(alldat)
##      sex vowel  f1   f2   f3   f4 type speaker dur f0 repetition gender
## ...1   m     i 300 2670 3320 4015 <NA>      NA  NA NA         NA   <NA>
## ...2   m     e 470 2185 2470 4350 <NA>      NA  NA NA         NA   <NA>
## ...3   m     a 710 1232 2720 3650 <NA>      NA  NA NA         NA   <NA>
## ...4   m     o 442  866 2268 3252 <NA>      NA  NA NA         NA   <NA>
## ...5   m     u 320  784 2576 3615 <NA>      NA  NA NA         NA   <NA>
## ...6   f     i 325 2715 3130 4170 <NA>      NA  NA NA         NA   <NA>

In the Week 9 lecture, we saw that codes were provided in the type column of the h95 data set, which separated adult from child data. So let’s compare these two variables:

table(alldat$type)
## 
##    b    g    m    w    c 
##    0    0 1200 1136    0
table(alldat$sex, alldat$type)
##    
##       b   g   m   w   c
##   f   0   0   0 560   0
##   m   0   0 660   0   0

So we have a mismatch here: 1220 observations have the value “m” (man) for type, but only 660 of these have a corresponding value of “m” (male) for sex; and 1136 observations have the value “w” (woman) for type, but only 560 of these have a corresponding value of “f” (female) for `sex’.

step #1: include values in the sex column for these missing observations by matching corresponding indices from the type column

alldat$sex[alldat$type=="w"] <- "f"
alldat$sex[alldat$type=="m"] <- "m"

The next mismatch comes not from the type column, but from the gender column:

table(alldat$gender)
## 
##  f  m 
## 10 10
table(alldat$sex, alldat$gender)
##    
##     f m
##   f 0 0
##   m 0 0

10 observations have the value “f” and 10 observations have the value “m” for gender, but none of these have a corresponding value for sex. Although sex and gender are not the same thing, many studies have used one label to refer to the other. Since we don’t have many observations that are affected here, we will assume a correspondence for these observations.

step #2: include values in the sex column for these missing observations by matching corresponding indices from the gender column

alldat$sex[alldat$gender=="f"] <- "f"
alldat$sex[alldat$gender=="m"] <- "m"

After replacing data for these two types of variable mismatches, we can now see that all observations are accounted for and have a relevant value for the variable sex:

nrow(alldat)
## [1] 2418
sum(table(alldat$sex))
## [1] 2418

Normalization

Now that we have incorporated all of the data, what do the formant differences look like for male and female speakers?

ggplot(alldat, aes(x=f2,y=f1,col=sex)) + geom_point() + 
  scale_x_reverse() + scale_y_reverse()

We can see that there is a very clear (and predicted) pattern here: the male speakers have lower F1 and F2 values than the female speakers, resulting in a smaller vowel space for the male speakers.

To account for this difference, we can ask the following question: “What range of formant values is produced by the male speakers, and what range of formant values is produced by the female speakers?” By scaling these usage-based ranges in the same way for both sets of speakers, we can normalize (or standardize) the formant measurements.

To do this, we will use z-score normalization, defined as \(z={x-\mu \over \sigma }\), where \(\mu\) is the mean of the population and \(\sigma\) is the standard deviation of the population. Since we don’t have measurements for all male and female speakers on the planet, we can approximate this by using the mean and standard devation of the data that we do have: \(z={x-{\bar {x}} \over S}\).

Z-score normalization can be performed very simply in R using the scale() function, e.g.:

scale(1:10)
##             [,1]
##  [1,] -1.4863011
##  [2,] -1.1560120
##  [3,] -0.8257228
##  [4,] -0.4954337
##  [5,] -0.1651446
##  [6,]  0.1651446
##  [7,]  0.4954337
##  [8,]  0.8257228
##  [9,]  1.1560120
## [10,]  1.4863011
## attr(,"scaled:center")
## [1] 5.5
## attr(,"scaled:scale")
## [1] 3.02765

step #3: normalize F1 values for all female speakers, added to new column f1norm

step #4: normalize F2 values for all female speakers, added to new column f2norm

step #5: normalize F1 values for all male speakers, added to new column f1norm

step #6: normalize F2 values for all male speakers, added to new column f2norm

alldat$f1norm[alldat$sex=="f"] <- scale(alldat$f1[alldat$sex=="f"])
alldat$f2norm[alldat$sex=="f"] <- scale(alldat$f2[alldat$sex=="f"])

alldat$f1norm[alldat$sex=="m"] <- scale(alldat$f1[alldat$sex=="m"])
alldat$f2norm[alldat$sex=="m"] <- scale(alldat$f2[alldat$sex=="m"])

After normalization, we can see that the vowel spaces are now quite similar for the male and female speakers!

ggplot(alldat, aes(x=f2norm, y=f1norm, col=sex)) + geom_point() + 
  scale_x_reverse() + scale_y_reverse()

Rescaling

If we plot the new normalized formant values, we can see that the vowel categories are now much more distinct:

ggplot(alldat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() + 
  scale_x_reverse() + scale_y_reverse()

However, the values are now z-scores (i.e. standard deviations), which isn’t very meaningful for understanding the range of formant values themselves. The last thing we need to do is rescale these z-scores by transforming them back to the Hertz scale. If the original z-score transformation was performed as \(z={x-{\bar {x}} \over S}\), then we can transform back to Hertz using \(x={z*S - {\bar {x}}}\). However, the trick here is to use the mean of the male standard devation and the female standard deviation, rather than the overall standard deviation.

step #7: rescale F1 values for all speakers

step #8: rescale F2 values for all speakers

alldat$f1norm <- alldat$f1norm * 
  mean(unlist(lapply(split(alldat$f1, alldat$sex), sd))) + 
  mean(alldat$f1)

alldat$f2norm <- alldat$f2norm * 
  mean(unlist(lapply(split(alldat$f2, alldat$sex), sd))) + 
  mean(alldat$f2)

After rescaling, we can see that the pattern of variation is the exact same as before, but the F1 and F2 scales are now in Hertz (frequency):

ggplot(alldat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
  scale_x_reverse() + scale_y_reverse()

Debugging exercise

Now that you understand the steps that need to be accomplished, the goal for today’s exercise is to debug the following code, which contains several errors. First, copy the code into a new R script and save the file. Then, remember to source the file before trying to debug.

Use all of the debugging tips you learned during this week’s lecture:

  • Search error messages online
  • Use the traceback() function
  • Use breakpoints
  • Use the print() function to print values

There are many different types of problems in the code… and not all of them will generate error messages! So make sure you fully understand what the code should be doing and compare it with what the code is actually doing.

A few things to note about this function:

  • The input to the data argument is the entire data frame that we’ve been working with
  • The input to the variable argument is a string matching a column name, in our case: “sex”
  • The rescale argument is used to (optionally) rescale the data from z-scores to Hertz
normalize_formants <- function(data, variable, rescale = FALSE) {
  
  # include missing data from "type" column
  data[variable][data$type=="W"] <- "f"
  data[variable][data$type=="M"] <- "m"
  
  # include missing data from "gender" column
  data[variable][data$gender=="F"] <- "f"
  data[variable][data$gender=="M"] <- "m"
  
  # for each unique category in the user-selected variable, normalize F1 and F2 values
  for (cat in unique(data[variable]) ) {
    data$f1norm[data[variable]==cat] <- scale(data$f1[data[variable]==cat])
    data$f2norm[data[variable]==cat] <- scale(data$f2[data[variable]==cat])
  }
  
  # check if user wants to rescale data
  if (rescale = TRUE) {
    # rescale F1 values
    data$f1norm <- data$f1norm * 
      # apply sd() function over the list of variable categories, separated using split()
      mean( lapply( split(data$f1, data[[variable]]), sd) ) +
      mean(data$f1)
    
    # rescale F2 values
    data$f2norm <- data$f2norm * 
      # apply sd() function over the list of variable categories, separated using split()
      mean( lapply( split(data$f2, data[[variable]]), sd) ) + 
      mean(data$f2)
  }
  
  # return normalized (and/or rescaled) data
  return(dat)
}
# debugged code
normalize_formants <- function(data, variable, rescale = FALSE) {
  
  data[[variable]][data$type=="w"] <- "f"
  data[[variable]][data$type=="m"] <- "m"
  
  data[[variable]][data$gender=="f"] <- "f"
  data[[variable]][data$gender=="m"] <- "m"
  
  for (cat in unique(data[[variable]]) ) {
    data$f1norm[data[[variable]]==cat] <- scale(data$f1[data[[variable]]==cat])
    data$f2norm[data[[variable]]==cat] <- scale(data$f2[data[[variable]]==cat])
  }
  
  if (rescale == TRUE) {
    data$f1norm <- data$f1norm * 
      mean( unlist( lapply( split(data$f1, data[[variable]]), sd) ) ) + 
      mean(data$f1)
    data$f2norm <- data$f2norm * 
      mean( unlist( lapply( split(data$f2, data[[variable]]), sd) ) ) + 
      mean(data$f2)
  }
  
  return(data)
}

When finished, your debugged code should be able to produce the following when the data are not rescaled:

normdat <- normalize_formants(alldat, "sex")

ggplot(normdat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
  scale_x_reverse() + scale_y_reverse()

…but also the following when the data are rescaled:

normdat <- normalize_formants(alldat, "sex", rescale=T)

ggplot(normdat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
  scale_x_reverse() + scale_y_reverse()