During the past two weeks, we’ve been working with formant data
combined from multiple data sets available in the phonTools
package…
load("/shared/groups/jrole001/pals0047/data/vowel_formants.Rda")
…and we saw that the result looked like this:
library(ggplot2)
ggplot(alldat, aes(x=f2, y=f1, col=vowel)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
Although there is certainly a pattern here in which we can identify separate vowel categories, there is still quite a lot of variation here, which makes the resulting vowel space a bit messy.
In this final practical session of PALS0047, we will learn where this variation comes from, how to rectify it, and how to generate a new version of the vowel space that looks like this:
alldat$sex[alldat$type=="w"] <- "f"
alldat$sex[alldat$type=="m"] <- "m"
alldat$sex[alldat$gender=="f"] <- "f"
alldat$sex[alldat$gender=="m"] <- "m"
alldat$f1norm[alldat$sex=="f"] <- scale(alldat$f1[alldat$sex=="f"])
alldat$f2norm[alldat$sex=="f"] <- scale(alldat$f2[alldat$sex=="f"])
alldat$f1norm[alldat$sex=="m"] <- scale(alldat$f1[alldat$sex=="m"])
alldat$f2norm[alldat$sex=="m"] <- scale(alldat$f2[alldat$sex=="m"])
alldat$f1norm <- alldat$f1norm * mean(c(
sd(alldat$f1[alldat$sex=="f"]),
sd(alldat$f1[alldat$sex=="m"]) )) +
mean(alldat$f1)
alldat$f2norm <- alldat$f2norm * mean(c(
sd(alldat$f2[alldat$sex=="f"]),
sd(alldat$f2[alldat$sex=="m"]) )) +
mean(alldat$f2)
ggplot(alldat, aes(x=f2norm,y=f1norm,col=vowel)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
There are a number of complicated steps that will be used in the code that you will debug for this exercise. Before tackling the debugging itself, I highly recommend first working through the logic of these steps to understand the goals of the code.
load("../data/vowel_formants.Rda")
We saw in the Week 8 lecture that children have higher formant values than adults, because children have shorter vocal tracts (and, hence, the resonant frequencies of their vocal tracts are higher). However, there is also a general pattern of speaker sex that shares the same explanation: women generally have higher formant values than men because their vocal tracts are generally shorter. To account for this difference, we will normalize the formant values for men and women speakers separately.
There is currently a problem in our data set, however. We ideally want to plot all 2418 observations in the data set:
nrow(alldat)
## [1] 2418
But if we look at a breakdown of the observations that have values
for the variable sex
…
table(alldat$sex)
##
## f m
## 575 707
…we can see that nearly half of the data isn’t included here!
sum(table(alldat$sex))
## [1] 1282
So where is this “missing” data? Let’s remind ourselves which variables we have in this combined data set:
head(alldat)
## sex vowel f1 f2 f3 f4 type speaker dur f0 repetition gender
## ...1 m i 300 2670 3320 4015 <NA> NA NA NA NA <NA>
## ...2 m e 470 2185 2470 4350 <NA> NA NA NA NA <NA>
## ...3 m a 710 1232 2720 3650 <NA> NA NA NA NA <NA>
## ...4 m o 442 866 2268 3252 <NA> NA NA NA NA <NA>
## ...5 m u 320 784 2576 3615 <NA> NA NA NA NA <NA>
## ...6 f i 325 2715 3130 4170 <NA> NA NA NA NA <NA>
In the Week 9 lecture, we saw that codes were provided in the
type
column of the h95
data set, which
separated adult from child data. So let’s compare these two
variables:
table(alldat$type)
##
## b g m w c
## 0 0 1200 1136 0
table(alldat$sex, alldat$type)
##
## b g m w c
## f 0 0 0 560 0
## m 0 0 660 0 0
So we have a mismatch here: 1220 observations have the value “m”
(man) for type
, but only 660 of these have a corresponding
value of “m” (male) for sex
; and 1136 observations have the
value “w” (woman) for type
, but only 560 of these have a
corresponding value of “f” (female) for `sex’.
step #1: include values in the sex
column for these missing
observations by matching corresponding indices from the
type
column
alldat$sex[alldat$type=="w"] <- "f"
alldat$sex[alldat$type=="m"] <- "m"
The next mismatch comes not from the type
column, but
from the gender
column:
table(alldat$gender)
##
## f m
## 10 10
table(alldat$sex, alldat$gender)
##
## f m
## f 0 0
## m 0 0
10 observations have the value “f” and 10 observations have the value
“m” for gender
, but none of these have a corresponding
value for sex
. Although sex and gender are not the same
thing, many studies have used one label to refer to the other. Since we
don’t have many observations that are affected here, we will assume a
correspondence for these observations.
step #2: include values in the sex
column for these missing
observations by matching corresponding indices from the
gender
column
alldat$sex[alldat$gender=="f"] <- "f"
alldat$sex[alldat$gender=="m"] <- "m"
After replacing data for these two types of variable mismatches, we
can now see that all observations are accounted for and have a relevant
value for the variable sex
:
nrow(alldat)
## [1] 2418
sum(table(alldat$sex))
## [1] 2418
Now that we have incorporated all of the data, what do the formant differences look like for male and female speakers?
ggplot(alldat, aes(x=f2,y=f1,col=sex)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
We can see that there is a very clear (and predicted) pattern here: the male speakers have lower F1 and F2 values than the female speakers, resulting in a smaller vowel space for the male speakers.
To account for this difference, we can ask the following question: “What range of formant values is produced by the male speakers, and what range of formant values is produced by the female speakers?” By scaling these usage-based ranges in the same way for both sets of speakers, we can normalize (or standardize) the formant measurements.
To do this, we will use z-score normalization, defined as \(z={x-\mu \over \sigma }\), where \(\mu\) is the mean of the population and \(\sigma\) is the standard deviation of the population. Since we don’t have measurements for all male and female speakers on the planet, we can approximate this by using the mean and standard devation of the data that we do have: \(z={x-{\bar {x}} \over S}\).
Z-score normalization can be performed very simply in R using the
scale()
function, e.g.:
scale(1:10)
## [,1]
## [1,] -1.4863011
## [2,] -1.1560120
## [3,] -0.8257228
## [4,] -0.4954337
## [5,] -0.1651446
## [6,] 0.1651446
## [7,] 0.4954337
## [8,] 0.8257228
## [9,] 1.1560120
## [10,] 1.4863011
## attr(,"scaled:center")
## [1] 5.5
## attr(,"scaled:scale")
## [1] 3.02765
step #3: normalize F1 values for all female speakers, added to new
column f1norm
step #4: normalize F2 values for all female speakers, added to new
column f2norm
step #5: normalize F1 values for all male speakers, added to new column
f1norm
step #6: normalize F2 values for all male speakers, added to new column
f2norm
alldat$f1norm[alldat$sex=="f"] <- scale(alldat$f1[alldat$sex=="f"])
alldat$f2norm[alldat$sex=="f"] <- scale(alldat$f2[alldat$sex=="f"])
alldat$f1norm[alldat$sex=="m"] <- scale(alldat$f1[alldat$sex=="m"])
alldat$f2norm[alldat$sex=="m"] <- scale(alldat$f2[alldat$sex=="m"])
After normalization, we can see that the vowel spaces are now quite similar for the male and female speakers!
ggplot(alldat, aes(x=f2norm, y=f1norm, col=sex)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
If we plot the new normalized formant values, we can see that the vowel categories are now much more distinct:
ggplot(alldat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
However, the values are now z-scores (i.e. standard deviations), which isn’t very meaningful for understanding the range of formant values themselves. The last thing we need to do is rescale these z-scores by transforming them back to the Hertz scale. If the original z-score transformation was performed as \(z={x-{\bar {x}} \over S}\), then we can transform back to Hertz using \(x={z*S - {\bar {x}}}\). However, the trick here is to use the mean of the male standard devation and the female standard deviation, rather than the overall standard deviation.
step #7: rescale F1 values for all speakers
step #8: rescale F2 values for all speakers
alldat$f1norm <- alldat$f1norm *
mean(unlist(lapply(split(alldat$f1, alldat$sex), sd))) +
mean(alldat$f1)
alldat$f2norm <- alldat$f2norm *
mean(unlist(lapply(split(alldat$f2, alldat$sex), sd))) +
mean(alldat$f2)
After rescaling, we can see that the pattern of variation is the exact same as before, but the F1 and F2 scales are now in Hertz (frequency):
ggplot(alldat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
Now that you understand the steps that need to be accomplished, the goal for today’s exercise is to debug the following code, which contains several errors. First, copy the code into a new R script and save the file. Then, remember to source the file before trying to debug.
Use all of the debugging tips you learned during this week’s lecture:
traceback()
functionprint()
function to print valuesThere are many different types of problems in the code… and not all of them will generate error messages! So make sure you fully understand what the code should be doing and compare it with what the code is actually doing.
A few things to note about this function:
data
argument is the entire data frame
that we’ve been working withvariable
argument is a string matching
a column name, in our case: “sex”rescale
argument is used to (optionally) rescale
the data from z-scores to Hertznormalize_formants <- function(data, variable, rescale = FALSE) {
# include missing data from "type" column
data[variable][data$type=="W"] <- "f"
data[variable][data$type=="M"] <- "m"
# include missing data from "gender" column
data[variable][data$gender=="F"] <- "f"
data[variable][data$gender=="M"] <- "m"
# for each unique category in the user-selected variable, normalize F1 and F2 values
for (cat in unique(data[variable]) ) {
data$f1norm[data[variable]==cat] <- scale(data$f1[data[variable]==cat])
data$f2norm[data[variable]==cat] <- scale(data$f2[data[variable]==cat])
}
# check if user wants to rescale data
if (rescale = TRUE) {
# rescale F1 values
data$f1norm <- data$f1norm *
# apply sd() function over the list of variable categories, separated using split()
mean( lapply( split(data$f1, data[[variable]]), sd) ) +
mean(data$f1)
# rescale F2 values
data$f2norm <- data$f2norm *
# apply sd() function over the list of variable categories, separated using split()
mean( lapply( split(data$f2, data[[variable]]), sd) ) +
mean(data$f2)
}
# return normalized (and/or rescaled) data
return(dat)
}
# debugged code
normalize_formants <- function(data, variable, rescale = FALSE) {
data[[variable]][data$type=="w"] <- "f"
data[[variable]][data$type=="m"] <- "m"
data[[variable]][data$gender=="f"] <- "f"
data[[variable]][data$gender=="m"] <- "m"
for (cat in unique(data[[variable]]) ) {
data$f1norm[data[[variable]]==cat] <- scale(data$f1[data[[variable]]==cat])
data$f2norm[data[[variable]]==cat] <- scale(data$f2[data[[variable]]==cat])
}
if (rescale == TRUE) {
data$f1norm <- data$f1norm *
mean( unlist( lapply( split(data$f1, data[[variable]]), sd) ) ) +
mean(data$f1)
data$f2norm <- data$f2norm *
mean( unlist( lapply( split(data$f2, data[[variable]]), sd) ) ) +
mean(data$f2)
}
return(data)
}
When finished, your debugged code should be able to produce the following when the data are not rescaled:
normdat <- normalize_formants(alldat, "sex")
ggplot(normdat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
scale_x_reverse() + scale_y_reverse()
…but also the following when the data are rescaled:
normdat <- normalize_formants(alldat, "sex", rescale=T)
ggplot(normdat, aes(x=f2norm, y=f1norm, col=vowel)) + geom_point() +
scale_x_reverse() + scale_y_reverse()