install.packages(c('rPraat', 'phonTools', 'seewave', 'multitaper', 'emuR',
'praatpicture'))
Generating and analyzing multitaper spectra in R
Brief introduction
This is a tutorial showing how to generate multitaper spectra in R and how to compute spectral moments and DCT coefficients from multitaper spectra. I will not go into a lot of technical detail about how multitaper spectra are generated; for more on this, see Reidy (2013) and Reidy (2015).
By way of introduction, when linguists analyze speech data, we are often interested in how much energy is found at different frequencies. This is not easy to spot from the waveform, so the signal is converted into spectrograms or spectral slices, usually generated using the fast Fourier transformation (FFT). As the name suggests, this method is blazing fast.
FFT spectra are suitable for analyzing voiced portions of speech, but in both theory and practice they can be less suitable for analyzing voiceless portions of speech. This is because the Fourier basis is periodic, making the FFT inherently more suitable for periodic signals, such as voiced portions of speech, and less suitable for aperiodic signals, such as voiceless portions of speech.
Variance is reduced in multitaper spectral estimation, making multitaper spectra theoretically and practically and more suitable for voiceless portions of speech. Theoretically, because there is no strong assumption of periodicity in the underlying signal, and practically, because FFT spectra of voiceless speech are packed with unwanted noise, which is a major disadvantage if you want to use spectra (or numbers computed from spectra) as dependent variables in a statistical model.
We’ll be using the packages rPraat
, phonTools
, seewave
, multitaper
, and emuR
, as well as praatpicture
for plotting, so make sure those are installed.
Loading and preprocessing sound files
There are several functions for loading sound files into R, resulting in objects with different formats. I’ll use snd.read()
from the package rPraat
here, but you can use whichever method you prefer. The file contains a female Kmhmu’ speaker saying a single syllable [siːm]. It comes from this OSF repository (see Kirby, Pittayaporn, and Brunelle (2022)).
library(rPraat)
<- snd.read('snd/1.wav')
snd_obj class(snd_obj)
type name
"list" "Sound" "1.wav"
snd_obj
is an object of the class Sound
, which R treats as a list containing a bunch of information about the sound file. The actual sound signal is stored in snd_obj$sig
, time (in seconds) of each sample is given in snd_obj$t
, and the sample rate of the file is given in snd_obj$fs
. If we use the base R plotting function to produce a simple line plot of the signal, we get a waveform. I’ve added red lines indicating the location of the sibilant.
<- c(0.075, 0.21)
sib_loc plot(x=snd_obj$t, y=snd_obj$sig, type='l',
xlab='Time (s)',
ylab='Amplitude')
abline(v=sib_loc, col='red', lwd=2.5)
Here’s a spectrogram of the sound, generated using the spectrogram()
function from the phonTools
package:
library(phonTools)
spectrogram(snd_obj$sig[,1], fs=snd_obj$fs, maxfreq=12000,
colors=FALSE, dynamicrange=60)
abline(v=sib_loc*1000, col='red', lwd=2.5)
This sound file has a sample rate of 44.1 kHz, which is much more than we need for our purposes, so I’ll downsample it to 24 kHz using the resamp()
function from the seewave
package. This function takes the arguments f
(the original sample rate) and g
(the desired sample rate). I choose 24 kHz because sibilants do have a tendency to vary in very high frequencies, which probably makes something like 16 kHz problematic as we would then limit ourselves to analyzing frequencies below 8 kHz.
library(seewave)
<- 24000
sr <- resamp(snd_obj$sig, f=snd_obj$fs, g=sr) snd_24kHz
Next, I’ll extract just those sound samples that belong to the sibilant in the object, and extract a 10 ms snippet from the middle of that.
<- snd_24kHz[(sib_loc[1]*sr):(sib_loc[2]*sr)]
snd_sib <- length(snd_sib)/2
mid <- snd_sib[(mid-(0.005*sr)):(mid+(0.005*sr))] snd
Our 10 ms snippet snd
looks like this:
plot(snd, type='l', xlab='Time (samples)', ylab='Amplitude')
Generating spectra
First of all, for expository purposes, I’ll generate an FFT spectrum of our sound using the spectralslice()
function in phonTools
. The fs
argument is the signal’s sample rate.
<- spectralslice(snd, fs=sr) fft
As you can see, it’s very noisy and jagged. I saved it in an object fft
which is a matrix containing information about the energy distribution in dB by frequency:
head(fft)
hz dB
[1,] 0.00000 -11.315922
[2,] 33.19502 -10.944350
[3,] 66.39004 -10.128381
[4,] 99.58506 -9.370342
[5,] 132.78008 -8.981041
[6,] 165.97510 -9.095506
I’ll go ahead and convert that to a data frame, which will make our lives easier down the line.
<- as.data.frame(fft) fft_df
Multitaper spectra can be generated using the spec.mtm()
function from the multitaper
package. I set the arguments nw
, a frequency bandwidth parameter, and k
, the number of eigenspectra used to compute the final spectrum, following the suggestions by Reidy (2013). The spec.mtm()
defaults are nw=4
, which is also what Reidy uses, and k=7
, where Reidy uses k=8
. The deltat
argument is the duration of each sample in the signal, i.e. 1 divided by our sample rate.
library(multitaper)
<- spec.mtm(snd, nw=4, k=8, deltat=1/sr,
mts xlab='Frequency (Hz)', ylab='', main='')
The multitaper spectrum has a much more stable shape than the FFT spectrum.
Since we already converted the FFT spectrum to a data frame, that already has the format we need for computing things like spectral moments and DCT coefficients. The R object containing the multitaper spectrum is rather more complicated:
summary(mts)
Length Class Mode
origin.n 1 -none- numeric
method 1 -none- character
pad 1 -none- numeric
spec 257 -none- numeric
freq 257 -none- numeric
series 1 -none- character
adaptive 1 -none- logical
mtm 14 -none- list
The information that we have in the corresponding FFT spectrum data frame fft_df
is what’s stored in mts$spec
and mts$freq
. Be aware, however, that we can’t just grab those from the mts
object and assume all is well. If we plot them directly, we get this:
plot(x=mts$freq, y=mts$spec, type='l',
xlab='Frequency (Hz)', ylab="What's this?")
It doesn’t look at all like what we saw above. This is because energy is in the multitaper
produces.
plot(x=mts$freq, y=log(mts$spec), type='l',
xlab='Frequency (Hz)', ylab='')
I usually standardize spectra before analyzing them statistically, so the exact scale is much less important than the curve shape. For now, let’s save a logged version of the multitaper spectrum in a data frame like we did with the FFT spectrum.
<- data.frame(hz = mts$freq,
mts_df energy = log(mts$spec))
When analyzing the behavior of the high energy range of spectra, it’s customary to filter away lower frequencies to remove any intrusive influence of voicing or other low frequency rumbling, which can have a huge influence on measures that summarize overall shape and energy distribution of the spectrum. This can be done with a bandpass filter before processing the sound, but here we just remove information from frequencies below 500 Hz from the spectrum.
<- fft_df[which(fft_df$hz > 500),]
fft_df <- mts_df[which(mts_df$hz > 500),] mts_df
Peak frequency
A straightforward way to summarize a spectrum is to get its peak frequency, i.e. the frequency bin with the highest energy. This can be achieved very simply with the base R function which.max()
.
$hz[which.max(fft_df$dB)] fft_df
[1] 5244.813
$hz[which.max(mts_df$energy)] mts_df
[1] 5671.875
There’s a difference of a few hundred Hz between the two spectral estimation methods. It is fairly typical to restrict this measure to the mid-frequency region, i.e. in the region between 3,000–7,000 Hz (see e.g. Koenig et al. 2013). Since peak finding is very straightforward, we can already clearly see that the results would be identical; these spectra have their peak in the mid-frequency. If that hadn’t been the case though, looking in the mid-frequency could be achieved like this:
<- fft_df[which(fft_df$hz > 3000 & fft_df$hz < 7000),]
fft_midFreq <- mts_df[which(mts_df$hz > 3000 & mts_df$hz < 7000),]
mts_midFreq $hz[which.max(fft_midFreq$dB)] fft_midFreq
[1] 5244.813
$hz[which.max(mts_midFreq$energy)] mts_midFreq
[1] 5671.875
Computing spectral moments
We calculate spectral moments using the function moments()
from the package emuR
. It takes the arguments count
which is our energy dimension (in whatever scale) and x
which is our frequency dimension (in whatever scale). moments()
simply returns a vector with four numbers, corresponding to the first four spectral moments, i.e. mean (AKA center of gravity), variance, skew, and kurtosis. Phonetics studies usually report standard deviation rather than variance, because standard deviation is on the well-known Hz scale while variance is on the minval=TRUE
because our energy scale contains negative values; if we don’t do this, the results would essentially be as if we had flipped the spectrum upside down.
Let’s try to do this for our FFT spectrum first.
library(emuR)
<- moments(fft_df$dB, fft_df$hz, minval=TRUE)
fft_moments 2] <- sqrt(fft_moments[2]) fft_moments[
And we’ll repeat that for our multitaper spectrum.
<- moments(mts_df$energy, mts_df$hz, minval=TRUE)
mts_moments 2] <- sqrt(mts_moments[2]) mts_moments[
Let’s compare the two results.
fft_moments
[1] 6527.47064546 3029.16900616 -0.03016832 -0.95492682
mts_moments
[1] 6713.23831437 2617.28439091 0.01304411 -0.48917071
There are notable differences in all spectral moments. The take home message is that the method of spectral estimation matters.
Computing DCT coefficients
Another way of summarizing spectral shape is using the coefficients of a discrete cosine transformation of the spectrum. Usually the first four coefficients are reported. This is also implemented in the emuR
package, in the dct()
function. dct()
takes the arguments data
which is our energy dimension, and m
which is the number of coefficients to return. Since DCT coefficients only say something about the shape of a curve, the function doesn’t actually care about the frequency dimension. If we set m=3
, the function returns a vector of four numbers, corresponding to k0, k1, k2, and k3, which reflect mean amplitude, linear slope, curvature, and strength at higher frequencies, respectively.
For our FFT spectrum, it looks like this.
<- dct(fft_df$dB, m=3) fft_dct
And for our multitaper spectrum, it looks like this.
<- dct(mts_df$energy, m=3) mts_dct
These are the results.
fft_dct
[1] -24.575433 -2.643086 -6.839182 -2.032708
mts_dct
[1] -29.3925342 -0.4104879 -1.4238712 -0.2278941
Here, as well, different methods of spectral estimation have serious implications for the results.
Bulk processing
Usually we won’t want to compute and analyze a single spectrum. If we have a research question, we’ll probably want to do this in bulk by looking at how the spectrum changes over time, or we’ll have multiple tokens of some consonant, multiple sound files, etc. Here I’ll just give two examples of how spectra can be computed and analyzed in bulk. Hopefully these examples can also be helpful for other use cases.
If we wanted to look at spectral dynamics over time in our fricative snd_sib
by computing, say, spectral moments from 10 equidistant multitaper spectra of 10 ms, we could then do the following.
#number of samples in 10 ms
<- sr/100
cs #number of samples in the fricative
<- length(snd_sib)
n_samp #our number of equidistant spectra
<- 10
steps #create vector of start times for the spectra
<- seq(1, n_samp-cs, length.out=steps)
t1 #create vector of end times for the spectra
<- seq(cs, n_samp, length.out=steps)
t2 #create empty data frame with nrow=steps
<- data.frame(step = rep(NA, steps),
mom cog = rep(NA, steps),
sd = rep(NA, steps),
skew = rep(NA, steps),
kurtosis = rep(NA, steps))
for (i in 1:steps) {
#create ith 10 ms snippet
<- snd_sib[t1[i]:t2[i]]
tmp #compute multitaper spectrum without plotting
<- spec.mtm(tmp, nw=4, k=8, deltat=1/sr, plot=FALSE)
tmp_mts <- data.frame(hz = tmp_mts$freq,
tmp_mts energy = log(tmp_mts$spec))
<- tmp_mts[which(tmp_mts$hz > 500),]
tmp_mts #compute spectral moments
<- moments(count=tmp_mts$energy, x=tmp_mts$hz, minval=TRUE)
tmp_mom $step[i] <- i
mom$cog[i] <- tmp_mom[1]
mom#compute standard deviation from variance
$sd[i] <- sqrt(tmp_mom[2])
mom$skew[i] <- tmp_mom[3]
mom$kurtosis[i] <- tmp_mom[4]
mom }
This results in a data frame with time series for each spectral moments. It looks like this:
mom
step cog sd skew kurtosis
1 1 5988.986 2881.961 0.18437379 -0.6530865
2 2 6299.798 2915.016 0.01133856 -0.7698554
3 3 6525.229 2586.294 0.20834472 -0.4085182
4 4 6822.153 2448.914 0.03136296 -0.3386035
5 5 6993.307 2568.983 -0.10483378 -0.4386231
6 6 6727.912 2865.474 -0.11990408 -0.6666871
7 7 7250.932 2500.938 -0.19279606 -0.2802282
8 8 6478.217 2872.782 -0.14174052 -0.4770977
9 9 6620.092 2701.421 -0.19581113 -0.3652390
10 10 5857.860 2772.067 -0.02317965 -0.5592586
(Hint: If you were to repeat this with FFT spectra, you’d see that the variation between adjacent spectra would be much larger).
Imagine that we had a sound file with a bunch of sibilants and for each of them we want to compute the first four DCT coefficients of multitaper spectra computed around the 10 ms midpoint. Here’s an example of a sound file with 5 [s]es and a TextGrid that marks their locations. (This is plotted using the praatpicture
librar – much more information about how this works can be found here).
library(praatpicture)
praatpicture('snd/5.wav', spec_freqRange = c(0,8000),
tg_focusTierLineType = 'solid',
tg_focusTierColor = 'blue')
In order to do this, we’d first load in the TextGrid using the function tg.read()
from the rPraat
package.
<- tg.read('snd/5.TextGrid') tg
This TextGrid has just one tier sib
, and the resulting R object is a list containing information about the start and end times of each interval and the labels in them.
The rest of the operation could look like this:
#find the intervals with label 's'
<- which(tg$sib$label == 's')
s_int #how many intervals?
<- length(s_int)
ns #create empty data frame with nrow=ns
<- data.frame(id = rep(NA, ns),
dct_coef t1 = rep(NA, ns),
t2 = rep(NA, ns),
k0 = rep(NA, ns),
k1 = rep(NA, ns),
k2 = rep(NA, ns),
k3 = rep(NA, ns))
for (i in 1:ns) {
#number of the ith interval with label 's'
<- s_int[i]
int #start time of the ith interval with label 's'
<- tg$sib$t1[int]
t1 #end time of the ith interval with label 's'
<- tg$sib$t2[int]
t2 #mid point of the interval
<- t1 + (t2-t1) / 2
midpoint
#read in 10 ms snippet from sound file around the midpoint
#of the ith interval with label 's'
<- snd.read('snd/5.wav',
tmp from=midpoint-0.005, to=midpoint+0.005,
units='seconds')
#downsample to 24 kHz as shown above
<- as.vector(resamp(tmp$sig, f=tmp$fs, g=sr))
tmp_24kHz
#compute multitaper spectrum without plotting
<- spec.mtm(tmp_24kHz, nw=4, k=8, deltat=1/sr, plot=FALSE)
tmp_mts <- data.frame(hz = tmp_mts$freq,
tmp_mts energy = log(tmp_mts$spec))
<- tmp_mts[which(tmp_mts$hz > 500),]
tmp_mts #get 4 DCT coefficients as shown above
<- dct(tmp_mts$energy, m=3)
tmp_dct #fill in data frame
$id[i] <- i
dct_coef$t1[i] <- t1
dct_coef$t2[i] <- t2
dct_coef$k0[i] <- tmp_dct[1]
dct_coef$k1[i] <- tmp_dct[2]
dct_coef$k2[i] <- tmp_dct[3]
dct_coef$k3[i] <- tmp_dct[4]
dct_coef }
We now have a data frame dct_coef
with spectral DCT coefficients computed from the midpoint of each sibilant:
dct_coef
id t1 t2 k0 k1 k2 k3
1 1 0.1142821 0.2950639 -27.22306 -1.674568 -0.44917079 0.6753327
2 2 0.8182295 0.9670684 -26.86305 -1.926533 -0.10199791 0.5209929
3 3 1.7678831 1.8943365 -27.11824 -1.864177 -0.60113986 0.8022334
4 4 2.5671831 2.7023670 -26.85385 -1.826578 -0.39936280 0.8335437
5 5 3.2914200 3.4253128 -28.32103 -1.821612 -0.03566683 0.5755081
These are both relatively simple cases, involving multiple tokens of multiple ‘snippets’ from a single sound file. Often, we’ll be interested in analyzing potentially multiple tokens from multiple sound files. To enable such a workflow, we’re probably better off turning parts of the analysis into a function. Here I define a function getSpecPeak
for grabbing the mid-frequency peak from the mid-point of each token with a certain TextGrid label, allowing users to define the analysis window around the midpoint, how to resample, and how to define mid-frequency regions.
<- function(filename, tierName, tgMatch, resamp = 24000,
getSpecPeak analysisWindow = 0.01, midFreqRegion = c(3000, 7000)) {
# Read in TextGrid with flexible filename
<- tg.read(paste0(filename, '.TextGrid'), encoding = 'auto')
tg # Find matches in TextGrid
<- which(tg[[tierName]]$label == tgMatch)
match_int # How many matches in TextGrid?
<- length(match_int)
n_match # Create empty data frame
<- data.frame(id = rep(NA, n_match),
spec_peak t1 = rep(NA, n_match),
t2 = rep(NA, n_match),
following = rep(NA, n_match),
peak = rep(NA, n_match))
# Loop through TextGrid matches
for (i in 1:n_match) {
# Keep identifier
<- match_int[i]
int # Keep start and end times
<- tg[[tierName]]$t1[int]
t1 <- tg[[tierName]]$t2[int]
t2 # Get the following TextGrid label (probably of interest)
<- tg[[tierName]]$label[int + 1]
following # Get midpoint timestamp
<- t1 + (t2-t1) / 2
midpoint
# Read in sound analysis window of sound file
<- snd.read(paste0(filename, '.wav'),
tmp from=midpoint-(analysisWindow/2),
to=midpoint+(analysisWindow/2),
units='seconds')
# Resample using user input
<- tmp$fs
sr <- as.vector(resamp(tmp$sig, f=tmp$fs, g=resamp))
tmp
# Generate multitaper spectrum and convert to data frame
<- spec.mtm(tmp, nw=4, k=8, deltat=1/resamp, plot=FALSE)
tmp_mts <- data.frame(hz = tmp_mts$freq,
tmp_mts energy = log(tmp_mts$spec))
# Filter spectrum to mid frequency region
<- tmp_mts[which(tmp_mts$hz > midFreqRegion[1] &
tmp_mts $hz < midFreqRegion[2]),]
tmp_mts# Find peak
<- tmp_mts[which.max(tmp_mts$energy),'hz']
tmp_peak # Populate dummy data frame
$id[i] <- i
spec_peak$t1[i] <- t1
spec_peak$t2[i] <- t2
spec_peak$following[i] <- following
spec_peak$peak[i] <- tmp_peak
spec_peak
}return(spec_peak)
}
This function can now be used on a single file:
getSpecPeak(filename = 'snd/1', tierName = 'sib', tgMatch = 's')
id t1 t2 following peak
1 1 0.07544516 0.2164158 5109.375
Very simple output as there’s only one /s/ in this sound file! But having a function also makes it easier to batch process multiple sound files. We just need a list of files that we want to process, and then we can loop over them, like so:
# list all files with .wav extension in the snd folder
<- list.files(path = 'snd', pattern = '.wav', full.names = TRUE)
wavFiles
# loop over them
for (i in 1:length(wavFiles)) {
# get versions of file names without .wav extension
<- gsub('.wav', '', wavFiles[i])
bare_filename
# if this is the first run through the loop, initiate data frame
if (i == 1) {
<- getSpecPeak(bare_filename, tierName = 'sib', tgMatch = 's')
specPeakBulk $file <- bare_filename
specPeakBulk# if this is not the first run through the loop, create new temporary
# data frame and combine at the end
else {
} <- getSpecPeak(bare_filename, tierName = 'sib', tgMatch = 's')
tmp $file <- bare_filename
tmp<- rbind(specPeakBulk, tmp)
specPeakBulk
} }
These are the results:
specPeakBulk
id t1 t2 following peak file
1 1 0.07544516 0.2164158 5109.375 snd/1
2 1 0.11428206 0.2950639 6890.625 snd/5
3 2 0.81822953 0.9670684 6890.625 snd/5
4 3 1.76788311 1.8943365 6656.250 snd/5
5 4 2.56718306 2.7023670 6984.375 snd/5
6 5 3.29142004 3.4253128 6750.000 snd/5
More complex methods and workflows
When we generate spectra in R and store them in R objects, it means we also have access to all the complex statistical methods that are available in R. In addition to parameterizing the spectra with one or a few numbers as we’ve done here, we can use them as inputs or dependent variables in advanced statistical methods. I’ll briefly mention a few here, but won’t demonstrate the methods here. One such approach is functional principal component analysis (FPCA), a dimensionality reduction method that takes as input a potentially large number of functions or curves, and can be used to summarize their principal modes of variation. The output would then be 1) the average spectrum, 2) a small number of shapes (principal components) that summarize ways in which the input spectra differ from the average, and 3) for each input spectrum, scores indicating how much it matches each of the principal components. Puggaard-Rode (2023) uses this approach to analyze regional variation in /t/ midpoint spectra in Danish (the code is laid out here).
An approach for integrating both the frequency and time domains using function-on-scalar regression is demonstrated by Puggaard-Rode (2022). This involves smoothing over both the time and frequency domain to estimate differences in time-varying spectral shape in different conditions.