Generating and analyzing multitaper spectra in R

Author
Affiliation

Institute for Phonetics and Speech Processing, Ludwig Maximilian University of Munich

Published

May 8, 2025

Brief introduction

This is a tutorial showing how to generate multitaper spectra in R and how to compute spectral moments and DCT coefficients from multitaper spectra. I will not go into a lot of technical detail about how multitaper spectra are generated; for more on this, see Reidy () and Reidy ().

By way of introduction, when linguists analyze speech data, we are often interested in how much energy is found at different frequencies. This is not easy to spot from the waveform, so the signal is converted into spectrograms or spectral slices, usually generated using the fast Fourier transformation (FFT). As the name suggests, this method is blazing fast.

FFT spectra are suitable for analyzing voiced portions of speech, but in both theory and practice they can be less suitable for analyzing voiceless portions of speech. This is because the Fourier basis is periodic, making the FFT inherently more suitable for periodic signals, such as voiced portions of speech, and less suitable for aperiodic signals, such as voiceless portions of speech.

Variance is reduced in multitaper spectral estimation, making multitaper spectra theoretically and practically and more suitable for voiceless portions of speech. Theoretically, because there is no strong assumption of periodicity in the underlying signal, and practically, because FFT spectra of voiceless speech are packed with unwanted noise, which is a major disadvantage if you want to use spectra (or numbers computed from spectra) as dependent variables in a statistical model.

We’ll be using the packages rPraat, phonTools, seewave, multitaper, and emuR, as well as praatpicture for plotting, so make sure those are installed.

install.packages(c('rPraat', 'phonTools', 'seewave', 'multitaper', 'emuR',
                   'praatpicture'))

Loading and preprocessing sound files

There are several functions for loading sound files into R, resulting in objects with different formats. I’ll use snd.read() from the package rPraat here, but you can use whichever method you prefer. The file contains a female Kmhmu’ speaker saying a single syllable [siːm]. It comes from this OSF repository (see Kirby, Pittayaporn, and Brunelle ()).

library(rPraat)
snd_obj <- snd.read('snd/1.wav')
class(snd_obj)
           type    name 
 "list" "Sound" "1.wav" 

snd_obj is an object of the class Sound, which R treats as a list containing a bunch of information about the sound file. The actual sound signal is stored in snd_obj$sig, time (in seconds) of each sample is given in snd_obj$t, and the sample rate of the file is given in snd_obj$fs. If we use the base R plotting function to produce a simple line plot of the signal, we get a waveform. I’ve added red lines indicating the location of the sibilant.

sib_loc <- c(0.075, 0.21)
plot(x=snd_obj$t, y=snd_obj$sig, type='l',
     xlab='Time (s)',
     ylab='Amplitude')
abline(v=sib_loc, col='red', lwd=2.5)

Here’s a spectrogram of the sound, generated using the spectrogram() function from the phonTools package:

library(phonTools)
spectrogram(snd_obj$sig[,1], fs=snd_obj$fs, maxfreq=12000, 
            colors=FALSE, dynamicrange=60)
abline(v=sib_loc*1000, col='red', lwd=2.5)

This sound file has a sample rate of 44.1 kHz, which is much more than we need for our purposes, so I’ll downsample it to 24 kHz using the resamp() function from the seewave package. This function takes the arguments f (the original sample rate) and g (the desired sample rate). I choose 24 kHz because sibilants do have a tendency to vary in very high frequencies, which probably makes something like 16 kHz problematic as we would then limit ourselves to analyzing frequencies below 8 kHz.

library(seewave)
sr <- 24000
snd_24kHz <- resamp(snd_obj$sig, f=snd_obj$fs, g=sr)

Next, I’ll extract just those sound samples that belong to the sibilant in the object, and extract a 10 ms snippet from the middle of that.

snd_sib <- snd_24kHz[(sib_loc[1]*sr):(sib_loc[2]*sr)]
mid <- length(snd_sib)/2
snd <- snd_sib[(mid-(0.005*sr)):(mid+(0.005*sr))]

Our 10 ms snippet snd looks like this:

plot(snd, type='l', xlab='Time (samples)', ylab='Amplitude')

Generating spectra

First of all, for expository purposes, I’ll generate an FFT spectrum of our sound using the spectralslice() function in phonTools. The fs argument is the signal’s sample rate.

fft <- spectralslice(snd, fs=sr)

As you can see, it’s very noisy and jagged. I saved it in an object fft which is a matrix containing information about the energy distribution in dB by frequency:

head(fft)
            hz         dB
[1,]   0.00000 -11.315922
[2,]  33.19502 -10.944350
[3,]  66.39004 -10.128381
[4,]  99.58506  -9.370342
[5,] 132.78008  -8.981041
[6,] 165.97510  -9.095506

I’ll go ahead and convert that to a data frame, which will make our lives easier down the line.

fft_df <- as.data.frame(fft)

Multitaper spectra can be generated using the spec.mtm() function from the multitaper package. I set the arguments nw, a frequency bandwidth parameter, and k, the number of eigenspectra used to compute the final spectrum, following the suggestions by Reidy (). The spec.mtm() defaults are nw=4, which is also what Reidy uses, and k=7, where Reidy uses K=2nW, i.e. k=8. The deltat argument is the duration of each sample in the signal, i.e. 1 divided by our sample rate.

library(multitaper)
mts <- spec.mtm(snd, nw=4, k=8, deltat=1/sr, 
                xlab='Frequency (Hz)', ylab='', main='')

The multitaper spectrum has a much more stable shape than the FFT spectrum.

Since we already converted the FFT spectrum to a data frame, that already has the format we need for computing things like spectral moments and DCT coefficients. The R object containing the multitaper spectrum is rather more complicated:

summary(mts)
         Length Class  Mode     
origin.n   1    -none- numeric  
method     1    -none- character
pad        1    -none- numeric  
spec     257    -none- numeric  
freq     257    -none- numeric  
series     1    -none- character
adaptive   1    -none- logical  
mtm       14    -none- list     

The information that we have in the corresponding FFT spectrum data frame fft_df is what’s stored in mts$spec and mts$freq. Be aware, however, that we can’t just grab those from the mts object and assume all is well. If we plot them directly, we get this:

plot(x=mts$freq, y=mts$spec, type='l',
     xlab='Frequency (Hz)', ylab="What's this?")

It doesn’t look at all like what we saw above. This is because energy is in the W/m2 scale and not the decibel scale. Taking the natural log of this number will give us a spectrum that’s visually identical to what the plotting function of multitaper produces.

plot(x=mts$freq, y=log(mts$spec), type='l',
     xlab='Frequency (Hz)', ylab='')

I usually standardize spectra before analyzing them statistically, so the exact scale is much less important than the curve shape. For now, let’s save a logged version of the multitaper spectrum in a data frame like we did with the FFT spectrum.

mts_df <- data.frame(hz = mts$freq,
                     energy = log(mts$spec))

When analyzing the behavior of the high energy range of spectra, it’s customary to filter away lower frequencies to remove any intrusive influence of voicing or other low frequency rumbling, which can have a huge influence on measures that summarize overall shape and energy distribution of the spectrum. This can be done with a bandpass filter before processing the sound, but here we just remove information from frequencies below 500 Hz from the spectrum.

fft_df <- fft_df[which(fft_df$hz > 500),]
mts_df <- mts_df[which(mts_df$hz > 500),]

Peak frequency

A straightforward way to summarize a spectrum is to get its peak frequency, i.e. the frequency bin with the highest energy. This can be achieved very simply with the base R function which.max().

fft_df$hz[which.max(fft_df$dB)]
[1] 5244.813
mts_df$hz[which.max(mts_df$energy)]
[1] 5671.875

There’s a difference of a few hundred Hz between the two spectral estimation methods. It is fairly typical to restrict this measure to the mid-frequency region, i.e. in the region between 3,000–7,000 Hz (see e.g. ). Since peak finding is very straightforward, we can already clearly see that the results would be identical; these spectra have their peak in the mid-frequency. If that hadn’t been the case though, looking in the mid-frequency could be achieved like this:

fft_midFreq <- fft_df[which(fft_df$hz > 3000 & fft_df$hz < 7000),]
mts_midFreq <- mts_df[which(mts_df$hz > 3000 & mts_df$hz < 7000),]
fft_midFreq$hz[which.max(fft_midFreq$dB)]
[1] 5244.813
mts_midFreq$hz[which.max(mts_midFreq$energy)]
[1] 5671.875

Computing spectral moments

We calculate spectral moments using the function moments() from the package emuR. It takes the arguments count which is our energy dimension (in whatever scale) and x which is our frequency dimension (in whatever scale). moments() simply returns a vector with four numbers, corresponding to the first four spectral moments, i.e. mean (AKA center of gravity), variance, skew, and kurtosis. Phonetics studies usually report standard deviation rather than variance, because standard deviation is on the well-known Hz scale while variance is on the Hz2 scale. For this reason, we replace the second number of the vector with its square root. We also have to set minval=TRUE because our energy scale contains negative values; if we don’t do this, the results would essentially be as if we had flipped the spectrum upside down.

Let’s try to do this for our FFT spectrum first.

library(emuR)
fft_moments <- moments(fft_df$dB, fft_df$hz, minval=TRUE)
fft_moments[2] <- sqrt(fft_moments[2])

And we’ll repeat that for our multitaper spectrum.

mts_moments <- moments(mts_df$energy, mts_df$hz, minval=TRUE)
mts_moments[2] <- sqrt(mts_moments[2])

Let’s compare the two results.

fft_moments
[1] 6527.47064546 3029.16900616   -0.03016832   -0.95492682
mts_moments
[1] 6713.23831437 2617.28439091    0.01304411   -0.48917071

There are notable differences in all spectral moments. The take home message is that the method of spectral estimation matters.

Computing DCT coefficients

Another way of summarizing spectral shape is using the coefficients of a discrete cosine transformation of the spectrum. Usually the first four coefficients are reported. This is also implemented in the emuR package, in the dct() function. dct() takes the arguments data which is our energy dimension, and m which is the number of coefficients to return. Since DCT coefficients only say something about the shape of a curve, the function doesn’t actually care about the frequency dimension. If we set m=3, the function returns a vector of four numbers, corresponding to k0, k1, k2, and k3, which reflect mean amplitude, linear slope, curvature, and strength at higher frequencies, respectively.

For our FFT spectrum, it looks like this.

fft_dct <- dct(fft_df$dB, m=3)

And for our multitaper spectrum, it looks like this.

mts_dct <- dct(mts_df$energy, m=3)

These are the results.

fft_dct
[1] -24.575433  -2.643086  -6.839182  -2.032708
mts_dct
[1] -29.3925342  -0.4104879  -1.4238712  -0.2278941

Here, as well, different methods of spectral estimation have serious implications for the results.

Bulk processing

Usually we won’t want to compute and analyze a single spectrum. If we have a research question, we’ll probably want to do this in bulk by looking at how the spectrum changes over time, or we’ll have multiple tokens of some consonant, multiple sound files, etc. Here I’ll just give two examples of how spectra can be computed and analyzed in bulk. Hopefully these examples can also be helpful for other use cases.

If we wanted to look at spectral dynamics over time in our fricative snd_sib by computing, say, spectral moments from 10 equidistant multitaper spectra of 10 ms, we could then do the following.

#number of samples in 10 ms
cs <- sr/100
#number of samples in the fricative
n_samp <- length(snd_sib)
#our number of equidistant spectra
steps <- 10
#create vector of start times for the spectra
t1 <- seq(1, n_samp-cs, length.out=steps)
#create vector of end times for the spectra
t2 <- seq(cs, n_samp, length.out=steps)
#create empty data frame with nrow=steps
mom <- data.frame(step = rep(NA, steps),
                  cog = rep(NA, steps),
                  sd = rep(NA, steps),
                  skew = rep(NA, steps),
                  kurtosis = rep(NA, steps))

for (i in 1:steps) {
  #create ith 10 ms snippet
  tmp <- snd_sib[t1[i]:t2[i]]
  #compute multitaper spectrum without plotting
  tmp_mts <- spec.mtm(tmp, nw=4, k=8, deltat=1/sr, plot=FALSE)
  tmp_mts <- data.frame(hz = tmp_mts$freq,
                        energy = log(tmp_mts$spec))
  tmp_mts <- tmp_mts[which(tmp_mts$hz > 500),]
  #compute spectral moments
  tmp_mom <- moments(count=tmp_mts$energy, x=tmp_mts$hz, minval=TRUE)
  mom$step[i] <- i
  mom$cog[i] <- tmp_mom[1]
  #compute standard deviation from variance
  mom$sd[i] <- sqrt(tmp_mom[2])
  mom$skew[i] <- tmp_mom[3]
  mom$kurtosis[i] <- tmp_mom[4]
}

This results in a data frame with time series for each spectral moments. It looks like this:

mom
   step      cog       sd        skew   kurtosis
1     1 5988.986 2881.961  0.18437379 -0.6530865
2     2 6299.798 2915.016  0.01133856 -0.7698554
3     3 6525.229 2586.294  0.20834472 -0.4085182
4     4 6822.153 2448.914  0.03136296 -0.3386035
5     5 6993.307 2568.983 -0.10483378 -0.4386231
6     6 6727.912 2865.474 -0.11990408 -0.6666871
7     7 7250.932 2500.938 -0.19279606 -0.2802282
8     8 6478.217 2872.782 -0.14174052 -0.4770977
9     9 6620.092 2701.421 -0.19581113 -0.3652390
10   10 5857.860 2772.067 -0.02317965 -0.5592586

(Hint: If you were to repeat this with FFT spectra, you’d see that the variation between adjacent spectra would be much larger).

Imagine that we had a sound file with a bunch of sibilants and for each of them we want to compute the first four DCT coefficients of multitaper spectra computed around the 10 ms midpoint. Here’s an example of a sound file with 5 [s]es and a TextGrid that marks their locations. (This is plotted using the praatpicture librar – much more information about how this works can be found here).

library(praatpicture)
praatpicture('snd/5.wav', spec_freqRange = c(0,8000),
             tg_focusTierLineType = 'solid', 
             tg_focusTierColor = 'blue')

In order to do this, we’d first load in the TextGrid using the function tg.read() from the rPraat package.

tg <- tg.read('snd/5.TextGrid')

This TextGrid has just one tier sib, and the resulting R object is a list containing information about the start and end times of each interval and the labels in them.

The rest of the operation could look like this:

#find the intervals with label 's'
s_int <- which(tg$sib$label == 's')
#how many intervals?
ns <- length(s_int)
#create empty data frame with nrow=ns
dct_coef <- data.frame(id = rep(NA, ns),
                       t1 = rep(NA, ns),
                       t2 = rep(NA, ns),
                       k0 = rep(NA, ns),
                       k1 = rep(NA, ns),
                       k2 = rep(NA, ns),
                       k3 = rep(NA, ns))

for (i in 1:ns) {
  #number of the ith interval with label 's'
  int <- s_int[i]
  #start time of the ith interval with label 's'
  t1 <- tg$sib$t1[int]
  #end time of the ith interval with label 's'
  t2 <- tg$sib$t2[int]
  #mid point of the interval
  midpoint <- t1 + (t2-t1) / 2
  
  #read in 10 ms snippet from sound file around the midpoint
  #of the ith interval with label 's'
  tmp <- snd.read('snd/5.wav', 
                  from=midpoint-0.005, to=midpoint+0.005, 
                  units='seconds')
  #downsample to 24 kHz as shown above
  tmp_24kHz <- as.vector(resamp(tmp$sig, f=tmp$fs, g=sr))
  
  #compute multitaper spectrum without plotting
  tmp_mts <- spec.mtm(tmp_24kHz, nw=4, k=8, deltat=1/sr, plot=FALSE)
  tmp_mts <- data.frame(hz = tmp_mts$freq,
                        energy = log(tmp_mts$spec))
  tmp_mts <- tmp_mts[which(tmp_mts$hz > 500),]
  #get 4 DCT coefficients as shown above
  tmp_dct <- dct(tmp_mts$energy, m=3)
  #fill in data frame
  dct_coef$id[i] <- i
  dct_coef$t1[i] <- t1
  dct_coef$t2[i] <- t2
  dct_coef$k0[i] <- tmp_dct[1]
  dct_coef$k1[i] <- tmp_dct[2]
  dct_coef$k2[i] <- tmp_dct[3]
  dct_coef$k3[i] <- tmp_dct[4]
}

We now have a data frame dct_coef with spectral DCT coefficients computed from the midpoint of each sibilant:

dct_coef
  id        t1        t2        k0        k1          k2        k3
1  1 0.1142821 0.2950639 -27.22306 -1.674568 -0.44917079 0.6753327
2  2 0.8182295 0.9670684 -26.86305 -1.926533 -0.10199791 0.5209929
3  3 1.7678831 1.8943365 -27.11824 -1.864177 -0.60113986 0.8022334
4  4 2.5671831 2.7023670 -26.85385 -1.826578 -0.39936280 0.8335437
5  5 3.2914200 3.4253128 -28.32103 -1.821612 -0.03566683 0.5755081

These are both relatively simple cases, involving multiple tokens of multiple ‘snippets’ from a single sound file. Often, we’ll be interested in analyzing potentially multiple tokens from multiple sound files. To enable such a workflow, we’re probably better off turning parts of the analysis into a function. Here I define a function getSpecPeak for grabbing the mid-frequency peak from the mid-point of each token with a certain TextGrid label, allowing users to define the analysis window around the midpoint, how to resample, and how to define mid-frequency regions.

getSpecPeak <- function(filename, tierName, tgMatch, resamp = 24000,
                        analysisWindow = 0.01, midFreqRegion = c(3000, 7000)) {
  
  # Read in TextGrid with flexible filename
  tg <- tg.read(paste0(filename, '.TextGrid'), encoding = 'auto')
  # Find matches in TextGrid
  match_int <- which(tg[[tierName]]$label == tgMatch)
  # How many matches in TextGrid?
  n_match <- length(match_int)
  # Create empty data frame
  spec_peak <- data.frame(id = rep(NA, n_match),
                          t1 = rep(NA, n_match),
                          t2 = rep(NA, n_match),
                          following = rep(NA, n_match),
                          peak = rep(NA, n_match))
  # Loop through TextGrid matches
  for (i in 1:n_match) {
    # Keep identifier
    int <- match_int[i]
    # Keep start and end times
    t1 <- tg[[tierName]]$t1[int]
    t2 <- tg[[tierName]]$t2[int]
    # Get the following TextGrid label (probably of interest)
    following <- tg[[tierName]]$label[int + 1]
    # Get midpoint timestamp
    midpoint <- t1 + (t2-t1) / 2
    
    # Read in sound analysis window of sound file
    tmp <- snd.read(paste0(filename, '.wav'), 
                    from=midpoint-(analysisWindow/2), 
                    to=midpoint+(analysisWindow/2), 
                    units='seconds')
    # Resample using user input
    sr <- tmp$fs
    tmp <- as.vector(resamp(tmp$sig, f=tmp$fs, g=resamp))
    
    # Generate multitaper spectrum and convert to data frame
    tmp_mts <- spec.mtm(tmp, nw=4, k=8, deltat=1/resamp, plot=FALSE)
    tmp_mts <- data.frame(hz = tmp_mts$freq,
                          energy = log(tmp_mts$spec))
    # Filter spectrum to mid frequency region
    tmp_mts <- tmp_mts[which(tmp_mts$hz > midFreqRegion[1] &
                               tmp_mts$hz < midFreqRegion[2]),]
    # Find peak
    tmp_peak <- tmp_mts[which.max(tmp_mts$energy),'hz']
    # Populate dummy data frame
    spec_peak$id[i] <- i
    spec_peak$t1[i] <- t1
    spec_peak$t2[i] <- t2
    spec_peak$following[i] <- following
    spec_peak$peak[i] <- tmp_peak
  }
  return(spec_peak)
}

This function can now be used on a single file:

getSpecPeak(filename = 'snd/1', tierName = 'sib', tgMatch = 's')
  id         t1        t2 following     peak
1  1 0.07544516 0.2164158           5109.375

Very simple output as there’s only one /s/ in this sound file! But having a function also makes it easier to batch process multiple sound files. We just need a list of files that we want to process, and then we can loop over them, like so:

# list all files with .wav extension in the snd folder
wavFiles <- list.files(path = 'snd', pattern = '.wav', full.names = TRUE)

# loop over them
for (i in 1:length(wavFiles)) {
  # get versions of file names without .wav extension
  bare_filename <- gsub('.wav', '', wavFiles[i])
  
  # if this is the first run through the loop, initiate data frame
  if (i == 1) {
    specPeakBulk <- getSpecPeak(bare_filename, tierName = 'sib', tgMatch = 's')
    specPeakBulk$file <- bare_filename
  # if this is not the first run through the loop, create new temporary 
  # data frame and combine at the end
  } else {
    tmp <- getSpecPeak(bare_filename, tierName = 'sib', tgMatch = 's')
    tmp$file <- bare_filename
    specPeakBulk <- rbind(specPeakBulk, tmp)
  }
}

These are the results:

specPeakBulk
  id         t1        t2 following     peak  file
1  1 0.07544516 0.2164158           5109.375 snd/1
2  1 0.11428206 0.2950639           6890.625 snd/5
3  2 0.81822953 0.9670684           6890.625 snd/5
4  3 1.76788311 1.8943365           6656.250 snd/5
5  4 2.56718306 2.7023670           6984.375 snd/5
6  5 3.29142004 3.4253128           6750.000 snd/5

More complex methods and workflows

When we generate spectra in R and store them in R objects, it means we also have access to all the complex statistical methods that are available in R. In addition to parameterizing the spectra with one or a few numbers as we’ve done here, we can use them as inputs or dependent variables in advanced statistical methods. I’ll briefly mention a few here, but won’t demonstrate the methods here. One such approach is functional principal component analysis (FPCA), a dimensionality reduction method that takes as input a potentially large number of functions or curves, and can be used to summarize their principal modes of variation. The output would then be 1) the average spectrum, 2) a small number of shapes (principal components) that summarize ways in which the input spectra differ from the average, and 3) for each input spectrum, scores indicating how much it matches each of the principal components. Puggaard-Rode () uses this approach to analyze regional variation in /t/ midpoint spectra in Danish (the code is laid out here).

An approach for integrating both the frequency and time domains using function-on-scalar regression is demonstrated by Puggaard-Rode (). This involves smoothing over both the time and frequency domain to estimate differences in time-varying spectral shape in different conditions.

References

Kirby, James, Pittayawat Pittayaporn, and Marc Brunelle. 2022. “Transphonologization of Onset Voicing: Revisiting Northern and Eastern Kmhmu.” Phonetica 79 (6): 591–629. https://doi.org/10.1515/phon-2022-0029.
Koenig, Laura L., Christine H. Shadle, Jonathan L. Preston, and Christine R. Mooshammer. 2013. “Toward Improved Spectral Measures of /s/. Results from Adolescents.” Journal of Speech, Language, and Hearing Research 56 (4): 1175–89. https://doi.org/10.1044/1092-4388(2012/12-0038).
Puggaard-Rode, Rasmus. 2022. “Analyzing Time-Varying Spectral Characteristics of Speech with Function-on-Scalar Regression.” Journal of Phonetics 95. https://doi.org/10.1016/j.wocn.2022.101191.
———. 2023. “The /t/ Release in Jutland Danish. Decomposing the Spectrum with Functional PCA.” In Proceedings of the 20th International Congress of Phonetic Sciences, edited by Radek Skarnitzl and Jan Volín, 3262–66. Prague: Guarant.
Reidy, Patrick. 2013. “An Introduction to Random Processes for the Spectral Analysis of Speech Data.” Ohio State University Working Papers in Linguistics 60: 67–116.
———. 2015. “A Comparison of Spectral Estimation Methods for the Analysis of Sibilant Fricatives.” JASA Express Letters 137 (4): 248–54. https://doi.org/10.1121/1.4915064.

Reuse

CC-BY-SA 4.0