PhonPhon, Aarhus University
8/6/2023

What is voice onset time?

  • The relative time between the release burst of a stop \(rel_t\) and the onset of vocal fold vibration \(vo_t\).
  • Calculated as \(vo_t - rel_t\).
  • When stops are voiceless, \(rel_t\) precedes \(vo_t\) and this results in a positive number.
  • When stops are pre-voiced, \(vo_t\) precedes \(rel_t\) and this results in a negative number.

What is voice onset time?

What is voice onset time?

  • \(rel_t\) precedes \(vo_t\).
  • \(vo_t=\) 0.12

  • \(vo_t\) precedes \(rel_t\).
  • \(vo_t=\) 0.047

What is voice onset time?

  • \(rel_t\) precedes \(vo_t\).
  • \(vo_t=\) 0.12
  • \(rel_t=\) 0.07
  • 120.1 \(-\) 69.6 \(=\) 50.5 ms.

  • \(vo_t\) precedes \(rel_t\).
  • \(vo_t=\) 0.047
  • \(rel_t=\) 0.109
  • 46.5 \(-\) 109.4 \(=\) -62.9 ms.

Why is VOT interesting?

Measuring VOT

  • VOT is relatively easy to measure.
    • It is usually ‘hand-measured’ in e.g. Praat by finding and marking landmarks in the waveform or spectrogram.
  • This is a tedious task.
  • It involves repeating the same relatively clearly defined task ad nauseum.

Measuring VOT

  • VOT is relatively easy to measure.
    • It is usually ‘hand-measured’ in e.g. Praat by finding and marking landmarks in the waveform or spectrogram.
  • This is a tedious task.
  • It involves repeating the same relatively clearly defined task ad nauseum.
    • Academics are generally clever but slow.
    • This is a task for someone who’s stupid and fast.
    • This is what computers are for.

Predicting VOT automatically

  • There are already programs on the market that implement automatic VOT prediction.
  • AutoVOT predicts positive VOT using a neural network that mimics human annotators.
    • Cannot be used with Windows.
    • Has a nice Praat plugin.
    • Scales best with a lot of representative training data.

Predicting VOT automatically

  • There are already programs on the market that implement automatic VOT prediction.
  • AutoVOT predicts positive VOT using a neural network that mimics human annotators.
    • Cannot be used with Windows.
    • Has a nice Praat plugin.
    • Scales best with a lot of representative training data.
  • Dr.VOT predicts positive and negative VOT in a similar way.
    • Cannot be used with Windows.
    • Can only be used from the command line.
    • Not clear how to add new training data.
    • Sound files can only contain a single syllable.

Predicting VOT automatically

  • getVOT primarily uses base R but optionally uses functionality from the acoustic analysis library phonTools.
  • It should run on all operating systems.
  • Only runs in R, but should be relatively straightforward to use.
  • It is not clever – there’s no underlying neural network.
    • This may result in decreased performance.
    • It also means that there’s no black box – we know exactly what’s going on under the hood.
  • Training data helps, but usually scales quite well with minimal training data.
  • A bit of an intellectual exercise / coding exercise.

Using the squiggly line…

  • How I used to think of annotating VOT:
    • I need to locate the burst or pre-voicing in the squiggly line.

…which is just a vector of numbers!

  • That squiggly line is just a bunch of amplitude measures.
snd_vl <- rPraat::snd.read('vl/firkant.wav', 
                           from=0.23, to=0.43, units='seconds')
length(snd_vl$sig)
## [1] 9600
head(snd_vl$sig[,1], 20)
##  [1]  0.059907834  0.054414502  0.052522355  0.051759392  0.051026948
##  [6]  0.050019837  0.047059542  0.040803247  0.038636433  0.038666952
## [11]  0.036530656  0.031434065  0.025788141  0.020477920  0.014435255
## [16]  0.009979553  0.004791406 -0.001007111 -0.004425184 -0.005798517

…which is just a vector of numbers!

  • That squiggly line is just a bunch of amplitude measures.
  • With a sound file already loaded in R, plotting a waveform is remarkably simple.
plot(snd_vl$sig, type='l')

…which is just a vector of numbers!

  • That squiggly line is just a bunch of amplitude measures.
  • With a sound file already loaded in R, plotting a waveform is remarkably simple.
  • If we can find and annotate VOT relatively easily in the squiggly line…
    • …it should be possible to find and locate it in the vector of numbers.
  • And if we find VOT using consistent landmarks…
    • …it should be possible to find it in a vector of numbers using a simple procedure.

The procedure: positive VOT

  • First: find the closure.
  • Look for an interval of \(d_{clo}\) ms (e.g. 10) in the first half of the sound file with the lowest mean amplitude.

The procedure: positive VOT

  • That’s this interval.
  • We look for a release after this interval.

The procedure: positive VOT

  • Look for an interval of 1 ms after the closure where the maximum amplitude is higher than a \(1/p_{rel}\) (e.g. 15) proportion of the maximum amplitude in the sound.

The procedure: positive VOT

The highest amplitude starts here:

The procedure: positive VOT

  • The first interval to have maximum amplitude above \(1/p_{rel}\) of the overall maximum amplitude is this one:

The procedure: positive VOT

  • Locating the onset of voicing is less straightforward.
  • By default, we use the autocorrelation function of measures within intervals of the wave of length \(d_{vo}\) ms (e.g. 1).
    • I.e., within a 1 ms sound snippet, how correlated is each measure on average with the one that comes after?
    • Stop releases are noisy, so we do not expect consecutive measures to be highly correlated.
    • Voiced sounds such as vowels are periodic, so we expect consecutive measures to be highly correlated.
    • The voicing onset is estimated to be the first interval where autocorrelation is on average above \(p_{vo}\) % of the most autocorrelated interval in the sound.

The procedure: positive VOT

Our intervals look like this:

The procedure: positive VOT

  • The interval with the highest average ACF is this one:

The procedure: positive VOT

  • If we zoom way in, that interval looks like this.
    • It’s quite a smooth line!

The procedure: positive VOT

  • If we plot the autocorrelation of that smooth interval, it looks like this:

The procedure: positive VOT

  • The interval with the lowest average ACF is this one:

The procedure: positive VOT

  • If we zoom way in, that interval looks like this.
    • It’s very jagged!

The procedure: positive VOT

  • If we plot the autocorrelation of that jagged interval, it looks like this:

The procedure: positive VOT

  • The first interval where mean ACF comes within 85% of the most autocorrelated interval is this one:

The procedure: positive VOT

  • This was our hand-measured VOT:

The procedure: positive VOT

  • And this is the VOT that we just predicted:

The procedure: positive VOT

  • Pretty good, right?
  • Using autocorrelation to predict voicing onset usually gets the best results, but for lower quality audio it’ll rarely work.
  • An alternative is using the pitch tracking algorithm from the phonTools package.
    • In this case, voicing onset is simply predicted as the first tracked pitch after the release.
  • Using amplitude to predict burst location is usually effective, but also won’t work if there’s too much background noise.
    • In this case, burst location can be predicted by first using phonTools::pitchtrack.
    • We’ll then first find a stretch of consistently tracked pitch and search for a burst only in the 200 ms preceding.

The procedure: negative VOT

  • When searching for negative VOT, we start as before:
    • In the first half of the sound, look for the most silent interval of duration \(d_{clo}\) (here, 10 ms).

The procedure: negative VOT

  • When searching for negative VOT, we start as before:
    • In the first half of the sound, look for the most silent interval of duration \(d_{clo}\) (here, 10 ms).

The procedure: negative VOT

  • Next, we partially inverse the process.
  • First, we look for voicing using ACF (or alternatively pitch tracking.
    • Which interval of duration \(d_{vo}\) (here, 1.2 ms) is above
      \(p_{vo}\) % (here, 90%) of the most correlated interval?

The procedure: negative VOT

  • Next, we partially inverse the process.
  • First, we look for voicing using ACF (or alternatively pitch tracking.
    • Which interval of duration \(d_{vo}\) (here, 1.2 ms) is above
      \(p_{vo}\) % (here, 90%) of the most correlated interval?

The procedure: negative VOT

  • Finding the release of a pre-voiced stop is rather more difficult.
  • The most precise method (but also the most fickle) is searching for the transient phase.
    • Immediately at the time of a stop release, the compressed air in the oral tract is discharged.
    • This results in a fairly even distribution of energy throughout the spectrum (see e.g. Stevens 1993).
    • Such smooth, even spectra are otherwise not found in speech.
    • They’re tricky to find automatically as they’re very short!

The procedure: negative VOT

  • Using phonTools, we generate spectra with a duration of 1 ms in windows of 10 samples.
  • Spectral smoothness is calculated by taking the standard deviation of the difference among adjacent amplitude measures in each spectrum.

The procedure: negative VOT

  • For reference, this is an example of a rather jagged spectrum.

The procedure: negative VOT

  • In general, the difference between adjacent amplitude measures is relatively high.
  • Let’s plot this!

The procedure: negative VOT

  • On the other hand, this is the smoothest spectrum:

The procedure: negative VOT

  • This is immediately clear when plotting the difference between adjacent measures.

The procedure: negative VOT

  • This is the location of that very smooth spectrum:

The procedure: negative VOT

  • When it works, this method for finding the release of voiced stops is very precise.
  • It’s fickle though, and only works when the sound quality is rather good.
  • An alternative method is looking for sudden amplitude increases.

The procedure: negative VOT

  • We note the maximum amplitude in 1 ms windows.

The procedure: negative VOT

  • This results in a time series that looks like this:

The procedure: negative VOT

  • This is actually really difficult to use, because it’s too jagged.
    • So we smooth this time series using DCT.

The procedure: negative VOT

  • The predicted stop release is when we see the highest velocity in this time series.

The procedure: positive VOT

  • In our sound wave, it looks like this:

The procedure: negative VOT

  • This was our hand-measured VOT:

The procedure: negative VOT

  • This is the VOT we predicted using the transient method:

The procedure: negative VOT

  • And this is the VOT we predicted using the amplitude method:

Parameter settings

  • This is implemented in the getVOT functions positiveVOT and negativeVOT.
  • These take the rather obscure arguments we’ve discussed throughout, and some we haven’t.
args(positiveVOT)
## function (sound, sr, closure_interval = 10, release_param = 15, 
##     vo_method = "acf", vo_granularity = 1, vo_param = 0.85, f0_wl = 30, 
##     f0_minacf = 0.5, burst_only = FALSE, f0_first = FALSE, plot = TRUE, 
##     params_list = NULL) 
## NULL

Parameter settings

  • To help setting these, getVOT has the functions neg_setParams and pos_setParams.
  • Let’s try this out on our voiced stop.
negParams <- neg_setParams(directory='voi', plot=F)

[1] "Testing voicing onset parameter settings for baa.wav"
[1] "On average, the selected parameters for voicing onset after a first pass agree with the training data within a margin of 0.285 ms"
[1] "Finetuning voicing onset parameter settings for baa.wav"
[1] "On average, the selected voicing onset parameters after finetuning agree with the training data within a margin of 0.092 ms"
[1] "Testing accuracy of different burst detection methods baa.wav"
[1] "On average, the select burst detection method agreewith the training data within a margin of 0.507 ms"

Parameter settings

  • We could’ve gotten slightly better results for this one stop with different parameters.
  • Usually the setParams functions should be used with 5-10 files.
    • The returned list of parameters then minimizes the difference between hand-annotation and prediction.
  • plot_training_diff will visualize the comparison between hand-annotated and predicted VOT.

Parameter settings

plot_training_diff(directory='pos_training', params_list=plotParams)

Parameter settings and cats

plot_training_diff(directory='pos_training', params_list=plotParams, 
                   cat=TRUE)

From R to Praat

  • Okay, so the functions positiveVOT and negativeVOT can predict VOT for individual sound files, but how is this useful?
negativeVOT(sound[,1], 44100, params_list=negParams, plot=FALSE)
## $vo
## [1] 2050.95
## 
## $rel
## [1] 4849.35
## 
## $vot
## [1] -63.5
## 
## $voi_int
## [1] 252

From R to Praat

  • Probably the two most important functions of getVOT are:
  • VOT2newTG
    • For sound files containing a single word, will generate new TextGrids with predicted VOT.
  • addVOT2TG
    • For longer sound files with existing TextGrids, will add a new tier with predicted VOT.

From R to Praat

VOT2newTG(directory='voi',
          sign='negative',
          neg_params_list=negParams)
addVOT2TG(directory='DanPASS_mono',
          seg_list=c('p', 't', 'k'),
          tg_tier='lydskrift',
          sign='positive')

Let’s try it out!

devtools::install.github('rpuggaardrode/getVOT')
VOT2newTG(directory='YOUR DIRECTORY NAME HERE', sign='positive')
  • If results are poor, try the following:
VOT2newTG(MY_PARAMS <- pos_setParams('YOUR DIRECTORY NAME HERE'))