getVOT

PhonPhon, Aarhus University
8/6/2023

What is voice onset time?

The relative time between the release burst of a stop \(rel_t\) and the onset of vocal fold vibration \(vo_t\).
Calculated as \(vo_t - rel_t\).
When stops are voiceless, \(rel_t\) precedes \(vo_t\) and this results in a positive number.
When stops are pre-voiced, \(vo_t\) precedes \(rel_t\) and this results in a negative number.

What is voice onset time?

\(rel_t\) precedes \(vo_t\).
\(vo_t=\) 0.12

\(vo_t\) precedes \(rel_t\).
\(vo_t=\) 0.047

What is voice onset time?

\(rel_t\) precedes \(vo_t\).
\(vo_t=\) 0.12
\(rel_t=\) 0.07
120.1 \(-\) 69.6 \(=\) 50.5 ms.

\(vo_t\) precedes \(rel_t\).
\(vo_t=\) 0.047
\(rel_t=\) 0.109
46.5 \(-\) 109.4 \(=\) -62.9 ms.

Why is VOT interesting?

Lisker & Abramson (1964) showed that VOT is the main cue to voicing contrasts in stops cross-linguistically.
- It robustly distinguishes pre-voiced, voiceless, and voiceless aspirated stops.
It is a major source of
- cross-linguistic variability (e.g. Cho & Ladefoged 1999)
- and, as such, of foreign accent (e.g. Flege 1991)
- by-speaker variation (e.g. Chodroff et al. 2017)
- sociolinguistic variation (e.g. Stuart-Smith et al. 2015)
For these reasons, it’s measured a lot (Abramson & Whalen 2017).

Measuring VOT

VOT is relatively easy to measure.
- It is usually ‘hand-measured’ in e.g. Praat by finding and marking landmarks in the waveform or spectrogram.
This is a tedious task.
It involves repeating the same relatively clearly defined task ad nauseum.

Measuring VOT

VOT is relatively easy to measure.
- It is usually ‘hand-measured’ in e.g. Praat by finding and marking landmarks in the waveform or spectrogram.
This is a tedious task.
It involves repeating the same relatively clearly defined task ad nauseum.
- Academics are generally clever but slow.
- This is a task for someone who’s stupid and fast.
- This is what computers are for.

Predicting VOT automatically

There are already programs on the market that implement automatic VOT prediction.
AutoVOT predicts positive VOT using a neural network that mimics human annotators.
- Cannot be used with Windows.
- Has a nice Praat plugin.
- Scales best with a lot of representative training data.

Predicting VOT automatically

There are already programs on the market that implement automatic VOT prediction.
AutoVOT predicts positive VOT using a neural network that mimics human annotators.
- Cannot be used with Windows.
- Has a nice Praat plugin.
- Scales best with a lot of representative training data.
Dr.VOT predicts positive and negative VOT in a similar way.
- Cannot be used with Windows.
- Can only be used from the command line.
- Not clear how to add new training data.
- Sound files can only contain a single syllable.

Predicting VOT automatically

getVOT primarily uses base R but optionally uses functionality from the acoustic analysis library phonTools.
It should run on all operating systems.
Only runs in R, but should be relatively straightforward to use.
It is not clever – there’s no underlying neural network.
- This may result in decreased performance.
- It also means that there’s no black box – we know exactly what’s going on under the hood.
Training data helps, but usually scales quite well with minimal training data.
A bit of an intellectual exercise / coding exercise.

Using the squiggly line…

How I used to think of annotating VOT:
- I need to locate the burst or pre-voicing in the squiggly line.

…which is just a vector of numbers!

That squiggly line is just a bunch of amplitude measures.

snd_vl <- rPraat::snd.read('vl/firkant.wav', 
                           from=0.23, to=0.43, units='seconds')
length(snd_vl$sig)

## [1] 9600

head(snd_vl$sig[,1], 20)

##  [1]  0.059907834  0.054414502  0.052522355  0.051759392  0.051026948
##  [6]  0.050019837  0.047059542  0.040803247  0.038636433  0.038666952
## [11]  0.036530656  0.031434065  0.025788141  0.020477920  0.014435255
## [16]  0.009979553  0.004791406 -0.001007111 -0.004425184 -0.005798517

…which is just a vector of numbers!

That squiggly line is just a bunch of amplitude measures.
With a sound file already loaded in R, plotting a waveform is remarkably simple.

plot(snd_vl$sig, type='l')

…which is just a vector of numbers!

That squiggly line is just a bunch of amplitude measures.
With a sound file already loaded in R, plotting a waveform is remarkably simple.
If we can find and annotate VOT relatively easily in the squiggly line…
- …it should be possible to find and locate it in the vector of numbers.
And if we find VOT using consistent landmarks…
- …it should be possible to find it in a vector of numbers using a simple procedure.

The procedure: positive VOT

First: find the closure.
Look for an interval of \(d_{clo}\) ms (e.g. 10) in the first half of the sound file with the lowest mean amplitude.

The procedure: positive VOT

That’s this interval.
We look for a release after this interval.

The procedure: positive VOT

Look for an interval of 1 ms after the closure where the maximum amplitude is higher than a \(1/p_{rel}\) (e.g. 15) proportion of the maximum amplitude in the sound.

The procedure: positive VOT

The highest amplitude starts here:

The procedure: positive VOT

The first interval to have maximum amplitude above \(1/p_{rel}\) of the overall maximum amplitude is this one:

The procedure: positive VOT

Locating the onset of voicing is less straightforward.
By default, we use the autocorrelation function of measures within intervals of the wave of length \(d_{vo}\) ms (e.g. 1).
- I.e., within a 1 ms sound snippet, how correlated is each measure on average with the one that comes after?
- Stop releases are noisy, so we do not expect consecutive measures to be highly correlated.
- Voiced sounds such as vowels are periodic, so we expect consecutive measures to be highly correlated.
- The voicing onset is estimated to be the first interval where autocorrelation is on average above \(p_{vo}\) % of the most autocorrelated interval in the sound.

The procedure: positive VOT

Our intervals look like this:

The procedure: positive VOT

The interval with the highest average ACF is this one:

The procedure: positive VOT

If we zoom way in, that interval looks like this.
- It’s quite a smooth line!

The procedure: positive VOT

If we plot the autocorrelation of that smooth interval, it looks like this:

The procedure: positive VOT

The interval with the lowest average ACF is this one:

The procedure: positive VOT

If we zoom way in, that interval looks like this.
- It’s very jagged!

The procedure: positive VOT

If we plot the autocorrelation of that jagged interval, it looks like this:

The procedure: positive VOT

The first interval where mean ACF comes within 85% of the most autocorrelated interval is this one:

The procedure: positive VOT

This was our hand-measured VOT:

The procedure: positive VOT

And this is the VOT that we just predicted:

The procedure: positive VOT

Pretty good, right?
Using autocorrelation to predict voicing onset usually gets the best results, but for lower quality audio it’ll rarely work.
An alternative is using the pitch tracking algorithm from the phonTools package.
- In this case, voicing onset is simply predicted as the first tracked pitch after the release.
Using amplitude to predict burst location is usually effective, but also won’t work if there’s too much background noise.
- In this case, burst location can be predicted by first using phonTools::pitchtrack.
- We’ll then first find a stretch of consistently tracked pitch and search for a burst only in the 200 ms preceding.

The procedure: negative VOT

When searching for negative VOT, we start as before:
- In the first half of the sound, look for the most silent interval of duration \(d_{clo}\) (here, 10 ms).

The procedure: negative VOT

When searching for negative VOT, we start as before:
- In the first half of the sound, look for the most silent interval of duration \(d_{clo}\) (here, 10 ms).

The procedure: negative VOT

Next, we partially inverse the process.
First, we look for voicing using ACF (or alternatively pitch tracking.
- Which interval of duration \(d_{vo}\) (here, 1.2 ms) is above
  \(p_{vo}\) % (here, 90%) of the most correlated interval?

The procedure: negative VOT

Next, we partially inverse the process.
First, we look for voicing using ACF (or alternatively pitch tracking.
- Which interval of duration \(d_{vo}\) (here, 1.2 ms) is above
  \(p_{vo}\) % (here, 90%) of the most correlated interval?

The procedure: negative VOT

Finding the release of a pre-voiced stop is rather more difficult.
The most precise method (but also the most fickle) is searching for the transient phase.
- Immediately at the time of a stop release, the compressed air in the oral tract is discharged.
- This results in a fairly even distribution of energy throughout the spectrum (see e.g. Stevens 1993).
- Such smooth, even spectra are otherwise not found in speech.
- They’re tricky to find automatically as they’re very short!

The procedure: negative VOT

Using phonTools, we generate spectra with a duration of 1 ms in windows of 10 samples.
Spectral smoothness is calculated by taking the standard deviation of the difference among adjacent amplitude measures in each spectrum.

The procedure: negative VOT

For reference, this is an example of a rather jagged spectrum.

The procedure: negative VOT

In general, the difference between adjacent amplitude measures is relatively high.
Let’s plot this!

The procedure: negative VOT

On the other hand, this is the smoothest spectrum:

The procedure: negative VOT

This is immediately clear when plotting the difference between adjacent measures.

The procedure: negative VOT

This is the location of that very smooth spectrum:

The procedure: negative VOT

When it works, this method for finding the release of voiced stops is very precise.
It’s fickle though, and only works when the sound quality is rather good.
An alternative method is looking for sudden amplitude increases.

The procedure: negative VOT

We note the maximum amplitude in 1 ms windows.

The procedure: negative VOT

This results in a time series that looks like this:

The procedure: negative VOT

This is actually really difficult to use, because it’s too jagged.
- So we smooth this time series using DCT.

The procedure: negative VOT

The predicted stop release is when we see the highest velocity in this time series.

The procedure: positive VOT

In our sound wave, it looks like this:

The procedure: negative VOT

This was our hand-measured VOT:

The procedure: negative VOT

This is the VOT we predicted using the transient method:

The procedure: negative VOT

And this is the VOT we predicted using the amplitude method:

Parameter settings

This is implemented in the getVOT functions positiveVOT and negativeVOT.
These take the rather obscure arguments we’ve discussed throughout, and some we haven’t.

args(positiveVOT)

## function (sound, sr, closure_interval = 10, release_param = 15, 
##     vo_method = "acf", vo_granularity = 1, vo_param = 0.85, f0_wl = 30, 
##     f0_minacf = 0.5, burst_only = FALSE, f0_first = FALSE, plot = TRUE, 
##     params_list = NULL) 
## NULL

Parameter settings

To help setting these, getVOT has the functions neg_setParams and pos_setParams.
Let’s try this out on our voiced stop.

negParams <- neg_setParams(directory='voi', plot=F)

[1] "Testing voicing onset parameter settings for baa.wav"
[1] "On average, the selected parameters for voicing onset after a first pass agree with the training data within a margin of 0.285 ms"
[1] "Finetuning voicing onset parameter settings for baa.wav"
[1] "On average, the selected voicing onset parameters after finetuning agree with the training data within a margin of 0.092 ms"
[1] "Testing accuracy of different burst detection methods baa.wav"
[1] "On average, the select burst detection method agreewith the training data within a margin of 0.507 ms"

Parameter settings

We could’ve gotten slightly better results for this one stop with different parameters.
Usually the setParams functions should be used with 5-10 files.
- The returned list of parameters then minimizes the difference between hand-annotation and prediction.
plot_training_diff will visualize the comparison between hand-annotated and predicted VOT.

Parameter settings

plot_training_diff(directory='pos_training', params_list=plotParams)

Parameter settings and cats

plot_training_diff(directory='pos_training', params_list=plotParams, 
                   cat=TRUE)

From `R` to Praat

Okay, so the functions positiveVOT and negativeVOT can predict VOT for individual sound files, but how is this useful?

negativeVOT(sound[,1], 44100, params_list=negParams, plot=FALSE)

## $vo
## [1] 2050.95
## 
## $rel
## [1] 4849.35
## 
## $vot
## [1] -63.5
## 
## $voi_int
## [1] 252

From `R` to Praat

Probably the two most important functions of getVOT are:
VOT2newTG
- For sound files containing a single word, will generate new TextGrids with predicted VOT.
addVOT2TG
- For longer sound files with existing TextGrids, will add a new tier with predicted VOT.

From `R` to Praat

VOT2newTG(directory='voi',
          sign='negative',
          neg_params_list=negParams)

addVOT2TG(directory='DanPASS_mono',
          seg_list=c('p', 't', 'k'),
          tg_tier='lydskrift',
          sign='positive')

Let’s try it out!

devtools::install.github('rpuggaardrode/getVOT')

VOT2newTG(directory='YOUR DIRECTORY NAME HERE', sign='positive')

If results are poor, try the following:

VOT2newTG(MY_PARAMS <- pos_setParams('YOUR DIRECTORY NAME HERE'))

What is voice onset time?

What is voice onset time?

What is voice onset time?

What is voice onset time?

Why is VOT interesting?

Measuring VOT

Measuring VOT

Predicting VOT automatically

Predicting VOT automatically

Predicting VOT automatically

Using the squiggly line…

…which is just a vector of numbers!

…which is just a vector of numbers!

…which is just a vector of numbers!

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: positive VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: positive VOT

The procedure: negative VOT

The procedure: negative VOT

The procedure: negative VOT

Parameter settings

Parameter settings

Parameter settings

Parameter settings

Parameter settings and cats

From R to Praat

From R to Praat

From R to Praat

Let’s try it out!

From `R` to Praat

From `R` to Praat

From `R` to Praat