5 Forced alignment in `emuR`

5.1 Objective and preliminaries

The objective of this chapter is to show how to go from a directory with .wav files that are orthographically transcribed in simple .txt files to a phonetically annotated and force-aligned Emu database.

The assumption is that you already have an R project called ipsR and that it contains the directories emu_databases and testsample. If this is not the case, please go back and follow the preliminaries chapter.

Start up R in the project you are using for this tutorial and load the usual packages:

library(tidyverse)
library(emuR)
library(wrassp)

In R, store the path to the directory testsample as sourceDir in the following way:

sourceDir <- "./testsample"

And also store in R the path to emu_databases as targetDir:

targetDir <- "./emu_databases"

5.2 Converting a text collection into an Emu database

The directory ./testsample/german on your computer contains .wav files and .txt files. Define the path to this directory in R and check that you can see these files with the list.files() function:

path.german <- file.path(sourceDir, "german")
list.files(path.german)

[1] "K01BE001.txt" "K01BE001.wav" "K01BE002.txt" "K01BE002.wav"

The above is an example of what we’ll call a text collection because it contains matching .wav and .txt files in the same directory, and each .wav file, the .txt file contains the corresponding orthography. We can see that this is true by using the function read_file() to read the content of these .txt files:

read_file(file.path(path.german, 'K01BE001.txt'))

[1] "heute ist schönes Frühlingswetter"

read_file(file.path(path.german, 'K01BE002.txt'))

[1] "die Sonne lacht"

The command convert_txtCollection() is used to convert a text collection into an Emu database. Below we make an Emu database called ger2, which we’ll store in targetDir:

convert_txtCollection(dbName = "ger2",
                      sourceDir = path.german,
                      targetDir = targetDir,
                      verbose=FALSE)

You can now load the database into R with load_emuDB():

ger2_DB <- load_emuDB(file.path(targetDir, "ger2_emuDB"), verbose=FALSE)
summary(ger2_DB)

── Summary of emuDB ────────────────────────────────────────────────────────────

Name:    ger2 
UUID:    5b818e3b-1d82-4471-985f-2ea5e5c8e2e4 
Directory:   C:\Users\rasmu\surfdrive\emuintro\emuintro\emu_databases\ger2_emuDB 
Session count: 1 
Bundle count: 2 
Annotation item count:  2 
Label count:  4 
Link count:  0

── Database configuration ──────────────────────────────────────────────────────

── SSFF track definitions ──

dataramme med 0 kolonner og 0 rækker

── Level definitions ──

 name   type nrOfAttrDefs attrDefNames          
 bundle ITEM 2            bundle; transcription;

── Link definitions ──

dataramme med 0 kolonner og 0 rækker

Try to serve() the database and have a look at it:

serve(ger2_DB, useViewer = F)

If you switch to hierarchy view, you should see that the words in the .txt files are a single item in the attribute level of bundle with the name transcription, as shown in the image below:

If we query() the database to find all annotations at the transcription level, it’s clear that the words are all stored in this way. Note that we need to include the argument calcTimes = FALSE here, because the annotation level transcription is of type ITEM and has no time information.

query(ger2_DB, "transcription =~ .*", calcTimes = FALSE)

5.3 Forced alignment

We are now going to run the Munich Automatic Segmentation (MAUS) pipeline over the database. We do this with the function runBASwebservice_all(), which combines a number of online processing tools. The obligatory arguments of this function are transcriptionAttributeDefinitionName() which will be the name of the newly created annotation level, and language, which in this case we set to deu-DE. We also set the argument runMINNI to FALSE; this is potentially used to forced-align data which has no annotations at all by first running the MINNI automatic speech recognition (ASR) routine. Note that runBASwebservice_all() can only be used if you have an active internet connection.

runBASwebservice_all(ger2_DB,
  transcriptionAttributeDefinitionName = "transcription",
  language = "deu-DE", 
  runMINNI = FALSE,
  verbose = FALSE)

The language setting in MAUS

There are quite a lot of languages available in MAUS. As of this writing, you can force-align Afrikaans, Albanian, Arabic, Basque, Catalan, Dutch, English, Estonian, Finnish, French, Georgian, German, Hungarian, Icelandic, Italian, Japanese, Luxembourgish, Maltese, Min Nan, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, and Thai. There’s also a language independent mode which expects files to be phonetically transcribed in X-SAMPA, and a special mode for Australian aboriginal languages. Additionally, many of these languages have modes for multiple different dialects. More information can be found in the MAUS help files and in the MAUS web interface.

Let’s have a look at the database summary again:

summary(ger2_DB)

── Summary of emuDB ────────────────────────────────────────────────────────────

Name:    ger2 
UUID:    5b818e3b-1d82-4471-985f-2ea5e5c8e2e4 
Directory:   C:\Users\rasmu\surfdrive\emuintro\emuintro\emu_databases\ger2_emuDB 
Session count: 1 
Bundle count: 2 
Annotation item count:  58 
Label count:  74 
Link count:  52

── Database configuration ──────────────────────────────────────────────────────

── SSFF track definitions ──

dataramme med 0 kolonner og 0 rækker

── Level definitions ──

 name   type    nrOfAttrDefs attrDefNames          
 bundle ITEM    2            bundle; transcription;
 ORT    ITEM    3            ORT; KAN; KAS;        
 MAU    SEGMENT 1            MAU;                  
 MAS    ITEM    1            MAS;

── Link definitions ──

 type        superlevelName sublevelName
 ONE_TO_MANY bundle         ORT         
 ONE_TO_MANY ORT            MAS         
 ONE_TO_MANY MAS            MAU

Note that multiple extra annotation levels and attributes have been created (ORT, KAN, KAS, MAU, and MAS), as well as links between them.

Let’s serve() the database.

serve(ger2_DB, useViewer = FALSE)

We immediately see that phone-level annotations have been added in the MAU level. Have a look at the hierarchy view and try to identify the levels, links, and attributes.

This shows that the phone-level annotations are linked to syllable-level annotations in the MAS level and word-level orthographic annotations in the ORT level. The ORT level further has the attributes KAN and KAS. These contain canonical representations of the word, i.e. phonetic annotations corresponding to the canonical pronunciations of these words. The MAS level is chunked into syllables.

Given this complex information, more complex queries are now also possible. We can find the word-initial MAU segments of all polysyllabic words like this:

mau.s <- query(ger2_DB,
               "[[MAU =~.* & Start(ORT, MAU)=1] ^ Num(ORT, MAS) > 1]")
#deselect the db_uuid column for readability
mau.s %>% select(-db_uuid)

This data frame can then be passed to requery_hier() so we can see the labels in the ORT level associated with these words, like so:

requery_hier(ger2_DB, mau.s, "ORT") %>% select(-db_uuid)

This may all seem rather opaque, but we’ll go into much more detail with how the querying language works in Chapter 7.

5.4 Forced alignment: Albanian

Next we’ll try out forced alignment for a different language (Albanian) and we will show how forced alignment can be done from a canonical phonemic transcription instead of from text.

5.4.1 From a text collection

First we’ll use a text collection like we saw for German previously. This text collection is in our sourceDir in a folder called albanian:

path.albanian <- file.path(sourceDir, "albanian")

First we’ll convert the text collection into an Emu database using convert_txtCollection() as above.

convert_txtCollection(dbName = "alb",
                      sourceDir = path.albanian,
                      targetDir = targetDir,
                      verbose=FALSE)

alb_DB <- load_emuDB(file.path(targetDir, "alb_emuDB"), verbose = FALSE)
summary(alb_DB)

── Summary of emuDB ────────────────────────────────────────────────────────────

Name:    alb 
UUID:    38874968-8ad6-48cf-8992-5fa6098e30d7 
Directory:   C:\Users\rasmu\surfdrive\emuintro\emuintro\emu_databases\alb_emuDB 
Session count: 1 
Bundle count: 4 
Annotation item count:  4 
Label count:  8 
Link count:  0

── Database configuration ──────────────────────────────────────────────────────

── SSFF track definitions ──

dataramme med 0 kolonner og 0 rækker

── Level definitions ──

 name   type nrOfAttrDefs attrDefNames          
 bundle ITEM 2            bundle; transcription;

── Link definitions ──

dataramme med 0 kolonner og 0 rækker

Have a look at the database, switch to hierarchy view, and verify that the words have been located at bundle -> transcription as for the German database above.

serve(alb_DB, useViewer = F)

Now run MAUS, just as before. The language code for Albanian is sqi-AL. Note that this will take longer than for German, possibly a couple of minutes.

runBASwebservice_all(alb_DB, 
                     transcriptionAttributeDefinitionName = "transcription",
                     language = "sqi-AL", 
                     runMINNI = FALSE,
                     verbose = FALSE)

summary(alb_DB)

── Summary of emuDB ────────────────────────────────────────────────────────────

Name:    alb 
UUID:    38874968-8ad6-48cf-8992-5fa6098e30d7 
Directory:   C:\Users\rasmu\surfdrive\emuintro\emuintro\emu_databases\alb_emuDB 
Session count: 1 
Bundle count: 4 
Annotation item count:  138 
Label count:  176 
Link count:  125

── Database configuration ──────────────────────────────────────────────────────

── SSFF track definitions ──

dataramme med 0 kolonner og 0 rækker

── Level definitions ──

 name   type    nrOfAttrDefs attrDefNames          
 bundle ITEM    2            bundle; transcription;
 ORT    ITEM    3            ORT; KAN; KAS;        
 MAU    SEGMENT 1            MAU;                  
 MAS    ITEM    1            MAS;

── Link definitions ──

 type        superlevelName sublevelName
 ONE_TO_MANY bundle         ORT         
 ONE_TO_MANY ORT            MAS         
 ONE_TO_MANY MAS            MAU

Look at the database and verify that the same kind of information has been automatically derived, as for the German database earlier.

serve(alb_DB, useViewer = FALSE)

5.4.2 From a canonical representation

MAUS also allows to segment sound files based on the canonical level that we saw in the KAN attribute above. This can be useful when the canonical representation provided by MAUS deviates considerably from what was actually said. For one of the words in 0001BF_1syll_1, the canonical representation is J E but what was actually said is closer to n J E.

To see how this works, switch the hierarchy view in the webApp from ORT → KAN and then change the node J E of the ORT:KAN level to n J E for file 0001BF_1syll_1, as in ?fig-albhier. Remember to save the annotation after editing.

We can now run MAUSdirectly on this canonical level using the runBASwebservice_maus() function. Here we again pass the language, and we pass the name of the existing annotation level with a canonical representation KAN to the argument canoAttributeDefinitionName. We also specify the newly created force-aligned level mausAttributeDefinitionName, which we call MAU2 in order to differentiate it MAU level that we already created.

runBASwebservice_maus(alb_DB,
                      canoAttributeDefinitionName = "KAN",
                      mausAttributeDefinitionName = "MAU2",
                      language = "sqi-AL",
                      verbose = FALSE)

Inspect the database again. There should now be a new annotation level called MAU2.

summary(alb_DB)

── Summary of emuDB ────────────────────────────────────────────────────────────

Name:    alb 
UUID:    38874968-8ad6-48cf-8992-5fa6098e30d7 
Directory:   C:\Users\rasmu\surfdrive\emuintro\emuintro\emu_databases\alb_emuDB 
Session count: 1 
Bundle count: 4 
Annotation item count:  219 
Label count:  257 
Link count:  197

── Database configuration ──────────────────────────────────────────────────────

── SSFF track definitions ──

dataramme med 0 kolonner og 0 rækker

── Level definitions ──

 name   type    nrOfAttrDefs attrDefNames          
 bundle ITEM    2            bundle; transcription;
 ORT    ITEM    3            ORT; KAN; KAS;        
 MAU    SEGMENT 1            MAU;                  
 MAS    ITEM    1            MAS;                  
 MAU2   SEGMENT 1            MAU2;

── Link definitions ──

 type        superlevelName sublevelName
 ONE_TO_MANY bundle         ORT         
 ONE_TO_MANY ORT            MAS         
 ONE_TO_MANY MAS            MAU         
 ONE_TO_MANY ORT            MAU2

If you have reason to suspect that the canonical representation will differ from what is actually said in the recording, you can use the function runBASwebservice_g2pForPronunciation() as the first step. This will only generate the canonical reprsentations without doing forced alignment, and this can then be corrected before running the MAUS forced alignment routine. If you have much more data, this may speed up your process.

5.5 Functions introduced in this chapter

read_file(): reads the content of a .txt file into R
convert_txtCollection(): converts pairs of .wav and .txt files into an Emu database.
run_BASwebservice_all(): performs all the steps needed to get from an Emu database with orthographical transcription that aren’t time-aligned to a multi-level force-aligned phonetic annotation
run_BASwebservice_maus(): performs forced alignment on an Emu database which already has (non-time-aligned) canonical phonetic annotations
run_BASwebservice_g2pForPronunciation(): performs grapheme-to-phoneme (G2P) conversion of an orthographical transcription to a canonical phonetic transcription on an Emu database