Forced Alignment of Speech and Text: Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files

Hi all,

This is the second post in the series and deals with building acoustic models for speech recognition using Kaldi recipes.

In this post, I'm going to cover the procedure for three languages, German, French and Spanish using the data from VoxForge.

BRIEF WORKFLOW:

Toolkit

KALDI :

Kaldi is an opensource toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. I've included the installation instructions in the readme file in the git repo.

Downloading the data

Step 1 :

Set the DATA_ROOT variable in the shell script path.sh to point to the location on drive where you intend to store the data.

I've stored it in /home/$USER/database/voxforge.

Step 2:

Execute getdata.sh. This step

a) Downloads the language specific speech data from the voxforge portal and saves it in a folder tgz.

b) Extracts the files into a folder called extracted.

Ex: I've them in

/home/$USER/database/voxforge/$LANGUAGE/tgz and

/home/$USER/database/voxforge/$LANGUAGE/extracted

Details of the data files:

French - 55 hours , 1600 speakers

Spanish - 53 hours, 1712 speakers

German - 50 hours, 1600 speakers

Mapping anonymous speakers to unique IDs
We'd like to be able to perform various speaker-dependent transforms and therefore avoiding anonymous speakers is an important step in building the models. Also, the "anonymous" speech is recorded under different environment/channel conditions, by speakers that may be both males and females and have different accents. I've given each speaker unique identity following Voxforge recipe.

Train/test set splitting and normalizing the metadata

The next step is to split the data into train and test sets and to produce the relevant transcription and speaker-dependent information. I've used 20 speakers for test.

Building the language model

I decided to use SRILM toolkit to estimate the test-time language model. SRILM is not installed by default under

$KALDI_ROOT/tools by Kaldi installation scripts, but needs to be installed manually. The installation assumes you have GNU autotools, C++ and Fortran compilers, as well as Boost C++ libraries installed.

Preparing the dictionary
The script, used for this task is called local/voxforge_prepare_dict.sh. It downloads the CMU's pronunciation dictionary first, and prepares a list of the words that are found in the train set, but not in cmudict. Pronunciations for these words are automatically generated using Sequitur G2P, which is installed under tools/g2p. The installation assumes you have NumPy, SWIG and C++ compiler on your system. Because the training of Sequitur models takes a lot of time this script is downloading and using a pre-built model trained on cmudict instead.

Feature Extraction:

I've used MFCCs( Mel Frequency Cepstral Coefficients) as features to train the acoustic models. Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The

MFCC Extraction Procedure

They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound.

MODEL BUILDING:

I've built the models using triphone context. The following improvements have been tried:

Improvement using MinimumPhone Error rate training.

Improvement using Maximum Likelihood Training.

Improvement using SGMMs.

Improvements using Hyperparameter tuning of DNNs.

I'll explain each of these in detail in my next post.

The next step is decoding the test files using the models built. In my previous post I've mentioned the procedure to pre process the test files to decode them using the models built using Kaldi.

The code for both the posts is updated. Please feel free to have a look and suggest changes. Both the posts contain readme files and run.sh to help navigate through the code.

*******************************************************************************************

Progress so far:

Model Building:
Built Acoustic Models for the following languages:
English

Spanish

German

French

*********************************************************************************************

Whats's next:

Model Building:

Danish

Swedish

Norwegian

Model Deployment:

English

Wrapper:

Python wrapper to cover the decoding process.

Documentation

**********************************************************************************************

What will be in the next post:

a) Improving the transcripts using different techniques

I'll be explaining the variants of the training procedures and the ways to fine tune them to improve the accuracy. Basically the following:
Monophone vs triphone training.

Multipass + Speaker Adaptation.

Improvement using MinimumPhone Error rate training.

Improvement using Maximum Likelihood Training.

Improvement using SGMMs.

Improvements using Hyperparameter tuning of DNNs.

b) Model building :
How to build models when Voxforge data is not available.

Forced Alignment of Speech and Text

Sunday, 21 June 2015

Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files

1 comment:

Blog Archive