Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files
Hi all, This is the second post in the series and deals with building acoustic models for speech recognition using Kaldi recipes.
In this post, I'm going to cover the procedure for three languages, German, French and Spanish using the data from VoxForge.
Kaldi is an opensource toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. I've included the installation instructions in the readme file in the git repo.
Downloading the data
Step 1 :
Set the DATA_ROOT variable in the shell script path.sh to point to the location on drive where you intend to store the data.
I've stored it in /home/$USER/database/voxforge.
Execute getdata.sh. This step
a) Downloads the language specific speech data from the voxforge portal and saves it in a folder tgz.
b) Extracts the files into a folder called extracted.
Ex: I've them in
Details of the data files:
French - 55 hours , 1600 speakers
Spanish - 53 hours, 1712 speakers
German - 50 hours, 1600 speakers
Mapping anonymous speakers to unique IDs We'd like to be able to perform various speaker-dependent transforms and therefore avoiding anonymous speakers is an important step in building the models. Also, the "anonymous" speech is recorded under different environment/channel conditions, by speakers that may be both males and females and have different accents. I've given each speaker unique identity following Voxforge recipe.
Train/test set splitting and normalizing the metadata
The next step is to split the data into train and test sets and to produce the relevant transcription and speaker-dependent information. I've used 20 speakers for test.
Building the language model
I decided to use SRILM toolkit to estimate the test-time language model. SRILM is not installed by default under $KALDI_ROOT/tools by Kaldi installation scripts, but needs to be installed manually. The installation assumes you have GNU autotools, C++ and Fortran compilers, as well as Boost C++ libraries installed.
Preparing the dictionary The script, used for this task is called local/voxforge_prepare_dict.sh. It downloads the CMU's pronunciation dictionary first, and prepares a list of the words that are found in the train set, but not in cmudict. Pronunciations for these words are automatically generated using Sequitur G2P, which is installed under tools/g2p. The installation assumes you have NumPy, SWIG and C++ compiler on your system. Because the training of Sequitur models takes a lot of time this script is downloading and using a pre-built model trained on cmudict instead.
I've used MFCCs( Mel Frequency Cepstral Coefficients) as features to train the acoustic models. Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinearmel scale of frequency. The
MFCC Extraction Procedure
They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound.
I've built the models using triphone context. The following improvements have been tried:
Improvement using MinimumPhone Error rate training.
Improvement using Maximum Likelihood Training.
Improvement using SGMMs.
Improvements using Hyperparameter tuning of DNNs.
I'll explain each of these in detail in my next post.
The next step is decoding the test files using the models built. In my previous post I've mentioned the procedure to pre process the test files to decode them using the models built using Kaldi.
The code for both the posts is updated. Please feel free to have a look and suggest changes. Both the posts contain readme files and run.sh to help navigate through the code.