Forced Alignment of Speech and Text

Friday 26 June 2015

Pipeline for the Word Level Alignment using Kaldi in English using Trained Acoustic Models

Hi,

This week marks the beginning of the Mid Term Evaluations of GSoC and this post describes the pipeline employed that will be deployed to decode the video files.

The picture below depicts the basic steps involved in the task:

Audio Extraction from the Video Files:

For this module, I've used ffmpeg, which is a cross-platform solution to process video and audio stream. This step produces 2 channel wavefiles which have a sampling rate of 44.1khz.

I've downsampled the files to 8000 Hz and mono channel using sox, which is a command line utility in ubuntu to process audio files.

The down sampled wave files are then split into smaller chunks based on silence regions as mentioned in my first post.

As I plan to decode the files using Kaldi, I've written the scripts in such a way that the smaller chunks are saved in the format compatible with Kaldi recipe that I've built for English( as explained in my post 2).

**********************************
Code :

ffmpeg -i video_file audio_file
sox wave_file -c 1 -r 8000 wave_downsampled
sox wave_downsampled wave_downsampled 1 0.1 1% 1 0.7 1% : newfile : restart

***********************************

Decoding Using Kaldi Trained Models:

The files necessary for the process of decoding are the graphs which are present in the exp folder.

Once this is done, adjust the paths in the Kaldi recipe to point to the test files and run the decoding step. The predictions are stored in the exp folder.

The utils folder has the script int2sym.pl which is required to generate the symbols corresponding to the phones/words.

Brief Sequence of Steps:

(1) Prepares data.
(2) Prepares the Language Model.
(3) Extracts MFCC( Mel Cepstral Coefficients) features as mentioned in the previous post.
(4) Decodes the data using acoustic model trained on 100 hrs of clean Librispeech data.

The code is updated here. Feel free to have a look and suggest changes.

Sunday 21 June 2015

Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files

Hi all,

This is the second post in the series and deals with building acoustic models for speech recognition using Kaldi recipes.

In this post, I'm going to cover the procedure for three languages, German, French and Spanish using the data from VoxForge.

BRIEF WORKFLOW:

Toolkit

KALDI :

Kaldi is an opensource toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. I've included the installation instructions in the readme file in the git repo.

Downloading the data

Step 1 :

Set the DATA_ROOT variable in the shell script path.sh to point to the location on drive where you intend to store the data.

I've stored it in /home/$USER/database/voxforge.

Step 2:

Execute getdata.sh. This step

a) Downloads the language specific speech data from the voxforge portal and saves it in a folder tgz.

b) Extracts the files into a folder called extracted.

Ex: I've them in

/home/$USER/database/voxforge/$LANGUAGE/tgz and

/home/$USER/database/voxforge/$LANGUAGE/extracted

Details of the data files:

French - 55 hours , 1600 speakers

Spanish - 53 hours, 1712 speakers

German - 50 hours, 1600 speakers

Mapping anonymous speakers to unique IDs
We'd like to be able to perform various speaker-dependent transforms and therefore avoiding anonymous speakers is an important step in building the models. Also, the "anonymous" speech is recorded under different environment/channel conditions, by speakers that may be both males and females and have different accents. I've given each speaker unique identity following Voxforge recipe.

Train/test set splitting and normalizing the metadata

The next step is to split the data into train and test sets and to produce the relevant transcription and speaker-dependent information. I've used 20 speakers for test.

Building the language model

I decided to use SRILM toolkit to estimate the test-time language model. SRILM is not installed by default under

$KALDI_ROOT/tools by Kaldi installation scripts, but needs to be installed manually. The installation assumes you have GNU autotools, C++ and Fortran compilers, as well as Boost C++ libraries installed.

Preparing the dictionary
The script, used for this task is called local/voxforge_prepare_dict.sh. It downloads the CMU's pronunciation dictionary first, and prepares a list of the words that are found in the train set, but not in cmudict. Pronunciations for these words are automatically generated using Sequitur G2P, which is installed under tools/g2p. The installation assumes you have NumPy, SWIG and C++ compiler on your system. Because the training of Sequitur models takes a lot of time this script is downloading and using a pre-built model trained on cmudict instead.

Feature Extraction:

I've used MFCCs( Mel Frequency Cepstral Coefficients) as features to train the acoustic models. Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The

MFCC Extraction Procedure

They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound.

MODEL BUILDING:

I've built the models using triphone context. The following improvements have been tried:

Improvement using MinimumPhone Error rate training.

Improvement using Maximum Likelihood Training.

Improvement using SGMMs.

Improvements using Hyperparameter tuning of DNNs.

I'll explain each of these in detail in my next post.

The next step is decoding the test files using the models built. In my previous post I've mentioned the procedure to pre process the test files to decode them using the models built using Kaldi.

The code for both the posts is updated. Please feel free to have a look and suggest changes. Both the posts contain readme files and run.sh to help navigate through the code.

*******************************************************************************************

Progress so far:

Model Building:
Built Acoustic Models for the following languages:
English

Spanish

German

French

*********************************************************************************************

Whats's next:

Model Building:

Danish

Swedish

Norwegian

Model Deployment:

English

Wrapper:

Python wrapper to cover the decoding process.

Documentation

**********************************************************************************************

What will be in the next post:

a) Improving the transcripts using different techniques

I'll be explaining the variants of the training procedures and the ways to fine tune them to improve the accuracy. Basically the following:
Monophone vs triphone training.

Multipass + Speaker Adaptation.

Improvement using MinimumPhone Error rate training.

Improvement using Maximum Likelihood Training.

Improvement using SGMMs.

Improvements using Hyperparameter tuning of DNNs.

b) Model building :
How to build models when Voxforge data is not available.

Friday 5 June 2015

An Introduction to Speech Processing and Forced Alignment of Speech and Text

This is the initial one in a series of posts explaining the techniques used and procedures followed to achieve forced alignment of speech and text as a part of Google Summer of Code 2015.

Lets think about it. What is this forced alignment and why is it important. From the title it seems very intuitive right?, we have speech and we have text....whats so big about aligning both of them ?

Here are the questions:

Why even bother about different techniques to achieve this?
Whats even so special about speech processing as a whole?
Speech is just a signal right? Apply DSP techniques and get over it.

What's in Speech Processing ?

Speech processing is the study of speech singals and the processing methods of these signals. The specific and interesting properties of speech, the natural variations, quasi stationarity, aperiodicity, uff.....can be a bit boggling once we start knowing a bit more about speech and definitely intimidating after a point in time.

" Lets write rules and capture the variations" ......someone thought.

Fair enough...but how many? Each and every phoneme( basic unit in speech....much like atoms in physical world), has its own specific properties and whats more cool... the rules for individual phonemes modify effortlessly when they combine with other phonemes. Consontant clusters are a headache. Coarticulation is a curse in disguise if not taken care of. Whats more....rules are dependent upon language . Too many things to consider. Lol !! And there are shameless exceptions.

"Machine Learning" .........A light bulb blew in the engineer's mind. Lets make machines learn. Its difficult to design a machine that can defeat human in chess but really easy and straight forward to design a machine which can understand and interpret humans was the idea. How cute !!!!!. Fast forward ten years and the exact opposite happened. We have chess as the basic game on our PC and the natural language processing is still a distant dream !!!!

So, the way out....at least now, seems to be to take the best of the two worlds...apply machine learning and cover the exceptions using the rules (sort of ). This is what I'll be doing. Push the baseline, see the results, cover the exceptions, push version 1...see exceptions......and some wer version 5 is gonna give us near 100% accuracy.

Forced Alignment:

Forced alignment is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment.

Large recordings like radio broadcasts or audio books are an interesting additional resource for speech research but often have nothing but an orthographic transcription available in terms of annotation. The alignment between text and speech is an important first step for further processing. Conventionally, this is done by using automatic speech recognition (ASR) on the speech corpus and then aligning recognition result and transcription.

The most common approach to this text-to-speech alignment is to make use of a standard commercial speech recognizer to produce a time-aligned transcription of the recording which is then aligned with the original text transcript on a graphemic level. The idea was introduced by Robert-Ribes ( the system was called FRANK).

Approach:

Now that we know what alignment is, it's almost effortless to apply a speech recognizer on the audio file and get the transcriptions. Or is it? The process is straight forward for smaller audio files. Having said that It goes without the need for saying that processing a huge audio file is a computationally intensive process.

An intelligent and logical approach to solve the same problem would be to break the huge audio file into smaller chunks.

Cool. But where to chunk the audio?? Obviously we don't want to chunk at the middle of speech. So...cut it at the silence. May be? Here is the basic formulation of the audio chunking module that I'm gonna use:

Silence Detection:
Speech signals usually contain many areas of silence or noise. In case the speech is free from noise, a threshold based on signal energy usually suffices the detection of silence regions. I've described the process of obtaining signal energy in detail in one of my other blog posts.

In the current scenario though, I'm expecting the presence of environment noise(traffic, birds, etc) along with the speech. Therefore, I'm trying out another feature in addition to the signal energy. I'm typically looking for a feature which emphasizes the brighter sounds (per se). As of now something that's closer to the desired function is the spectral centroid. Roughly saying, divides the entire spectrum into two parts, a high frequency region and a low frequency one.

Feature Calculation:

An excerpt from a post of one of my joint blogs:

Speech Segments Detection:

Here's the basic algo to get the speech segments:
1) Compute the histogram of the feature sequence’s values.
2) Apply a smoothing filter on the histogram.
3) Detect the histogram’s local maxima.
4) Let M1 and M2 be the positions of the first and second
local maxima respectively.
The threshold value is computed using T = (W·M1 + M2)/(W+1).

Speech Recognition:

As of now I'm using the speech recogntion module given by Kaldi recipe using the acoustic model of Librispeech data. I'll be explaining the fundamentals of recognition in the next post.