Friday, 5 June 2015

An Introduction to Speech Processing and Forced Alignment of Speech and Text

This is the initial one in a series of posts explaining the techniques used and procedures followed to achieve forced alignment of speech and text as a part of Google Summer of Code 2015.

Lets think about it. What is this forced alignment and why is it important. From the title it seems very intuitive right?, we have speech and we have text....whats so big about aligning both of them ?

Here are the questions:

Why even bother about different techniques to achieve this?
Whats even so special about speech processing as a whole?
Speech is just a signal right? Apply DSP techniques and get over it.

What's in Speech Processing ?

Speech processing is the study of speech singals and the processing methods of these signals. The specific and interesting properties of speech, the natural variations, quasi stationarity, aperiodicity, uff.....can be a bit boggling once we start knowing a bit more about speech and definitely intimidating after a point in time.

" Lets write rules and capture the variations" ......someone thought.

Fair enough...but how many? Each and every phoneme( basic unit in speech....much like atoms in physical world), has its own specific properties and whats more cool... the rules for individual phonemes modify effortlessly when they combine with other phonemes. Consontant clusters are a headache. Coarticulation is a curse in disguise if not taken care of. Whats more....rules are dependent upon language . Too many things to consider. Lol !! And there are  shameless exceptions.

"Machine Learning" .........A light bulb blew in the engineer's mind. Lets make machines learn. Its difficult to design a machine that can defeat human in chess but really easy and straight forward to design a machine which can understand and interpret humans was the idea. How cute !!!!!. Fast forward ten years and the exact opposite happened. We have chess as the basic game on our PC and the natural language processing is still a distant dream !!!!

So, the way least now, seems to be to take the best of the two worlds...apply machine learning and cover the exceptions using the rules (sort of ). This is what I'll be doing. Push the baseline, see the results, cover the exceptions, push version 1...see exceptions......and some wer version 5 is gonna give us near 100% accuracy. 

 Forced Alignment:
Forced alignment is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment.

Large recordings like radio broadcasts or audio books are an interesting additional resource for speech research but often have nothing but an orthographic transcription available in terms of annotation. The alignment between text and speech is an important first step for further processing. Conventionally, this is done by using automatic speech recognition (ASR) on the speech corpus and then aligning recognition result and transcription.
The most common approach to this text-to-speech alignment is to make use of a standard commercial speech recognizer to produce a time-aligned transcription of the recording which is then aligned with the original text transcript on a graphemic level.  The idea was introduced by Robert-Ribes ( the system was called FRANK).


Now that we know what alignment is, it's almost effortless to apply a speech recognizer on the audio file and get the transcriptions. Or is it? The process is straight forward for smaller audio files. Having said that It goes without the need for saying that processing a huge audio file is a computationally intensive process. 

An intelligent and logical approach to solve the same problem would be to break the huge audio file into smaller chunks. 

Cool. But where to chunk the audio?? Obviously we don't want to chunk at the middle of speech. So...cut it at the silence. May be? Here is the basic formulation of the audio chunking module that I'm gonna use:
Silence Detection:
Speech signals usually contain many areas of silence or noise. In case the speech is free from noise, a threshold based on signal energy usually suffices the detection of silence regions.  I've described the process of obtaining signal energy in detail in one of my other blog posts.

In the current scenario though, I'm expecting the presence of environment noise(traffic, birds, etc) along with the speech. Therefore, I'm trying out another feature in addition to the signal energy. I'm typically looking for a feature which emphasizes the brighter sounds (per se). As of now something that's closer to the desired function is the spectral centroid. Roughly saying, divides the entire spectrum into two parts,  a high frequency region and a low frequency one. 

Feature Calculation:

An excerpt from a post of one of my joint blogs:

Speech Segments Detection:

Here's the basic algo to get the speech segments:
1) Compute the histogram of the feature sequence’s values.
2) Apply a smoothing filter on the histogram.
3) Detect the histogram’s local maxima.
4) Let M1 and M2 be the positions of the first and second
local maxima respectively. 
The threshold value is computed using T = (W·M1 + M2)/(W+1).

Speech Recognition:

As of now I'm using the speech recogntion module given by Kaldi recipe using the acoustic model of Librispeech data. I'll be explaining the fundamentals of recognition in the next post.


  1. Well articulated Sai. Real good article

  2. Its really helpful for me, a waiting for more new post. Keep Blogging! thank you.
    Speech To Text