Featured Research

from universities, journals, and other organizations

Automatic speaker tracking in audio recordings

Date:
October 18, 2013
Source:
Massachusetts Institute of Technology
Summary:
A new system dispenses with the human annotation of training data required by its predecessors but achieves comparable results.

A central topic in spoken-language-systems research is what's called speaker diarization, or computationally determining how many speakers feature in a recording and which of them speaks when. Speaker diarization would be an essential function of any program that automatically annotated audio or video recordings.

Related Articles


To date, the best diarization systems have used what's called supervised machine learning: They're trained on sample recordings that a human has indexed, indicating which speaker enters when. In the October issue of IEEE Transactions on Audio, Speech, and Language Processing, however, MIT researchers describe a new speaker-diarization system that achieves comparable results without supervision: No prior indexing is necessary.

Moreover, one of the MIT researchers' innovations was a new, compact way to represent the differences between individual speakers' voices, which could be of use in other spoken-language computational tasks.

"You can know something about the identity of a person from the sound of their voice, so this technology is keying in to that type of information," says Jim Glass, a senior research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and head of its Spoken Language Systems Group. "In fact, this technology could work in any language. It's insensitive to that."

To create a sonic portrait of a single speaker, Glass explains, a computer system will generally have to analyze more than 2,000 different acoustic features; many of those may correspond to familiar consonants and vowels, but many may not. To characterize each of those features, the system might need about 60 variables, which describe properties such as the strength of the acoustic signal in different frequency bands.

E pluribus tres

The result is that for every second of a recording, a diarization system would have to search a space with 120,000 dimensions, which would be prohibitively time-consuming. In prior work, Najim Dehak, a research scientist in the Spoken Language Systems Group and one of the new paper's co-authors, had demonstrated a technique for reducing the number of variables required to describe the acoustic signature of a particular speaker, dubbed the i-vector.

To get a sense of how the technique works, imagine a graph that plotted, say, hours worked by an hourly worker against money earned. The graph would be a diagonal line in a two-dimensional space. Now imagine rotating the axes of the graph so that the x-axis is parallel to the line. All of a sudden, the y-axis becomes irrelevant: All the variation in the graph is captured by the x-axis alone.

Similarly, i-vectors find new axes for describing the information that characterizes speech sounds in the 120,000-dimension space. The technique first finds the axis that captures most of the variation in the information, then the axis that captures the next-most variation, and so on. So the information added by each new axis steadily decreases.

Stephen Shum, a graduate student in MIT's Department of Electrical Engineering and Computer Science and lead author on the new paper, found that a 100-variable i-vector -- a 100-dimension approximation of the 120,000-dimension space -- was an adequate starting point for a diarization system. Since i-vectors are intended to describe every possible combination of sounds that a speaker might emit over any span of time, and since a diarization system needs to classify only the sounds on a single recording, Shum was able to use similar techniques to reduce the number of variables even further, to only three.

Birds of a feather

For every second of sound in a recording, Shum thus ends up with a single point in a three-dimensional space. The next step is to identify the bounds of the clusters of points that correspond to the individual speakers. For that, Shum used an iterative process. The system begins with an artificially high estimate of the number of speakers -- say, 15 -- and finds a cluster of points that corresponds to each one.

Clusters that are very close to each other then coalesce to form new clusters, until the distances between them grow too large to be plausibly bridged. The process then repeats, beginning each time with the same number of clusters that it ended with on the previous iteration. Finally, it reaches a point at which it begins and ends with the same number of clusters, and the system associates each cluster with a single speaker.

"What was completely not obvious, what was surprising, was that this i-vector representation could be used on this very, very different scale, that you could use this method of extracting features on very, very short speech segments, perhaps one second long, corresponding to a speaker turn in a telephone conversation," Kenny adds. "I think that was the significant contribution of Stephen's work."


Story Source:

The above story is based on materials provided by Massachusetts Institute of Technology. The original article was written by Larry Hardesty. Note: Materials may be edited for content and length.


Journal Reference:

  1. Stephen H. Shum, Najim Dehak, Reda Dehak, James R. Glass. Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach. IEEE Transactions on Audio, Speech, and Language Processing, 2013; 21 (10): 2015 DOI: 10.1109/TASL.2013.2264673

Cite This Page:

Massachusetts Institute of Technology. "Automatic speaker tracking in audio recordings." ScienceDaily. ScienceDaily, 18 October 2013. <www.sciencedaily.com/releases/2013/10/131018132246.htm>.
Massachusetts Institute of Technology. (2013, October 18). Automatic speaker tracking in audio recordings. ScienceDaily. Retrieved October 25, 2014 from www.sciencedaily.com/releases/2013/10/131018132246.htm
Massachusetts Institute of Technology. "Automatic speaker tracking in audio recordings." ScienceDaily. www.sciencedaily.com/releases/2013/10/131018132246.htm (accessed October 25, 2014).

Share This



More Computers & Math News

Saturday, October 25, 2014

Featured Research

from universities, journals, and other organizations


Featured Videos

from AP, Reuters, AFP, and other news services

Real-Life Transformer Robot Walks, Then Folds Into a Car

Real-Life Transformer Robot Walks, Then Folds Into a Car

Buzz60 (Oct. 24, 2014) — Brave Robotics and Asratec teamed with original Transformers toy company Tomy to create a functional 5-foot-tall humanoid robot that can march and fold itself into a 3-foot-long sports car. Jen Markham has the story. Video provided by Buzz60
Powered by NewsLook.com
Microsoft Riding High On Strong Surface, Cloud Performance

Microsoft Riding High On Strong Surface, Cloud Performance

Newsy (Oct. 24, 2014) — Microsoft's Q3 earnings showed its tablets and cloud services are really hitting their stride. Video provided by Newsy
Powered by NewsLook.com
The Best Apps to Organize Your Life

The Best Apps to Organize Your Life

Buzz60 (Oct. 23, 2014) — Need help organizing your bills, schedules and other things? Ko Im (@konakafe) has the best apps to help you stay on top of it all! Video provided by Buzz60
Powered by NewsLook.com
Nike And Apple Team Up To Create Wearable ... Something

Nike And Apple Team Up To Create Wearable ... Something

Newsy (Oct. 23, 2014) — For those looking for wearable tech that's significantly less nerdy than Google Glass, Nike CEO Mark Parker says don't worry, It's on the way. Video provided by Newsy
Powered by NewsLook.com

Search ScienceDaily

Number of stories in archives: 140,361

Find with keyword(s):
 
Enter a keyword or phrase to search ScienceDaily for related topics and research stories.

Save/Print:
Share:  

Breaking News:

Strange & Offbeat Stories

 

Space & Time

Matter & Energy

Computers & Math

In Other News

... from NewsDaily.com

Science News

Health News

Environment News

Technology News



Save/Print:
Share:  

Free Subscriptions


Get the latest science news with ScienceDaily's free email newsletters, updated daily and weekly. Or view hourly updated newsfeeds in your RSS reader:

Get Social & Mobile


Keep up to date with the latest news from ScienceDaily via social networks and mobile apps:

Have Feedback?


Tell us what you think of ScienceDaily -- we welcome both positive and negative comments. Have any problems using the site? Questions?
Mobile iPhone Android Web
Follow Facebook Twitter Google+
Subscribe RSS Feeds Email Newsletters
Latest Headlines Health & Medicine Mind & Brain Space & Time Matter & Energy Computers & Math Plants & Animals Earth & Climate Fossils & Ruins