Building "Zombie" Voice Models
Audio-Mining the Past Using Voice Recognition for Transcription of Audio Artifacts
A poster I created about the project for my UT Austin iSchool research conference presentation.
For my capstone graduation project at UT Austin’s School of Information, I applied DocSoft:AV audio mining software to contemporary and archival audio recordings to establish a workflow and general guidelines for transcription of audio artifacts. DocSoft:AV is a speech-to-text transcription program that can be trained to the nuances of particular voices, creating a voice model for each speaker that uses the software. The software is conventionally used to create computer-generated transcripts of recorded lectures, readings, and other contemporary performances. I utilized it both with contemporary recordings made under ideal conditions and also with archival recordings of highly variable quality.
My project had three main goals:
- Build a voice model of monologuist Spalding Gray using archival recordings of Gray's theatrical performances, and then use this voice model to transcribe Gray's recordings to make them available for research use.
- Document best practices for automated audio transcription
- Create video tutorials and documentation on DocSoft:AV for use by UT Libraries staff
Conclusions
From this project, I concluded that the DocSoft:AV software is most responsive to studio-quality recordings with a clear speaker and minimal noise. High resolution audio files are no guarantee of good transcription results, and the "ideal" amount of training that must be done to generate a reliable voice model varies depending upon the quality of the recordings. Ultimately, in doing this project, I concluded that while some archival audio recordings can be successfully transcribed using this software, there are many variables to consider when selecting recordings for automated transcription, so ultimately, use of archival audio recordings with DocSoft:AV should be considered on a case by case basis.
More Information
For my project, I created three voice models (unique speaker profiles in DocSoft:AV) to test the transcription results under different recording conditions. For my first voice model, of Spalding Gray, the resulting transcripts were full of errors. The live recordings were of variable quality and featured crowd noise, background music, and ambient noise. Spalding also spoke very quickly with a heavy Rhode Island accent and used an unconventional vocabulary in his monologues. All these factors likely contributed to DocSoft's substantial transcription errors.
DocSoft's first attempt to transcribe a Spalding recording (left), and second attempt, after more training (right). You can see how the software eventually learns some words over time, but also makes new mistakes.
As a second test case, I created a voice model for Cecil Baldwin, the monologuist of popular podcast Welcome to Night Vale. After training on a few recordings, DocSoft:AV performed very well when generating transcripts of Cecil's voice. Like the Spalding recordings, these recordings contained background music and unusual vocabulary. Unlike the Spalding recordings, however, the Cecil recordings were made in a studio and the speaker spoke slowly and clearly. These factors seemed to make a tremendous difference in DocSoft's success.
As a third test, I created a voice model of my own voice, recorded in a studio. Because I knew the resulting recording would be transcribed, I made sure to speak slowly and clearly, which greatly improved DocSoft:AV's accuracy. Even after training on just one file, the transcript results were nearly flawless. This suggests that recording with the intent to transcribe and speaking in a way that caters to the software can also improve transcription accuracy.
DocSoft Video Tutorials
Using Camtasia Studio, I created two video tutorials for University of Texas staff to learn to use DocSoft:AV and its corresponding transcript editor, DocSoft:TE. These videos are also currently hosted on the UT Austin iSchool Glifos page.