Model of a Male Vocal Tract (with processor showing scale) c. 1966  © Noriko Umeda Archive, Tokyo, Japan

Project (2019-2020)

From Text to Speech: Re-Embodying Voice and Disembodying Difference

Historians of science have narrated the development of sonic technology as a process of cleaving sounds from the humans that produce them. Of course, while biological sounds—speech, grunts, and gasps—all originate within the body, most mechanical sounds do not. A phonograph or a tape recording might recreate a person’s utterances, but it doesn’t work the way a vocal tract does. This holds for a wide range of sonic technologies, but not all; in the mid-twentieth century, the speech synthesis community began fashioning their devices after an understanding of human anatomy. With artificial speech, the body once again became vital.

In 1968, Japanese researcher Noriko Umeda finished the first English text-to-speech system, generating a computer voice from plain text. For Umeda, modeling speech through the body promised to allow for the easy implementation of difference. Speech synthesis had struggled to create a naturalistic female voice, but researchers hoped that restoring the body to this technology—literally “reembodying” it—would infuse language with specificity, gendered and otherwise. Generalizable variations in the sizes of the vocal cavities between men, women, and children could now be algorithmically programmed.

I examine how Umeda and others charted an increasingly precise and complicated linguistic map to render written English into ever more natural-sounding speech. As text-to-speech reinscribed constructed corporeal and social categories, it seemed that the human voice and its constituent elements—phonological diversity, gender, age, and national origin—might be constructed with a set of easily replicable formulas. As such, synthetic speech, the performance of identity through sound, began to pose the question: Is representation reducible to algorithmic description?