Deepfake audio has a say

Imagine the following scenario. A phone rings. An office worker responds to it and hears her boss, in a panic, telling her that she forgot to transfer some money to the new contractor before leaving for the day and that he needs her to do it. She gives him the bank transfer information, and with the transferred money, the crisis was avoided.

The worker sits in his chair, takes a deep breath and watches his boss walk through the door. The voice on the other end of the call was not his boss. In fact, it wasn’t even a human. The voice he heard was that of an audio deepfake, a machine-generated audio sample designed to sound exactly like his boss.

Attacks like this using recorded audio have happened before, and conversational audio deepfakes may not be far off.

Deepfakes, both audio and video, have only been possible with the development of sophisticated machine learning technologies in recent years. Deepfakes have brought with them a new level of uncertainty around digital media. To detect deepfakes, many researchers have turned to analyzing visual artifacts – small glitches and inconsistencies – found in video deepfakes.

Audio deepfakes potentially pose an even greater threat, as people often communicate verbally without video – for example, via phone calls, radio and voice recordings. These voice communications greatly expand the possibilities for attackers to use deepfakes.

Smarter, faster: the Big Think newsletter

Subscribe to get counterintuitive, surprising and impactful stories delivered to your inbox every Thursday

Note: JavaScript is required for this content.

To detect audio deepfakes, we and our fellow researchers at the University of Florida have developed a technique that measures the acoustic and fluid dynamic differences between voice samples created organically by human speakers and those generated synthetically by computers.

Organic voices vs synthetic voices

Humans vocalize by forcing air over the various structures of the vocal tract, including the vocal cords, tongue, and lips. By rearranging these structures, you alter the acoustic properties of your vocal apparatus, allowing you to create over 200 distinct sounds or phonemes. However, human anatomy fundamentally limits the acoustic behavior of these different phonemes, resulting in a relatively narrow range of correct sounds for each.

In contrast, audio deepfakes are created by first allowing a computer to listen to audio recordings of a targeted victim speaker. Depending on the exact techniques used, the computer may need to listen to as little as 10-20 seconds of audio. This audio is used to extract key information about the unique aspects of the victim’s voice.

The attacker selects a phrase for the deepfake to speak, then, using a modified speech synthesis algorithm, generates an audio sample that sounds like the victim speaking the selected phrase. This process of creating a single deepfake audio sample can be accomplished in seconds, potentially giving attackers enough flexibility to use deepfake voice in a conversation.

Detection of audio deepfakes

The first step in differentiating human-produced speech from deepfake-generated speech is to understand how to acoustically model the vocal tract. Fortunately, scientists have techniques for estimating the sound of a person – or a being such as a dinosaur – based on anatomical measurements of their vocal tract.

We did the reverse. By reversing several of these same techniques, we were able to extract an approximation of a speaker’s vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker who created the audio sample.

Deepfake audio often results in vocal tract reconstructions that resemble drinking straws rather than biological vocal tracts. (Logan Blue et al./CC BY-ND)

From there, we hypothesized that deepfake audio samples would not be limited by the same anatomical limitations as humans. In other words, the analysis of deepfake audio samples simulated voice path shapes that do not exist in humans.

The results of our tests not only confirmed our hypothesis, but revealed something interesting. When extracting vocal tract estimates from deepfake audio, we found that the estimates were often comically incorrect. For example, it was common for deepfake audio to result in vocal tracts having the same relative diameter and consistency as a drinking straw, unlike human vocal tracts, which are much wider and more variable in shape.

This realization demonstrates that deepfake audio, even when compelling to human listeners, is far from indistinguishable from human-generated speech. By estimating the anatomy responsible for creating the observed speech, it is possible to identify whether the audio was generated by a person or a computer.

why it matters

Today’s world is defined by the digital exchange of media and information. Everything from news to entertainment to conversations with loved ones usually happens through digital exchanges. Even in their infancy, deepfake video and audio undermine people’s trust in these exchanges, limiting their usefulness.

If the digital world is to remain a vital resource of information in people’s lives, efficient and secure techniques for determining the source of an audio sample are crucial.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Leave a Reply

%d bloggers like this: