Deepfake sound has an effect

Imagine the following scenario. The phone is ringing. An office worker answers him and hears his boss in a panic and tells him that she forgot to transfer the money to the new contractor before she left for the day and needs him to do it. She gave him the wire transfer information, and by transferring the money, the crisis was averted.

The worker sits in his chair, takes a deep breath, and watches his boss walking in the door. The voice on the other end of the call was not the boss. In fact, he wasn’t even a human. The sound he heard was a deep-fake, a machine-generated sound sample designed to sound exactly like his boss.

Such attacks using recorded voice It has already happenedand deepfakes in the conversational sound might not be so far-fetched.

Deep fakes, whether audio or video, have only been possible through the development of sophisticated machine learning techniques in recent years. Deepfakes brought with them a new level of Uncertainty about digital media. To detect deepfakes, many researchers have turned to analyzing visual artifacts – glitches and subtle inconsistencies – found in Video Deep Fake.

Voice deepfakes are likely to pose an even greater threat, because people often communicate verbally without video – for example, via phone calls, radio, and audio recordings. These voice-only communications greatly expand the possibilities for attackers to use deep fakes.

Smarter, Faster: Big Think Newsletter

Subscribe to get unexpected, surprising and touching stories delivered to your inbox every Thursday

To detect sound deepfakes, We and our research colleagues at the University of Florida developed a technology Measures acoustic and fluid dynamic differences Between sound samples organically generated by human amplifiers and those synthetically generated by computers.

Organic Voices vs. Synthetic Sounds

Humans speak by forcing air over the various structures of the vocal tract, including the vocal folds, tongue, and lips. By rearranging these structures, you can change the vocal characteristics of your vocal channel, allowing you to create more than 200 distinct sounds or voices. However, human anatomy fundamentally limits the vocal behavior of these different sounds, resulting in a relatively small range of correct sounds for each.

In contrast, deep voice fakes are created by first allowing the computer to listen to the audio recordings of the target victim’s speaker. Depending on the exact technologies used, the computer You may need to listen for at least 10 to 20 seconds of audio. This voice is used to extract key information about the unique aspects of the victim’s voice.

The attacker chooses a phrase for the deep forgery to speak and then, using a modified text-to-speech algorithm, creates an audio sample that sounds as if the victim is saying the selected phrase. The process of creating a single deep fake voice model can be accomplished in a matter of seconds, giving attackers enough flexibility to use the deep fake voice in a conversation.

Audio counterfeit detection

The first step in distinguishing human-produced speech from deepfake speech is to understand how to model an acoustic model of the vocal tract. Fortunately, scientists have techniques for estimating what a person – or some beings like – can dinosaur He may appear based on the anatomical measurements of his vocal tract.

We did the opposite. By reversing many of these same techniques, we were able to derive an approximation of the speaker’s vocal apparatus during a segment of speech. This allowed us to delve into the anatomy of the speaker who created the audio sample.

Deepfaked sound often results in acoustic canal reconstructions that resemble drinking pipettes rather than biological acoustic channels. (Logan Blue et al. / CC BY-ND)

From here, we hypothesized that deep fake voice samples would fail to be constrained by the same anatomical limitations as humans. In other words, analysis of deepfake voice samples simulates vocal tract shapes that people do not have.

Our test results not only confirmed our hypothesis, but revealed something interesting. When extracting vocal tract estimates from deep voice, we found that the estimates were often comically incorrect. For example, it was common for deep-faking sound to produce vocal tracts with the same relative diameter and consistency as drinking straws, as opposed to human vocal tracts, which are much wider and more diverse in shape.

This realization demonstrates that deep voice, even when persuasive to human listeners, is indistinguishable from human-generated speech. By estimating the anatomy responsible for creating the observed speech, it is possible to determine whether the sound was generated by a person or a computer.

Why is this important

Today’s world is defined by the digital exchange of media and information. Everything from news to entertainment to conversations with loved ones usually happens via digital exchanges. Even in its infancy, deep fake video and audio undermine people’s trust in these exchanges, effectively limiting their usefulness.

If the digital world is to remain a critical resource of information in people’s lives, then effective and secure techniques for determining the source of an audio sample are critical.

This article has been republished from Conversation Under a Creative Commons License. Read the original article.

Leave a Comment