Skip to content

Creepy New AI Can Simulate Your Voice Perfectly After Hearing It for 3 Seconds

It's so good that its creators admit it "may carry potential risks in misuse."

Modern technology has revolutionized the way we get things done. Even the most basic version of the smartphones in most peoples' pockets or smart home devices in our living rooms have an impressive amount of capabilities—especially when you consider you can control them simply by talking, thanks to artificial intelligence (AI). But even as computers have progressed to help make our lives easier, they're also entering into new territory as they become able to mimic human behavior and even think for themselves. And now, one new creepy form of AI can simulate your voice perfectly after hearing it for just three seconds. Read on to learn more about the groundbreaking technology.

READ THIS NEXT: Never Charge Your Android Phone This Way, Experts Say.

Microsoft has developed a new type of AI that can flawlessly simulate your voice.

A young woman recording her voice on a computer using a microphone and headphones
Shutterstock / Soloviova Liudmyla

We've all relied on machines to make our daily lives easier in one way or another. But what if a computer could step in and mimic the way you speak without others even noticing?

Last week, researchers at Microsoft announced they had developed a new form of text-to-speech AI they've dubbed VALL-E, Ars Technica reports. The technology can simulate a person's voice by using a three-second audio clip, even picking up and preserving the original speaker's emotional tone and the acoustic sounds of the environment in which they're recording. The team says the model could be handy for creating automatic vocalizations of text—even though it comes with potential risks of highly sophisticated dupes similar to deepfake videos.

The company says the new tech is based on a "neural codec language model."

A man sitting on his computer while talking to his phone's virtual assistant
Shutterstock / fizkes

In its paper discussing the new tech, Microsoft dubs VALL-E a "neural codec language model." What this means is that while traditional text-to-speech (TTS) software takes written words and manipulates waveforms to generate vocalizations, the AI can pick up subtle elements of a voice and specific audio prompts that help it create a reliable recreation of a person speaking any sentence that's fed to it, according to the website Interesting Engineering.

"To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively," the team explains in their paper. "Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder."

RELATED: For more up-to-date information, sign up for our daily newsletter.

The team used over 60,000 hours of recorded speech to train the new AI.

author writing on computer
Michael Julius Photos / Shutterstock

To develop the new model, the team says it used about 60,000 hours of recorded speech in English from more than 7,000 individual speakers from an audio library assembled by Meta known as LibriLight. In most cases, recordings were pulled from readings of public-domain audiobooks stored on LibriVox, Ars Technica reports. In its trials, the team said that VALL-E needs the voice in the three-second sample to closely resemble one of the voices from its training data to produce a convincing result.

The team is now showcasing their work by posting specific examples of the software in action on a GitHub page. Each provides a three-second clip of a speaker's voice reading random text and a "ground truth," which is a recorded example of the speaker reading a sentence to be used for comparison. They then provide a "baseline" recording to show how typical TTS software would generate spoken audio and a "VALL-E" version of the recording for comparison to the previous two.

While the results aren't entirely perfect, they do showcase some very convincing examples where the machine-generated speech sounds shockingly human. The researchers also add that besides mimicking inflection and emotion, the software can also replicate the environment in which the base audio is recorded—for example, making it sound like someone is speaking outdoors, in an echoing room, or on a phone call.

So far, Microsoft hasn't released the program for others to test or experiment with.

hands typing on a laptop

The research team concludes their paper by saying they plan to increase the amount of training data to help the model improve its speaking styles and become better at mimicking human voice. But for the time being, Microsoft has also held back from making the new software available for developers or the general public to test—potentially because of its ability to trick people or be used for nefarious purposes.

"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker," the authors wrote in their conclusion. "To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models."

Zachary Mack
Zach is a freelance writer specializing in beer, wine, food, spirits, and travel. He is based in Manhattan. Read more
Filed Under