Microsoft AI Can Imitate Any Voice Based on Only 3 Seconds of Actual Recording

Related:

This sure is moving fast!

I remember Jordan Peterson having a total breakdown because the AI was able to fake his voice after six hours of him talking.

Well, most people who are not public figures do not have 6 hours of recordings of them talking available.

Everyone has 3 seconds.

Fox News:

Microsoft’s new language model Vall-E is reportedly able to imitate any voice using just a three-second sample recording.

The recently released AI tool was tested on 60,000 hours of English speech data. Researchers said in a paper out of Cornell University that it could replicate the emotions and tone of a speaker.

Those findings were apparently true even when creating a recording of words that the original speaker never actually said.

“Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot [text to speech] system in terms of speech naturalness and speaker similarity,” the authors wrote. “In addition, we find Vall-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

The Vall-E samples shared on GitHub are eerily similar to the speaker prompts, although they range in quality.

Speaking of Jordan Peterson deepfakes – there is going to have to be a round 2 of Jordan Peterson deepfakes.

I don’t want to get sued by that faggot, but it was just too much of an outrage the way he whined about that. You can’t just let a public figure get away with that kind of public whining.

Someone is going to have to build an army of Petersonbots.