VALL-E: Microsoft’s new text-to-speech AI that can mimic anyone’s voice

Microsoft has showcased ‘VALL-E’, a new text-to-speech AI tool that can mimic someone’s voice from just a three-second audio sample. It could also simulate the speaker’s emotions and the acoustic environment. Microsoft calls the tool a “neural codec language model.”

VALL-E has received 60,000 hours of English speech data training. In a paper published by Cornell University, the developers explained that the recording data consisted of more than 7,000 unique speakers on Meta’s LibriLight audio library. It tries to find a close match in the training data in order to mimic the voice. The tool uses the training data to infer what the target speaker would sound like.

VALL-E’s Text To Speech system (TTS) used hundreds of times more data than the existing TTS systems, says the team.”Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis,” researchers say.

The team also presented a demo on the VALL-E Github page. For each sentence, the tool has a three-second prompt from the speaker to imitate, a “ground truth” of the same speaker saying another phrase for comparison, a “baseline” conventional text-to-speech synthesis, and the VALL-E sample at the end.

Microsoft's VALL-E

To further improve the tool, Microsoft is looking to scale up its training data “to improve the model performance across prosody, speaking style, and speaker similarity perspectives.”

However, concerns about the ethical implications of this new technology have also emerged. Similar technologies could be misused and can pave a way for realistic spam calls that imitate the voices of real people that a potential victim knows. “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. When the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model,” says the company.

VALL-E is yet not available for general public to use.

Anubha Pandey

See Full Bio

Nothing Phone 2a Plus to be powered by MediaTek Dimensity 7350 Pro; launch set for July 31

Apple Maps goes Web-wide, challenging Google Maps’ dominance

Realme C61 launched: 90hz LCD Display, 32 MP camera, 5000mAh battery, UNISOC T612 chip and 5000mAh battery at Rs. 7,699

HMD Crest launched: OLED display, 50MP cameras and Unisoc T760 chip. Priced at Rs. 14,499.

Lenovo Yoga Book 9i Review: Two Screens, Endless Possibilities

The Best Smartphone Cameras for All Budgets (November 2023)

Upgrade Your Home Theater: A Comprehensive Review of the Best 4K Smart TVs on the Market

iQoo Neo 9 Pro Camera Test: A Preview Before the Launch

High-performance Xiaomi SU7 Ultra to come out in early 2025

Xiaomi celebrates ten years in India with a slew of launches

Mercedes-Benz EQA First Drive Review: Ticks all the boxes

Fiat revives the Panda series from the 80s, calls it the Fiat Grande Panda EV.

Altered.ai Review 2024: Key Features, Pricing, Pros & Cons

Process Street Review 2024: Key Features, Pricing, Pros and Cons

Spell.so Review 2024: Key Features, Pricing, Pros & Cons

Fliki Review 2024: Key Features, Pricing, Pros & Cons

Netflix’s Bioshock adaptation to scale down on budget; Here’s why

TikTok reportedly takes aim at Google with new Image Search feature in TikTok Shop

Bad Boys 4 fires up at the Box Office, but Summer sizzle remains elusive

Wallace and Gromit return after 16 years in “Vengeance Most Fowl”. Premiering on Netflix this Christmas

VALL-E: Microsoft’s new text-to-speech AI that can mimic anyone’s voice

Nothing Phone 2a Plus to be powered by MediaTek Dimensity 7350 Pro; launch set for July 31

Apple Maps goes Web-wide, challenging Google Maps’ dominance

Realme C61 launched: 90hz LCD Display, 32 MP camera, 5000mAh battery, UNISOC T612 chip and 5000mAh battery at Rs. 7,699

Netflix’s Bioshock adaptation to scale down on budget; Here’s why

LEAVE A COMMENT Cancel reply