Microsoft has showcased ‘VALL-E’, a new text-to-speech AI tool that can mimic someone’s voice from just a three-second audio sample. It could also simulate the speaker’s emotions and the acoustic environment. Microsoft calls the tool a “neural codec language model.”
VALL-E has received 60,000 hours of English speech data training. In a paper published by Cornell University, the developers explained that the recording data consisted of more than 7,000 unique speakers on Meta’s LibriLight audio library. It tries to find a close match in the training data in order to mimic the voice. The tool uses the training data to infer what the target speaker would sound like.
VALL-E’s Text To Speech system (TTS) used hundreds of times more data than the existing TTS systems, says the team.”Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis,” researchers say.
The team also presented a demo on the VALL-E Github page. For each sentence, the tool has a three-second prompt from the speaker to imitate, a “ground truth” of the same speaker saying another phrase for comparison, a “baseline” conventional text-to-speech synthesis, and the VALL-E sample at the end.
To further improve the tool, Microsoft is looking to scale up its training data “to improve the model performance across prosody, speaking style, and speaker similarity perspectives.”
However, concerns about the ethical implications of this new technology have also emerged. Similar technologies could be misused and can pave a way for realistic spam calls that imitate the voices of real people that a potential victim knows. “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. When the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model,” says the company.
VALL-E is yet not available for general public to use.