Amazon.com Inc. has come up with a new artificial intelligence system that can train digital voice assistants such as Alexa to learn new speaking styles, similar to a newsreader for example, in a matter of hours.
In a blog post today, Trevor Wood, Amazon’s applied science manager, said the new text-to-speech system could replace traditional methods of voice training that typically require actors to speak in the target style for tens of hours in order to train models.
“with the user having the speech produced by neural networks sounds much more natural than speech produced through concatenative methods, which string together short speech snippets stored in an audio database,” Wood wrote. “with the help of an increased and enhanced flexibility provided by [our system], we can easily vary the speaking style of synthesized speech.”
Amazon, which refers to its new model as “neural text-to-speech,” or NTTS, said there are two key components to it. One is a “generative neural network” that works by converting sequences of phonemes, which are distinct units of sound that distinguish one word from another, into sequences of spectrograms. Those, in turn, are a visual representation of the spectrum of frequencies of those sounds, since they vary over time. The spectrograms are said to “emphasize features that the human brain uses when processing speech,” Wood said.
The other component is known as a “vocoder,” which helps to convert those spectrograms into a continuous audio signal used to train the text-to-speech model.
The complex technical processes are detailed in Wood’s blog post, but the most important thing is that it seems to work just fine. The new training method can combine neural text-to-speech speech data with just a few hours of supplementary data to produce a model that can distinguish between elements of speech both unique to, and independent of, a particular speaking style.
“At the time of coming and showing up with a speaking-style code during operation, the network predicts the prosodic pattern suitable for that style and applies it to a separately generated, style-agnostic representation,” Wood wrote. “The high quality achieved with the help of a much more relatively little additional training data allows for rapid expansion of speaking styles.”
Wood said Amazon’s research shows that listeners have a big preference for voices created by the neutral text-to-speech method over traditional concatenative synthesis. In fact, the NTTS method was rated almost as high as normal human speech itself.
“The preference for the neutral-style NTTS reflects much more with an increased case in general speech synthesis quality due to neural generative methods,” Wood said. “the further and increased improvements NTTS newscaster voice reflects our system’s ability to capture a style relevant to the text.”