As an Amazon Associate I earn from qualifying purchases.

How Alexa learned to speak with an Irish accent

[ad_1]

In the last five years, speech synthesis technology has moved to all-neural models, which allow the separate elements of speech — prosody, accent, language, and speaker identity (voice) — to be controlled separately. It’s the technology that enabled the Amazon Text-to-Speech group to teach the feminine-sounding, English-language Alexa voice to speak in perfectly accented U.S. Spanish and the masculine-sounding U.S. voice to speak with a British accent.

Parse substitution.16x9.png

Related content

Papers focus on speech conversion and data augmentation — and sometimes both at once.

In both of those cases, however, we had two advantages: (1) abundant annotated speech samples with the target accent, which the existing voice model could learn from, and (2) a set of rules for mapping graphemes — sequences of characters — to the phonemes — the minimal units of phonetic information, and the input to our text-to-speech models — of the target accent.

In the case of the Irish-accented, female-sounding English Alexa voice, which launched late last year, we had neither of those advantages — no grapheme-to-phoneme rules and a dataset that was an order of magnitude smaller than those for British English and U.S. Spanish. When we tried using the same approach to accent transfer that had worked in the previous cases, the results were poor.

So instead of taking an existing voice and teaching it a new accent, we took recordings of accented speech and changed their speaker ID. This provided us with additional training data for our Irish-accent text-to-speech model in the target voice, which greatly improved the accent quality.

To address the problem of sparse Irish-accented training data, Amazon researchers used a voice conversion model to produce additional Irish-accented training data in the target voice, which greatly improved the accent quality.

More precisely, to train a multispeaker, multiaccent text-to-speech (TTS) model, we first synthesized training data using a separate voice conversion (VC) model.

The inputs to the voice conversion model include a speaker embedding, which is a vector representation of the acoustic characteristics of a given speaker’s voice; a mel-spectrogram, which is a snapshot of the frequency spectrum of the speech signal at short intervals; and the phoneme sequence associated with the spectrogram.

Arabic Alexa redone.jpeg

Related content

Arabic posed unique challenges for speech recognition, language understanding, and speech synthesis.

During training, the TTS model, too, receives a speaker embedding, mel-spectrograms, and phoneme sequences, but at inference time, it does not receive the spectrograms. It’s a multiaccent, multispeaker model, so at training time, it also receives an accent ID, a simple ordinal indicator of the input speech’s accent. At inference time, the accent ID signal will still control the accent of the output speech.

Using a multiaccent model is not essential to our approach, but at Alexa AI, we’ve found, empirically, that multiaccent models tend to yield more-natural-sounding synthetic speech than single-accent models.

The TTS model’s inputs also include information, extracted from the input speech signal, about the duration of the individual input phonemes, which gives the model better control of the accent rhythm. Again, at inference time, there is no input speech signal; instead, the durations of the phonemes are predicted by a separate duration model, which is trained in parallel with the TTS model.

During training (top), synthesized speech from the voice conversion model is used to simultaneously train a text-to-speech (TTS) model and a phoneme duration model. At inference time (bottom), the duration model’s predictions serve as an input to the TTS model.

Although we have no grapheme-to-phoneme (G2P) rules for Irish-accented English speech, we have to generate the input phonemes for our TTS model somehow, and we experimented with the G2P rules for both British English and American English. Neither of these is entirely accurate: for instance, the vowel sound of the word “can’t” — and thus the associated phoneme — is different in Irish English than in either of the other two accent groups. But we were able to get credible results with both British English and American English G2P rules.

American English worked slightly better, however, and this is probably because of rhoticity: American English speakers, like Irish English speakers, pronounce their r’s; British English speakers usually drop them.

CAMP architecture 16|9.png

Related content

Methods share a two-stage training process in which a model learns a representation from audio data, then learns to predict that representation from text.

To evaluate our method, we asked reviewers to compare Irish English speech synthesized by our method to recordings of four different Irish English speakers, one of whom was our source speaker — the one who provided the speech that was the basis of our augmented data. In terms of accent, reviewers rated recordings of the source speaker as about 72.56% similar to other recordings of the same speaker; they rated our synthesized speech (in a different voice) 61.4% similar to recordings of the source speaker.

When reviewers were asked to compare the accent of the source speaker to those of the other three Irish English speakers, however, the similarity score fell to 53%; when asked to do the same with our synthesized speech, the similarity score was 51%. In other words, reviewers thought that our synthesized speech approximated the “average” Irish accent about as well as the source speaker did. That the agreement is so low — for both real and synthetic speech — is a testimony to the diversity of accents in Irish English (sometimes called the language of a million accents).

To baseline the results, we also asked the reviewers to compare speech generated through our approach to speech generated through the leading prior approach. Overall, they found that our approach offered a 50% improvement in accent similarity over the prior approach.

Acknowledgements: We would like to acknowledge Andre Canelas for identifying the opportunity and driving the project and Dennis Stansbury, Seán Mac Aodha, Laura Teefy, and Rikki Price for their support in making the experience authentic.



[ad_2]

Source link

We will be happy to hear your thoughts

Leave a reply

myrzone.com- Expect more Pay Less
Logo