Highlights:
- Nvidia asserts that linguistic inclusion for voice AI has numerous data health advantages, including aiding AI models in comprehending speaker variation and noise characteristics.
- Nvidia wants to incorporate recent advancements in AST and next-generation voice AI into use cases for the real-time metaverse.
At the Speech AI Summit, Nvidia recently unveiled its new speech Artificial Intelligence (AI) environment. This ecosystem was created in collaboration with Mozilla Common Voice. The ecosystem focuses on creating open-source pre-trained models and crowdsourced multilingual speech corpuses. The goal of Nvidia and Mozilla Common Voice is to hasten the development of automatic voice recognition systems that work universally for all language speakers across the world.
Nvidia discovered that popular voice assistants like Amazon Alexa and Google support less than one percent of the spoken languages in the world. To address this issue, the company aims to improve linguistic inclusion in speech AI and increase the accessibility of speech data for languages with less access to resources.
Nvidia has now joined the league of Meta and Google. Recently, both businesses unveiled speech AI models to encourage communication between people who speak various languages. Translation Hub, Google’s speech-to-speech AI technology, can translate a large number of papers into numerous languages. The tech giant also unveiled that it is developing a universal speech translator trained in more than 400 languages, claiming that it is the “largest language model coverage seen in a speech model recently.”
Likewise, Meta AI’s universal speech translator (UST) project contributes to the development of AI systems that allow for real-time translation from speech to speech in any language, including those that are spoken but not frequently written.
A system for users of different languages
Nvidia asserts that linguistic inclusion for voice AI has numerous data health advantages, including aiding AI models in comprehending speaker variation and noise characteristics. The new speech AI ecosystem allows developers to develop, maintain, and enhance speech AI models and datasets for linguistic inclusion, usability, and experience. Users can train their models on Mozilla Common Voice datasets and then offer those pre-trained models as top-notch automatic speech recognition architectures. Then, other companies and people worldwide can modify and use those architectures to create their speech AI applications.
Caroline de Brito Gottlieb, product manager at Nvidia, said, “Demographic diversity is key to capturing language diversity. Several vital factors impact speech variation, such as underserved dialects, sociolects, pidgins, and accents. Through this partnership, we aim to create a dataset ecosystem that helps communities build speech datasets and models for any language or context.”
Currently, the Mozilla Common Voice platform supports 100 languages, with 24,000 hours of speech data from 500,000 contributors across the world. The most recent edition of the Common Voice dataset also includes more speech data from female speakers and six new languages, including Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona, and Cantonese.
Using the Mozilla Common Voice platform, users may donate their audio datasets by recording sentences as brief voice clips. Mozilla validates this to ensure dataset quality after submission.
Siddharth Sharma, head of product marketing, AI and deep learning at Nvidia, “The speech AI ecosystem extensively focuses on not only the diversity of languages but also on accents and noise profiles that different language speakers across the globe have. This has been our unique focus at Nvidia, and we created a solution that can be customized for every aspect of the speech AI model pipeline.”
Current speech AI implementations from Nvidia
The business is creating speech AI for various applications, including text-to-speech, automatic speech recognition (ASR), and artificial speech translation (AST). Nvidia Riva, a component of the Nvidia AI platform, offers cutting-edge GPU-optimized processes to design and deploy fully configurable, real-time AI pipelines for applications like contact center agent aids, virtual assistants, digital avatars, brand voices, and video conferencing transcription. Applications created via Riva can be installed on any cloud, in any data center, at the edge, or on embedded hardware.
The Singapore government’s transportation technology partner, NCS, adapted the Riva FastPitch model from Nvidia and created its text-to-speech engine for English-Singapore utilizing the voice data of local speakers. A recently designed app by NCS, Breeze, is an app for local drivers that translates languages including Mandarin, Hokkien, Malay, and Tamil into Singaporean English with the same expressiveness and clarity as a native Singaporean would speak them.
T-Mobile, a multinational mobile communications provider, also collaborated with Nvidia to create AI-based software for its customer experience centers that transcribes real-time client discussions and makes recommendations to thousands of front-line employees. In a bid to develop the software, T-Mobile used Riva and Nvidia Nemo, an open-source framework for cutting-edge conversational AI models. With the help of these Nvidia tools, T-Mobile engineers were able to optimize ASR models on the company’s unique datasets and accurately decipher customer jargon in loud circumstances.
The future focus of Nvidia is voice AI
According to Sharma, Nvidia wants to incorporate recent advancements in AST and next-generation voice AI into use cases for the real-time metaverse.
He said, “Today, we’re limited to only offering slow translation from one language to the other, and those translations have to go through text. But the future is where you can have people in the metaverse across so many different languages all being able to have instant translation with each other.”
He added, “the next step is developing systems that will enable fluid interactions with people across the globe through speech recognition for all languages and real-time text-to-speech.”