Meta’s new AI models can recognize and produce speech for over 1,000 languages

They trained him on two new datasets: one containing New Testament Bible audio recordings and corresponding text taken from the Internet in 1,107 languages, and another containing unlabeled New Testament audio recordings in 3,809 languages. The team processed the speech audio and text data to improve its quality before running an algorithm designed to align the audio recordings with the accompanying text. They then repeated this process with a second algorithm trained on the newly aligned data. With this method, the researchers were able to teach the algorithm to learn a new language more easily, even without the accompanying text.
“We can use what that model learned to quickly build speech systems with very little data,” says Michael Auli, a researcher at Meta who worked on the project.
« For English, we have a lot of good datasets, and we have them for some other languages, but we just don’t have them for languages spoken by, say, 1,000 people. »
The researchers say their models can converse in more than 1,000 languages but recognize more than 4,000.
They compared the models to those from rival companies, including OpenAI Whisper, and say theirs had half the error rate, despite covering 11 times as many languages.
However, the team cautions that the model is still at risk of misspelling certain words or phrases, which could result in inaccurate or potentially offensive labels. They also acknowledge that their speech recognition models produced more distorted words than other models, albeit only 0.7% more.
While the scope of the research is impressive, using religious texts to train AI models can be controversial, says Chris Emezue, a researcher at Masakhane, an organization working on natural language processing for African languages, who was involved in the project .
“The Bible contains many prejudices and misrepresentations,” he says.