The Microsoft Research (MSR) lab in India is working on creating digital ecosystems for Indian languages that have a limited online presence. These efforts are part of Project ELLORA, launched in 2015, which aims to bring rare Indian languages to the digital world and preserve them for future generations.
The team is creating language datasets by mapping out resources, including printed literature, to train AI models. They are also collaborating with the language communities to ensure the datasets are culturally relevant and accurate. Microsoft is currently working with the Mundari community, which speaks the Mundari language and is concerned about its longevity due to limited teaching in schools.
English has been the dominant language of the internet since its inception, with only 8 out of nearly 6,000 languages preferred online. This means that 88% of the world’s languages don’t have enough of a presence, affecting 1.2 billion people who can’t use their language to navigate the digital world.
Microsoft’s research team is working on a Hindi-to-Mundari text translation and speech recognition model to provide the community access to more content in their language. They have also developed a new technology called Interneural Machine Translation (INMT) to speed up the translation process. Apart from Mundari, Microsoft is also working with the Gondi and Idu Mishmi communities.
Meta, the parent company of Facebook, is also working on a similar project. They have developed an AI translation tool that can convert unwritten languages, such as Hokkien, to spoken English. These efforts aim to bring underrepresented languages to the digital world and preserve them for future generations.