Marathi at home. Hindi at work. And English online. More than a quarter of Indians speak two or three of the country’s 22 official languages and 250 distinct dialects. But despite more internet users logging on now than ever before, English remains the lingua franca of the internet.
Tech companies and the Government of India want to change that with AI. But to do so, it’ll require a lot more than just a declaration and wishful thinking. It will require meaningful consultation with the public, something tech companies and the Government have often fallen short of doing in the past.
Last month, the government of India launched BharatGen to make generative AI work in Indian languages. Prime minister Narendra Modi’s call was accompanied by statements made by Google and Nvidia CEOs Sundar Pichai and Jensen Huang who too shared the Modi’s enthusiasm to bring AI to India. India already boasts of many homegrown AI initiatives. Efforts to bring AI to India should work alongside local AI projects, and learn from them.
India is a hot market with over a billion potential users, and services have long tried to court them with slipshod products and little to no consultation with users. Take for example, Free Basics. Facebook launched the Free Basics programme in the mid-2010s as a well-meaning effort to bridge the digital divide and expand access to the internet by providing low-cost data access. A closer look into the offerings revealed that recipients of the service only got access to Facebook and a handful of other Western online services such as AccuWeather, BBC News, and ESPN. Users were limited from downloading additional online services onto their devices and they were not able to read the terms of the service offered in any language outside English. It was internet access in name only, with terms dictated by and for Facebook, which many termed a form of “digital colonialism”. Ultimately, the Telecom Regulatory Authority of India (TRAI) banned Free Basics in February 2016 in a ruling based on principles of net neutrality, to ensure that users had equal access to all internet content.
AI companies are on track to make a similar mistake. Already, Western tech companies have boasted about their multilingual systems that work in languages ranging from Assamese to Bengali. Yet, a closer look into these systems reveals that they have neither been built robustly nor tested rigorously. A study of chatbots revealed that they were unable to answer health-related questions in Hindi despite being used in various health-care settings already. On 60 Minutes, Sundar Pichai promised that AI could help people access the world’s information in their own language by affirming that Bard, Google’s AI chatbot, had taught itself Bengali. In reality, the model had been trained in the language but not in any meaningful way. Only 1.4% of the data used to train the model had been bilingual, with only 0.026% of the training material being in Bengali, sourced from Google translations of English text. Computer science experts on Twitter called out the company for perpetuating a notion of AI “hype” rather than investing in the building and testing of services that could benefit Bengali speakers.
Also read: Code Dependence Has a Human Cost and Is Fuelling Technofeudalism
When AI companies do try to build systems that work in multiple languages, they often largely ignore local expertise. Companies that have built large multilingual models have largely opted for an Anglocentric approach creating one model for English, and another for all other languages no matter how distant they are from one another. They also often rely on machine-translated text and underpaid data workers with limited discretion and power to use their expertise, and lack processes or will to adopt local feedback. For example, Microsoft’s chatbot Sydney was initially tested in India with many users reporting disturbing conversations. Yet, the company reportedly ignored the feedback until the New York Times covered the issue.
If Big Tech companies want to better serve Indian users, they can and should start with consultation and collaboration with groups with the expertise and understanding of the needs of Indian users. Collaboration with local groups can help to bridge the “resourcedness gap” or the dearth of high quality training datasets in many Indian languages.
My colleague Gabriel Nicholas and I have written in the past that this dearth of high quality training and testing data systems often impedes companies from building and testing models in non-English languages and incentivises them to cut corners such as taking non-English text from social media sites that are often replete with obscenities or typos. Local groups can provide high quality training data and testing prompts to ensure that models are built in a way that is more language-specific. As my colleagues write in a new brief, by engaging with language experts and local research groups, companies can ensure that communities can both meaningfully contribute to and benefit from NLP tools developed in their languages.
Engaging a diverse group of Indian language AI consortia also will help tech companies better represent the language diversity in the development and testing of AI systems for Indian users. Currently, Western tech companies operate as though other languages work the way English does, or that Indian users are all the same. However, many languages do not follow the semantic logic that English or even Indo-European languages do. English and other Latin languages and even some Indo-European languages like Hindi are gendered languages with pronouns for subjects and nouns, whereas Bengali, all Uralic and Turkic languages like Hungarian or Turkish, and many others lack gender pronouns at all. What’s more, companies assume that most users speak in one language or the other at any given time whereas linguists (and most Indians) know that we tend to “code-switch” – that is speak one language and another at the same time as in the case of Hinglish or Tamlish. Few training datasets or chatbots reflect this reality.
By failing to engage with language speakers and local experts, companies can replicate and even scale Anglocentric assumptions. Some claim that this failure to properly validate a model’s language accuracy could have major ramifications on the linguistic diversity and cultural uniqueness of the world, with AI researchers at Cornell claiming specifically that it could lead Indian users to hue more closely to American or Western norms when they write at the expense of their own style or agency. Regional and language-specific research consortia such as IndicLLM Suite led by AI4Bharat and Center for Tamil Natural Language Processing Research are examples of groups that can offer expertise and resources to shape and test multilingual systems in ways that better reflect the Indian context.
Tech companies can also ensure that their systems are fit for purpose by working with domain-specific experts. For building technologies in rural languages or contexts, organisations like Karya are essential. Karya is working with language speakers to record, document, and digitize languages spoken in rural India including Odia, Mundari, and others for domain-specific systems such as call centre support or agriculture data systems. For instances where models need to be tested for fairness or are developed to detect and action harmful conduct, Tattle is working to create high quality training datasets and systems to combat gender-based violence and other forms of harassment in Tamil, Hindi, and Indian English. Their training datasets are developed in consultation with language and subject matter experts such as gender-based organisations and other affected communities who are often at the front lines of new vectors of abuse and can ensure automated models can detect these evolving threats.
Finally, consulting with local groups will also help companies garner consent from individuals over the collection and use of language data. Local groups and language experts will be more likely to have the consent and consensus of speakers and be able to represent whether a community wants their languages digitized and handed over to Big Tech companies or not. Te Hiku Media in New Zealand, for example, have recorded and digitized Māori, one of the 3000 Indigenous languages that are under threat of extinction, but retain the rights to distribute or license the training data to AI systems to ensure that the benefits of the AI revolution benefit language speakers and not just profit tech companies. Nvidia has helped Te Hiku Media create their own Māori language models through a crowd-sourced labelling campaign and after receiving the consent of elders who have stewarded the language.
Meaningful engagement with local experts will require a paradigm shift on the part of the companies to invest in time and resources to consult with external experts and solicit input into product development roadmaps. But in doing so, companies can gain from locally-made datasets and benchmarks to train and test models. Companies should consider remunerating such experts for their expertise and ensuring local expertise is solicited and incorporated at each stage of the product development lifecycle. Some initiatives such as Meta’s No Language Left Behind or Microsoft’s ELLORA are examples of efforts that fund research that advances the performance of systems in non-English languages. This investment will also make all users safer by ensuring safeguards cannot be circumvented in languages other than English, as they currently can.
It’s in the Government of India’s best interest to play a leading role here. Companies seem encouraged by the prime minister’s tech savviness, and Indians want to be technologically innovative. Prime Minister Modi can leverage that by ensuring that foreign investment by tech companies bolsters local groups working on language access in tech and ultimately benefits Indian languages, rather than endangers them. The Government of India’s relevant agencies such as MeitY should also establish a meaningful consultation process with the public to understand the potential opportunities and risks AI systems, including multilingual ones, pose rather than offering a blank cheque of access to Western tech companies.
Ensuring technologies work in the world’s languages, or even in just Indian languages, is a mighty task. Engaging with language and subject-matter experts, companies can move the needle on actually serving the plethora of Indian users and businesses who seek information about farming and agriculture policy, healthcare, cricket history, and more.
Aliya Bhatia is a policy analyst at the Center for Democracy & Technology and the co-author of Lost in Translation: Large Language Models in Non-English Content Analysis.