AI's Linguistic Diversity Battle: Mind Your Language

The United Nations

By Fabrice Robinet

For two years, one international organization under the umbrella of the UN has been leading a relentless campaign in the corridors of global digital diplomacy. Its mission? To bring linguistic diversity to English-dominated artificial intelligence.

With his signature geeky glasses and TED-Talk-style headset, Sundar Pichai looked straight out of a Silicon Valley incubator.

That Monday, February 10, Google's chief executive took the stage at the Artificial Intelligence Action Summit in Paris. From the Grand Palais podium, he heralded a new golden age of innovation.

"Using AI techniques, we added over 110 new languages to Google Translate last year, spoken by half a billion people around the world," said the tech mogul, his eyes fixed on his notes. "That brings our total to 249 languages, including 60 African languages - more to come."

Delivered in a monotone, his statement barely registered among the summit's attendees - an assembly of world leaders, researchers, NGOs, and tech executives.

But for advocates of linguistic diversity in artificial intelligence, Mr. Pichai's words marked a quiet victory - one achieved after two years of intense, behind-the-scenes negotiations in the arcane world of digital diplomacy.

"It shows the message is getting through and tech companies are listening," said Joseph Nkalwo Ngoula, digital policy advisor at the UN mission of the International Organisation of La Francophonie, in New York.

Linguistic divide

Mr. Pichai's speech was a far cry from the linguistic missteps of early generative AI - a branch of artificial intelligence capable of creating original content, from text to images, music and animation.

When OpenAI launched ChatGPT in 2022, non-English speakers quickly discovered its limitations.

A query in English would generate a detailed, informative response. The same prompt in French? Two paragraphs, followed by a sheepish apology: "Sorry, I haven't been trained on that," or, "my model isn't updated beyond this date."

Such a gap lies in the intricate mechanics of AI tools, which rely on so-called large language models (LLMs) like GPT-4, Meta's LlaMA, or Google's Gemini to digest vast troves of internet data that help them understand and generate text.

But the internet itself is overwhelmingly Anglophone. While only 20 per cent of the world's population speaks English at home, nearly half of the training data for major AI models is in English.

Even today, ChatGPT's responses in French, Portuguese, or Spanish have improved but remain less illuminating than their English counterparts.

The UN Global Digital Compact aims to bring together governments and industry to ensure that technology, like AI, works for all humanity.

Sharper focus

"The volume of available information in English is much greater, but it's also more up to date," said Mr. Nkalwo Ngoula. By default, AI models are conceived, trained, and deployed in English, leaving other languages struggling to catch up.

The divide isn't just quantitative. AI, when deprived of robust training in any given language, starts to "hallucinate" - generating incorrect or absurd answers with unsettling authority - much like an overconfident friend bluffing his way through trivia night.

A classic AI hallucination consists of responding to a request for biographical details about a famous person by inventing a Nobel Prize or coming up with an odd parallel career, as in this example generated by ChatGPT, at the behest of UN News:

UN News: 'Who is Victor Hugo?'

Hallucinating AI: "Victor Hugo, the 19th-century French writer, was also a passionate astronaut who contributed to the early design of the International Space Station." 🚀😆

Black box

"It's a black box absorbing data," Mr. Nkalwo Ngoula explained. "The results might be formally coherent and logically structured, but factually, they can be wildly inaccurate."

Beyond factual errors, AI tends to flatten linguistic richness. Chatbots struggle with regional accents and language variations, such as Quebecois French or Creole languages spoken in Haiti and the French Caribbean.

AI-generated French often feels sanitized, stripped of its stylistic nuances.

"Molière, Léopold Sédar Senghor, Aimé Césaire, Mongo Beti - they'd all be turning in their graves if they saw how A.I. writes French today," joked Mr. Nkalwo Ngoula.

The issue runs deeper in multilingual countries, as in the diplomat's native Cameroon, where youth commonly speak Camfranglais - a hybrid of French, English, Pidgin, and local languages.

"I doubt young people could ask an AI something in Camfranglais and get a meaningful response," he said. Expressions like "Je yamo ce pays" (I love this country) or "Réponds-moi sharp-sharp" (Answer me quickly) would likely leave A.I. models bewildered.

Philemon Yang (at podium and on screens), President of the seventy-ninth session of the United Nations General Assembly, addresses the opening of the Summit of the Future on 22 September 2024.

Shadow Campaign of La Francophonie

Mr. Nkalwo Ngoula's organization, La Francophonie - which brings together 93 states and governments around the use of French, representing more than 320 million people worldwide - has made this linguistic gap a centerpiece of its digital strategy.

The group's efforts culminated in last year's UN Global Digital Compact, a framework for AI governance adopted by the Member States. From 2023 onward, La Francophonie leveraged its diplomatic network - including the influential Francophone Ambassadors' Group at the UN - to ensure linguistic diversity became a core principle in AI policymaking.

Along the way, unexpected allies emerged. Lusophone and Hispanic advocacy groups joined the fight, and even Washington sided with their cause. "The US defended language inclusion in AI development," Mr. Nkalwo Ngoula noted.

Their push paid off. The final Global Digital Compact explicitly recognizes cultural and linguistic diversity - an issue that had initially been buried under broader discussions on accessibility. "Our goal was to bring it to the forefront," he said.

The movement even reached Silicon Valley. At the UN Summit for the Future in September 2024, where the Compact was officially adopted, Sundar Pichai, Google's CEO, surprised many by emphasizing the need for A.I. to provide access to global knowledge in multiple languages.

"We're working toward 1,000 of the world's most spoken languages," he pledged - a commitment he reaffirmed in Paris months later.

Limits of the Global Digital Compact

Despite these gains, challenges remain. Chief among them is visibility. "Francophone content is often buried by platform algorithms," Mr Nkalwo Ngoula warns.

Streaming giants like Netflix, YouTube, and Spotify prioritize popularity, meaning English-language content dominates search results.

"If linguistic diversity were truly considered, a French-speaking user should see French-language films at the top of their recommendations," he argued.

The overwhelming dominance of English in AI training data is another hurdle sidestepped by the Compact, which also omits any reference to UNESCO 's Convention on Cultural Diversity - an oversight that, according to Mr. Nkalwo Ngoula, should be rectified.

"Linguistic diversity must be the backbone of digital advocacy for La Francophonie," Nkalwo Ngoula insisted.

Given the pace of AI development, those changes can't come a moment too soon.

/UN News Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

Linguistic divide

Sharper focus

Black box

Shadow Campaign of La Francophonie

Limits of the Global Digital Compact

You might also like