Don't Let English Dominate! How AI Protects Linguistic Diversity - Learning Cycle Collective: Global Voices on DEI

As of 2021, over half of all web content (60.4%) is in English. Therefore, when AI companies scrape the web to develop their large language models (LLMs) and training datasets, English is going to be the most represented language in both amount of data and frequency of use. An over-reliance on English language materials will lead the systems to be biased in favor of English-language perspectives, particularly those of the US, the UK, Australia, and Canada. This will very likely have a compounding effect, continually boosting English-speaking countries’ soft power as their texts and viewpoints on history, language, and culture become favored by humans and, increasingly, by generative AI technologies.

Digital Language Divide

It is not necessarily a bad thing to have a de facto lingua franca that enables greater cross-cultural and international communications. However, it is also fair for speakers of less-commonly spoken languages to be concerned about the ”digital language divide,” or the notion that their languages and, by extension, their communities are potentially being excluded from future technological advances due to the over-reliance on English and other globally dominant languages.

Self-Perpetuating Cycle of English Dominance

The dominance of English can very quickly become a self-perpetuating cycle. It may be that people are choosing to communicate in English or develop new tools or digital ecosystems in English not because it is the best or most ideal language choice, but simply because that is what everyone else is doing. If this was permitted to continue unquestioned, the dominance of English could only be expected to grow and grow.

Efforts to Promote Linguistic Diversity in AI

In an effort to start staving off that possibility, Silo AI, a Finnish startup, has developed Viking, a large language model for certain Nordic languages. This act of purposeful inclusion is intended to increase data inclusivity and ensure continued representation in AI for speakers of Finnish (5.8M), Danish (5.6M), Icelandic (358,000), Norwegian (5M), and Swedish (9.2M). Even combined, they are obviously dwarfed by the 1.4B speakers of English, which makes this effort to ensure their languages are considered and incorporated in the development of new AI technologies even more noteworthy.

Of the Nordic languages, only Dutch appears on the list of languages that are most frequently used on the internet content (and therefore AI training data), standing at 0.6% of internet content. Japanese, with its 125.4M speakers, is reported to make up 4.3% of internet content. Japanese is thus slightly better positioned to be incorporated in LLMs, but not by much.

Infographic illustrating disconnect between online language usage and real-world linguistic diversity

It is wonderful to see an active effort to increase the prominence of some smaller language groups from the perspectives of both accessibility and linguistic diversity. With all the recent reporting about the inherent dangers and risks of AI, it is nice to be able to celebrate an undeniably and unambiguously beneficial use case for the technology. It would not do to create a world where such high-value, capable technology is locked behind a language barrier. After all, the number of people worldwide who do not speak English vastly outnumbers those who do, and as English speakers, native or acquired, we should want to bring everyone in the world along with us as we move into a more high-tech future!

Learning Cycle promotes a variety of programs to help deepen your understanding of DE&I. If you would like to know more about us, please click on About Us.

Learning Cycle Collective: Global Voices on DEI