The language ‘engines’ that power generative artificial intelligence (AI) are plagued by a wide range of issues that can hurt society, most notably through the spread of misinformation and discriminatory content, including racist and sexist stereotypes.
In large part these failings of popular AI systems, such as ChatGPT, are due to shortcomings with the language databases upon which they are trained.
To address these issues, researchers from the University of Birmingham have developed a novel framework for better understanding large language models (LLMs) by integrating principles from sociolinguistics – the study of language variation and change.
Publishing their research in Frontiers in AI, the experts argue that by accurately representing different ‘varieties of language’, the performance of AI systems could be significantly improved – addressing critical challenges in AI, including social bias, misinformation, domain adaptation, and alignment with societal values.
The researchers emphasise the importance of using sociolinguistic principles to train LLMs to better represent the diverse dialects, registers, and periods of which any language is composed – opening new avenues for developing AI systems that are more accurate and reliable, as well as more ethical and socially aware.
Lead author Professor Jack Grieve commented: “When prompted, generative AIs such as ChatGPT may be more likely to produce negative portrayals about certain ethnicities and genders, but our research offers solutions for how LLMs can be trained in a more principled manner to mitigate social biases.
“These types of issues can generally be traced back to the data that the LLM was trained on. If the training corpus contains relatively frequent expression of harmful or inaccurate ideas about certain social groups, LLMs will inevitably reproduce those biases resulting in potentially racist or sexist content.”
The study suggests that fine-tuning LLMs on datasets designed to represent the target language in all its diversity – as decades of research in sociolinguistics has described in detail – can generally enhance the societal value of these AI systems. The researchers also believe that by balancing training data from different social groups and contexts, it is possible to address issues around the amount of data required to train these systems.
“We propose that increasing the sociolinguistic diversity of training data is far more important than merely expanding its scale,” added Professor Grieve. “For all these reasons, we therefore believe there is a clear and urgent need for sociolinguistic insight in LLM design and evaluation.
“Understanding the structure of society, and how this structure is reflected in patterns of language use, is critical to maximizing the benefits of LLMs for the societies in which they are increasingly being embedded. More generally, incorporating insights from the humanities and the social sciences is crucial for developing AI systems that better serve humanity.”
ENDS
For more information, please contact Press Office, University of Birmingham, tel: +44 (0)121 414 2772: email: pressoffice@contacts.bham.ac.uk
Notes to editor:
-
The University of Birmingham is ranked amongst the world’s top 100 institutions. Its work brings people from across the world to Birmingham, including researchers, teachers and more than 8,000 international students from over 150 countries.
-
‘The Sociolinguistic Foundations of Language Modelling’ – Jack Grieve, Sara Bartl, Matteo Fuoli, Jason Grafmiller, Weihang Huang, Alejandro Jawerbaum, Akira Murakami, Marcus Perlman, Dana Roemling, and Bodo Winter is published by Frontiers in AI.
Journal
Frontiers in Artificial Intelligence
Method of Research
Data/statistical analysis
Subject of Research
People
Article Title
The Sociolinguistic Foundations of Language Modelling
Article Publication Date
13-Jan-2025