An English literature graduate turned data scientist has developed a new method for large language models (LLMs) used by AI chatbots to understand and analyse small chunks of text, such as those on social media profiles, in customer responses online or for understanding online posts responding to disaster events.
In today’s digital world, such use of short text has become central to online communication. However, analysing these snippets is challenging because they often lack shared words or context. This lack of context makes it difficult for AI to find patterns or group similar texts.
The new research addresses the problem by using large language models (LLMs) to group large datasets of short text into clusters. These clusters condense potentially millions of tweets or comments into easy-to-understand groups generated by the model.
PhD student Justin Miller has developed this method for use by AI programs that successfully produced coherent categories after analysing nearly 40,000 Twitter (X) user biographies from accounts tweeting about US President Donald Trump over two days in September 2020.
The language model developed by Mr Miller, an English literature graduate, clustered the biographies into 10 categories, and allocated scores within each of these categories to assist in analysing the likely occupation of the tweeters, their political leaning, or even their use of emojis.
The study is published in the Royal Society Open Science journal.
Mr Miller said: “What makes this study stand out is its focus on human-centred design. The clusters created by the large language models are not only computationally effective but also make sense to people.
“For instance, texts about family, work, or politics are grouped in ways that humans can intuitively name and understand. Furthermore, the research shows that generative AI, such as ChatGPT, can mimic how humans interpret these clusters.
“In some cases, the AI provided clearer and more consistent cluster names than human reviewers, particularly when distinguishing meaningful patterns from background noise.”
Mr Miller, a doctoral candidate in the School of Physics and a member of the Computational Social Sciences lab, said the tool he has developed could be used to simplify large datasets, gain insights for decision making and improve search and organisation.
Using large language models (LLMs), the authors created clusters using a methodology known as “Gaussian mixture modelling” that capture the essence of the text and are easier for humans to understand. They validated these clusters by comparing human interpretations with those from a generative LLM, which closely matched human reviews.
This approach not only improved clustering quality but also suggests that human reviews, while valuable, might not be the only standard for cluster validation.
Mr Miller said: “Large datasets, which would be impossible to manually read, can be reduced into meaningful, manageable groups.”
Applications include:
- Simplifying Large Datasets: Large datasets, which would be impossible to manually read, can be reduced into meaningful, manageable groups. For example, Mr Miller applied the same methods from this paper to another project on the Russia-Ukraine war. By clustering over 1 million social media posts, he identified 10 distinct topics, including Russian disinformation campaigns, the use of animals as symbols in humanitarian relief, and Azerbaijan’s attempts to showcase its support for Ukraine.
- Gain Insights for Decision-Making: Clusters provide actionable insights for organisations, governments and businesses. A business might use clustering to identify what customers like or dislike about their product, while governments could use it to condense wide ranging public sentiment into a few topics.
- Improve Search and Organisation: For platforms handling large volumes of user-generated content, clustering makes it easier to organise, filter and retrieve relevant information. This method can help users quickly find what they’re looking for and improve overall content management.
Mr Miller said: “This dual use of AI for clustering and interpretation opens up significant possibilities. By reducing reliance on costly and subjective human reviews, it offers a scalable way to make sense of massive amounts of text data. From social media trend analysis to crisis monitoring or customer insights, this approach combines machine efficiency with human understanding to organise and explain data effectively.”
-ENDS-
Interviews
Justin Miller | justin.k.miller@sydney.edu.au
Media enquiries
Marcus Strom | marcus.strom@sydney.edu.au | +61 474 269 459
Research
Miller, J. and Alexander, T. ‘Human-interpretable clustering of short text using large language models’ (Royal Society Open Science 2025) DOI: 10.1098/rsos.241692
Declaration: The researchers declare no conflicts.
Outside of work hours, please call +61 2 8627 0246 (this directs to a mobile number) or email media.office@sydney.edu.au.
Journal
Royal Society Open Science
Method of Research
Data/statistical analysis
Subject of Research
Not applicable
Article Title
Human-interpretable clustering of short text using large language models
Article Publication Date
21-Jan-2025
COI Statement
The authors declare no conflicts of interest.