By Alistair Jones
SMU Office of Research – Our embrace of digital technology continues to bloom and grow. It has become integral to the way we communicate, access services and transact business.
Forbes magazine estimated that at the end of 2023, there were 15 billion connected digital devices in the world, almost twice as many as there were people. And as the number of online and mobile interactions has proliferated, so too have the opportunities for them to be hacked.
Cybercrime is big business. A recent survey by McKinsey & Company estimated that at the present rate of growth, damage from cyber-attacks will amount to about US$10.5 trillion annually by 2025. It represents a 300 per cent increase from 2015 levels.
Containing the risk of data being compromised, stolen or sold to rogue agents is a matter of urgency for governments. Discovering and addressing vulnerabilities in government software is paramount to securing digital government services and keeping data safe from attackers.
A new research project led by David Lo, a Professor of Computer Science at Singapore Management University (SMU), aims to help address this issue in Singapore.
"The key objective of the project is to develop an approach to find cybersecurity weaknesses, also known as vulnerabilities, in a software application’s source code," Professor Lo says.
"The user of the to-be-realised solution would be the Government Technology Agency (GovTech) of Singapore’s Cyber Security Group (CSG)."
GovTech is the lead agency delivering Singapore government's digital services to the public and oversees the infrastructure to drive the implementation of the country's Smart Nation initiative.
Professor Lo's project has been awarded a TRANS grant.
Smart Nation and Digital Government Translational R&D (TRANS) grants fund translational R&D and technology, or process innovations, to solve public sector challenges, demonstrate the feasibility of new ideas, and encourage agencies to experiment and deploy innovative solutions.
Challenges and the road to innovation
It is best practice for programming teams that develop bespoke software applications to review source code to discover vulnerabilities. However, manual source code review is a laborious process. Many teams prefer to utilise automated tools, commonly known as Static Application Security Testing (SAST).
However, SAST tools often struggle to keep up with the increased complexity of modern code and application development frameworks.
"Many present SAST tools have been reported to have high false alarm rates," Professor Lo says. "Many warnings these tools generate often do not correspond to real cybersecurity weaknesses. Moreover, they miss many vulnerabilities in source code."
During the past few years, generative artificial intelligence algorithms have been used to train large language models (LLMs) to create tools such as ChatGPT, which could have potential applications in cybersecurity.
But these models are not specialised for vulnerability discovery and are provided as black boxes without details or access to their internal workings. They also require data to be sent to third-party services, which has raised concerns.
Professor Lo and his team propose building a localised and specialised LLM solution, in this case a large data model, specifically tuned for vulnerability discovery and contextualised to the government code base.
The tuning process will leverage the corpus of knowledge from open-source projects and communities, such as GitHub, which is used by around 100 million developers worldwide.
Coverage from a large amount of vulnerability data is needed to effectively train and fine-tune a vulnerability discovery model. Rather than having a static collection, which can become obsolete fairly quickly in light of new types of vulnerabilities, programming languages, constructs, and paradigms, the researchers will create a data pipeline to provide a steady stream of vulnerability code introduced and subsequently fixed by thousands of engineers across the world.
Key points of discovery would include a large number of diverse, high-quality, and relevant code changes of security patches in open-source repositories – a ‘security patch’ is an update made to a software application to fix vulnerabilities. The tuned solution could then flag similar vulnerable code in government source code.
Architecting the solution
So, how will the vulnerability discovery solution be realised?
"This project will realise a large code model-based system that consists of four pillars: Data, Model, Operation and Infrastructure," Professor Lo says. "The purpose of the Data pillar is to gather high-quality vulnerability data to be obtained from publicly available sources.
"The purpose of the Model pillar is to transform vulnerability data collected in the Data pillar into a high-performing machine learning model based on large code models to identify vulnerability-introducing commits in continuous integration and continuous delivery pipelines.
"The purpose of the Operation pillar is to refine and adapt the vulnerability detection model created in the Model pillar so that it can work optimally to help software engineers discover vulnerabilities introduced to source code stored in government GitLab project repositories.
"The purpose of the Infrastructure pillar is to provide common services that support the three aforementioned pillars," Professor Lo says.
The project will build on previously published works co-authored by Professor Lo that produced solutions that curate vulnerability data from public code and vulnerability repositories. Many of these works are produced as a result of a prior funded project.
"The previous project, funded by the National Satellite of Excellence in Trustworthy Software Systems, developed state-of-the-art solutions that will be the starting point of the Data pillar," Professor Lo says.
"Specifically, the solutions built there can be used to identify changes that developers made in public source code repositories to fix vulnerabilities. Those changes will be used as one valuable source of data to help our proposed approach identify similar vulnerabilities across Government repositories."
Paving the way for a safer digital future
In an era where digital security is more crucial than ever, the work of Professor Lo and his team at Singapore Management University represents a beacon of hope. The project is testament to the potential of AI in enhancing cybersecurity and also a crucial step towards safeguarding our digital infrastructure. By addressing the intricate challenges of coding vulnerabilities through advanced AI methodologies, this project is poised to elevate the standards of cybersecurity significantly.
The implications of this research extend beyond the immediate realm of government software. The methodologies and discoveries made here have the potential to influence cybersecurity practices globally. As we move forward into an increasingly interconnected digital world, the work of Professor Lo and his team underscores the importance of continuous innovation and vigilance in the face of evolving cyber threats.
This research / project is supported by the National Research Foundation, Singapore, Smart Nation Group under its Smart Nation and Digital Government Translational R&D Grant (Award No. TRANS2023-TGC02).