Feature Story | 9-Nov-2023

Topology, algebra, and geometry give math respect in data science

Is that an airplane?

DOE/Pacific Northwest National Laboratory

**image:**
**Topology, Algebra, and Geometry Give Math Respect in Data Science**
view more

Credit: Image by Timothy Holland | Pacific Northwest National Laboratory

By John Roach

In the computer vision field of object detection, deep learning models are trained to identify objects of interest within an image of a scene. For example, such models can be trained to detect viruses in microscopy images or pick out airplanes parked on tarmacs in overhead aerial imagery.

“In many cases, like microscopy or overhead images, a user would want to ensure that the objects are found regardless of their orientation,” said Tegan Emerson, a senior data scientist and leader of the mathematics, statistics, and data science group at Pacific Northwest National Laboratory (PNNL). “However, this property is not inherent in all deep learning models.”

In some cases, the deep learning model can pick out the airplanes with noses pointed north but fail to detect the airplanes pointed south, for instance.

Emerson and her colleagues explored solutions to address this problem by applying the algebraic concept of group action to a deep learning model for object detection. Group action describes how things are changed under a collection of operations such as rotation. With these algebra-based architecture changes applied to the model, objects are more reliably detected in imagery no matter their orientation.

“If you constrain the model to have this type of mathematical invariance to it, you’re able to maintain your ability to detect and appropriately identify the objects within your scene, which makes this a much more trustworthy tool for people to use,” Emerson said. “That matters in operational environments where a lot of our algorithms are going to be deployed.”

Giving math respect in data science

In recent years, mathematicians were pushed to the sidelines in data science disciplines as computer power and datasets used to train machine learning (ML) models grew exponentially and led to a step-change in capabilities such as artificial intelligence (AI) systems that can generate fluid prose in natural language, noted Timothy Doster, a senior data scientist at PNNL.

“The mathematics community felt a little behind the time as massive amounts of funding went into these computer science fields,” he said. “But now they’re seeing research around explainability or dependability of these algorithms and that’s where math can really come in and address these areas.”

In 2022, Doster, Emerson, and PNNL data scientist colleague Henry Kvinge co-founded the Topology, Algebra, and Geometry in Data Science (TAG-DS) community to help spur interest in the application of math to address specific topics in data science and ML.

The community hosts workshops and conferences as well as provides publishing opportunities to drive awareness of mathematically principled solutions to data science problems. Most recently, the team hosted the second annual TAG in ML workshop at the International Conference on Machine Learning (ICML) on July 28, 2023, in Honolulu, Hawaii, and attracted more than 200 participants.

Part of the interest in the TAG-DS community stems from the growing complexity of ML systems, which operate on high-dimensional, complex datasets using models that have thousands to billions of learnable parameters, noted Kvinge.

“Such settings transcend human intuition which begins to quickly degrade beyond three dimensions,” he said. “Modern topology, algebra, and geometry were designed to allow mathematicians to understand exotic spaces, making them natural toolboxes to investigate when studying state-of-the-art machine learning.”

Proof of math in data science

In some cases, the application of math to data science can improve the rigor of AI models trained with massive datasets and computer power. For example, the mathematical study of symmetry, or representation theory, is used in some of the models capable of predicting how proteins fold and twist into their three-dimensional shapes, according to Kvinge.

Protein folding models help scientists understand the structure of proteins, which are the building blocks of life—they are molecular machines that play a fundamental role in the structure, function, and regulation of nearly every biological process.

“We know that how a protein folds should not depend on its location in space nor its orientation, and consequently a deep learning model should ignore these factors of variation when processing representations of proteins,” he explained. “Building model architectures can be done far more accurately when you understand how to capture the symmetries intrinsic to three-dimensional space.”

In other cases, mathematics techniques can improve data used in more niche data science tasks such as using topological data analysis to extract shape-based features for ML models used to understand the structure and properties of materials such as the metal rods, tubes, and cubes that provide cars and trucks their shape, strength, and fuel economy.

“Topology is the study of shape and there is a widely used quote from a leader in the field that states, ‘Data has shape, shape has meaning’ and what shape means for different formats of data can be nuanced,” noted Emerson.

In one study, researchers applied topology to scanning electron microscopy images that were used to support research and development in advanced manufacturing. In this case, white precipitates, or solid materials, that formed during a metal manufacturing process were visible throughout the image. By looking at the topology of the precipitates at multiple threshold values, the team was able to capture physically meaningful features, summarize the information, and use it as input to the ML model.

“Part of the difference in the paradigm for TAG-DS both at PNNL and in the scientific community is that you’re not just trying to train a model. What you’re trying to do is build a solution,” said Emerson. “You want something that actually addresses a need or a way to support a human who is involved in the processing pipeline.”

Growing the TAG-DS community

Engagement with the TAG-DS community has more than doubled in its first year of existence, according to Doster. For example, the TAG-ML workshop at ICML in 2022 had about 40 published submissions. This year’s workshop received more than 90 submissions and included four keynotes by world leaders in geometric and topological deep learning, two poster sessions, six spotlight talks, and other activities.

Looking forward, the group is planning to host more workshops at computer science and mathematics conferences and is aiming to host a standalone TAG-DS conference in 2025.

According to Emerson, the ability of TAG-DS to increase the rigor, trustworthiness, and explainability of AI systems will only grow in importance as technologies such as generative AI become widespread.

“From a national laboratory’s perspective with our interest for the nation, but also for the average person in daily life, the mathematical rigor that the TAG-DS community can bring to understanding the ways these tools can support you, when they will work, how they will fail, and when they are not an appropriate technique to be using is critical,” she said.

###

About PNNL

Pacific Northwest National Laboratory draws on its distinguishing strengths in chemistry, Earth sciences, biology and data science to advance scientific knowledge and address challenges in sustainable energy and national security. Founded in 1965, PNNL is operated by Battelle for the Department of Energy’s Office of Science, which is the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science. For more information on PNNL, visit PNNL's News Center. Follow us on Twitter, Facebook, LinkedIn and Instagram.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.