You’d be hard pressed to find an industry today that doesn’t use data in some capacity. Whether it's health care workers using data to report the rate of flu infections in a certain state, manufacturers using data to better understand average production times, or even a small coffee shop owner flipping through sales data to learn about the previous month’s bestselling latte, data can reveal patterns and offer insights into our everyday behavior.
All of this data plays a critical role in artificial intelligence (AI) decision-making. Further, it creates a serious need for people to understand the value of data in the first place. By understanding how individual data sources contribute to technology-based decision-making processes, we can create a more effective and improved experience for all AI users.
For instance, studies have shown that prevalent facial recognition software performs less reliably in identifying women and people of color compared with white men, reflecting imbalances in facial data representing diverse populations. Measuring the value of data enables us to eliminate inputs that might contribute to biased models. Furthermore, understanding the value of data allows us to assign appropriate pricing to data sources, thereby facilitating data sharing. This is particularly important to industries where certain data is difficult to obtain or for small businesses grappling with limited data access.
Assistant Professor Ruoxi Jia in the Bradley Department of Electrical and Computer Engineering at Virginia Tech has received an National Science Foundation (NSF) Faculty Early Career Development (CAREER) award to investigate fundamental theories and computational tools needed to measure the value of data.
The five-year $500,000 grant will allow Jia and her team to develop scalable and reliable data valuation techniques that support strategic data acquisition and improve machine learning based data analytics.
“Right now, there is a lot of excitement about machine learning and AI, especially after the emergence of ChatGPT,” said Jia. “But what’s under the hood is a lot of data. That’s what enables this kind of machine, and that’s what we’re aiming to improve.”
ChatGPT, an AI chatbot launched this fall, allows users to ask for help with things such as writing essays, drafting business plans, generating code, and even composing music. As of Dec. 4, ChatGPT already had over 1 million users.
Open AI built its auto-generative system on a model called GPT 3, which is trained on billions of tokens. These tokens, used for natural language processing, are similar to words in a paragraph. For comparison’s sake, the novel “Harry Potter and the Order of the Phoenix” has about 250,000 words and 185,000 tokens. Essentially, ChatGPT has been trained on billions of data points, making this kind of intelligent machine possible.
Jia noted the importance of data quality and how it can impact machine learning results.
“If you have bad data feeding into machine learning, you will get bad results,” said Jia. “We call that 'garbage in, garbage out.' We want to get an understanding, especially a quantitative understanding, of which data is more valuable and which is less valuable for the purpose of data selection.”
The importance of more quality-based data has been noticed by ChatGPT developers as they just announced the release of GPT-4. The latest technology is “multimodal,” meaning images as well as text prompts can spur it to generate content.
A large amount of data is required to develop this type of machine intelligence, but not all data is open sourced or public. Some data sets are owned by private entities and there is privacy involved. Jia hopes that in the future, monetary incentives can be introduced to help acquire these types of data sets and improve the machine learning algorithms that are needed in all industries.
The University of California-Berkeley grad has had conversations with Google Research and Sony AI Research, among others, who are interested in the research benefits. Jia hopes these companies will adopt the technology developed and serve as advocates for data sharing. Sharing data and adopting improved machine learning algorithms will greatly benefit not only industries but individual consumers as well. For instance, if you’ve ever had a bad experience with a customer service chatbot, you’ve experienced low-quality data and poor machine learning algorithm design.
Jia hopes to use her background and area expertise to improve these web-based interactions for all. As a school-aged child, Jia always enjoyed math and science, but her decision to enter the electrical and computer engineering field stemmed from her desire to help people.
“Both of my parents are doctors. It was amazing to grow up seeing them help patients with some kind of medical formula,” said Jia. “That’s why I chose to study math and science. You can have a concrete impact. I’m using a different kind of formula to help, but I like that pursuing this career has made me feel like I can make a difference in someone’s life.”
The CAREER award is the National Science Foundation’s most prestigious award for early-career faculty with the potential to serve as academic role models in research and education and to lead advances in their organization’s mission. Throughout this project, Jia has demonstrated her desire to serve as an academic role model for graduate, undergraduate, and even K-12 students.
She is a core faculty in the Sanghani Center for Artificial Intelligence and Data Analytics, formerly known as the Discovery Analytics Center. The center has more than 20 faculty members and 120 graduate students, two of whom are working directly with Jia to conduct the planned research.
Jia plans to implement an education plan that equips students with the skills to harness data to improve decision-making impacting society. This educational plan will start with new machine learning courses for undergraduate students in the first two years of the project and focus on K-12 engagement in years three through five.
“There was a famous statistician named John Tukey,” Jia said. “He had a saying that the best thing about being a statistician is that you get to play in everyone's backyard. Machine learning is very much the same. It touches many areas of my colleagues’ work so it is easy for me to build connections and collaborate with other people. I really feel that my research is a privilege. It's a privilege to work in this area that many people care about.”