News Release

Precious2GPT: Multiomics transformer and conditional diffusion for generation of multi-omics multi-species multi-tissue synthetic biological data

Reports and Proceedings

InSilico Medicine

Schemetic representation of the Precious2GPT structure

image: 

From: Precious2GPT: the combination of multiomics pretrained transformer and conditional diffusion for artificial multi-omics multi-species multi-tissue sample generation

view more 

Credit: Insilico Medicine

  • PreciousGPT series are pioneering architecture designed to understand the biological mechanisms and the aging process for life from birth to death
  • Precious2GPT diffusion-transformer architecture was published in Nature npj Aging
  • Precious2GPT integrates pretrained transformers with conditional diffusion models for generating multi-omics, multi-species, and multi-tissue data for drug discovery and aging research
  • Precious3GPT is in the process of community validation open source, and can be accessed on discord

 

Scientists at Insilico Medicine have introduced Precious2GPT, an innovative multimodal architecture that integrates the pretrained transformer and conditional diffusion for generating and predicting multi-omics, multi-species, and multi-tissue samples data. Published in the Nature npj aging, this pioneering study showcases Precious2GPT's capability  to provide high-quality biological data that mimics the real world conditions to support biological mechanisms and the aging process researches, enhancing the understanding of fundamental life biology from birth to death.

Synthetic data generation in omics is a vital tool for training and evaluating genomic analysis tools, controlling differential expression, and exploring data architecture. Traditional methods often fall short due to the complexity and variability inherent in biological data. Precious2GPT addresses these challenges by integrating Conditional Diffusion (CDiffusion) and decoder-only Multi-omics Pretrained Transformer (MoPT) models, trained on gene expression and DNA methylation data. This novel approach not only outperforms existing models like Conditional Generative Adversarial Networks (CGANs) but also excels in generating representative synthetic data that captures tissue- and age-specific information.

The AI work was performed by Insilico’s teams under Insilico Medicine Canada in Montreal and  Insilico Medicine Middle East in Abu Dhabi and validation of the synthetic data generation and other capabilities of the model was performed by multiple teams around the world. 

"Precious2GPT represents a major advancement in synthetic data generation for multi-omics research," says Frank Pun, PhD, co-author of the study. "The model generates accurate omics data, offering great potential for advancing our understanding of complex biological phenomena and developing new therapeutic strategies."

The research team at Insilico employed a hybrid approach to construct Precious2GPT. The process began with the CDiffusion model generating an initial dataset that simulates gene expression levels based on a gene expression network. This network ensures biologically plausible gene expression patterns by incorporating dependencies between genes. The MoPT model then evaluates the quality of each gene's generation, calculating a quality score that reflects the similarity between the synthetic data and real-world profiles. By combining these models using Feature Weighted Linear Stacking (FWLS), the team achieved a balanced and high-quality synthetic data generation.

The validation study results are promising. Precious2GPT demonstrated superior performance in age prediction accuracy using the generated data, even generating data beyond 120 years of age. This capability is particularly valuable for aging research, where longitudinal biological data is often scarce. Additionally, the model's ability to generate tissue-specific data was validated through UMAP dimensionality reduction, showing high concordance with real labels.

In a colorectal cancer case study, Precious2GPT showcased its potential in identifying gene signatures and therapeutic targets. By generating control samples for colorectal cancer cell lines, the model enabled a meta-analysis that revealed significant gene expression signatures, closely aligning with known colorectal cancer pathology. This highlights the model's utility in bioinformatic analyses and target discovery.

Insilico has been at the forefront of both generative AI and aging research, and began publishing studies on biomarkers of aging using advanced bioinformatics in 2014. Later, the company trained deep neural networks (DNNs) on human “multi-omics” longitudinal data and retrained them on diseases to develop its end-to-end Pharma.AI platform for target discovery, drug design, and clinical trial prediction. 

The concept of multimodal transformers for aging research was first proposed by Alex Zhavoronkov, founder and CEO of Insilico Medicine during the Gordon Research Conference (GRC) on Systems Aging in May, 2022. Subsequently, in order to explore the potential of multi modal transformers and diffusion models in learning longitudinal multi-Omics and development of the body world models, Insilico started working on the PreciousGPT series. Prior to Precious2GPT, Insilico released Precious1GPT in June 2023, a dual-transformer model using methylation and transcriptomic data for aging biomarker development and target discovery. 

“We are combining transformer and diffusion models and using other machine learning techniques to build models that understand fundamental biological changes in time and at the same time, understand how to affect this biology using different small molecule approaches, biologics, food and many other modifications that modulate the different biological pathways at different levels of organization. says Alex Zhavoronkov, PhD, founder and CEO of Insilico Medicine and corresponding author of the study.  “We open-source the PreciousGPT series and expect to unite researchers around the world to work in peace to extend healthy, productive and sustainable life for everyone on the planet.”

The implications of Precious2GPT extend beyond aging research. The model's ability to generate synthetic data with high accuracy and specificity opens new avenues for studying various biological processes and diseases. Insilico scientists plan to further expand the application of Precious2GPT to other bioinformatics tasks including survival analysis, cross-modality prediction, and disease-specific omics generation. 

 

About Insilico Medicine

Insilico Medicine, a clinical stage end-to-end generative artificial intelligence (AI)-driven drug discovery company, is connecting biology, chemistry, and clinical trials analysis using next-generation AI systems. The company has developed AI platforms that utilize deep generative models, reinforcement learning, transformers, and other modern machine learning techniques for novel target discovery and the generation of novel molecular structures with desired properties. Insilico Medicine is developing breakthrough solutions to discover and develop innovative drugs for cancer, fibrosis, immunity, central nervous system diseases, infectious diseases, autoimmune diseases, and aging-related diseases. 

Website: www.insilico.com


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.