Using a new algorithm called FLSHclust (“flash clust”), researchers have discovered 188 rare and previously unknown CRISPR-linked gene modules – including a novel type VII CRISPR-Cas system – among billions of protein sequences. The approach and its findings provide novel opportunities for harnessing CRISPR systems and understanding the vast functional diversity of microbial proteins. CRISPR systems have been leveraged to develop a growing suite of novel biomolecular approaches, including CRISPR/Cas-mediated genome editing. The discovery of previously unknown CRISPR systems has the potential to lead to the further development of these biotechnologies, including safer and more effective genomic therapeutics. The CRISPR toolbox has been expanded through computational searches of protein sequence databases. However, the algorithmic approaches commonly used have become impractical for mining exponentially growing datasets containing billions of proteins. To address this limitation, Han Altae-Tran and colleagues developed FLSHclust (fast locality-sensitive hashing-based clustering) – an algorithm for clustering proteins by sequence similarity, which, unlike currently available methods, can quickly and efficiently analyze vast protein sequence databases. To evaluate their approach, Altae-Tran et al. used FLSHclust to search for rare CRISPR systems in an 8.8 terrabase pair metagenomic database containing 8 billion proteins and 10.2 million CRISPR arrays. The analysis uncovered 188 previously unknown CRISPR-associated genes. The authors also identified and characterized a new class of Cas-14 containing CRISPR system, type VII, which acts on RNA. According to the findings, the newly identified systems were rare, and many only encompassed a single cluster out of the nearly 130,000 CRISPR-linked clusters revealed by FLSHclust. “The discovery of previously unknown cas genes and CRISPR systems substantially expands the known CRISPR diversity, emphasizing the functional versatility of CRISPR whereby previously undiscovered proteins and domains are often recruited, either replacing preexisting components or conferring newly identified functions to the preexisting scaffold of Cas proteins,” write Altae-Tran et al. “Taken together, the results of the work reveal unprecedented organizational and functional flexibility and modularity of CRISPR systems but also demonstrates that most variants are rare and only found in relatively unusual bacteria and archaea.”
Journal
Science
Article Title
Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering
Article Publication Date
24-Nov-2023