Hanyang University | Jangho Seo

NovoRank

NovoRank is a machine learning-based post-processing tool designed to improve peptide identification accuracy.

Workflow of NovoRank. (A) Analysis flow of NovoRank. (B) Two-step clustering method. Colors represent a cluster in each step. (C) Example cluster-score (C-score) calculation in a cluster of two MS/MS spectra. (D) Deep learning model to predict a more desirable peptide among the top two candidate peptides in a cluster.

Summary

Overview

De novo sequencing is a crucial technique in proteomics that identifies peptide sequences directly from experimental data, without relying on existing protein databases. This method plays a key role in identifying novel peptides, such as neoantigen discovery. However, conventional de novo sequencing tools consider only individual spectrum and rely solely on imperfect scoring functions, often leading to erroneous peptide identifications. NovoRank enhances the reliability and accuracy of de novo sequencing by re-ranking candidate sequences using the comprehensive analysis of similar spectra, assuming that they originate from the same peptide species.

Keywords

Clustering
Deep Learning
Data Analytics

Dataset

Experimental Data Sets

The main dataset used in NovoRank consists of MS/MS spectra (.mgf), obtained from mass spectrometry experiments and containing peptide fragmentation information.

For more details, please refer to the link at the top of this section and here.

Contribution

Introduced Two-Step Clustering Method: To effectively classify similar spectra, Spectral Clustering was applied for initial grouping, followed by DBSCAN to perform fine-grained clustering within the first-stage clusters, ensuring that similar spectra are more precisely grouped together.

Proposed C-Score: A new scoring method (C-Score) was introduced to identify more reliable candidate peptides within clusters. The C-Score is calculated by normalizing the sum of each peptide's de novo scores by the cluster size. This method assigns higher scores to peptides with higher de novo scores and those that appear more frequently within the cluster, allowing for a fair comparison between clusters of different sizes. This approach ensures that peptides with a higher likelihood of being the correct sequence are ranked higher, leading to more accurate results.

Applied Deep Learning Model: Developed a multi-modal deep learning model that simultaneously takes spectrum, sequence, and tabular data as inputs to assign the best peptide. By selecting the most desirable peptide from the top two candidates, the model improved identification accuracy.

Applicable to Existing De Novo Sequencing Tools: Applied to various de novo peptide sequencing tools, improving peptide identification performance by enhancing the precision and recall.

Results

Precision and recall of NovoRank across three de novo peptide sequencing tools and three data sets. Precision and recall are depicted in (A) and (B), respectively, according to various score thresholds.

Precision and recall increased by an average of 4.6% and 4.5%, with improvements of 3.8% and 3.6% specifically in Casanovo, a state-of-the-art (SOTA) model.

Casanovo, pNovo3, and PEAKS achieved an increase of 0.37%−0.61%, 7.52%−18.80%, and 3.06%−4.24% in correct peptide identifications, respectively. Although some peptides were missed, the number of newly identified peptides was greater.

※ De novo sequencing tools

1. PEAKS: Algorithm-based commercial tool

2. pNovo3: Tool combining algorithm and machine learning

3. Casanovo: Deep learning-based tool using transformer (SOTA)

References

NovoRank: Refinement for De Novo Peptide Sequencing Based on Spectral Clustering and Deep Learning