Using the Python API

[1]:

import logging
import plotly.io

logging.basicConfig(level=logging.INFO)
plotly.io.renderers.default = "plotly_mimetype+notebook"

Reading and parsing peptide-spectrum matches

[2]:

from psm_utils.io import read_file

from ms2rescore.report.charts import score_histogram

INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.

Reading the PSM file

MS²Rescore is fully centered around the use of a psm_utils PSMList. This is a unified data representation of PSMs and their various attributes. Internally, it is simply a list of Pydantic data classes which represent PSMs. With the submodule psm_utils.io, we can read PSMs from a variety of file formats. Here, we will read a PSM file in the MaxQuant msms.txt format.

Importantly, for rescoring, the PSM file must contain all target and decoy PSMs, including PSMs that did not pass the FDR threshold. Most search engines must be specifically configured to return all PSMs without FDR filtering.

[3]:

psm_list = read_file("../../../examples/id/msms.txt", filetype="msms")
psm_list["spectrum_id"] = [str(spec_id) for spec_id in psm_list["spectrum_id"]]

For a quick inspection, we can format the PSM list as a Pandas dataframe and display the first few rows:

[4]:

psm_list.to_dataframe().head()

[4]:

	peptidoform	spectrum_id	run	collection	spectrum	is_decoy	score	qvalue	pep	precursor_mz	retention_time	ion_mobility	protein_list	rank	source	provenance_data	metadata	rescoring_features
0	AAAAAAALQAK/2	4703	20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02	None	None	False	107.660	None	0.001517	478.77982	5.2007	None	[P36578, H3BM89, H3BU31]	None	msms	{'msms_filename': '..\..\..\examples\id\msms.t...	{'Scan index': '3698', 'Sequence': 'AAAAAAALQA...	{}
1	[ac]-AAAAAEQQQFYLLLGNLLSPDNVVR/3	13572	20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02	None	None	False	107.740	None	0.004931	915.15197	11.8470	None	[O00410, E7ETV3, E7EQT5, C9JZD8]	None	msms	{'msms_filename': '..\..\..\examples\id\msms.t...	{'Scan index': '11885', 'Sequence': 'AAAAAEQQQ...	{}
2	[ac]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3	13366	20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02	None	None	False	137.890	None	0.000493	957.85029	11.6900	None	[O00410, E7ETV3, E7EQT5, C9JZD8]	None	msms	{'msms_filename': '..\..\..\examples\id\msms.t...	{'Scan index': '11695', 'Sequence': 'AAAAAEQQQ...	{}
3	AAAAAQGGGGGEPR/2	505	20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02	None	None	False	22.641	None	0.142020	585.28653	0.5178	None	[E9PJF0, E9PQW4, P27361]	None	msms	{'msms_filename': '..\..\..\examples\id\msms.t...	{'Scan index': '419', 'Sequence': 'AAAAAQGGGGG...	{}
4	AAAAAWEEPSSGN[de]GTAR/2	6589	20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02	None	None	False	89.403	None	0.046504	823.87389	6.6105	None	[Q9P258]	None	msms	{'msms_filename': '..\..\..\examples\id\msms.t...	{'Scan index': '5439', 'Sequence': 'AAAAAWEEPS...	{}

We can also directly plot the current PSM score distributions:

[5]:

score_histogram(psm_list)

Parsing modification names

While psm_utils could take care of all file parsing, we must still map the amino acid modification names that were used by the search engine to ones that are recognized by tools such as MS²PiP and DeepLC. This includes:

Names as used in the Unimod or PSI-MOD databases
Accession numbers as used in the Unimod or PSI-MOD databases
Chemical formulas
Mass shifts in Da

Note that, for instance DeepLC, requires a chemical formula to encode modifications. It will not be able to correctly encode modifications if only the mass shift is provided. It is therefore always preferred to provide a name/accession of a database where this information can be retrieved, or provide the chemical formula directly.

If a chemical formula is provided, other tools, such as MS²PIP, can use it to derive the correct mass shift.

To map modification names, simply provide a dictionary to the psm_list.map_modifications method.

[6]:

psm_list.rename_modifications({
    "gl": "Gln->pyro-Glu",
    "ox": "Oxidation",
    "ac": "Acetylation",
    "de": "Deamidation",
})

Assigning fixed modifications

Some search engines (although not many) do not report fixed (sometimes also called static) modifications in their output PSM files. This is for instance the case for MaxQuant. With the method psm_list.add_fixed_modifications, we can systematically assign fixed modifications to their amino acid targets, as was done during the search.

Note that add_fixed_modifications adds the modification in the ProForma 2.0 encoding as a prefix. To fully apply the modifications in the sequence, use the psm_list.apply_fixed_modifications method. A PSM ACDE/2 would therefore go from <[U:4]@C>ACDE/2 to AC[U:4]DE/2.

[7]:

psm_list.add_fixed_modifications([("U:Carbamidomethyl", ["C"])])
psm_list.apply_fixed_modifications()

Adding rescoring features

[8]:

import pandas as pd

from ms2rescore.feature_generators.basic import BasicFeatureGenerator
from ms2rescore.feature_generators.maxquant import MaxQuantFeatureGenerator
from ms2rescore.feature_generators.ms2pip import MS2PIPFeatureGenerator
from ms2rescore.feature_generators.deeplc import DeepLCFeatureGenerator

The PSM list, as read from a MaxQuant msms.txt file, currently does not contain any rescoring features:

[9]:

pd.DataFrame(list(psm_list["rescoring_features"]))

[9]:


0
1
2
3
4
...
35551
35552
35553
35554
35555

35556 rows × 0 columns

Note that psm_list[rescoring_features] returns a Numpy array of dictionaries, but we can use Pandas to format it as a table.

Basic features

Basic features from BasicFeatureGenerator can be added from any PSM list and contain simple feature, such as the search engine score, charge state, and absolute precursor mass error.

[10]:

basic_fgen = BasicFeatureGenerator()
basic_fgen.add_features(psm_list)

INFO:ms2rescore.feature_generators.basic:Adding basic features to PSMs.

[11]:

pd.DataFrame(list(psm_list["rescoring_features"]))

[11]:

	charge_n	charge_1	charge_2	charge_3	charge_4	charge_5	charge_6	charge_7	abs_ms1_error_ppm	search_engine_score
0	2	0	1	0	0	0	0	0	1.137810	107.6600
1	3	0	0	1	0	0	0	0	0.598941	107.7400
2	3	0	0	1	0	0	0	0	0.573291	137.8900
3	2	0	1	0	0	0	0	0	0.929236	22.6410
4	2	0	1	0	0	0	0	0	0.666129	89.4030
...	...	...	...	...	...	...	...	...	...	...
35551	2	0	1	0	0	0	0	0	0.782088	76.1000
35552	4	0	0	0	1	0	0	0	1.147761	114.6400
35553	2	0	1	0	0	0	0	0	0.813101	43.1240
35554	3	0	0	1	0	0	0	0	0.719097	23.5620
35555	3	0	0	1	0	0	0	0	1.083807	9.1885

35556 rows × 10 columns

Now the PSMs contain a simple set of features.

Note that the charge state is present as both an integer and a one-hot encoded vector. This is because charge state can act as both a categorical and a numerical feature. Simply put, sometimes an effect increases or decreases with increasing charge state, and sometimes it is simply different for different charge states.

MS²PIP features

MS²PIP is a machine learning tool that predicts the MS2 spectrum of a peptide given its sequence. It is previously identified MS2 spectra and their corresponding peptide sequences. Because MS²PIP uses the highly performant - but traditional - machine learning approach XGBoost, it can already produce accurate predictions even if trained on smaller spectral libraries. This makes MS²PIP a very flexible platform to train new models on custom datasets. Nevertheless, MS²PIP comes with several pre-trained models, which we will use in this tutorial.

Because traditional proteomics search engines do not fully consider MS2 peak intensities in their scoring functions, adding rescoring features derived from spectrum prediction tools has proved to be a very effective way to further improve the sensitivity of peptide-spectrum matching. Generating features from MS²PIP predictions follows these steps:

Predict MS2 spectra for all PSMs in the dataset, including decoy PSMs and low scoring target PSMs.
Read and parse the observed MS2 spectra from the original spectrum files (MGF or mzML).
Compare the predicted and observed spectra using various similarity metrics, which are returned as rescoring features.

Configuring MS²PIP

In contrast to the basic feature generator, MS²PIP requires some parameters to be set and requires access to the original observed peptide spectra:

model: Name of the prediction model to be used. This strongly depends on the dataset your are rescoring. A list of all MS²PIP models is available on https://ms2pip.readthedocs.io/en/latest/prediction-models/.
ms2_tolerance: As MS²PIP must reannotate the observed MS2 spectra, a mass tolerance must be set. For MS²PIP, this is configured in Dalton. A good value for Orbitrap spectra, for example, is 0.02.
spectrum_path: Path to the original spectrum files. This can be a single file, or a directory containing multiple files. The spectrum files must be in MGF or mzML format.
spectrum_id_pattern: A regular expression pattern to extract the spectrum ID from the PSMs (see below).
processes: Number of CPU processes to use for parallel processing. Note that higher values can lead to memory issues.

[12]:

ms2pip_fgen = MS2PIPFeatureGenerator(
    model="HCD",
    ms2_tolerance=0.02,
    spectrum_path="../../../examples/mgf/20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.mgf",
    spectrum_id_pattern=r".*scan=(\d+)$",
    processes=8,
)

A very import aspect of generating MS²PIP-derived features is making sure the peptide identifications are matched to the correct spectra. Unfortunately, many search engines alter spectrum identifiers in their output files. In the case of MaxQuant and data from Thermo instruments, fortunately, the scan numbers are returned. If the spectrum files have been converted to MGF with ThermoRawFileParser, we need to extract these exact scan numbers from the spectrum TITLE entries. This can be done with a regular expression pattern. For example, the following pattern extracts the scan number from the following spectrum title:

spectrum_id_pattern = r"scan=(\d+)"

mzspec=20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.raw: controllerType=0 controllerNumber=1 scan=2

We can test this with the first spectrum in the file:

[13]:

from pyteomics.mgf import read as read_mgf

for spectrum in read_mgf("../../../examples/mgf/20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.mgf"):
    print(spectrum["params"]["title"])
    break

mzspec=20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.raw: controllerType=0 controllerNumber=1 scan=2

[14]:

import re
spectrum_id = re.match(r".*scan=(\d+)$", spectrum["params"]["title"]).group(1)
print(spectrum_id)

We can use this scan number to retrieve its corresponding PSM:

[15]:

psm_list[psm_list["spectrum_id"] == spectrum_id]

[15]:

PSMList(
    psm_list=[
        PSM(
            peptidoform=Peptidoform('ELMKLAPR/2'),
            spectrum_id='2',
            run='20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02',
            collection=None,
            spectrum=None,
            is_decoy=False,
            score=58.194,
            qvalue=None,
            pep=0.11486,
            precursor_mz=479.28114,
            retention_time=0.0054151,
            ion_mobility=None,
            protein_list=['K7ER25', 'K7EJY1', 'Q7Z3S7', 'E7EUE0'],
            rank=None,
            source='msms',
            provenance_data={
                'msms_filename': '..\\..\\..\\examples\\id\\msms.txt'
            },
            metadata={
                'Scan index': '0',
                'Sequence': 'ELMKLAPR',
                'Length': '8',
                'Missed cleavages': '1',
                'Modifications': 'Unmodified',
                ... +51
            },
            rescoring_features={
                'charge_n': 2,
                'charge_1': 0,
                'charge_2': 1,
                'charge_3': 0,
                'charge_4': 0,
                ... +5
            }
        )
    ]
)

Generating MS²PIP features

[16]:

ms2pip_fgen.add_features(psm_list)

INFO:ms2rescore.feature_generators.ms2pip:Adding MS²PIP-derived features to PSMs.
INFO:ms2rescore.feature_generators.ms2pip:Running MS²PIP for PSMs from run `20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02`...
INFO:ms2pip.core:Processing spectra and peptides...

WARNING:ms2rescore.feature_generators.ms2pip:Failed to calculate features for 2 PSMs

[17]:

pd.DataFrame(list(psm_list["rescoring_features"]))

[17]:

	charge_n	charge_1	charge_2	charge_3	charge_4	charge_5	charge_6	charge_7	abs_ms1_error_ppm	search_engine_score	...	iony_abs_diff_Q2	iony_abs_diff_Q3	iony_mean_abs_diff	iony_std_abs_diff	dotprod	dotprod_ionb	dotprod_iony	cos	cos_ionb	cos_iony
0	2	0	1	0	0	0	0	0	1.137810	107.6600	...	0.004537	0.009003	0.005573	0.004217	0.007582	6.011568e-03	0.001571	0.965756	0.986066	0.948538
1	3	0	0	1	0	0	0	0	0.598941	107.7400	...	0.001146	0.003543	0.003072	0.004661	0.013845	8.715502e-03	0.005130	0.950527	0.972807	0.984188
2	3	0	0	1	0	0	0	0	0.573291	137.8900	...	0.001195	0.002719	0.001657	0.001813	0.010731	9.636219e-03	0.001095	0.930393	0.933799	0.973310
3	2	0	1	0	0	0	0	0	0.929236	22.6410	...	0.006460	0.015503	0.012758	0.016877	0.002545	1.657421e-03	0.000887	0.626743	0.900722	0.415943
4	2	0	1	0	0	0	0	0	0.666129	89.4030	...	0.001901	0.011982	0.010727	0.019760	0.001069	3.751754e-04	0.000694	0.720658	0.976536	0.662783
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
35551	2	0	1	0	0	0	0	0	0.782088	76.1000	...	0.016826	0.030917	0.019391	0.015875	0.002695	1.119519e-03	0.001575	0.839569	0.991228	0.776745
35552	4	0	0	0	1	0	0	0	1.147761	114.6400	...	0.000678	0.002492	0.001421	0.001660	0.001290	4.580596e-04	0.000832	0.852407	0.987883	0.977172
35553	2	0	1	0	0	0	0	0	0.813101	43.1240	...	0.014101	0.016411	0.015302	0.016832	0.001289	1.163584e-03	0.000125	0.386544	0.886590	0.076230
35554	3	0	0	1	0	0	0	0	0.719097	23.5620	...	0.003215	0.007025	0.005133	0.006286	0.000170	1.747118e-07	0.000170	0.207400	0.007175	0.248114
35555	3	0	0	1	0	0	0	0	1.083807	9.1885	...	0.000841	0.002671	0.003405	0.004654	0.000018	3.528620e-12	0.000018	0.073865	0.767783	0.233401

35556 rows × 81 columns

Generating DeepLC features

[18]:

deeplc_fgen = DeepLCFeatureGenerator(
    lower_score_is_better=False,
    calibration_set_size=0.15,
    spectrum_path=None,
    processes=8,
    deeplc_retrain=False,
)
deeplc_fgen.add_features(psm_list)

INFO:ms2rescore.feature_generators.deeplc:Adding DeepLC-derived features to PSMs.
INFO:ms2rescore.feature_generators.deeplc:Running DeepLC for PSMs from run (1/1): `20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02`...

Generating features from MaxQuant output

While MaxQuant msms.txt files do not contain directly usable rescoring features, it does contain various metrics from spectrum matching from which we can calculate rescoring features. These include features related to the mass error of the annotated MS2 peaks and the summed intensity of the annotated peaks.

The data that is used to calculate these features is stored in the PSM metadata attributes. For instance:

[19]:

pd.DataFrame(list(psm_list["metadata"]))[[
    "Matches", "Intensities", "Mass Deviations [Da]", "Intensity coverage"
]].head()

[19]:

	Matches	Intensities	Mass Deviations [Da]	Intensity coverage
0	y1;y2;y3;y4;y5;y6;y7;y8;y9;y1-NH3;y2-NH3;y3-NH...	6420431.2;5259286.7;7595262.2;8731264.7;116766...	-9.10020734181671E-05;-0.000929981532976853;-0...	0.450133939639015
1	y1;y2;y4;y6;y7;y8;y9;y10;y11;y12;y13;y14;y2-NH...	357644.7;48235;495609.6;1150605.6;1479450.3;14...	-0.000377184504145589;-0.000336244570064537;-0...	0.682793460933943
2	y1;y2;y3;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14...	467523.3;1624841.4;1272199.7;504549;6773740.4;...	-0.000650374560791533;0.000150255794096665;0.0...	0.632290650531871
3	y2;y5;y5-H2O;a2;b2;b3	245612.6;40630.3;18142.2;108568;152454.4;155335.3	-0.0009548283943559;0.00245136965645543;0.0023...	0.187275289286221
4	y1;y4;y6;y9;y10;y11;y12;y13;y11-H2O;y1-NH3;a2;...	9529767.6;1651203.2;270327.2;11359473.6;212221...	-0.000178087084577783;0.00014824673877456;0.00...	0.0479428579117575

[20]:

maxquant_fgen = MaxQuantFeatureGenerator()
maxquant_fgen.add_features(psm_list)

INFO:ms2rescore.feature_generators.maxquant:Adding MaxQuant-derived features to PSMs.

Assessing individual feature performance before rescoring

[21]:

import plotly.express as px
from ms2rescore.report.charts import (
    calculate_feature_qvalues,
    feature_ecdf_auc_bar,
    fdr_plot,
    ms2pip_correlation,
)
import deeplc.plot

features = pd.DataFrame(list(psm_list["rescoring_features"]))

The following plot shows the number of identified PSM at varying FDR thresholds, where the red line indicates the commonly used 1% FDR threshold.

[22]:

psm_list.calculate_qvalues()  # msms.txt does not contain q-values
fig = fdr_plot(psm_list.to_dataframe(), fdr_thresholds=None, log=False)
fig.show()

We can make a similar plot for each of the rescoring features by simply calculating q-values as if the feature were a search engine score.

[23]:

feature_qvalues, feature_ecdf_auc = calculate_feature_qvalues(
    features=features,
    is_decoy=psm_list["is_decoy"],
)

[24]:

feature_names = {
    "basic": basic_fgen.feature_names,
    "ms2pip": ms2pip_fgen.feature_names,
    "deeplc": deeplc_fgen.feature_names,
    "maxquant": maxquant_fgen.feature_names,
}
feature_generator_map = {name: gen for gen, f_list in feature_names.items() for name in f_list}

feature_qvalues_melt = feature_qvalues.melt(var_name="feature", value_name="q-value")
feature_qvalues_melt["feature_generator"] = feature_qvalues_melt["feature"].map(feature_generator_map)

[25]:

# Plot takes 100MB in saved file...

# fig = px.ecdf(
#     data_frame=feature_qvalues_melt,
#     x="q-value",
#     color="feature_generator",
#     line_dash="feature",
# )
# fig.show()

The larger the area under the curve (AUC) is for each line, the better. We can therefore visualize the same information in a more easy-to-interpret bar chart with the AUCs, which were already calculated by the calculate_feature_qvalues function.

[26]:

feature_ecdf_auc["feature_generator"] = feature_ecdf_auc["feature"].map(feature_generator_map)

[27]:

fig = feature_ecdf_auc_bar(feature_ecdf_auc)
fig.show()

Importantly, a low q-value ECDF AUC does not necessarily mean that a feature is not valuable for rescoring. The more orthogonal a feature is to the other information already used in PSM scoring, the more it will help to confidently identify PSMs. For instance, retention time features are completely independent from the information used in a traditional search engine. When combined with other scoring information, they can provide a substantial boost in sensitivity.

The quality of specific rescoring features can also be assessed in a manner specific to the feature. For instance, we can plot the distribution of the MS²PIP pearson correlations for target PSMs that initially passed the FDR threshold:

[28]:

fig = ms2pip_correlation(
    features=features,
    is_decoy=psm_list["is_decoy"],
    qvalue=psm_list["qvalue"]
)
fig.show()

For DeepLC, we can plot the predicted retention time versus the observed retention time:

[29]:

fig = deeplc.plot.scatter(
    df=features[(~psm_list["is_decoy"]) & (psm_list["qvalue"] <= 0.01)],  # noqa: E712
    predicted_column="predicted_retention_time_best",
    observed_column="observed_retention_time_best",
)
fig.show()

DeepLC also provides a function to plot the current relative mean absolute error (rMAE) of the predicted retention times against a distribution of 460 benchmark datasets. The lower the rMAE, the better.

[30]:

fig = deeplc.plot.distribution_baseline(
    df=features[(~psm_list["is_decoy"]) & (psm_list["qvalue"] <= 0.01)],  # noqa: E712
    predicted_column="predicted_retention_time_best",
    observed_column="observed_retention_time_best",
)
fig.show()

Rescoring

Using Mokapot directly

[31]:

import matplotlib.pyplot as plt
from mokapot import brew
from ms2rescore.rescoring_engines.mokapot import convert_psm_list

%matplotlib inline

In this tutorial, we will use Mokapot as rescoring engine. By default it employs the same methodologies as Percolator, but is fully implemented in Python. First, the PSM list must be converted to a Mokapot LinearPsmDataset.

[32]:

linear_psm_dataset = convert_psm_list(psm_list)

Then, the Mokapot brew function can be called to rescore the PSMs:

[33]:

confidence_results, models = brew(linear_psm_dataset)

Mokapot contains a function to conveniently plot the results:

[34]:

confidence_results.plot_qvalues()
plt.show()

../../_images/tutorials_in-depth-python-api_67_0.png

Or to compare the un-rescored PSMs to the rescored PSMs:

[35]:

linear_psm_dataset.assign_confidence().plot_qvalues(label="Before rescoring")
confidence_results.plot_qvalues(label="After rescoring")
plt.legend()
plt.show()

../../_images/tutorials_in-depth-python-api_69_0.png

Using mokapot directly allows us to quickly generate LinearPsmDataset objects with different feature sets, for instance:

Only basic features
Basic + MaxQuant-derived features (mimicking a traditional Percolator rescoring run)
Basic + MaxQuant-derived + MS²PIP features
Basic + MaxQuant-derived + MS²PIP + DeepLC features

[36]:

feature_sets = {
    "basic": feature_names["basic"],
    "basic_mq": feature_names["basic"] + feature_names["maxquant"],
    "basic_mq_ms2pip": feature_names["basic"] + feature_names["maxquant"] + feature_names["ms2pip"],
    "basic_mq_ms2pip_deeplc": feature_names["basic"] + feature_names["maxquant"] + feature_names["ms2pip"] + feature_names["deeplc"],
}

[37]:

linear_psm_dataset.assign_confidence().plot_qvalues(label="Before rescoring")

for feature_set, features in feature_sets.items():
    linear_psm_dataset = convert_psm_list(psm_list, features)
    confidence_results, models = brew(linear_psm_dataset)
    confidence_results.plot_qvalues(label=f"Rescoring with: {feature_set}")

plt.xscale('log')
plt.xlim(1e-4, 1)
plt.legend()
plt.show()

../../_images/tutorials_in-depth-python-api_72_0.png

Using Mokapot within MS²Rescore

Mokapot can also be used for rescoring with the higher-level integration in MS²Rescore. Simply pass a PSMList object with rescoring features added to the rescore function. Doing so will automatically convert the PSMList to a Mokapot LinearPsmDataset, run the brew function, and update the original PSMList with the new scores, q-values and PEPs.

[38]:

from ms2rescore.rescoring_engines import mokapot

mokapot.rescore(psm_list)

[39]:

score_histogram(psm_list.to_dataframe())