Using the Python API

[1]:
import logging
import plotly.io

logging.basicConfig(level=logging.INFO)
plotly.io.renderers.default = "plotly_mimetype+notebook"

Reading and parsing peptide-spectrum matches

[2]:
from psm_utils.io import read_file

from ms2rescore.report.charts import score_histogram
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.

Reading the PSM file

MS²Rescore is fully centered around the use of a psm_utils PSMList. This is a unified data representation of PSMs and their various attributes. Internally, it is simply a list of Pydantic data classes which represent PSMs. With the submodule psm_utils.io, we can read PSMs from a variety of file formats. Here, we will read a PSM file in the MaxQuant msms.txt format.

Importantly, for rescoring, the PSM file must contain all target and decoy PSMs, including PSMs that did not pass the FDR threshold. Most search engines must be specifically configured to return all PSMs without FDR filtering.

[3]:
psm_list = read_file("../../../examples/id/msms.txt", filetype="msms")
psm_list["spectrum_id"] = [str(spec_id) for spec_id in psm_list["spectrum_id"]]

For a quick inspection, we can format the PSM list as a Pandas dataframe and display the first few rows:

[4]:
psm_list.to_dataframe().head()
[4]:
peptidoform spectrum_id run collection spectrum is_decoy score qvalue pep precursor_mz retention_time ion_mobility protein_list rank source provenance_data metadata rescoring_features
0 AAAAAAALQAK/2 4703 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 None None False 107.660 None 0.001517 478.77982 5.2007 None [P36578, H3BM89, H3BU31] None msms {'msms_filename': '..\..\..\examples\id\msms.t... {'Scan index': '3698', 'Sequence': 'AAAAAAALQA... {}
1 [ac]-AAAAAEQQQFYLLLGNLLSPDNVVR/3 13572 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 None None False 107.740 None 0.004931 915.15197 11.8470 None [O00410, E7ETV3, E7EQT5, C9JZD8] None msms {'msms_filename': '..\..\..\examples\id\msms.t... {'Scan index': '11885', 'Sequence': 'AAAAAEQQQ... {}
2 [ac]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3 13366 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 None None False 137.890 None 0.000493 957.85029 11.6900 None [O00410, E7ETV3, E7EQT5, C9JZD8] None msms {'msms_filename': '..\..\..\examples\id\msms.t... {'Scan index': '11695', 'Sequence': 'AAAAAEQQQ... {}
3 AAAAAQGGGGGEPR/2 505 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 None None False 22.641 None 0.142020 585.28653 0.5178 None [E9PJF0, E9PQW4, P27361] None msms {'msms_filename': '..\..\..\examples\id\msms.t... {'Scan index': '419', 'Sequence': 'AAAAAQGGGGG... {}
4 AAAAAWEEPSSGN[de]GTAR/2 6589 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 None None False 89.403 None 0.046504 823.87389 6.6105 None [Q9P258] None msms {'msms_filename': '..\..\..\examples\id\msms.t... {'Scan index': '5439', 'Sequence': 'AAAAAWEEPS... {}

We can also directly plot the current PSM score distributions:

[5]:
score_histogram(psm_list)

Parsing modification names

While psm_utils could take care of all file parsing, we must still map the amino acid modification names that were used by the search engine to ones that are recognized by tools such as MS²PiP and DeepLC. This includes:

  • Names as used in the Unimod or PSI-MOD databases

  • Accession numbers as used in the Unimod or PSI-MOD databases

  • Chemical formulas

  • Mass shifts in Da

Note that, for instance DeepLC, requires a chemical formula to encode modifications. It will not be able to correctly encode modifications if only the mass shift is provided. It is therefore always preferred to provide a name/accession of a database where this information can be retrieved, or provide the chemical formula directly.

If a chemical formula is provided, other tools, such as MS²PIP, can use it to derive the correct mass shift.

To map modification names, simply provide a dictionary to the psm_list.map_modifications method.

[6]:
psm_list.rename_modifications({
    "gl": "Gln->pyro-Glu",
    "ox": "Oxidation",
    "ac": "Acetylation",
    "de": "Deamidation",
})

Assigning fixed modifications

Some search engines (although not many) do not report fixed (sometimes also called static) modifications in their output PSM files. This is for instance the case for MaxQuant. With the method psm_list.add_fixed_modifications, we can systematically assign fixed modifications to their amino acid targets, as was done during the search.

Note that add_fixed_modifications adds the modification in the ProForma 2.0 encoding as a prefix. To fully apply the modifications in the sequence, use the psm_list.apply_fixed_modifications method. A PSM ACDE/2 would therefore go from <[U:4]@C>ACDE/2 to AC[U:4]DE/2.

[7]:
psm_list.add_fixed_modifications([("U:Carbamidomethyl", ["C"])])
psm_list.apply_fixed_modifications()

Adding rescoring features

[8]:
import pandas as pd

from ms2rescore.feature_generators.basic import BasicFeatureGenerator
from ms2rescore.feature_generators.maxquant import MaxQuantFeatureGenerator
from ms2rescore.feature_generators.ms2pip import MS2PIPFeatureGenerator
from ms2rescore.feature_generators.deeplc import DeepLCFeatureGenerator

The PSM list, as read from a MaxQuant msms.txt file, currently does not contain any rescoring features:

[9]:
pd.DataFrame(list(psm_list["rescoring_features"]))
[9]:
0
1
2
3
4
...
35551
35552
35553
35554
35555

35556 rows × 0 columns

Note that psm_list[rescoring_features] returns a Numpy array of dictionaries, but we can use Pandas to format it as a table.

Basic features

Basic features from BasicFeatureGenerator can be added from any PSM list and contain simple feature, such as the search engine score, charge state, and absolute precursor mass error.

[10]:
basic_fgen = BasicFeatureGenerator()
basic_fgen.add_features(psm_list)
INFO:ms2rescore.feature_generators.basic:Adding basic features to PSMs.
[11]:
pd.DataFrame(list(psm_list["rescoring_features"]))
[11]:
charge_n charge_1 charge_2 charge_3 charge_4 charge_5 charge_6 charge_7 abs_ms1_error_ppm search_engine_score
0 2 0 1 0 0 0 0 0 1.137810 107.6600
1 3 0 0 1 0 0 0 0 0.598941 107.7400
2 3 0 0 1 0 0 0 0 0.573291 137.8900
3 2 0 1 0 0 0 0 0 0.929236 22.6410
4 2 0 1 0 0 0 0 0 0.666129 89.4030
... ... ... ... ... ... ... ... ... ... ...
35551 2 0 1 0 0 0 0 0 0.782088 76.1000
35552 4 0 0 0 1 0 0 0 1.147761 114.6400
35553 2 0 1 0 0 0 0 0 0.813101 43.1240
35554 3 0 0 1 0 0 0 0 0.719097 23.5620
35555 3 0 0 1 0 0 0 0 1.083807 9.1885

35556 rows × 10 columns

Now the PSMs contain a simple set of features.

Note that the charge state is present as both an integer and a one-hot encoded vector. This is because charge state can act as both a categorical and a numerical feature. Simply put, sometimes an effect increases or decreases with increasing charge state, and sometimes it is simply different for different charge states.

MS²PIP features

MS²PIP is a machine learning tool that predicts the MS2 spectrum of a peptide given its sequence. It is previously identified MS2 spectra and their corresponding peptide sequences. Because MS²PIP uses the highly performant - but traditional - machine learning approach XGBoost, it can already produce accurate predictions even if trained on smaller spectral libraries. This makes MS²PIP a very flexible platform to train new models on custom datasets. Nevertheless, MS²PIP comes with several pre-trained models, which we will use in this tutorial.

Because traditional proteomics search engines do not fully consider MS2 peak intensities in their scoring functions, adding rescoring features derived from spectrum prediction tools has proved to be a very effective way to further improve the sensitivity of peptide-spectrum matching. Generating features from MS²PIP predictions follows these steps:

  1. Predict MS2 spectra for all PSMs in the dataset, including decoy PSMs and low scoring target PSMs.

  2. Read and parse the observed MS2 spectra from the original spectrum files (MGF or mzML).

  3. Compare the predicted and observed spectra using various similarity metrics, which are returned as rescoring features.

Configuring MS²PIP

In contrast to the basic feature generator, MS²PIP requires some parameters to be set and requires access to the original observed peptide spectra:

  • model: Name of the prediction model to be used. This strongly depends on the dataset your are rescoring. A list of all MS²PIP models is available on https://ms2pip.readthedocs.io/en/latest/prediction-models/.

  • ms2_tolerance: As MS²PIP must reannotate the observed MS2 spectra, a mass tolerance must be set. For MS²PIP, this is configured in Dalton. A good value for Orbitrap spectra, for example, is 0.02.

  • spectrum_path: Path to the original spectrum files. This can be a single file, or a directory containing multiple files. The spectrum files must be in MGF or mzML format.

  • spectrum_id_pattern: A regular expression pattern to extract the spectrum ID from the PSMs (see below).

  • processes: Number of CPU processes to use for parallel processing. Note that higher values can lead to memory issues.

[12]:
ms2pip_fgen = MS2PIPFeatureGenerator(
    model="HCD",
    ms2_tolerance=0.02,
    spectrum_path="../../../examples/mgf/20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.mgf",
    spectrum_id_pattern=r".*scan=(\d+)$",
    processes=8,
)

A very import aspect of generating MS²PIP-derived features is making sure the peptide identifications are matched to the correct spectra. Unfortunately, many search engines alter spectrum identifiers in their output files. In the case of MaxQuant and data from Thermo instruments, fortunately, the scan numbers are returned. If the spectrum files have been converted to MGF with ThermoRawFileParser, we need to extract these exact scan numbers from the spectrum TITLE entries. This can be done with a regular expression pattern. For example, the following pattern extracts the scan number from the following spectrum title:

spectrum_id_pattern = r"scan=(\d+)"
mzspec=20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.raw: controllerType=0 controllerNumber=1 scan=2

We can test this with the first spectrum in the file:

[13]:
from pyteomics.mgf import read as read_mgf

for spectrum in read_mgf("../../../examples/mgf/20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.mgf"):
    print(spectrum["params"]["title"])
    break
mzspec=20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.raw: controllerType=0 controllerNumber=1 scan=2
[14]:
import re
spectrum_id = re.match(r".*scan=(\d+)$", spectrum["params"]["title"]).group(1)
print(spectrum_id)
2

We can use this scan number to retrieve its corresponding PSM:

[15]:
psm_list[psm_list["spectrum_id"] == spectrum_id]
[15]:
PSMList(
    psm_list=[
        PSM(
            peptidoform=Peptidoform('ELMKLAPR/2'),
            spectrum_id='2',
            run='20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02',
            collection=None,
            spectrum=None,
            is_decoy=False,
            score=58.194,
            qvalue=None,
            pep=0.11486,
            precursor_mz=479.28114,
            retention_time=0.0054151,
            ion_mobility=None,
            protein_list=['K7ER25', 'K7EJY1', 'Q7Z3S7', 'E7EUE0'],
            rank=None,
            source='msms',
            provenance_data={
                'msms_filename': '..\\..\\..\\examples\\id\\msms.txt'
            },
            metadata={
                'Scan index': '0',
                'Sequence': 'ELMKLAPR',
                'Length': '8',
                'Missed cleavages': '1',
                'Modifications': 'Unmodified',
                ... +51
            },
            rescoring_features={
                'charge_n': 2,
                'charge_1': 0,
                'charge_2': 1,
                'charge_3': 0,
                'charge_4': 0,
                ... +5
            }
        )
    ]
)

Generating MS²PIP features

[16]:
ms2pip_fgen.add_features(psm_list)
INFO:ms2rescore.feature_generators.ms2pip:Adding MS²PIP-derived features to PSMs.
INFO:ms2rescore.feature_generators.ms2pip:Running MS²PIP for PSMs from run `20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02`...
INFO:ms2pip.core:Processing spectra and peptides...
WARNING:ms2rescore.feature_generators.ms2pip:Failed to calculate features for 2 PSMs
[17]:
pd.DataFrame(list(psm_list["rescoring_features"]))
[17]:
charge_n charge_1 charge_2 charge_3 charge_4 charge_5 charge_6 charge_7 abs_ms1_error_ppm search_engine_score ... iony_abs_diff_Q2 iony_abs_diff_Q3 iony_mean_abs_diff iony_std_abs_diff dotprod dotprod_ionb dotprod_iony cos cos_ionb cos_iony
0 2 0 1 0 0 0 0 0 1.137810 107.6600 ... 0.004537 0.009003 0.005573 0.004217 0.007582 6.011568e-03 0.001571 0.965756 0.986066 0.948538
1 3 0 0 1 0 0 0 0 0.598941 107.7400 ... 0.001146 0.003543 0.003072 0.004661 0.013845 8.715502e-03 0.005130 0.950527 0.972807 0.984188
2 3 0 0 1 0 0 0 0 0.573291 137.8900 ... 0.001195 0.002719 0.001657 0.001813 0.010731 9.636219e-03 0.001095 0.930393 0.933799 0.973310
3 2 0 1 0 0 0 0 0 0.929236 22.6410 ... 0.006460 0.015503 0.012758 0.016877 0.002545 1.657421e-03 0.000887 0.626743 0.900722 0.415943
4 2 0 1 0 0 0 0 0 0.666129 89.4030 ... 0.001901 0.011982 0.010727 0.019760 0.001069 3.751754e-04 0.000694 0.720658 0.976536 0.662783
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35551 2 0 1 0 0 0 0 0 0.782088 76.1000 ... 0.016826 0.030917 0.019391 0.015875 0.002695 1.119519e-03 0.001575 0.839569 0.991228 0.776745
35552 4 0 0 0 1 0 0 0 1.147761 114.6400 ... 0.000678 0.002492 0.001421 0.001660 0.001290 4.580596e-04 0.000832 0.852407 0.987883 0.977172
35553 2 0 1 0 0 0 0 0 0.813101 43.1240 ... 0.014101 0.016411 0.015302 0.016832 0.001289 1.163584e-03 0.000125 0.386544 0.886590 0.076230
35554 3 0 0 1 0 0 0 0 0.719097 23.5620 ... 0.003215 0.007025 0.005133 0.006286 0.000170 1.747118e-07 0.000170 0.207400 0.007175 0.248114
35555 3 0 0 1 0 0 0 0 1.083807 9.1885 ... 0.000841 0.002671 0.003405 0.004654 0.000018 3.528620e-12 0.000018 0.073865 0.767783 0.233401

35556 rows × 81 columns

Generating DeepLC features

[18]:
deeplc_fgen = DeepLCFeatureGenerator(
    lower_score_is_better=False,
    calibration_set_size=0.15,
    spectrum_path=None,
    processes=8,
    deeplc_retrain=False,
)
deeplc_fgen.add_features(psm_list)
INFO:ms2rescore.feature_generators.deeplc:Adding DeepLC-derived features to PSMs.
INFO:ms2rescore.feature_generators.deeplc:Running DeepLC for PSMs from run (1/1): `20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02`...

Generating features from MaxQuant output

While MaxQuant msms.txt files do not contain directly usable rescoring features, it does contain various metrics from spectrum matching from which we can calculate rescoring features. These include features related to the mass error of the annotated MS2 peaks and the summed intensity of the annotated peaks.

The data that is used to calculate these features is stored in the PSM metadata attributes. For instance:

[19]:
pd.DataFrame(list(psm_list["metadata"]))[[
    "Matches", "Intensities", "Mass Deviations [Da]", "Intensity coverage"
]].head()
[19]:
Matches Intensities Mass Deviations [Da] Intensity coverage
0 y1;y2;y3;y4;y5;y6;y7;y8;y9;y1-NH3;y2-NH3;y3-NH... 6420431.2;5259286.7;7595262.2;8731264.7;116766... -9.10020734181671E-05;-0.000929981532976853;-0... 0.450133939639015
1 y1;y2;y4;y6;y7;y8;y9;y10;y11;y12;y13;y14;y2-NH... 357644.7;48235;495609.6;1150605.6;1479450.3;14... -0.000377184504145589;-0.000336244570064537;-0... 0.682793460933943
2 y1;y2;y3;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14... 467523.3;1624841.4;1272199.7;504549;6773740.4;... -0.000650374560791533;0.000150255794096665;0.0... 0.632290650531871
3 y2;y5;y5-H2O;a2;b2;b3 245612.6;40630.3;18142.2;108568;152454.4;155335.3 -0.0009548283943559;0.00245136965645543;0.0023... 0.187275289286221
4 y1;y4;y6;y9;y10;y11;y12;y13;y11-H2O;y1-NH3;a2;... 9529767.6;1651203.2;270327.2;11359473.6;212221... -0.000178087084577783;0.00014824673877456;0.00... 0.0479428579117575
[20]:
maxquant_fgen = MaxQuantFeatureGenerator()
maxquant_fgen.add_features(psm_list)
INFO:ms2rescore.feature_generators.maxquant:Adding MaxQuant-derived features to PSMs.

Assessing individual feature performance before rescoring

[21]:
import plotly.express as px
from ms2rescore.report.charts import (
    calculate_feature_qvalues,
    feature_ecdf_auc_bar,
    fdr_plot,
    ms2pip_correlation,
)
import deeplc.plot

features = pd.DataFrame(list(psm_list["rescoring_features"]))

The following plot shows the number of identified PSM at varying FDR thresholds, where the red line indicates the commonly used 1% FDR threshold.

[22]:
psm_list.calculate_qvalues()  # msms.txt does not contain q-values
fig = fdr_plot(psm_list.to_dataframe(), fdr_thresholds=None, log=False)
fig.show()

We can make a similar plot for each of the rescoring features by simply calculating q-values as if the feature were a search engine score.

[23]:
feature_qvalues, feature_ecdf_auc = calculate_feature_qvalues(
    features=features,
    is_decoy=psm_list["is_decoy"],
)
[24]:
feature_names = {
    "basic": basic_fgen.feature_names,
    "ms2pip": ms2pip_fgen.feature_names,
    "deeplc": deeplc_fgen.feature_names,
    "maxquant": maxquant_fgen.feature_names,
}
feature_generator_map = {name: gen for gen, f_list in feature_names.items() for name in f_list}

feature_qvalues_melt = feature_qvalues.melt(var_name="feature", value_name="q-value")
feature_qvalues_melt["feature_generator"] = feature_qvalues_melt["feature"].map(feature_generator_map)
[25]:
# Plot takes 100MB in saved file...

# fig = px.ecdf(
#     data_frame=feature_qvalues_melt,
#     x="q-value",
#     color="feature_generator",
#     line_dash="feature",
# )
# fig.show()

The larger the area under the curve (AUC) is for each line, the better. We can therefore visualize the same information in a more easy-to-interpret bar chart with the AUCs, which were already calculated by the calculate_feature_qvalues function.

[26]:
feature_ecdf_auc["feature_generator"] = feature_ecdf_auc["feature"].map(feature_generator_map)
[27]:
fig = feature_ecdf_auc_bar(feature_ecdf_auc)
fig.show()

Importantly, a low q-value ECDF AUC does not necessarily mean that a feature is not valuable for rescoring. The more orthogonal a feature is to the other information already used in PSM scoring, the more it will help to confidently identify PSMs. For instance, retention time features are completely independent from the information used in a traditional search engine. When combined with other scoring information, they can provide a substantial boost in sensitivity.

The quality of specific rescoring features can also be assessed in a manner specific to the feature. For instance, we can plot the distribution of the MS²PIP pearson correlations for target PSMs that initially passed the FDR threshold:

[28]:
fig = ms2pip_correlation(
    features=features,
    is_decoy=psm_list["is_decoy"],
    qvalue=psm_list["qvalue"]
)
fig.show()

For DeepLC, we can plot the predicted retention time versus the observed retention time:

[29]:
fig = deeplc.plot.scatter(
    df=features[(~psm_list["is_decoy"]) & (psm_list["qvalue"] <= 0.01)],  # noqa: E712
    predicted_column="predicted_retention_time_best",
    observed_column="observed_retention_time_best",
)
fig.show()

DeepLC also provides a function to plot the current relative mean absolute error (rMAE) of the predicted retention times against a distribution of 460 benchmark datasets. The lower the rMAE, the better.

[30]:
fig = deeplc.plot.distribution_baseline(
    df=features[(~psm_list["is_decoy"]) & (psm_list["qvalue"] <= 0.01)],  # noqa: E712
    predicted_column="predicted_retention_time_best",
    observed_column="observed_retention_time_best",
)
fig.show()

Rescoring

Using Mokapot directly

[31]:
import matplotlib.pyplot as plt
from mokapot import brew
from ms2rescore.rescoring_engines.mokapot import convert_psm_list

%matplotlib inline

In this tutorial, we will use Mokapot as rescoring engine. By default it employs the same methodologies as Percolator, but is fully implemented in Python. First, the PSM list must be converted to a Mokapot LinearPsmDataset.

[32]:
linear_psm_dataset = convert_psm_list(psm_list)

Then, the Mokapot brew function can be called to rescore the PSMs:

[33]:
confidence_results, models = brew(linear_psm_dataset)

Mokapot contains a function to conveniently plot the results:

[34]:
confidence_results.plot_qvalues()
plt.show()
../../_images/tutorials_in-depth-python-api_67_0.png

Or to compare the un-rescored PSMs to the rescored PSMs:

[35]:
linear_psm_dataset.assign_confidence().plot_qvalues(label="Before rescoring")
confidence_results.plot_qvalues(label="After rescoring")
plt.legend()
plt.show()
../../_images/tutorials_in-depth-python-api_69_0.png

Using mokapot directly allows us to quickly generate LinearPsmDataset objects with different feature sets, for instance:

  • Only basic features

  • Basic + MaxQuant-derived features (mimicking a traditional Percolator rescoring run)

  • Basic + MaxQuant-derived + MS²PIP features

  • Basic + MaxQuant-derived + MS²PIP + DeepLC features

[36]:
feature_sets = {
    "basic": feature_names["basic"],
    "basic_mq": feature_names["basic"] + feature_names["maxquant"],
    "basic_mq_ms2pip": feature_names["basic"] + feature_names["maxquant"] + feature_names["ms2pip"],
    "basic_mq_ms2pip_deeplc": feature_names["basic"] + feature_names["maxquant"] + feature_names["ms2pip"] + feature_names["deeplc"],
}
[37]:
linear_psm_dataset.assign_confidence().plot_qvalues(label="Before rescoring")

for feature_set, features in feature_sets.items():
    linear_psm_dataset = convert_psm_list(psm_list, features)
    confidence_results, models = brew(linear_psm_dataset)
    confidence_results.plot_qvalues(label=f"Rescoring with: {feature_set}")

plt.xscale('log')
plt.xlim(1e-4, 1)
plt.legend()
plt.show()
../../_images/tutorials_in-depth-python-api_72_0.png

Using Mokapot within MS²Rescore

Mokapot can also be used for rescoring with the higher-level integration in MS²Rescore. Simply pass a PSMList object with rescoring features added to the rescore function. Doing so will automatically convert the PSMList to a Mokapot LinearPsmDataset, run the brew function, and update the original PSMList with the new scores, q-values and PEPs.

[38]:
from ms2rescore.rescoring_engines import mokapot

mokapot.rescore(psm_list)
[39]:
score_histogram(psm_list.to_dataframe())