Using the Python API
This tutorial shows how to use the MS²Rescore Python API for each step of the rescoring process individually. This is useful if you want to customize rescoring for your own Python workflow or if you want to understand how MS²Rescore works.
Note that the full MS²Rescore workflow is also available from Python with the single function call ms2rescore.rescore()
.
[1]:
import logging
import plotly.io
logging.basicConfig(level=logging.INFO)
plotly.io.renderers.default = "plotly_mimetype+notebook"
Reading and parsing peptide-spectrum matches
[2]:
from psm_utils.io import read_file
from ms2rescore.report.charts import score_histogram
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Reading the PSM file
MS²Rescore is fully centered around the use of a psm_utils
PSMList. This is a unified data representation of PSMs and their various attributes. Internally, it is simply a list of Pydantic data classes which represent PSMs. With the submodule psm_utils.io
, we can read PSMs from a variety of file formats. Here, we will read a PSM file in the MaxQuant msms.txt
format.
Importantly, for rescoring, the PSM file must contain all target and decoy PSMs, including PSMs that did not pass the FDR threshold. Most search engines must be specifically configured to return all PSMs without FDR filtering.
[3]:
psm_list = read_file("../../../examples/id/msms.txt", filetype="msms")
psm_list["spectrum_id"] = [str(spec_id) for spec_id in psm_list["spectrum_id"]]
For a quick inspection, we can format the PSM list as a Pandas dataframe and display the first few rows:
[4]:
psm_list.to_dataframe().head()
[4]:
peptidoform | spectrum_id | run | collection | spectrum | is_decoy | score | qvalue | pep | precursor_mz | retention_time | ion_mobility | protein_list | rank | source | provenance_data | metadata | rescoring_features | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAAAAAALQAK/2 | 4703 | 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 | None | None | False | 107.660 | None | 0.001517 | 478.77982 | 5.2007 | None | [P36578, H3BM89, H3BU31] | None | msms | {'msms_filename': '..\..\..\examples\id\msms.t... | {'Scan index': '3698', 'Sequence': 'AAAAAAALQA... | {} |
1 | [ac]-AAAAAEQQQFYLLLGNLLSPDNVVR/3 | 13572 | 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 | None | None | False | 107.740 | None | 0.004931 | 915.15197 | 11.8470 | None | [O00410, E7ETV3, E7EQT5, C9JZD8] | None | msms | {'msms_filename': '..\..\..\examples\id\msms.t... | {'Scan index': '11885', 'Sequence': 'AAAAAEQQQ... | {} |
2 | [ac]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3 | 13366 | 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 | None | None | False | 137.890 | None | 0.000493 | 957.85029 | 11.6900 | None | [O00410, E7ETV3, E7EQT5, C9JZD8] | None | msms | {'msms_filename': '..\..\..\examples\id\msms.t... | {'Scan index': '11695', 'Sequence': 'AAAAAEQQQ... | {} |
3 | AAAAAQGGGGGEPR/2 | 505 | 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 | None | None | False | 22.641 | None | 0.142020 | 585.28653 | 0.5178 | None | [E9PJF0, E9PQW4, P27361] | None | msms | {'msms_filename': '..\..\..\examples\id\msms.t... | {'Scan index': '419', 'Sequence': 'AAAAAQGGGGG... | {} |
4 | AAAAAWEEPSSGN[de]GTAR/2 | 6589 | 20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02 | None | None | False | 89.403 | None | 0.046504 | 823.87389 | 6.6105 | None | [Q9P258] | None | msms | {'msms_filename': '..\..\..\examples\id\msms.t... | {'Scan index': '5439', 'Sequence': 'AAAAAWEEPS... | {} |
We can also directly plot the current PSM score distributions:
[5]:
score_histogram(psm_list)