Configuration
Introduction
MS²Rescore can be configured through the command line interface (CLI), the graphical user interface (GUI), or a JSON/TOML configuration file. The configuration file can be used to set options that are not available in the CLI or GUI, or to set default values for options that are available in the CLI or GUI.
If no configuration file is passed, or some options are not configured, the default values for these settings will be used. Options passed from the CLI and the GUI will override the configuration file. The full configuration is validated against a JSON Schema. A full example configuration file can be found in ms2rescore/package_data/config_default.json. An overview of all options can be found below.
Configuring input files
In the configuration file, input files can be specified as follows:
"psm_file": "path/to/psms.tsv",
"psm_file_type": "infer",
"spectrum_path": "path/to/spectra.mgf"
psm_file = "path/to/psms.tsv"
psm_file_type = "infer"
spectrum_path = "path/to/spectra.mgf"
See Input files for more information.
Parsing modification labels
MS²Rescore uses the HUPO-PSI standardized ProForma v2 notation to represent modified peptides in a string format. Unfortunately, most PSM file types coming from different proteomics search engines use a custom modification notation.
For example, a MaxQuant Modified sequence
would be parsed as follows: _AM(ox)SIVMLSM_
🠚
AM[ox]SIVMLSM
. However, the label ox
is not a resolvable modification, as it is not
present in any of the supported controlled vocabularies. Therefore, ox
needs to be mapped to
U:Oxidation
, where U
denotes that the Unimod database is used
and Oxidation
denotes the official Unimod name.
To correctly parse the various notations to ProForma, ms2rescore
requires a configuration
modification_mapping
which maps each specific search engine modification label to a valid
ProForma label.
Accepted ProForma modification labels in psm_utils
(and by extension in
ms2rescore
) are, in order of preference:
Type |
Long format example |
Short format example |
---|---|---|
PSI-MOD accession |
MOD:00046 |
M:00046 |
PSI-MOD name |
MOD:O-phospho-L-serine |
M:O-phospho-L-serine |
Unimod accession |
UNIMOD:21 |
U:21 |
Unimod name |
UNIMOD:Phospho |
U:Phospho |
Formula |
Formula:HO3P |
/ |
Mass shift |
+79.96633052075 |
/ |
If a modification is not defined in any of the supported controlled vocabularies, preferably provide the formula instead of a mass shift, as the mass shift can always be calculated from the formula, but not vice-versa, and some feature generators (such as DeepLC) require the modification formula.
Formula modification labels can be defined with the Formula:
prefix, followed by each atom
symbol and its count, denoting which atoms are added or removed by the modification. If no count is
provided, it is assumed to be 1. For example, Formula:HO3P
is equivalent to Formula:H1O3P1
.
For isotopes, prefix the atom symbol with the isotope number and place the entire block (isotope
number, atom symbol, and number of atoms) in square brackets. For example, the SILAC 13C(2) 15N(1)
label (UNIMOD:2088)
would be notated as Formula:C-2[13C2]N-1[15N]
, meaning that two C atoms are removed, two
13C atoms are added, one N atom is removed and one
15N atom is added.
And example of the modification_mapping
could be:
Adding fixed modifications
Some search engines, such as MaxQuant, do not report fixed modifications that were part of the search. To correctly rescore PSMs, fixed modifications that are not reported in the PSM file must be configured separately. For instance:
"fixed_modifications": {
"U:Carbamidomethyl": ["C"]
}
[ms2rescore.fixed_modifications]
"U:Carbamidomethyl" = ["C"]
Fixed terminal modifications can be added by using the special labels N-term
and C-term
.
For example, to additionally add TMT6plex to the N-terminus and lysine residues, the following
configuration can be used:
"fixed_modifications": {
"U:Carbamidomethyl": ["C"],
"U:TMT6plex": ["N-term", "K"]
}
[ms2rescore.fixed_modifications]
"U:Carbamidomethyl" = ["C"]
"U:TMT6plex" = ["N-term", "K"]
Caution
Most search engines DO return fixed modifications as part of the modified peptide sequences.
In these cases, they must NOT be added to the fixed_modifications
configuration.
Mapping PSMs to spectra
Essential for MS²Rescore to function correctly is linking the search engine PSMs to the original
spectra. As spectrum file converters and search engines often modify spectrum titles, two options
are available to map PSMs to spectra: spectrum_id_pattern
and psm_id_pattern
. Through these
two options, regular expression patterns can be defined that extract the same spectrum identifier
from the spectrum file and from the PSM file, respectively.
For example, if the spectrum file contains the following identifier in the MGF title field:
mzspec=20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.raw: controllerType=0 controllerNumber=1 scan=2
and the PSM file contains the following identifier in the spectrum_id
field:
20161213_NGHF_DBJ_SA_Exp3A_HeLa_1ug_7min_15000_02.raw.2.2
then the following patterns can be used to extract 2
from both identifiers:
"spectrum_id_pattern": ".*scan=(\\d+)$",
"psm_id_pattern": ".*\\..*\\.(.*)"
spectrum_id_pattern = '.*scan=(\d+)$'
psm_id_pattern = ".*\..*\.(.*)"
Both options should match the entire string and require a single capture group (denoted by the parentheses) to mark the section of the match that should be extracted.
Warning
Regular expression patterns often contain special characters that need to be escaped. For example,
the \
should be escaped with an additional \
in JSON, as is shown above. In TOML files,
the full regex can be wrapped in single quotes to avoid excaping.
Note
Find out more about regular expression patterns and try them on regex101.com. You can try out the above examples at https://regex101.com/r/VhBJRM/1 and https://regex101.com/r/JkT79a/1.
Selecting decoy PSMs
Usually, PSMs are already marked as target or decoy in the PSM file. When this is not the case,
it can usually be derived from the protein name. For example, if the protein name contains the
prefix DECOY_
, the PSM is a decoy PSM. The following option can be used to define a regular
expression pattern that extracts the decoy status from the protein name:
"id_decoy_pattern": "DECOY_"
id_decoy_pattern = "DECOY_"
Multi-rank rescoring
Some search engines, such as MaxQuant, report multiple candidate PSMs for the same spectrum.
MS²Rescore can rescore multiple candidate PSMs per spectrum. This allows for lower-ranking
candidate PSMs to become the top-ranked PSM after rescoring. This behavior can be controlled with
the max_psm_rank_input
option.
To ensure a correct FDR control after rescoring, MS²Rescore filters out lower-ranking PSMs before
final FDR calculation and writing the output files. To allow for lower-ranking PSMs to be included
in the final output - for instance, to consider chimeric spectra - the max_psm_rank_output
option can be used.
For example, to rescore the top 5 PSMs per spectrum and output the best PSM after rescoring, the following configuration can be used:
"max_psm_rank_input": 5
"max_psm_rank_output": 1
max_psm_rank_input = 5
max_psm_rank_output = 1
Configuring rescoring engines
MS²Rescore supports multiple rescoring engines, such as Mokapot and Percolator. The rescoring
engine can be selected and configured with the rescoring_engine
option. For example, to use
Mokapot with a custom train_fdr of 0.1%, the following configuration can be used:
"rescoring_engine": {
"mokapot": {
"train_fdr": 0.001
}
[ms2rescore.rescoring_engine.mokapot]
train_fdr = 0.001
All options for the rescoring engines can be found in the ms2rescore.rescoring_engines section.
All configuration options
MS²Rescore configuration
Properties
ms2rescore
(object): General MS²Rescore settings. Cannot contain additional properties.feature_generators
(object): Feature generators and their configurations. Default:{"basic": {}, "ms2pip": {"model": "HCD", "ms2_tolerance": 0.02}, "deeplc": {}, "maxquant": {}}
..*
: Refer to #/definitions/feature_generator.basic
: Refer to #/definitions/basic.ms2pip
: Refer to #/definitions/ms2pip.deeplc
: Refer to #/definitions/deeplc.maxquant
: Refer to #/definitions/maxquant.ionmob
: Refer to #/definitions/ionmob.im2deep
: Refer to #/definitions/im2deep.
rescoring_engine
(object): Rescoring engine to use and its configuration. Leave empty to skip rescoring and write features to file. Default:{"mokapot": {}}
..*
: Refer to #/definitions/rescoring_engine.percolator
: Refer to #/definitions/percolator.mokapot
: Refer to #/definitions/mokapot.
config_file
: Path to configuration file.One of
string
null
psm_file
: Path to file with peptide-spectrum matches.One of
string
null
array
Items (string)
psm_file_type
(string): PSM file type. By default inferred from file extension. Default:"infer"
.psm_reader_kwargs
(object): Keyword arguments passed to the PSM reader. Default:{}
.spectrum_path
: Path to spectrum file or directory with spectrum files.One of
string
null
output_path
: Path and root name for output files.One of
string
null
log_level
(string): Logging level. Must be one of:["debug", "info", "warning", "error", "critical"]
.id_decoy_pattern
: Regex pattern used to identify the decoy PSMs in identification file. Default:null
.One of
string
null
spectrum_id_pattern
: Regex pattern to extract index or scan number from spectrum file. Requires at least one capturing group. Default:"(.*)"
.One of
string
null
psm_id_pattern
: Regex pattern to extract index or scan number from PSM file. Requires at least one capturing group. Default:"(.*)"
.One of
string
null
psm_id_rt_pattern
: Regex pattern to extract retention time from PSM identifier. Requires at least one capturing group. Default:null
.One of
string
null
psm_id_im_pattern
: Regex pattern to extract ion mobility from PSM identifier. Requires at least one capturing group. Default:null
.One of
string
null
lower_score_is_better
(boolean): Bool indicating if lower score is better. Default:false
.max_psm_rank_input
(number): Maximum rank of PSMs to use as input for rescoring. Minimum:1
. Default:10
.max_psm_rank_output
(number): Maximum rank of PSMs to return after rescoring, before final FDR calculation. Minimum:1
. Default:1
.modification_mapping
(object): Mapping of modification labels to each replacement label. Default:{}
.fixed_modifications
(object): Mapping of amino acids with fixed modifications to the modification name. Can contain additional properties. Default:{}
.processes
(number): Number of parallel processes to use; -1 for all available. Minimum:-1
. Default:-1
.rename_to_usi
(boolean): Convert spectrum IDs to their universal spectrum identifier.fasta_file
: Path to FASTA file with protein sequences to use for protein inference.One of
string
null
write_report
(boolean): Write an HTML report with various QC metrics and charts. Default:false
.profile
(boolean): Write a txt report using cProfile for profiling. Default:false
.
Definitions
feature_generator
(object): Feature generator configuration. Can contain additional properties.rescoring_engine
(object): Rescoring engine configuration. Can contain additional properties.basic
(object): Basic feature generator configuration. Can contain additional properties. Refer to #/definitions/feature_generator.ms2pip
(object): MS²PIP feature generator configuration. Can contain additional properties. Refer to #/definitions/feature_generator.model
(string): MS²PIP model to use (see MS²PIP documentation). Default:"HCD"
.ms2_tolerance
(number): MS2 error tolerance in Da. Minimum:0
. Default:0.02
.
deeplc
(object): DeepLC feature generator configuration. Can contain additional properties. Refer to #/definitions/feature_generator.calibration_set_size
: Calibration set size. Default:0.15
.One of
integer
number
maxquant
(object): MaxQuant feature generator configuration. Can contain additional properties. Refer to #/definitions/feature_generator.ionmob
(object): Ion mobility feature generator configuration using Ionmob. Can contain additional properties. Refer to #/definitions/feature_generator.ionmob_model
(string): Path to Ionmob model directory. Default:"GRUPredictor"
.reference_dataset
(string): Path to Ionmob reference dataset file. Default:"Meier_unimod.parquet"
.tokenizer
(string): Path to tokenizer json file. Default:"tokenizer.json"
.
im2deep
(object): Ion mobility feature generator configuration using IM2Deep. Can contain additional properties. Refer to #/definitions/feature_generator.reference_dataset
(string): Path to IM2Deep reference dataset file. Default:"Meier_unimod.parquet"
.
mokapot
(object): Mokapot rescoring engine configuration. Additional properties are passed to the Mokapot brew function. Can contain additional properties. Refer to #/definitions/rescoring_engine.train_fdr
(number): FDR threshold for training Mokapot. Minimum:0
. Maximum:1
. Default:0.01
.write_weights
(boolean): Write Mokapot weights to a text file. Default:false
.write_txt
(boolean): Write Mokapot results to a text file. Default:false
.write_flashlfq
(boolean): Write Mokapot results to a FlashLFQ-compatible file. Default:false
.
percolator
(object): Percolator rescoring engine configuration. Can contain additional properties. Refer to #/definitions/rescoring_engine.init-weights
: Weights file for scoring function. Default:false
.One of
string
null