1 OVERVIEW

artMS is a Bioconductor package that provides a set of tools for the analysis and integration of large-scale proteomics (mass-spectrometry-based) datasets obtained using the popular proteomics software MaxQuant.

artMS also perfoms basic quality control and relative quantification for metabolomics datasets obtained using the alignment table generated by MarkerView.

The functions available in artMS can be grouped into the following categories:

  • Multiple quality control (QC) functions.
  • Relative quantification using MSstats.
  • Downstream analysis and integration of quantifications (enrichment, clustering, PCA, summary plots, etc)
  • Generation of input files for other tools, including SAINTq, SAINTexpress, Photon, and Phosfate

1.1 How to install

Before you begin, ensure that your system is running an R version >= 3.5 or the installation of artMS won’t work. You can check the R version running on your system by executing the function getRversion()

If the outcome is >= 3.5.0, congratulations! you can move forward. If it is not, then you need to install the latest version of R in your system.

Two options to install artMS:

  • Official BioConductor releases (recommended)

artMS is available in BioConductor. Instructions to install the package are available here.

(Why Bioconductor? Here you can find a nice summary of many good reasons).

  • Development version from Github

(Warning: not stable, but it has the latest)

Assuming that you have an R (>= 3.5) version running on your system, follow these steps:

install.packages("devtools")
library(devtools)
install_github("biodavidjm/artMS")

Once installed, the package can be loaded and attached to your current workspace as follows:

library(artMS)

1.2 Input files

artMS performs the different analyses taking as input the following files:

  • evidence.txt file: output of the quantitative proteomics software package MaxQuant.
  • keys.txt (tab-delimited) txt file generated by the user describing the experimental design.
  • contrast.txt (tab-delimited) txt file generated by the user with the comparisons between conditions to be quantified.

Check below to find out more about generating the input files.

1.3 Configuration file

artmsQuantification() requires a large number of arguments, specially those related to the statistical package MSstats. To facilite the task of providing all those arguments, the function artmsQuantification() takes a config file (in yaml format) for the customization of the parameters for quantification (using MSstats) and other operations, including QC analyses, charts, and annotations.

A configuration file template can be generated by running artmsWriteConfigYamlFile()

Check below to learn the details of the configuration file.

1.4 Basic workflows

1.4.1 Proteomics

  • Generate the input files: Check the input files section for details

  • Quality Control: if you are interested in performing only quality control analysis, run the following functions:

    • artmsQualityControlEvidenceBasic(): QC based on the evidence.txt file
    • artmsQualityControlEvidenceExtended(): based on the evidence.txt file
    • artmsQualityControlSummaryExtended(): based on the summary.txt file
  • Relative Quantification: fill up the configuration file and run the following function:

  • Analysis of Quantifications: performs annotations, clustering analysis, PCA analysis, enrichment analysis by running the function

  • Miscellaneous functions: Check below to discover more useful functions provided by the artMS package.

1.4.2 Metabolomics

artMS also enables the relative quantification of untargeted polar metabolites using the alignment table generated by MarkerView. This means that the metabolites do not need to have an ID, as the m/z and retention time will be used as identifiers. Typical workflow:

  • Run QC on the metabolomics dataset: artmsQualityControlMetabolomics()

  • Relative quantification: artmsQuantification() (notice that a few options must be changed in the config file before running the function)

Please, keep in mind that most of the functions won’t work for metabolomics data due to annotation issues (protein/gene ids are the primary ids for most of the functions). Check the metabolomics section to find out more.

2 REQUIRED INPUT FILES

2.1 Input files

Three basic (tab-delimited) files are required to perform the full pack of operations:

2.1.1 evidence.txt

The output of the quantitative proteomics software package MaxQuant. It combines all the information about the identified peptides.

2.1.2 keys.txt

Tab delimited file generated by the user. It summarizes the experimental design of the evidence file. artMS merges the keys.txt and evidence.txt by the “RawFile” column. Each RawFile corresponds to a unique individual experimental technical replicate / biological replicate / Condition / Run.

For any basic label-free proteomics experiment, the keys file must contain the following columns and rules:

  • RawFile: The name of the RAW-file for which the mass spectral data was derived from.
  • IsotopeLabelType: 'L' for label free experiments ('H' will be used for SILAC experiments, see below)
  • Condition: The conditions names must follow these rules:
    • Use only letters (A - Z, both uppercase and lowercase) and numbers (0 - 9). The only special character allowed is underscore (_).
    • Very important: A condition name cannot begin with a number (R limitation).
  • BioReplicate: biological replicate number. It is based on the condition name. Use as prefix the corresponding Condition name, and add as suffix dash (-) plus the biological replicate number. For example, if condition H1N1_06H has too biological replicates, name them H1N1_06H-1 and H1N1_06H-2
  • Run: a unique number for all the MS runs (from 1 to the total number of raw files). It will be specially useful when having technical replicates. A special case is SILAC experiments (H and L label are run simultaneously. See below to find out more)

Example of keys file: check the artMS data object artms_data_ph_keys

RawFile IsotopeLabelType Condition BioReplicate Run
qx006145 L Cal_33 Cal_33-1 1
qx006148 L Cal_33 Cal_33-4 4
qx006151 L HSC6 HSC6-2 6
qx006152 L HSC6 HSC6-3 7

Tip: it is recommended to use Microsoft Excel (OpenOffice Cal / or similar) to generate the keys file. Do not forget to choose the format = Tab Delimited Text (.txt) when saving the file (use save as option)

2.1.3 contrast.txt

The comparisons between conditions that the user wants to quantify.

  • Example #1: the comparison for the keys described above would be:
HSC6-Cal_33
  • Example #2: let’s quantify changes in protein abundance between wild type (WT_A549) relative to two additional experimental conditions with drugs (WT_DRUG_A and WT_DRUG_B), but also changes in protein abundance between DRUG_A and DRUG_B, the contrast file would look like this:
WT_DRUG_A-WT_A549
WT_DRUG_B-WT_A549
WT_DRUG_A-WT_DRUG_B

Requirements:

  • The two conditions to be compared must be separated by a dash symbol (-), and only one dash symbol is allowed, i.e., only one comparison per line.

As a result of the quantification, the condition on the left will take the positive log2FC sign -if the protein is more abundant in condition on the left (numerator), and the condition on the right the negative log2FC -if a protein is more abundant in condition on the right term (denominator).

Example of wrong comparisons

Only condition names are allowed. Individual Bioreplicates cannot be compared. For example, this is wrong:

HSC6-Cal_33-1

2.2 The artMS configuration file

The configuration file (in yaml format) contains a variety of options available for the QC, quantification, and annotations performed by artMS.

To generate a sample configuration file, go to the project folder (setwd(/path/to/your/working/folder/)) and execute:

library(artMS)
artmsWriteConfigYamlFile(config_file_name = "config.yaml", 
                         verbose = FALSE)

Open the config.yaml file with your favorite editor (RStudio for example). Although it might look complex, the default options work very well.

The configuration (yaml) file contains the following sections:

2.2.1 Section: files

files :
  evidence : /path/to/the/evidence.txt
  keys : /path/to/the/keys.txt
  contrasts : /path/to/the/contrast.txt
  output : /path/to/the/results_folder/ph-results.txt

The file path/name of the required files. It is recommended to create a new folder in your folder project (for example, results_folder). The results file name (e.g. -results.txt) will be used as prefix for the several files (txt and pdf) that will be generated.


2.2.2 Section: qc

qc:
  basic: 1 # 1 = yes; 0 = no
  extended: 1 # 1 = yes; 0 = no

Select to perform both ‘basic’ and ‘extended’ quality control. Read below to find out more about the details of each type of analysis.

2.2.3 Section: data

data:
  enabled : 1 # 1 = yes; 0 = no
  fractions: 
    enabled : 0 # 1 for protein fractionation
  silac: 
    enabled : 0 # 1 for SILAC experiments
  filters: 
    enabled : 1
    contaminants : 1
    protein_groups : remove # remove, keep
    modifications : AB # PH, UB, AB, APMS
  sample_plots : 1 # correlation plots

Let’s break it down data:

  • enabled : 1: to pre-process the data provided in the files section. 0: won’t process the data (and a pre-generated MSstats file will be expected)

  • fractions: Multiple fractionation or separation methods are often combined in proteomics to improve signal-to-noise and proteome coverage and to reduce interference between peptides in quantitative proteomics.
    • enabled : 1 for fractionation dataset. See Special case: Protein Fractionation below for details
    • enabled : 0 no fractions
  • silac:
    • enabled : 1: check if the files belong to a SILAC experiment. See Special case: SILAC below for details
    • enabled : 0: no silac experiment (default)
  • filters:
    • enabled : 1 Enables filtering (this section)
    • contaminants : 1 Removes contaminants (CON__ and REV__ labeled by MaxQuant)
    • protein_groups : remove choose whether remove or keep protein groups
    • modifications : AB any of the proteomics experiments, PH, UB, or AC for posttranslational modifications, AB or APMS otherwise.
  • sample_plots
    • 1 Generate correlation plots
    • 0 otherwise

2.2.4 Section: MSstats

msstats :
  enabled : 1
  msstats_input : # `-mss.txt` file or blank (default)
  profilePlots : none 
  normalization_method : equalizeMedians
  normalization_reference :  # blank (default) if equalizeMedians
  summaryMethod : TMP 
  censoredInt : NA  
  cutoffCensored : minFeature  
  MBimpute : 1 
  feature_subset: all

Let’s break it down:

  • enabled : Choose 1 to run MSstats, 0 otherwise.
  • msstats_input : leave it blank if MSstats will be run (previous enabled : 1). But if MSstats was already run and the evidence-mss.txt file is available, then choose enabled : 0 and provide here the evidence-mss.txt file path/name
  • profilePlots : Choose one of the following options:
    • before plot before normalization
    • after plot after normalization
    • before-after: recommended, although computational expensive
    • none no normalization plots
  • normalization_method : available options:
    • equalizeMedians
    • quantile
    • 0: no normalization (not recommended)
    • globalStandards if selected, specified the reference protein in normalization_reference (next)
  • normalization_reference : UniProt id if globalStandards is chosen as the normalization_method (above)
  • summaryMethod : TMP # “TMP”(default) means Tukey’s median polish, which is robust estimation method. “linear” uses linear mixed model. “logOfSum” conducts log2 (sum of intensities) per run.
  • censoredInt :
    • NA (default) Missing values are censored or at random. ‘NA’ assumes that all ‘NA’s in ’Intensity’ column are censored.
    • 0 uses zero intensities as censored intensity. In this case, NA intensities are missing at random. The output from Skyline should use 0. Null assumes that all NA intensities are randomly missing.
  • cutoffCensored :
    • minFeature Cutoff value for censoring. Only with censoredInt : NA or 0. Default is ‘minFeature’, which uses minimum value for each feature.
    • minFeatureNRun uses the smallest between minimum value of corresponding feature and minimum value of corresponding run.
    • minRun uses minimum value for each run.
  • MBimpute :
    • TRUE only for summaryMethod="TMP" and censoredInt='NA' or 0. TRUE (default) imputes ‘NA’ or ‘0’ (depending on censoredInt option) by Accelerated failure model.
    • FALSE uses the values assigned by cutoffCensored.
  • feature_subset :
    • all : default
    • highQuality : this option seems to be buggy right now

Check MSstats documentation to find out more about every option.


2.2.5 Section: output_extras

  enabled : 1 # if 0, won't process anything on this section
  annotate :  
    enabled: 1 
    species : HUMAN
  plots:
    volcano: 1
    heatmap: 1
    LFC : -1.5 1.5 # Range of minimal log2fc
    FDR : 0.05
    heatmap_cluster_cols : 0
    heatmap_display : log2FC # log2FC or pvalue

Extra actions to perform based on the MSstats results, including annotations and plots (heatmaps and volcano plots). Let’s break it down:

  • enabled : 1 (default) enables this section, 0 turns it off
  • annotate :
    • enabled: 1 (default), will generate a -results-annotated.txt file that includes Gene and Protein.Name (only for supported species)
    • species: The supported species are: HUMAN, MOUSE, ANOPHELES, ARABIDOPSIS, BOVINE, WORM, CANINE, FLY, ZEBRAFISH, ECOLI_STRAIN_K12, ECOLI_STRAIN_SAKAI, CHICKEN, RHESUS, MALARIA, CHIMP, RAT, YEAST, PIG, XENOPUS
  • plots : options for additional plots
    • volcano : 1
    • LFC : log2 fold change cutoff (minimal negative and positive value)
    • FDR : false discovery rate cutoff for significance (recommended: 0.05)
    • heatmap : correlation plots
    • heatmap_cluster_cols : 1 perfoms clustering of columns, 0 (default) doesn’t
    • heatmap_display : choose to display either log2FC or pvalue

2.3 Special case: Protein fractionation

To handle protein fractionation experiments, two options must be activated

  1. keys.txt: The keys file must contain an additional column named “FractionKey” with the information about fractions. For example:
Raw.file IsotopeLabelType Condition BioReplicate Run FractionKey
S9524_Fx1 L AB AB-1 1 1
S9524_Fx2 L AB AB-1 1 2
S9524_Fx3 L AB AB-1 1 3
S9524_Fx4 L AB AB-1 1 4
S9524_Fx5 L AB AB-1 1 5
S9524_Fx6 L AB AB-1 1 6
S9524_Fx7 L AB AB-1 1 7
S9524_Fx8 L AB AB-1 1 8
S9524_Fx9 L AB AB-1 1 9
S9524_Fx10 L AB AB-1 1 10
S9525_Fx1 L AB AB-2 2 1
S9525_Fx2 L AB AB-2 2 2
S9525_Fx3 L AB AB-2 2 3
S9525_Fx4 L AB AB-2 2 4
S9525_Fx5 L AB AB-2 2 5
S9525_Fx6 L AB AB-2 2 6
S9525_Fx7 L AB AB-2 2 7
S9525_Fx8 L AB AB-2 2 8
S9525_Fx9 L AB AB-2 2 9
S9525_Fx10 L AB AB-2 2 10
S9526_Fx1 L AB AB-3 3 1
S9526_Fx2 L AB AB-3 3 2
S9526_Fx3 L AB AB-3 3 3
S9526_Fx4 L AB AB-3 3 4
S9526_Fx5 L AB AB-3 3 5
S9526_Fx6 L AB AB-3 3 6
S9526_Fx7 L AB AB-3 3 7
S9526_Fx8 L AB AB-3 3 8
S9526_Fx9 L AB AB-3 3 9
S9526_Fx10 L AB AB-3 3 10
  1. config.yaml: Enable fractions in the configuration file as follow:
fractions: 
  enabled : 1 # 1 for protein fractions, 0 otherwise

2.4 Special case: SILAC

One of the most widely used techniques that enable relative protein quantification is stable isotope labeling by amino acids in cell culture (SILAC). The keys.txt file can capture the typical SILAC experiment. The following example shows a SILAC experiment with two conditions, two biological replicates, and two technical replicates:

RawFile IsotopeLabelType Condition BioReplicate Run
QE20140321-01 H iso iso-1 1
QE20140321-02 H iso iso-1 2
QE20140321-04 L iso iso-2 3
QE20140321-05 L iso iso-2 4
QE20140321-01 L iso_M iso_M-1 1
QE20140321-02 L iso_M iso_M-1 2
QE20140321-04 H iso_M iso_M-2 3
QE20140321-05 H iso_M iso_M-2 4

It is also required to activate the silac option in the yaml file as follows:

silac: 
  enabled : 1 # 1 for SILAC experiments

3 QUALITY CONTROL

artMS provides 3 functions to perform QC analyses.

3.1 Basic QC (evidence.txt-based)

The basic quality control analysis takes as input both the evidence.txt and keys.txt files and generates several QC plots exploring different aspects of the MS data. Run it as follows:

artmsQualityControlEvidenceBasic(evidence_file = artms_data_ph_evidence,
                                 keys_file = artms_data_ph_keys,
                                 prot_exp = "PH")

The following pdf files are generated by default:

  • -basicReproducibility.pdf: correlation dot plot for all the combinations of biological replicates of every condition, based on MS Intensity values of features (peptide+charge)
  • -correlationMatrixBR.pdf: It contains 3 pages. Correlation matrix for all the biological replicates using MS Intensity values, Clustering matrix of the MS Intensities, and correlation distribution histogram.
  • -correlationMatrixBR.pdf: Same as the previous one, but based on MS Intensity values of Conditions
  • -IntensityDistributions.pdf: 2 pages. Box-dot plot and Jitter plot of MS (raw) intensity values for each biological replicate.
  • -intensityStats.pdf: several pages, including bar plots of Total Sum of Intensities in BioReplicates, Total Sum of Intensities in Conditions, Total Peptide Counts in BioReplicates, Total Peptide Counts in conditions grouped by categories (CON: contaminants, PROT peptides, REV reversed sequences used by MaxQuant to estimate the FDR); Box plots of MS Intensity values per biological replicates and conditions; bar plots of total intensity (excluding contaminants) by bioreplicates and conditions; Bar plots of total feature counts by bioreplicates and conditions.
  • -ptmStats.pdf: If any PTM is selected (PH, UB, AC) an extra pdf file will be generated with stats related to the selected modification, including: bar plot of peptide counts and intensities, broken by PTM/other categories; bar plots of total sum-up of MS intensity values by other/PTM categories.

Check ?artmsQualityControlEvidenceBasic() to find out more options about this function.

Next, for illustration purposes, let’s show how to generate only one plot (e.g. INTDIST):

# But for illustration purposes printing only INTDIST plot:
library(artMS)
suppressWarnings(
artmsQualityControlEvidenceBasic(evidence_file = artms_data_ph_evidence,
                                 keys_file = artms_data_ph_keys,
                                 prot_exp = "PH",
                                 plotINTDIST = TRUE,
                                 plotREPRO = FALSE,
                                 plotCORMAT = FALSE,
                                 plotINTMISC = FALSE,
                                 plotPTMSTATS = FALSE,
                                 printPDF = FALSE,
                                 verbose = FALSE))