vignettes/input_preparation_workflow.Rmd
input_preparation_workflow.Rmd
This vignette will give you an overview of how you can prepare the quantitative protein/peptide matrix output from common search engines and software such as Spectronaut, MaxQuant, Proteome Discoverer and Skyline for the analysis with protti. Due to its modular and flexible structure protti can be used on the output of common bottom-up proteomics search engines irrespective of the measurement mode (DDA, DIA, targeted-MS).
Furthermore, you are not only restricted to reports from the above mentioned search engines. As long as your data has a tabular format (data frame) and a specific minimal number of data columns you can analyse it with protti. The columns minimally required contain information on sample, condition, intensity, protein ID and the level the intensity is based on (fragment, precursor, peptide) if different from protein intensity. Depending on the analysis many more columns can be useful, but they are not required. Ultimately, your data should have a structure similar to this:
Sample | Protein ID | Peptide Sequence | Condition | Intensity |
---|---|---|---|---|
sample1 | P62942 | PEPTIDER | treated | 14000 |
sample2 | P62942 | PEPTI | treated | 15000 |
sample3 | P62942 | PEPTIDE | treated | 14500 |
sample4 | P62942 | PEPTIDER | control | 18000 |
sample5 | P62942 | PEPTI | control | 21000 |
sample6 | P62942 | PEPTIDE | control | 19000 |
It is very important, that each unit of the level you perform your analysis on (e.g. peptide) has a single unique intensity associated with it. If, for example, a peptide has two different intensities, protti would not know how to deal with this and many functions will likely fail.
Data should always be organised in a format called tidy data. That means
data should be contained in a long format (e.g. all sample names in one
column) rather than a wide format (e.g. each sample name in its own
column with intensity as the content of the columns). You can easily
achieve this by using the pivot_longer()
function from the
tidyr
package. The output of many search engines already contains tidy data
and working with it is very easy because you can refer to information
with only one variable. protti is designed to work
together well with the tidyverse
package
family that is build around the concept of tidy data.
Many search engines provide the user with protein intensities.
However, it is also possible to calculate protein intensities directly
from precursor intensities with the protti function
calculate_protein_abundance()
. Protti
implements the "iq"
method, previously implemented in the R
package iq
which performs protein quantification based on the maximal peptide ratio
extraction algorithm adapted from the MaxLFQ algorithm (Cox, J. 2013).
One advantage of calculating the protein abundance with protti is the possibility to median normalise run intensities on the precursor level. This is closer to the actually acquired intensities and thus sample concentrations than if normalisation is performed on the protein level. Some search engines provide the option for automatic median normalisation but not all. Furthermore, some search engines calculate protein intensities by summation of precursor intensities irrespective of missingness of peptides in certain samples. In these cases the maximal peptide ratio implemented in extraction algorithm provides a more robust calculation of protein intensities.
If you prefer to use protein intensities provided by the seach engine of your choice this is not a problem and we will show how some of this information can be converted into the right format.
We will demonstrate how most outputs can be converted with functions
from the R packages magrittr
,
dplyr
,
tidyr
and stringr
.
You can load packages after you installed them with the
library()
function.
Note that we are using the R package magrittr
because of its pipe operator %>%
. It takes the output of
the preceding function and supplies it as the first argument of the
following function. Using %>%
makes code easier to read
and follow.
Spectronaut reports already contain data in the tidy data format. Therefore nothing needs to be changed in order to use them with protti. However, the columns we would recommend (not all columns are required) to export from Spectronaut are:
calculate_protein_abundance()
)Please make sure that the report is a .csv file. You can use the
read_protti()
function in order to load the spectronaut
report into R. This function is a wrapper around the fast
fread()
function from the data.table
package and the clean_names()
function from the janitor
package. This will allow you to not only load your data into R very
fast, but also to clean up the column names into lower snake case. This
will make it easier to remember them and to use them in your data
analysis. For the Spectronaut columns R.FileName
will
change for example into r_file_name
.
# To read in your own data you can use read_protti()
spectronaut_data <- read_protti(filename = "mydata/spectronaut.csv")
Depending on which analysis you are performing you will have to use
different outputs. For peptide-centric analyses we would recommend to
use the evidence.txt
file. If you want to perform a
protein-centric analysis and you want to use protein quantities
calculated by MaxQuant, you need the proteinGroups.txt
file. However, you can also apply the maximal peptide ratio extraction
algorithm from the iq
R package implemented in the
protein_abundance_calculation()
function of
protti. This allows you to only use the
evidence.txt
file. The resulting protein intensities are
identical since they were calculated with the same algorithm.
In case you are interested in performing a
peptide-centric analysis (necessary for LiP-MS), you
should use the evidence.txt
file provided in the search
output of MaxQuant.
The evidence.txt
file basically contains all the
information we need to run protti. It is also contained
in a long format which makes it easy to read in and use directly. One
thing to take into consideration is the lack of a column for information
on proteotypicity of peptides. However, this information can be inferred
from the Proteins
column if it contains more than one
protein ID. You can extract this information and create a new column
called is_proteotypic
containing logicals that will be
TRUE
if the Proteins
column does not contain a
semicolon and FALSE
if it does (this indicates that the
peptide belongs to more than one protein). As mentioned in the data
analysis vignettes this information is necessary for the analysis of
LiP-MS data but it could be also considered for the correct calculation
of protein abundances.
Another column that is required for the analysis of your data is a column indicating conditions to which certain samples belong. This can be easily added to the evidence file by joining a data frame containing the specific annotations. You can create such a data frame in Excel and import it into R for a large number of samples or just create it directly in R.
MaxQuant output provides information on decoy hits contained in the
column reverse
and also has information on whether your hit
is a contaminant potential_contaminant
. You should filter
these out before the analysis. However, the contaminant column can be
used for quality control.
One important thing for MaxQuant data is to make sure that you only have one intensity assigned to each peptide or precursor. You can do this by summing up all intensities that MaxQuant exports (these can be MULTI-MSMS, MSMS, ISO-MSMS, MULTI-MATCH, ISO-SECPEP) or you can filter for example for precursors with MULTI-MSMS quantification and only use these.
In this section we will show you how to read in the file with
read_protti()
and how to create the
is_proteotypic
column and the condition
column
(minimally required) with the help of the stringr
and
dplyr
packages. How to filter your data best is described
in the data analysis vignettes.
# To read in your own data you can use read_protti()
evidence <- read_protti(filename = "yourpath/evidence.txt")
evidence_proteotypic <- evidence %>%
# adds new column with logicals that are TRUE if the peptide can be assigned
# to only one protein and FALSE if it can be assigned to multiple
mutate(is_proteotypic = str_detect(
string = proteins,
pattern = ";",
negate = TRUE
)) %>%
# adds new column with logicals indicating if peptide is coming from a potential contaminant
mutate(is_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE))
# Make an annotation data frame and merge it with your data frame to obtain conditions
# We are annotating sample 1-3 as controls and samples 4-6 as treated conditions
file_name <- c( # make sure that the names are the same name as in your report
"sample1",
"sample2",
"sample3",
"sample4",
"sample5",
"sample6"
)
condition <- c(
"control",
"control",
"control",
"treated",
"treated",
"treated"
)
annotation <- data.frame(file_name, condition)
# Combine your long data frame with the annotation
evidence_annotated <- evidence_proteotypic %>%
left_join(y = annotation, by = "file_name")
For protein-centric analyses you can use the
proteinGroups.txt
file provided by MaxQuant. This file
contains information in a wide format where each sample has its own
column containing intensity values. Therefore, we need to transform this
data into a long format to meet the conditions of tidy data.
We will filter the data and use tidyr
’s
pivot_longer()
to change the format to long format.
Furthermore, we produce an annotation data frame to create a
conditions
column. The filtering is only done in order to
remove proteins with potentially low quality. Further filtering for
decoys and potential contaminants should be performed based on the data
analysis vignettes.
# To read in your own data you can use read_protti()
protein_groups <- read_protti(filename = "yourpath/proteinGroups.txt") %>%
# adds new column with logicals indicating if protein is a potential contaminant,
# you can filter these out later on. You should also consider filtering out proteins
# that were "only identified by site" and reverse hits, as well as proteins with only
# one identified peptide
mutate(is_potential_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE))
# Change wide format to long format and create new columns called `r_file_name`and `intensity`
protein_groups_long <- protein_groups %>%
pivot_longer(
cols = starts_with("intensity_"),
names_to = "file_name",
values_to = "intensity"
)
# Make an annotation data frame and merge it with your data frame to obtain conditions
# We are annotating sample 1-3 as controls and samples 4-6 as treated conditions
file_name <- c( # make sure that the names are the same name as in your report
"intensity_sample1",
"intensity_sample2",
"intensity_sample3",
"intensity_sample4",
"intensity_sample5",
"intensity_sample6"
)
condition <- c(
"control",
"control",
"control",
"treated",
"treated",
"treated"
)
annotation <- data.frame(file_name, condition)
# Combine your long data frame with the annotation
protein_groups_annotated <- protein_groups_long %>%
left_join(y = annotation, by = "file_name")
The Skyline output is already in long format, however, to process it you need to sum up the transition intensities to obtain the intensity of one precursor. If you prefer to analyse your data on the fragment level, you should create a column that uniquely identifies each fragment of each precursor. You could do that by pasting together the peptide sequence with the charge and the product m/z.
The required Skyline output columns include:
You can add replicate and condition annotations in Skyline directly.
However, we will explain in this section how you can also do it in R. If
you want to analyse your data on the protein abundance level you will
have to combine the precursor intensities to obtain one value for
protein abundance. This could be done using the
calculate_protein_abundance()
function from
protti.
# Load data
skyline_data <- read_protti(filename = "yourpath/skyline.csv")
skyline_data_int <- skyline_data %>%
# create a column with precursor information
mutate(precursor = paste0(peptide_sequence, "_", charge)) %>%
group_by(replicate_name, precursor) %>%
# making a new column containing the summed up intensities of all transitions of one precursor
mutate(sum_intensity = sum(area)) %>%
select(-c(product_mz, area)) %>% # removing the columns we don't need
distinct() # removing duplicated rows from the data frame
# Add annotation
# make sure that the names are the same name as in your report
replicate_name <- c(
"sample_1",
"sample_2",
"sample_3",
"sample_1",
"sample_2",
"sample_3"
)
condition <- c(
"control",
"control",
"control",
"treated",
"treated",
"treated"
)
annotation <- data.frame(replicate_name, condition)
# Combine your long data frame with the annotation
skyline_annotated <- skyline_data_int %>%
left_join(y = annotation, by = "replicate_name")
The Proteome Discoverer output contains data in wide format (one column for each sample). Similar to MaxQuant there is also the option for a peptide or a protein-centric export. We will discuss both cases in this segment.
For a peptide-centric or a LiP-MS analysis please export the “Peptide Groups” report. Before preparing your export you can add the column “sequence” to your table otherwise Proteome Discoverer will only export the “annotated sequence” column which includes the preceding and following amino acids in the protein sequence.
The required columns include:
After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such.
We will read in the file using read_protti()
and then
select the columns we are interested in. You can use the
contaminant
column for qualitiy control. The
number_proteins
column contains information on the
proteotypicity. If this is 1 then the peptide is proteotypic. If you
want to analyse your data qualitatively only with quality control
functions of protti you can keep peptides without
quantifications. Before you start your quantitative analysis remove
observations that are labeled "No Quan Values"
in the
quan_info
column. In the below example they are filtered
out at this step, but you can keep them and only filter them out
later.
# Load data
pd_pep_data <- read_protti("yourpath/PDpeptides.csv")
# Select relevant columns
pd_pep_selected <- pd_pep_data %>%
select(
sequence,
modifications,
number_proteins,
contaminant,
master_protein_accessions,
starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped"
quan_info
)
# Filter data frame
pd_pep_filtered <- pd_pep_selected %>%
filter(contaminant == FALSE) %>% # remove annotated contaminants
filter(number_proteins == 1) %>% # select proteotypic peptides
filter(quan_info != "No Quan Values") # remove peptides that have no quantification values
# Convert into long format
pd_pep_long <- pd_pep_filtered %>%
pivot_longer(
cols = starts_with("abundances"),
names_to = "file_name",
values_to = "intensity"
) %>%
# combine peptide sequence and modifications to make a precursor column
mutate(precursor = paste(sequence, modifications))
# Make annotation data frame
file_name <- c( # make sure that the names are the same name as in your report
"abundances_grouped_f1",
"abundances_grouped_f2",
"abundances_grouped_f3",
"abundances_grouped_f4",
"abundances_grouped_f5",
"abundances_grouped_f6"
)
condition <- c(
"control",
"control",
"control",
"treated",
"treated",
"treated"
)
annotation <- data.frame(file_name, condition)
# Combine your long data frame with the annotation
pd_pep_long_annotated <- pd_pep_long %>%
left_join(y = annotation, by = "file_name")
For a protein-centric or analysis please export the “Proteins” report.
The required columns include:
After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such.
We will read in the file using read_protti()
and then
select the columns we are interested in. Similar to above you can either
filter the contaminant
and number_peptides
columns now or later.
# Load data
pd_prot_data <- read_protti("yourpath/PDproteins.csv")
# Select relevant columns
pd_prot_selected <- pd_prot_data %>%
select(
accession,
description,
contaminant,
number_peptides,
starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped"
)
# Filter data frame
pd_prot_data_filtered <- pd_prot_selected %>%
filter(contaminant == FALSE) %>% # remove annotated contaminants
filter(number_peptides > 1) # select proteins with more than one identified peptide
# Convert into long format
pd_prot_long <- pd_prot_data_filtered %>%
pivot_longer(
cols = starts_with("abundances"),
names_to = "file_name",
values_to = "intensity"
)
# Make annotation data frame
file_name <- c( # make sure that the names are the same name as in your report
"abundances_grouped_f1",
"abundances_grouped_f2",
"abundances_grouped_f3",
"abundances_grouped_f4",
"abundances_grouped_f5",
"abundances_grouped_f6"
)
condition <- c(
"control",
"control",
"control",
"treated",
"treated",
"treated"
)
annotation <- data.frame(file_name, condition)
# Combine your long data frame with the annotation
pd_prot_long_annotated <- pd_prot_long %>%
left_join(y = annotation, by = "file_name")
As mentioned in the beginning of this vignette you can use the output of any search engine as long as it contains the minimally required columns. If it is not in the right format you can see if some of the above transformations can be applied to your data. It is also always useful to check if you can find additional columns that help you in your analysis and that you can export from your search engine. Always make sure that all of your observations are ones you are interested in. Check if there are decoys, contaminants or non-proteotypic peptides in your data. For protein-centric analysis, potentially remove quantifications that rely on only a few peptides.