R/map_peptides_on_structure.R
map_peptides_on_structure.Rd
Peptides are mapped onto PDB structures or AlphaFold prediction based on their positions. This is accomplished by replacing the B-factor information in the structure file with values that allow highlighting of peptides, protein regions or amino acids when the structure is coloured by B-factor. In addition to simply highlighting peptides, protein regions or amino acids, a continuous variable such as fold changes associated with them can be mapped onto the structure as a colour gradient.
map_peptides_on_structure(
peptide_data,
uniprot_id,
pdb_id,
chain,
auth_seq_id,
map_value,
file_format = ".cif",
scale_per_structure = TRUE,
export_location = NULL,
structure_file = NULL,
show_progress = TRUE
)
a data frame that contains the input columns to this function. If structure
or prediction files should be fetched automatically, please provide column names to the following
arguments: uniprot_id, pdb_id, chain, auth_seq_id,
map_value. If no PDB structure for a protein is available the pdb_id
and chain
column should contain NA at these positions. If a structure or prediction file is provided in the
structure_file
argument, this data frame should only contain information associated with
the provided structure. In case of a user provided structure, column names should be provided to
the following arguments: uniprot_id, chain, auth_seq_id, map_value.
a character column in the peptide_data
data frame that contains UniProt
identifiers for a corresponding peptide, protein region or amino acid.
a character column in the peptide_data
data frame that contains PDB
identifiers for structures in which a corresponding peptide, protein region or amino acid is found.
If a protein prediction should be fetched from AlphaFold, this column should contain NA. This
column is not required if a structure or prediction file is provided in the structure_file
argument.
a character column in the peptide_data
data frame that contains the name of
the chain from the PDB structure in which the peptide, protein region or amino acid is found.
If a protein prediction should be fetched from AlphaFold, this column should contain NA. If an
AlphaFold prediction is provided to the structure_file
argument the chain should be
provided as usual (All AlphaFold predictions only have chain A). Important: please provide
the author defined chain definitions for both ".cif" and ".pdb" files. When the output of the
find_peptide_in_structure
function is used as the input for this function, this
corresponds to the auth_asym_id
column.
optional, a character (or numeric) column in the peptide_data
data frame
that contains semicolon separated positions of peptides, protein regions or amino acids in the
corresponding PDB structure or AlphaFold prediction. This information can be obtained from the
find_peptide_in_structure
function. The corresponding column in the output is called
auth_seq_id
. In case of AlphaFold predictions, UniProt positions should be used. If
signal positions and not stretches of amino acids are provided, the column can be numeric and
does not need to contain the semicolon separator.
a numeric column in the peptide_data
data frame that contains a value
associated with each peptide, protein region or amino acid. If one start to end position pair
has multiple different map values, the maximum will be used. This value will be displayed as a
colour gradient when mapped onto the structure. The value can for example be the fold change,
p-value or score associated with each peptide, protein region or amino acid (selection). If
the selections should be displayed with just one colour, the value in this column should be
the same for every selection. For the mapping, values are scaled between 50 and 100. Regions
in the structure that do not map any selection receive a value of 0. If an amino acid position
is associated with multiple mapped values, e.g. from different peptides, the maximum mapped
value will be displayed.
a character vector containing the file format of the structure that will be
fetched from the database for the PDB identifiers provided in the pdb_id
column. This
can be either ".cif" or ".pdb". The default is ".cif"
. We recommend using ".cif" files
since every structure contains a ".cif" file but not every structure contains a ".pdb" file.
Fetching and mapping onto ".cif" files takes longer than for ".pdb" files. If a structure file
is provided in the structure_file
argument, the file format is detected automatically
and does not need to be provided.
a logical value that specifies if scaling should be performed for each structure independently (TRUE) or over the whole data set (FALSE). The default is TRUE, which scales the scores of each structure independently so that each structure has a score range from 50 to 100.
optional, a character argument specifying the path to the location in which the fetched and altered structure files should be saved. If left empty, they will be saved in the current working directory. The location should be provided in the following format "folderA/folderB".
optional, a character argument specifying the path to the location and
name of a structure file in ".cif" or ".pdb" format. If a structure is provided the peptide_data
data frame should only contain mapping information for this structure.
a logical, if show_progress = TRUE
, a progress bar will be shown
(default is TRUE).
The function exports a modified ".pdb" or ".cif" structure file. B-factors have been
replaced with scaled (50-100) values provided in the map_value
column.
# \donttest{
# Load libraries
library(dplyr)
# Create example data
peptide_data <- data.frame(
uniprot_id = c("P0A8T7", "P0A8T7", "P60906"),
peptide_sequence = c(
"SGIVSFGKETKGKRRLVITPVDGSDPYEEMIPKWRQLNV",
"NVFEGERVER",
"AIGEVTDVVEKE"
),
start = c(1160, 1197, 55),
end = c(1198, 1206, 66),
map_value = c(70, 100, 100)
)
# Find peptide positions in structures
positions_structure <- find_peptide_in_structure(
peptide_data = peptide_data,
peptide = peptide_sequence,
start = start,
end = end,
uniprot_id = uniprot_id,
retain_columns = c(map_value)) %>%
filter(pdb_ids %in% c("6UU2", "2EL9"))
#> [2/6] Extract experimental conditions ...
#> DONE (0.02s)
#> [3/6] Extracting polymer information:
#> -> 1/6 UniProt IDs ...
#> DONE (0.4s)
#> -> 2/6 UniProt alignment ...
#> DONE (0.4s)
#> -> 3/6 Ligand binding sites ...
#> DONE (2.87s)
#> -> 4/6 Modified monomers ...
#> DONE (0.13s)
#> -> 5/6 Secondary structure ...
#> DONE (0.75s)
#> -> 6/6 Unmodeled residues ...
#> DONE (0.16s)
#> [4/6] Correct author sequence positions for some PDB IDs ...
#> None to correct(0.22s)
#> [5/6] Extract non-polymer information ...
#> DONE (0.01s)
#> [6/6] Combine information ...
#> DONE (0.44s)
# Map peptides on structures
# You can determine the preferred output location
# with the export_location argument. Currently it
# is saved in the working directory.
map_peptides_on_structure(
peptide_data = positions_structure,
uniprot_id = uniprot_id,
pdb_id = pdb_ids,
chain = auth_asym_id,
auth_seq_id = auth_seq_id,
map_value = map_value,
file_format = ".pdb",
export_location = getwd()
)
#> The following structures were not fetched, likely because no ".pdb"
#> file is available. Try using the ".cif" format for these.6UU2_P0A8T7
# }