R/create_structure_contact_map.R
create_structure_contact_map.Rd
Creates a contact map of a subset or of all atom or residue distances in a structure or
AlphaFold prediction file. Contact maps are a useful tool for the identification of protein
regions that are in close proximity in the folded protein. Additionally, regions that are
interacting closely with a small molecule or metal ion can be easily identified without the
need to open the structure in programs such as PyMOL or ChimeraX. For large datasets (more
than 40 contact maps) it is recommended to use the parallel_create_structure_contact_map()
function instead, regardless of if maps should be created in parallel or sequential.
create_structure_contact_map(
data,
data2 = NULL,
id,
chain = NULL,
auth_seq_id = NULL,
distance_cutoff = 10,
pdb_model_number_selection = c(0, 1),
return_min_residue_distance = TRUE,
show_progress = TRUE,
export = FALSE,
export_location = NULL,
structure_file = NULL
)
a data frame containing at least a column with PDB ID information of which the name
can be provided to the id
argument. If only this column is provided, all atom or residue
distances are calculated. Additionally, a chain column can be present in the data frame of which
the name can be provided to the chain
argument. If chains are provided, only distances
of this chain relative to the rest of the structure are calculated. Multiple chains can be
provided in multiple rows. If chains are provided for one structure but not for another, the
rows should contain NAs. Furthermore, specific residue positions can be provided in the auth_seq_id
column if the selection should be further reduced. It is not recommended to create full
contact maps for more than a few structures due to time and memory limitations. If contact maps are
created only for small regions it is possible to create multiple maps at once. By default distances
of regions provided in this data frame to the complete structure are computed. If distances of regions
from this data frame to another specific subset of regions should be computed, the second subset
of regions can be provided through the optional data2
argument.
optional, a data frame that contains a subset of regions for which distances to regions
provided in the data
data frame should be computed. If regions from the data
data
frame should be compared to the whole structure, data2 does not need to be provided.
This data frame should have the same structure and column names as the data
data frame.
a character column in the data
data frame that contains PDB or UniProt IDs for
structures or AlphaFold predictions of which contact maps should be created. If a structure not
downloaded directly from PDB is provided (i.e. a locally stored structure file) to the
structure_file
argument, this column should contain "my_structure" as content.
optional, a character column in the data
data frame that contains chain
identifiers for the structure file. Identifiers defined by the structure author should be used.
Distances will be only calculated between the provided chains and the rest of the structure.
optional, a character (or numeric) column in the data
data frame
that contains semicolon separated positions of regions for which distances should be calculated.
This always needs to be provided in combination with a corresponding chain in chain
.
The position should match the positioning defined by the structure author. For
PDB structures this information can be obtained from the find_peptide_in_structure
function. The corresponding column in the output is called auth_seq_id
. If an
AlphaFold prediction is provided, UniProt positions should be used. If signal positions
and not stretches of amino acids are provided, the column can be numeric and does not need
to contain the semicolon separator.
a numeric value specifying the distance cutoff in Angstrom. All values for pairwise comparisons are calculated but only values smaller than this cutoff will be returned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of 5 Angstrom and less are returned. Using a small value can reduce the size of the contact map drastically and is therefore recommended. The default value is 10.
a numeric vector specifying which models from the structure files should be considered for contact maps. E.g. NMR models often have many models in one file. The default for this argument is c(0, 1). This means the first model of each structure file is selected for contact map calculations. For AlphaFold predictions the model number is 0 (only .pdb files), therefore this case is also included here.
a logical value that specifies if the contact map should be returned for all atom distances or the minimum residue distances. Minimum residue distances are smaller in size. If atom distances are not strictly needed it is recommended to set this argument to TRUE. The default is TRUE.
a logical value that specifies if a progress bar will be shown (default is TRUE).
a logical value that indicates if contact maps should be exported as ".csv". The
name of the file will be the structure ID. Default is export = FALSE
.
optional, a character value that specifies the path to the location in
which the contact map should be saved if export = TRUE
. If left empty, they will be
saved in the current working directory. The location should be provided in the following format
"folderA/folderB".
optional, a character value that specifies the path to the location and
name of a structure file in ".cif" or ".pdb" format for which a contact map should be created.
All other arguments can be provided as usual with the exception of the id
column in the
data
data frame, which should not contain a PDB or UniProt ID but a character vector
containing only "my_structure".
A list of contact maps for each PDB or UniProt ID provided in the input is returned.
If the export
argument is TRUE, each contact map will be saved as a ".csv" file in the
current working directory or the location provided to the export_location
argument.
# \donttest{
# Create example data
data <- data.frame(
pdb_id = c("6NPF", "1C14", "3NIR"),
chain = c("A", "A", NA),
auth_seq_id = c("1;2;3;4;5;6;7", NA, NA)
)
# Create contact map
contact_maps <- create_structure_contact_map(
data = data,
id = pdb_id,
chain = chain,
auth_seq_id = auth_seq_id,
return_min_residue_distance = TRUE
)
str(contact_maps[["3NIR"]])
#> tibble [8,062 × 14] (S3: tbl_df/tbl/data.frame)
#> $ label_comp_id_var1 : chr [1:8062] "THR" "THR" "THR" "THR" ...
#> $ label_seq_id_var1 : num [1:8062] 1 1 1 1 1 1 1 1 1 1 ...
#> $ label_asym_id_var1 : chr [1:8062] "A" "A" "A" "A" ...
#> $ auth_comp_id_var1 : chr [1:8062] "THR" "THR" "THR" "THR" ...
#> $ auth_seq_id_var1 : chr [1:8062] "1" "1" "1" "1" ...
#> $ auth_asym_id_var1 : chr [1:8062] "A" "A" "A" "A" ...
#> $ label_comp_id_var2 : chr [1:8062] "THR" "THR" "CYS" "ARG" ...
#> $ label_seq_id_var2 : num [1:8062] 1 2 3 10 13 17 23 27 32 33 ...
#> $ label_asym_id_var2 : chr [1:8062] "A" "A" "A" "A" ...
#> $ auth_comp_id_var2 : chr [1:8062] "THR" "THR" "CYS" "ARG" ...
#> $ auth_seq_id_var2 : chr [1:8062] "1" "2" "3" "10" ...
#> $ auth_asym_id_var2 : chr [1:8062] "A" "A" "A" "A" ...
#> $ id : chr [1:8062] "3NIR" "3NIR" "3NIR" "3NIR" ...
#> $ min_distance_residue: num [1:8062] 0 1.18 3.23 4.72 6.71 ...
contact_maps
#> $`6NPF`
#> # A tibble: 330 × 14
#> label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1
#> <chr> <dbl> <chr> <chr>
#> 1 SER 19 A SER
#> 2 SER 19 A SER
#> 3 SER 19 A SER
#> 4 SER 19 A SER
#> 5 SER 19 A SER
#> 6 SER 19 A SER
#> 7 SER 19 A SER
#> 8 SER 19 A SER
#> 9 SER 19 A SER
#> 10 SER 19 A SER
#> # ℹ 320 more rows
#> # ℹ 10 more variables: auth_seq_id_var1 <chr>, auth_asym_id_var1 <chr>,
#> # label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>,
#> # label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <chr>,
#> # auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl>
#>
#> $`1C14`
#> # A tibble: 18,553 × 14
#> label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1
#> <chr> <dbl> <chr> <chr>
#> 1 GLY 2 A GLY
#> 2 GLY 2 A GLY
#> 3 GLY 2 A GLY
#> 4 GLY 2 A GLY
#> 5 GLY 2 A GLY
#> 6 GLY 2 A GLY
#> 7 GLY 2 A GLY
#> 8 GLY 2 A GLY
#> 9 GLY 2 A GLY
#> 10 GLY 2 A GLY
#> # ℹ 18,543 more rows
#> # ℹ 10 more variables: auth_seq_id_var1 <chr>, auth_asym_id_var1 <chr>,
#> # label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>,
#> # label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <chr>,
#> # auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl>
#>
#> $`3NIR`
#> # A tibble: 8,062 × 14
#> label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1
#> <chr> <dbl> <chr> <chr>
#> 1 THR 1 A THR
#> 2 THR 1 A THR
#> 3 THR 1 A THR
#> 4 THR 1 A THR
#> 5 THR 1 A THR
#> 6 THR 1 A THR
#> 7 THR 1 A THR
#> 8 THR 1 A THR
#> 9 THR 1 A THR
#> 10 THR 1 A THR
#> # ℹ 8,052 more rows
#> # ℹ 10 more variables: auth_seq_id_var1 <chr>, auth_asym_id_var1 <chr>,
#> # label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>,
#> # label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <chr>,
#> # auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl>
#>
# }