Creates a contact map of a subset or of all atom or residue distances in a structure or AlphaFold prediction file. Contact maps are a useful tool for the identification of protein regions that are in close proximity in the folded protein. Additionally, regions that are interacting closely with a small molecule or metal ion can be easily identified without the need to open the structure in programs such as PyMOL or ChimeraX.

create_structure_contact_map(
  data,
  id,
  chain = NULL,
  start_in_pdb = NULL,
  end_in_pdb = NULL,
  distance_cutoff = 10,
  pdb_model_number_selection = c(0, 1),
  return_min_residue_distance = TRUE,
  show_progress = TRUE,
  export = FALSE,
  export_location = NULL,
  structure_file = NULL
)

Arguments

data

a data frame containing at least a column with PDB ID information of which the name can be provided to the id argument. If only this column is provided, all atom or residue distances are calculated. Additionally, a chain column can be present in the data frame of which the name can be provided to the chain argument. If chains are provided, only distances of this chain relative to the rest of the structure are calculated. Multiple chains can be provided in multiple rows. If chains are provided for one structure but not for another, the rows should contain NAs. Furthermore, specific residue positions can be provided in start and end columns if the selection should be further reduced. It is not recommended to create full contact maps for more than a few structures due to time and memory reasons. If contact maps are created only for small regions it is possible to create multiple maps at once.

id

a character column in the data data frame that contains PDB or UniProt IDs for structures or AlphaFold predictions of which contact maps should be created. If a structure not downloaded directly from PDB is provided (i.e. a locally stored structure file) to the structure_file argument, this column should contain "my_structure" as content.

chain

optional, a character column in the data data frame that contains chain identifiers for the structure file. Identifiers defined by the structure author should be used. Distances will be only calculated between the provided chains and the rest of the structure.

start_in_pdb

optional, a numeric column in the data data frame that contains start positions of regions which for distances should be calculated. This needs to be always provided in combination with a corresponding end position in end_in_pdb and chain in chain. The position should match the positioning defined by the structure author. For PDB structures this information can be obtained from the find_peptide_in_structure function. The corresponding column in the output is called auth_seq_id_start. If an AlphaFold prediction is provided, UniProt positions should be used.

end_in_pdb

optional, a numeric column in the data data frame that contains end positions of regions which for distances should be calculated. This needs to be always provided in combination with a corresponding start position in start_in_pdb and chain in chain. The position should match the positioning defined by the structure author. For PDB structures this information can be obtained from the find_peptide_in_structure function. The corresponding column in the output is called auth_seq_id_end. If an AlphaFold prediction is provided, UniProt positions should be used.

distance_cutoff

a numeric value specifying the distance cutoff in Angstrom. All values for pairwise comparisons are calculated but only values smaller than this cutoff will be returned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of 5 Angstrom and less are returned. Using a small value can reduce the size of the contact map drastically and is therefore recommended. The default value is 10.

pdb_model_number_selection

a numeric vector specifying which models from the structure files should be considered for contact maps. E.g. NMR models often have many models in one file. The default for this argument is c(0, 1). This means the first model of each structure file is selected for contact map calculations. For AlphaFold predictions the model number is 0 (only .pdb files), therefore this case is also included here.

return_min_residue_distance

a logical value that specifies if the contact map should be returned for all atom distances or the minimum residue distances. Minimum residue distances are smaller in size. If atom distances are not strictly needed it is recommended to set this argument to TRUE. The default is TRUE.

show_progress

a logical value that specifies if a progress bar will be shown (default is TRUE).

export

a logical value that indicates if contact maps should be exported as ".csv". The name of the file will be the structure ID. Default is export = FALSE.

export_location

optional, a character value that specifies the path to the location in which the contact map should be saved if export = TRUE. If left empty, they will be saved in the current working directory. The location should be provided in the following format "folderA/folderB".

structure_file

optional, a character value that specifies the path to the location and name of a structure file in ".cif" or ".pdb" format for which a contact map should be created. All other arguments can be provided as usual with the exception of the id column in the data data frame, which should not contain a PDB or UniProt ID but a character vector containing only "my_structure".

Value

A list of contact maps for each PDB or UniProt ID provided in the input is returned. If the export argument is TRUE, each contact map will be saved as a ".csv" file in the current working directory or the location provided to the export_location argument.

Examples

# \donttest{ # Create example data data <- data.frame( pdb_id = c("6NPF", "1C14", "3NIR"), chain = c("A", "A", NA), start = c(1, NA, NA), end = c(10, NA, NA) ) # Create contact map contact_maps <- create_structure_contact_map( data = data, id = pdb_id, chain = chain, start_in_pdb = start, end_in_pdb = end, return_min_residue_distance = TRUE ) str(contact_maps[["3NIR"]])
#> tibble [8,062 × 14] (S3: tbl_df/tbl/data.frame) #> $ label_comp_id_var1 : chr [1:8062] "THR" "THR" "THR" "THR" ... #> $ label_seq_id_var1 : num [1:8062] 1 1 1 1 1 1 1 1 1 1 ... #> $ label_asym_id_var1 : chr [1:8062] "A" "A" "A" "A" ... #> $ auth_comp_id_var1 : chr [1:8062] "THR" "THR" "THR" "THR" ... #> $ auth_seq_id_var1 : num [1:8062] 1 1 1 1 1 1 1 1 1 1 ... #> $ auth_asym_id_var1 : chr [1:8062] "A" "A" "A" "A" ... #> $ label_comp_id_var2 : chr [1:8062] "THR" "THR" "CYS" "ARG" ... #> $ label_seq_id_var2 : num [1:8062] 1 2 3 10 13 17 23 27 32 33 ... #> $ label_asym_id_var2 : chr [1:8062] "A" "A" "A" "A" ... #> $ auth_comp_id_var2 : chr [1:8062] "THR" "THR" "CYS" "ARG" ... #> $ auth_seq_id_var2 : num [1:8062] 1 2 3 10 13 17 23 27 32 33 ... #> $ auth_asym_id_var2 : chr [1:8062] "A" "A" "A" "A" ... #> $ id : chr [1:8062] "3NIR" "3NIR" "3NIR" "3NIR" ... #> $ min_distance_residue: num [1:8062] 0 1.18 3.23 4.72 6.71 ...
contact_maps
#> $`6NPF` #> # A tibble: 504 × 14 #> label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1 #> <chr> <dbl> <chr> <chr> #> 1 SER 19 A SER #> 2 SER 19 A SER #> 3 SER 19 A SER #> 4 SER 19 A SER #> 5 SER 19 A SER #> 6 SER 19 A SER #> 7 SER 19 A SER #> 8 SER 19 A SER #> 9 SER 19 A SER #> 10 SER 19 A SER #> # … with 494 more rows, and 10 more variables: auth_seq_id_var1 <dbl>, #> # auth_asym_id_var1 <chr>, label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>, #> # label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <dbl>, #> # auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl> #> #> $`1C14` #> # A tibble: 18,553 × 14 #> label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1 #> <chr> <dbl> <chr> <chr> #> 1 GLY 2 A GLY #> 2 GLY 2 A GLY #> 3 GLY 2 A GLY #> 4 GLY 2 A GLY #> 5 GLY 2 A GLY #> 6 GLY 2 A GLY #> 7 GLY 2 A GLY #> 8 GLY 2 A GLY #> 9 GLY 2 A GLY #> 10 GLY 2 A GLY #> # … with 18,543 more rows, and 10 more variables: auth_seq_id_var1 <dbl>, #> # auth_asym_id_var1 <chr>, label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>, #> # label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <dbl>, #> # auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl> #> #> $`3NIR` #> # A tibble: 8,062 × 14 #> label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1 #> <chr> <dbl> <chr> <chr> #> 1 THR 1 A THR #> 2 THR 1 A THR #> 3 THR 1 A THR #> 4 THR 1 A THR #> 5 THR 1 A THR #> 6 THR 1 A THR #> 7 THR 1 A THR #> 8 THR 1 A THR #> 9 THR 1 A THR #> 10 THR 1 A THR #> # … with 8,052 more rows, and 10 more variables: auth_seq_id_var1 <dbl>, #> # auth_asym_id_var1 <chr>, label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>, #> # label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <dbl>, #> # auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl> #>
# }