Creates a contact map of all atoms from a structure file — create_structure_contact

Creates a contact map of a subset or of all atom or residue distances in a structure or AlphaFold prediction file. Contact maps are a useful tool for the identification of protein regions that are in close proximity in the folded protein. Additionally, regions that are interacting closely with a small molecule or metal ion can be easily identified without the need to open the structure in programs such as PyMOL or ChimeraX. For large datasets (more than 40 contact maps) it is recommended to use the parallel_create_structure_contact_map() function instead, regardless of if maps should be created in parallel or sequential.

create_structure_contact_map(
  data,
  data2 = NULL,
  id,
  chain = NULL,
  auth_seq_id = NULL,
  distance_cutoff = 10,
  pdb_model_number_selection = c(0, 1),
  return_min_residue_distance = TRUE,
  show_progress = TRUE,
  export = FALSE,
  export_location = NULL,
  structure_file = NULL
)

Arguments

data: a data frame containing at least a column with PDB ID information of which the name can be provided to the id argument. If only this column is provided, all atom or residue distances are calculated. Additionally, a chain column can be present in the data frame of which the name can be provided to the chain argument. If chains are provided, only distances of this chain relative to the rest of the structure are calculated. Multiple chains can be provided in multiple rows. If chains are provided for one structure but not for another, the rows should contain NAs. Furthermore, specific residue positions can be provided in the auth_seq_id column if the selection should be further reduced. It is not recommended to create full contact maps for more than a few structures due to time and memory limitations. If contact maps are created only for small regions it is possible to create multiple maps at once. By default distances of regions provided in this data frame to the complete structure are computed. If distances of regions from this data frame to another specific subset of regions should be computed, the second subset of regions can be provided through the optional data2 argument.
data2: optional, a data frame that contains a subset of regions for which distances to regions provided in the data data frame should be computed. If regions from the data data frame should be compared to the whole structure, data2 does not need to be provided. This data frame should have the same structure and column names as the data data frame.
id: a character column in the data data frame that contains PDB or UniProt IDs for structures or AlphaFold predictions of which contact maps should be created. If a structure not downloaded directly from PDB is provided (i.e. a locally stored structure file) to the structure_file argument, this column should contain "my_structure" as content.
chain: optional, a character column in the data data frame that contains chain identifiers for the structure file. Identifiers defined by the structure author should be used. Distances will be only calculated between the provided chains and the rest of the structure.
auth_seq_id: optional, a character (or numeric) column in the data data frame that contains semicolon separated positions of regions for which distances should be calculated. This always needs to be provided in combination with a corresponding chain in chain. The position should match the positioning defined by the structure author. For PDB structures this information can be obtained from the find_peptide_in_structure function. The corresponding column in the output is called auth_seq_id. If an AlphaFold prediction is provided, UniProt positions should be used. If signal positions and not stretches of amino acids are provided, the column can be numeric and does not need to contain the semicolon separator.
distance_cutoff: a numeric value specifying the distance cutoff in Angstrom. All values for pairwise comparisons are calculated but only values smaller than this cutoff will be returned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of 5 Angstrom and less are returned. Using a small value can reduce the size of the contact map drastically and is therefore recommended. The default value is 10.
pdb_model_number_selection: a numeric vector specifying which models from the structure files should be considered for contact maps. E.g. NMR models often have many models in one file. The default for this argument is c(0, 1). This means the first model of each structure file is selected for contact map calculations. For AlphaFold predictions the model number is 0 (only .pdb files), therefore this case is also included here.
return_min_residue_distance: a logical value that specifies if the contact map should be returned for all atom distances or the minimum residue distances. Minimum residue distances are smaller in size. If atom distances are not strictly needed it is recommended to set this argument to TRUE. The default is TRUE.
show_progress: a logical value that specifies if a progress bar will be shown (default is TRUE).
export: a logical value that indicates if contact maps should be exported as ".csv". The name of the file will be the structure ID. Default is export = FALSE.
export_location: optional, a character value that specifies the path to the location in which the contact map should be saved if export = TRUE. If left empty, they will be saved in the current working directory. The location should be provided in the following format "folderA/folderB".
structure_file: optional, a character value that specifies the path to the location and name of a structure file in ".cif" or ".pdb" format for which a contact map should be created. All other arguments can be provided as usual with the exception of the id column in the data data frame, which should not contain a PDB or UniProt ID but a character vector containing only "my_structure".

Value

A list of contact maps for each PDB or UniProt ID provided in the input is returned. If the export argument is TRUE, each contact map will be saved as a ".csv" file in the current working directory or the location provided to the export_location argument.

Examples

# \donttest{
# Create example data
data <- data.frame(
  pdb_id = c("6NPF", "1C14", "3NIR"),
  chain = c("A", "A", NA),
  auth_seq_id = c("1;2;3;4;5;6;7", NA, NA)
)

# Create contact map
contact_maps <- create_structure_contact_map(
  data = data,
  id = pdb_id,
  chain = chain,
  auth_seq_id = auth_seq_id,
  return_min_residue_distance = TRUE
)

str(contact_maps[["3NIR"]])
#> tibble [8,062 × 14] (S3: tbl_df/tbl/data.frame)
#>  $ label_comp_id_var1  : chr [1:8062] "THR" "THR" "THR" "THR" ...
#>  $ label_seq_id_var1   : num [1:8062] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ label_asym_id_var1  : chr [1:8062] "A" "A" "A" "A" ...
#>  $ auth_comp_id_var1   : chr [1:8062] "THR" "THR" "THR" "THR" ...
#>  $ auth_seq_id_var1    : chr [1:8062] "1" "1" "1" "1" ...
#>  $ auth_asym_id_var1   : chr [1:8062] "A" "A" "A" "A" ...
#>  $ label_comp_id_var2  : chr [1:8062] "THR" "THR" "CYS" "ARG" ...
#>  $ label_seq_id_var2   : num [1:8062] 1 2 3 10 13 17 23 27 32 33 ...
#>  $ label_asym_id_var2  : chr [1:8062] "A" "A" "A" "A" ...
#>  $ auth_comp_id_var2   : chr [1:8062] "THR" "THR" "CYS" "ARG" ...
#>  $ auth_seq_id_var2    : chr [1:8062] "1" "2" "3" "10" ...
#>  $ auth_asym_id_var2   : chr [1:8062] "A" "A" "A" "A" ...
#>  $ id                  : chr [1:8062] "3NIR" "3NIR" "3NIR" "3NIR" ...
#>  $ min_distance_residue: num [1:8062] 0 1.18 3.23 4.72 6.71 ...

contact_maps
#> $`6NPF`
#> # A tibble: 330 × 14
#>    label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1
#>    <chr>                          <dbl> <chr>              <chr>            
#>  1 SER                               19 A                  SER              
#>  2 SER                               19 A                  SER              
#>  3 SER                               19 A                  SER              
#>  4 SER                               19 A                  SER              
#>  5 SER                               19 A                  SER              
#>  6 SER                               19 A                  SER              
#>  7 SER                               19 A                  SER              
#>  8 SER                               19 A                  SER              
#>  9 SER                               19 A                  SER              
#> 10 SER                               19 A                  SER              
#> # ℹ 320 more rows
#> # ℹ 10 more variables: auth_seq_id_var1 <chr>, auth_asym_id_var1 <chr>,
#> #   label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>,
#> #   label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <chr>,
#> #   auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl>
#> 
#> $`1C14`
#> # A tibble: 18,553 × 14
#>    label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1
#>    <chr>                          <dbl> <chr>              <chr>            
#>  1 GLY                                2 A                  GLY              
#>  2 GLY                                2 A                  GLY              
#>  3 GLY                                2 A                  GLY              
#>  4 GLY                                2 A                  GLY              
#>  5 GLY                                2 A                  GLY              
#>  6 GLY                                2 A                  GLY              
#>  7 GLY                                2 A                  GLY              
#>  8 GLY                                2 A                  GLY              
#>  9 GLY                                2 A                  GLY              
#> 10 GLY                                2 A                  GLY              
#> # ℹ 18,543 more rows
#> # ℹ 10 more variables: auth_seq_id_var1 <chr>, auth_asym_id_var1 <chr>,
#> #   label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>,
#> #   label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <chr>,
#> #   auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl>
#> 
#> $`3NIR`
#> # A tibble: 8,062 × 14
#>    label_comp_id_var1 label_seq_id_var1 label_asym_id_var1 auth_comp_id_var1
#>    <chr>                          <dbl> <chr>              <chr>            
#>  1 THR                                1 A                  THR              
#>  2 THR                                1 A                  THR              
#>  3 THR                                1 A                  THR              
#>  4 THR                                1 A                  THR              
#>  5 THR                                1 A                  THR              
#>  6 THR                                1 A                  THR              
#>  7 THR                                1 A                  THR              
#>  8 THR                                1 A                  THR              
#>  9 THR                                1 A                  THR              
#> 10 THR                                1 A                  THR              
#> # ℹ 8,052 more rows
#> # ℹ 10 more variables: auth_seq_id_var1 <chr>, auth_asym_id_var1 <chr>,
#> #   label_comp_id_var2 <chr>, label_seq_id_var2 <dbl>,
#> #   label_asym_id_var2 <chr>, auth_comp_id_var2 <chr>, auth_seq_id_var2 <chr>,
#> #   auth_asym_id_var2 <chr>, id <chr>, min_distance_residue <dbl>
#> 
# }