Finds peptide positions in a PDB structure based on positional matching

Finds peptide positions in a PDB structure. Often positions of peptides in UniProt and a PDB structure are different due to different lengths of structures. This function maps a peptide based on its UniProt positions onto a PDB structure. This method is superior to sequence alignment of the peptide to the PDB structure sequence, since it can also match the peptide if there are truncations or mismatches. This function also provides an easy way to check if a peptide is present in a PDB structure.

find_peptide_in_structure(
  peptide_data,
  peptide,
  start,
  end,
  uniprot_id,
  pdb_data = NULL,
  retain_columns = NULL
)

Arguments

peptide_data: a data frame containing at least the input columns to this function.
peptide: a character column in the peptide_data data frame that contains the sequence or any other unique identifier for the peptide that should be found.
start: a numeric column in the peptide_data data frame that contains start positions of peptides.
end: a numeric column in the peptide_data data frame that contains end positions of peptides.
uniprot_id: a character column in the peptide_data data frame that contains UniProt identifiers that correspond to the peptides.
pdb_data: optional, a data frame containing data obtained with fetch_pdb(). If not provided, information is fetched automatically. If this function should be run multiple times it is faster to fetch the information once and provide it to the function. If provided, make sure that the column names are identical to the ones that would be obtained by calling fetch_pdb().
retain_columns: a vector indicating if certain columns should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector).

Value

A data frame that contains peptide positions in the corresponding PDB structures. If a peptide is not found in any structure or no structure is associated with the protein, the data frame contains NAs values for the output columns. The data frame contains the following and additional columns:

auth_asym_id: Chain identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure.
label_asym_id: Chain identifier following the standardised convention for mmCIF files.
peptide_seq_in_pdb: The sequence of the peptide mapped to the structure. If the peptide only maps partially, then only the part of the sequence that maps on the structure is returned.
fit_type: The fit type is either "partial" or "fully" and it indicates if the complete peptide or only part of it was found in the structure.
label_seq_id_start: Contains the first residue position of the peptide in the structure following the standardised convention for mmCIF files.
label_seq_id_end: Contains the last residue position of the peptide in the structure following the standardised convention for mmCIF files.
auth_seq_id_start: Contains the first residue position of the peptide in the structure based on the alternative residue identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure. This does not need to be numeric and is therefore of type character.
auth_seq_id_end: Contains the last residue position of the peptide in the structure based on the alternative residue identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure. This does not need to be numeric and is therefore of type character.
auth_seq_id: Contains all positions (separated by ";") of the peptide in the structure based on the alternative residue identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure. This does not need to be numeric and is therefore of type character.
n_peptides: The number of peptides from one protein that were searched for within the current structure.
n_peptides_in_structure: The number of peptides from one protein that were found within the current structure.

Examples

# \donttest{
# Create example data
peptide_data <- data.frame(
  uniprot_id = c("P0A8T7", "P0A8T7", "P60906"),
  peptide_sequence = c(
    "SGIVSFGKETKGKRRLVITPVDGSDPYEEMIPKWRQLNV",
    "NVFEGERVER",
    "AIGEVTDVVEKE"
  ),
  start = c(1160, 1197, 55),
  end = c(1198, 1206, 66)
)

# Find peptides in protein structure
peptide_in_structure <- find_peptide_in_structure(
  peptide_data = peptide_data,
  peptide = peptide_sequence,
  start = start,
  end = end,
  uniprot_id = uniprot_id
)
#> [2/6] Extract experimental conditions ... 
#> DONE (0.02s)
#> [3/6] Extracting polymer information: 
#> -> 1/6 UniProt IDs ... 
#> DONE (0.26s)
#> -> 2/6 UniProt alignment ... 
#> DONE (0.25s)
#> -> 3/6 Ligand binding sites ... 
#> DONE (2.43s)
#> -> 4/6 Modified monomers ... 
#> DONE (0.13s)
#> -> 5/6 Secondary structure ... 
#> DONE (1.23s)
#> -> 6/6 Unmodeled residues ... 
#> DONE (0.15s)
#> [4/6] Correct author sequence positions for some PDB IDs ... 
#> None to correct(0.21s)
#> [5/6] Extract non-polymer information ... 
#> DONE (0.01s)
#> [6/6] Combine information ... 
#> DONE (0.48s)

head(peptide_in_structure, n = 10)
#> # A tibble: 10 × 16
#>    uniprot_id pdb_ids auth_asym_id label_asym_id peptide_sequence               
#>    <chr>      <chr>   <chr>        <chr>         <chr>                          
#>  1 P0A8T7     3LU0    D            D             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  2 P0A8T7     3LU0    D            D             NVFEGERVER                     
#>  3 P0A8T7     4IQZ    A            A             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  4 P0A8T7     4IQZ    A            A             NVFEGERVER                     
#>  5 P0A8T7     4IQZ    B            B             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  6 P0A8T7     4IQZ    B            B             NVFEGERVER                     
#>  7 P0A8T7     4IQZ    C            C             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  8 P0A8T7     4IQZ    C            C             NVFEGERVER                     
#>  9 P0A8T7     4IQZ    D            D             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#> 10 P0A8T7     4IQZ    D            D             NVFEGERVER                     
#> # ℹ 11 more variables: peptide_seq_in_pdb <chr>, fit_type <chr>, start <dbl>,
#> #   end <dbl>, label_seq_id_start <dbl>, label_seq_id_end <dbl>,
#> #   auth_seq_id_start <chr>, auth_seq_id_end <chr>, auth_seq_id <chr>,
#> #   n_peptides <int>, n_peptides_in_structure <int>
# }