Finds peptide positions in a PDB structure. Often positions of peptides in UniProt and a PDB structure are different due to different lengths of structures. This function maps a peptide based on its UniProt positions onto a PDB structure. This method is superior to sequence alignment of the peptide to the PDB structure sequence, since it can also match the peptide if there are truncations or mismatches. This function also provides an easy way to check if a peptide is present in a PDB structure.

find_peptide_in_structure(
peptide_data,
peptide,
start,
end,
uniprot_id,
pdb_data = NULL,
retain_columns = NULL
)

## Arguments

peptide_data a data frame containing at least the input columns to this function. a character column in the peptide_data data frame that contains the sequence or any other unique identifier for the peptide that should be found. a numeric column in the peptide_data data frame that contains start positions of peptides. a numeric column in the peptide_data data frame that contains end positions of peptides. a character column in the peptide_data data frame that contains UniProt identifiers that correspond to the peptides. optional, a data frame containing data obtained with fetch_pdb(). If not provided, information is fetched automatically. If this function should be run multiple times it is faster to fetch the information once and provide it to the function. If provided, make sure that the column names are identical to the ones that would be obtained by calling fetch_pdb(). a vector indicating if certain columns should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector).

## Value

A data frame that contains peptide positions in the corresponding PDB structures. If a peptide is not found in any structure or no structure is associated with the protein, the data frame contains NAs values for the output columns. The data frame contains the following and additional columns:

• auth_asym_id: Chain identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure.

• label_asym_id: Chain identifier following the standardised convention for mmCIF files.

• peptide_sequence_in_pdb: The sequence of the peptide mapped to the structure. If the peptide only maps partially, then only the part of the sequence that maps on the structure is returned.

• fit_type: The fit type is either "partial" or "fully" and it indicates if the complete peptide or only part of it was found in the structure.

• label_seq_id_start: Contains the first residue position of the peptide in the structure following the standardised convention for mmCIF files.

• label_seq_id_end: Contains the last residue position of the peptide in the structure following the standardised convention for mmCIF files.

• auth_seq_id_start: Contains the first residue position of the peptide in the structure based on the alternative residue identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure.

• auth_seq_id_end: Contains the last residue position of the peptide in the structure based on the alternative residue identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure.

• n_peptides: The number of peptides from one protein that were searched for within the current structure.

• n_peptides_in_structure: The number of peptides from one protein that were found within the current structure.

## Examples

# \donttest{
# Create example data
peptide_data <- data.frame(
uniprot_id = c("P0A8T7", "P0A8T7", "P60906"),
peptide_sequence = c(
"SGIVSFGKETKGKRRLVITPVDGSDPYEEMIPKWRQLNV",
"NVFEGERVER",
"AIGEVTDVVEKE"
),
start = c(1160, 1197, 55),
end = c(1198, 1206, 66)
)

# Find peptides in protein structure
peptide_in_structure <- find_peptide_in_structure(
peptide_data = peptide_data,
peptide = peptide_sequence,
start = start,
end = end,
uniprot_id = uniprot_id
)

#> # A tibble: 10 × 15
#>    uniprot_id pdb_ids auth_asym_id label_asym_id peptide_sequence
#>    <chr>      <chr>   <chr>        <chr>         <chr>
#>  1 P0A8T7     2LMC    B            B             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  2 P0A8T7     2LMC    B            B             NVFEGERVER
#>  3 P0A8T7     3IYD    D            D             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  4 P0A8T7     3IYD    D            D             NVFEGERVER
#>  5 P0A8T7     4KN7    D            D             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  6 P0A8T7     4KN7    D            D             NVFEGERVER
#>  7 P0A8T7     4KN7    I            J             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#>  8 P0A8T7     4KN7    I            J             NVFEGERVER
#>  9 P0A8T7     4YLO    D            D             SGIVSFGKETKGKRRLVITPVDGSDPYEEM…
#> 10 P0A8T7     4YLO    D            D             NVFEGERVER
#> # … with 10 more variables: peptide_sequence_in_pdb <chr>, fit_type <chr>,
#> #   start <dbl>, end <dbl>, label_seq_id_start <dbl>, label_seq_id_end <dbl>,
#> #   auth_seq_id_start <dbl>, auth_seq_id_end <dbl>, n_peptides <int>,
#> #   n_peptides_in_structure <int># }