Fetches structure metadata from RCSB. If you want to retrieve atom data such as positions, use
the function fetch_pdb_structure()
.
fetch_pdb(pdb_ids, batchsize = 100, show_progress = TRUE)
A data frame that contains structure metadata for the PDB IDs provided. The data frame contains some columns that might not be self explanatory.
auth_asym_id: Chain identifier provided by the author of the structure in order to match the identification used in the publication that describes the structure.
label_asym_id: Chain identifier following the standardised convention for mmCIF files.
entity_beg_seq_id, ref_beg_seq_id, length, pdb_sequence: entity_beg_seq_id
is a
position in the structure sequence (pdb_sequence
) that matches the position given in
ref_beg_seq_id
, which is a position within the protein sequence (not included in the
data frame). length
identifies the stretch of sequence for which positions match
accordingly between structure and protein sequence. entity_beg_seq_id
is a residue ID
based on the standardised convention for mmCIF files.
auth_seq_id: Residue identifier provided by the author of the structure in order to
match the identification used in the publication that describes the structure. This character
vector has the same length as the pdb_sequence
and each position is the identifier for
the matching amino acid position in pdb_sequence
. The contained values are not
necessarily numbers and the values do not have to be positive.
modified_monomer: Is composed of first the composition ID of the modification, followed
by the label_seq_id
position. In parenthesis are the parent monomer identifiers as
they appear in the sequence.
ligand_*: Any column starting with the ligand_*
prefix contains information about
the position, identity and donors for ligand binding sites. If there are multiple entities of
ligands they are separated by "|". Specific donor level information is separated by ";".
secondar_structure: Contains information about helix and sheet secondary structure elements. Individual regions are separated by ";".
unmodeled_structure: Contains information about unmodeled or partially modeled regions in the model. Individual regions are separated by ";".
auth_seq_id_original: In some cases the sequence positions do not match the number of residues
in the sequence either because positions are missing or duplicated. This always coincides with modified
residues, however does not always occur when there is a modified residue in the sequence. This column
contains the original auth_seq_id
information that does not have these positions corrected.
# \donttest{
pdb <- fetch_pdb(pdb_ids = c("6HG1", "1E9I", "6D3Q", "4JHW"))
#> [2/6] Extract experimental conditions ...
#> DONE (0.02s)
#> [3/6] Extracting polymer information:
#> -> 1/6 UniProt IDs ...
#> DONE (0.01s)
#> -> 2/6 UniProt alignment ...
#> DONE (0.02s)
#> -> 3/6 Ligand binding sites ...
#> DONE (0.11s)
#> -> 4/6 Modified monomers ...
#> DONE (0.01s)
#> -> 5/6 Secondary structure ...
#> DONE (0.01s)
#> -> 6/6 Unmodeled residues ...
#> DONE (0.01s)
#> [4/6] Correct author sequence positions for some PDB IDs ...
#> None to correct(0.01s)
#> [5/6] Extract non-polymer information ...
#> DONE (0.01s)
#> [6/6] Combine information ...
#> DONE (0.02s)
head(pdb)
#> # A tibble: 6 × 46
#> pdb_ids auth_asym_id label_asym_id reference_database_accession protein_name
#> <chr> <chr> <chr> <chr> <chr>
#> 1 6HG1 A A P27708 Multifunction…
#> 2 6HG1 A A P27708 Multifunction…
#> 3 6HG1 A A P27708 Multifunction…
#> 4 6HG1 A A P27708 Multifunction…
#> 5 6HG1 A A P05020 Dihydroorotase
#> 6 6HG1 A A P05020 Dihydroorotase
#> # ℹ 41 more variables: reference_database_name <chr>, entity_beg_seq_id <int>,
#> # ref_beg_seq_id <int>, length <int>, pdb_sequence <chr>, auth_seq_id <chr>,
#> # auth_seq_id_original <chr>, engineered_mutation <chr>,
#> # modified_monomer <chr>, ligand_donor_atom_id <chr>,
#> # ligand_donor_auth_seq_id <chr>, ligand_donor_label_seq_id <chr>,
#> # ligand_donor_id <chr>, ligand_label_asym_id <chr>, ligand_atom_id <chr>,
#> # ligand_id <chr>, ligand_entity_id <chr>, …
# }