Extract metal-binding protein information from UniProt — extract_metal

Information of metal binding proteins is extracted from UniProt data retrieved with fetch_uniprot as well as QuickGO data retrieved with fetch_quickgo.

extract_metal_binders(
  data_uniprot,
  data_quickgo,
  data_chebi = NULL,
  data_chebi_relation = NULL,
  data_eco = NULL,
  data_eco_relation = NULL,
  show_progress = TRUE
)

Arguments

data_uniprot: a data frame containing at least the ft_binding, cc_cofactor and cc_catalytic_activity columns.
data_quickgo: a data frame containing molecular function gene ontology information for at least the proteins of interest. This data should be obtained by calling fetch_quickgo().
data_chebi: optional, a data frame that can be manually obtained with fetch_chebi(stars = c(2, 3)). It should contain 2 and 3 star entries. If not provided it will be fetched within the function. If the function is run many times it is recommended to provide the data frame to save time.
data_chebi_relation: optional, a data frame that can be manually obtained with fetch_chebi(relation = TRUE). If not provided it will be fetched within the function. If the function is run many times it is recommended to provide the data frame to save time.
data_eco: optional, a data frame that contains evidence and conclusion ontology data that can be obtained by calling fetch_eco(). If not provided it will be fetched within the function. If the function is run many times it is recommended to provide the data frame to save time.
data_eco_relation: optional, a data frame that contains relational evidence and conclusion ontology data that can be obtained by calling fetch_eco(return_relation = TRUE). If not provided it will be fetched within the function. If the function is run many times it is recommended to provide the data frame to save time.
show_progress: a logical value that specifies if progress will be shown (default is TRUE).

Value

A data frame containing information on protein metal binding state. It contains the following columns:

accession: UniProt protein identifier.
most_specific_id: ChEBI ID that is most specific for the position after combining information from all sources. Can be multiple IDs separated by "," if a position appears multiple times due to multiple fitting IDs.
most_specific_id_name: The name of the ID in the most_specific_id column. This information is based on ChEBI.
ligand_identifier: A ligand identifier that is unique per ligand per protein. It consists of the ligand ID and ligand name. The ligand ID counts the number of ligands of the same type per protein.
ligand_position: The amino acid position of the residue interacting with the ligand.
binding_mode: Contains information about the way the amino acid residue interacts with the ligand. If it is "covalent" then the residue is not in contact with the metal directly but only the cofactor that binds the metal.
metal_function: Contains information about the function of the metal. E.g. "catalytic".
metal_id_part: Contains a ChEBI ID that identifiers the metal part of the ligand. This is always the metal atom.
metal_id_part_name: The name of the ID in the metal_id_part column. This information is based on ChEBI.
note: Contains notes associated with information based on cofactors.
chebi_id: Contains the original ChEBI IDs the information is based on.
source: Contains the sources of the information. This can consist of "binding", "cofactor", "catalytic_activity" and "go_term".
eco: If there is evidence the annotation is based on it is annotated with an ECO ID, which is split by source.
eco_type: The ECO identifier can fall into the "manual_assertion" group for manually curated annotations or the "automatic_assertion" group for automatically generated annotations. If there is no evidence it is annotated as "automatic_assertion". The information is split by source.
evidence_source: The original sources (e.g. literature, PDB) of evidence annotations split by source.
reaction: Contains information about the chemical reaction catalysed by the protein that involves the metal. Can contain the EC ID, Rhea ID, direction specific Rhea ID, direction of the reaction and evidence for the direction.
go_term: Contains gene ontology terms if there are any metal related ones associated with the annotation.
go_name: Contains gene ontology names if there are any metal related ones associated with the annotation.
assigned_by: Contains information about the source of the gene ontology term assignment.
database: Contains information about the source of the ChEBI annotation associated with gene ontology terms.

For each protein identifier the data frame contains information on the bound ligand as well as on its position if it is known. Since information about metal ligands can come from multiple sources, additional information (e.g. evidence) is nested in the returned data frame. In order to unnest the relevant information the following steps have to be taken: It is possible that there are multiple IDs in the "most_specific_id" column. This means that one position cannot be uniquely attributed to one specific ligand even with the same ligand_identifier. Apart from the "most_specific_id" column, in which those instances are separated by ",", in other columns the relevant information is separated by "||". Then information should be split based on the source (not the source column, that one can be removed from the data frame). There are certain columns associated with specific sources (e.g. go_term is associated with the "go_term" source). Values of columns not relevant for a certain source should be replaced with NA. Since a most_specific_id can have multiple chebi_ids associated with it we need to unnest the chebi_id column and associated columns in which information is separated by "|". Afterwards evidence and additional information can be unnested by first splitting data for ";;" and then for ";".

Examples

# \donttest{
# Create example data

uniprot_ids <- c("P00393", "P06129", "A0A0C5Q309", "A0A0C9VD04")

## UniProt data
data_uniprot <- fetch_uniprot(
  uniprot_ids = uniprot_ids,
  columns = c(
    "ft_binding",
    "cc_cofactor",
    "cc_catalytic_activity"
  )
)

## QuickGO data
data_quickgo <- fetch_quickgo(
  id_annotations = uniprot_ids,
  ontology_annotations = "molecular_function"
)
#> Retrieving GO annotations ... 
#> DONE(0.57s)

## ChEBI data (2 and 3 star entries)
data_chebi <- fetch_chebi(stars = c(2, 3))
data_chebi_relation <- fetch_chebi(relation = TRUE)

## ECO data
eco <- fetch_eco()
eco_relation <- fetch_eco(return_relation = TRUE)

# Extract metal binding information
metal_info <- extract_metal_binders(
  data_uniprot = data_uniprot,
  data_quickgo = data_quickgo,
  data_chebi = data_chebi,
  data_chebi_relation = data_chebi_relation,
  data_eco = eco,
  data_eco_relation = eco_relation
)
#> Preparing annotation data frames ... 
#> DONE (1.65s)
#> Extract ft_binding information from UniProt ... 
#> DONE (0.19s)
#> Extract cc_cofactor information from UniProt ... 
#> DONE (0.04s)
#> Extract cc_catalytic_activity information from UniProt ... 
#> DONE (0.05s)
#> Extract molecular_function information from QuickGO ... 
#> DONE (0.03s)
#> Find ChEBI sub IDs ... 
#> DONE (0.6s)
#> Combine data ... 
#> DONE (0.25s)

metal_info
#> # A tibble: 35 × 20
#>    accession  most_specific_id most_specific_id_name ligand_identifier
#>    <chr>      <chr>            <chr>                 <chr>            
#>  1 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    1(Fe cation)     
#>  2 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    1(Fe cation)     
#>  3 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    1(Fe cation)     
#>  4 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    2(Fe cation)     
#>  5 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    1(Fe cation)     
#>  6 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    2(Fe cation)     
#>  7 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    2(Fe cation)     
#>  8 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    1(Fe cation)     
#>  9 A0A0C5Q309 29033,29034      iron(2+)||iron(3+)    2(Fe cation)     
#> 10 A0A0C9VD04 18420            magnesium(2+)         1(Mg(2+))        
#> # ℹ 25 more rows
#> # ℹ 16 more variables: ligand_position <dbl>, binding_mode <lgl>,
#> #   metal_function <lgl>, metal_id_part <chr>, metal_id_part_name <chr>,
#> #   note <chr>, chebi_id <chr>, source <chr>, eco <chr>, eco_type <chr>,
#> #   evidence_source <chr>, reaction <chr>, go_term <chr>, go_name <chr>,
#> #   assigned_by <chr>, database <chr>
# }