Fetches protein metadata from UniProt.

fetch_uniprot(
  uniprot_ids,
  columns = c("protein_name", "length", "sequence", "gene_names", "xref_geneid",
    "xref_string", "go_f", "go_p", "go_c", "cc_interaction", "ft_act_site", "ft_binding",
    "cc_cofactor", "cc_catalytic_activity", "xref_pdb"),
  batchsize = 200,
  show_progress = TRUE
)

Arguments

uniprot_ids

a character vector of UniProt accession numbers.

columns

a character vector of metadata columns that should be imported from UniProt (all possible columns can be found here. For cross-referenced database provide the database name with the prefix "xref_", e.g. "xref_pdb")

batchsize

a numeric value that specifies the number of proteins processed in a single single query. Default and max value is 200.

show_progress

a logical value that determines if a progress bar will be shown. Default is TRUE.

Value

A data frame that contains all protein metadata specified in columns for the proteins provided. The input_id column contains the provided UniProt IDs. If an invalid ID was provided that contains a valid UniProt ID, the valid portion of the ID is still fetched and present in the accession column, while the input_id column contains the original not completely valid ID.

Examples

# \donttest{
fetch_uniprot(c("P36578", "O43324", "Q00796"))
#> # A tibble: 3 × 17
#>   accession input_id protein_name         length sequence gene_names xref_geneid
#>   <chr>     <chr>    <chr>                 <dbl> <chr>    <chr>      <chr>      
#> 1 O43324    O43324   Eukaryotic translat…    174 MAAAAEL… EEF1E1 AI… 9521;      
#> 2 P36578    P36578   Large ribosomal sub…    427 MACARPL… RPL4 RPL1  6124;      
#> 3 Q00796    Q00796   Sorbitol dehydrogen…    357 MAAAAKP… SORD       6652;      
#> # ℹ 10 more variables: xref_string <chr>, go_f <chr>, go_p <chr>, go_c <chr>,
#> #   cc_interaction <chr>, ft_act_site <lgl>, ft_binding <chr>,
#> #   cc_cofactor <chr>, cc_catalytic_activity <chr>, xref_pdb <chr>

# Not completely valid ID
fetch_uniprot(c("P02545", "P02545;P20700"))
#> Warning: The following input IDs were found to contain valid uniprot accession
#> numbers. They were fetched and the original input ID can be found in
#> the "input_id" column: P02545;P20700
#> # A tibble: 3 × 17
#>   accession input_id      protein_name    length sequence gene_names xref_geneid
#>   <chr>     <chr>         <chr>            <dbl> <chr>    <chr>      <chr>      
#> 1 P02545    P02545        Prelamin-A/C […    664 METPSQR… LMNA LMN1  4000;      
#> 2 P02545    P02545;P20700 Prelamin-A/C […    664 METPSQR… LMNA LMN1  4000;      
#> 3 P20700    P02545;P20700 Lamin-B1           586 MATATPV… LMNB1 LMN… 4001;      
#> # ℹ 10 more variables: xref_string <chr>, go_f <chr>, go_p <chr>, go_c <chr>,
#> #   cc_interaction <chr>, ft_act_site <lgl>, ft_binding <lgl>,
#> #   cc_cofactor <lgl>, cc_catalytic_activity <lgl>, xref_pdb <chr>
# }