Fetches protein metadata from UniProt.
fetch_uniprot(
uniprot_ids,
columns = c("protein_name", "length", "sequence", "gene_names", "xref_geneid",
"xref_string", "go_f", "go_p", "go_c", "cc_interaction", "ft_act_site", "ft_binding",
"cc_cofactor", "cc_catalytic_activity", "xref_pdb"),
batchsize = 200,
max_tries = 10,
timeout = 20,
show_progress = TRUE
)
a character vector of UniProt accession numbers.
a character vector of metadata columns that should be imported from UniProt (all
possible columns can be found here. For
cross-referenced database provide the database name with the prefix "xref_", e.g. "xref_pdb"
)
a numeric value that specifies the number of proteins processed in a single single query. Default and max value is 200.
a numeric value that specifies the number of times the function tries to download the data in case an error occurs.
a numeric value that specifies the maximum request time per try. Default is 20 seconds.
a logical value that determines if a progress bar will be shown. Default is TRUE.
A data frame that contains all protein metadata specified in columns
for the
proteins provided. The input_id
column contains the provided UniProt IDs. If an invalid ID
was provided that contains a valid UniProt ID, the valid portion of the ID is still fetched and
present in the accession
column, while the input_id
column contains the original not completely
valid ID.
# \donttest{
fetch_uniprot(c("P36578", "O43324", "Q00796"))
#> # A tibble: 3 × 17
#> accession input_id protein_name length sequence gene_names xref_geneid
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 O43324 O43324 Eukaryotic translat… 174 MAAAAEL… EEF1E1 AI… 9521;
#> 2 P36578 P36578 Large ribosomal sub… 427 MACARPL… RPL4 RPL1 6124;
#> 3 Q00796 Q00796 Sorbitol dehydrogen… 357 MAAAAKP… SORD 6652;
#> # ℹ 10 more variables: xref_string <chr>, go_f <chr>, go_p <chr>, go_c <chr>,
#> # cc_interaction <chr>, ft_act_site <lgl>, ft_binding <chr>,
#> # cc_cofactor <chr>, cc_catalytic_activity <chr>, xref_pdb <chr>
# Not completely valid ID
fetch_uniprot(c("P02545", "P02545;P20700"))
#> Warning: The following input IDs were found to contain valid uniprot accession
#> numbers. They were fetched and the original input ID can be found in
#> the "input_id" column: P02545;P20700
#> # A tibble: 3 × 17
#> accession input_id protein_name length sequence gene_names xref_geneid
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 P02545 P02545 Prelamin-A/C [… 664 METPSQR… LMNA LMN1 4000;
#> 2 P02545 P02545;P20700 Prelamin-A/C [… 664 METPSQR… LMNA LMN1 4000;
#> 3 P20700 P02545;P20700 Lamin-B1 586 MATATPV… LMNB1 LMN… 4001;
#> # ℹ 10 more variables: xref_string <chr>, go_f <chr>, go_p <chr>, go_c <chr>,
#> # cc_interaction <chr>, ft_act_site <lgl>, ft_binding <lgl>,
#> # cc_cofactor <lgl>, cc_catalytic_activity <lgl>, xref_pdb <chr>
# }