Uses the predicted aligned error (PAE) of AlphaFold predictions to find possible protein domains. A graph-based community clustering algorithm (Leiden clustering) is used on the predicted error (distance) between residues of a protein in order to infer pseudo-rigid groups in the protein. This is for example useful in order to know which parts of protein predictions are likely in a fixed relative position towards each other and which might have varying distances. This function is based on python code written by Tristan Croll. The original code can be found on his GitHub page.

predict_alphafold_domain(
  pae_list,
  pae_power = 1,
  pae_cutoff = 5,
  graph_resolution = 1,
  return_data_frame = FALSE,
  show_progress = TRUE
)

Arguments

pae_list

a list of proteins that contains aligned errors for their AlphaFold predictions. This list can be retrieved with the fetch_alphafold_aligned_error() function. It should contain a column containing the scored residue (scored_residue), the aligned residue (aligned_residue) and the predicted aligned error (error).

pae_power

a numeric value, each edge in the graph will be weighted proportional to (1 / pae^pae_power). Default is 1.

pae_cutoff

a numeric value, graph edges will only be created for residue pairs with pae < pae_cutoff. Default is 5.

graph_resolution

a numeric value that regulates how aggressive the clustering algorithm is. Smaller values lead to larger clusters. Value should be larger than zero, and values larger than 5 are unlikely to be useful. Higher values lead to stricter (i.e. smaller) clusters. The value is provided to the Leiden clustering algorithm of the igraph package as graph_resolution / 100. Default is 1.

return_data_frame

a logical value; if TRUE a data frame instead of a list is returned. It is recommended to only use this if information for few proteins is retrieved. Default is FALSE.

show_progress

a logical value that specifies if a progress bar will be shown. Default is TRUE.

Value

A list of the provided proteins that contains domain assignments for each residue. If return_data_frame is TRUE, a data frame with this information is returned instead. The data frame contains the following columns:

  • residue: The protein residue number.

  • domain: A numeric value representing a distinct predicted domain in the protein.

  • accession: The UniProt protein identifier.

Examples

# \donttest{
# Fetch aligned errors
aligned_error <- fetch_alphafold_aligned_error(
  uniprot_ids = c("F4HVG8", "O15552"),
  error_cutoff = 4
)

# Predict protein domains
af_domains <- predict_alphafold_domain(
  pae_list = aligned_error,
  return_data_frame = TRUE
)

head(af_domains, n = 10)
#> # A tibble: 10 × 3
#>    residue domain accession
#>      <int>  <dbl> <chr>    
#>  1       1      1 F4HVG8   
#>  2       2      1 F4HVG8   
#>  3       3      1 F4HVG8   
#>  4       4      1 F4HVG8   
#>  5       5      1 F4HVG8   
#>  6       6      1 F4HVG8   
#>  7       7      1 F4HVG8   
#>  8       8      1 F4HVG8   
#>  9       9      1 F4HVG8   
#> 10      10      1 F4HVG8   
# }