Perform gene ontology enrichment analysis — calculate_go

Analyses enrichment of gene ontology terms associated with proteins in the fraction of significant proteins compared to all detected proteins. A two-sided Fisher's exact test is performed to test significance of enrichment or depletion. GO annotations can be provided to this function either through UniProt go_annotations_uniprot, through a table obtained with fetch_go in the go_data argument or GO annotations are fetched automatically by the function by providing ontology_type and organism_id.

calculate_go_enrichment(
  data,
  protein_id,
  is_significant,
  group = NULL,
  y_axis_free = TRUE,
  facet_n_col = 2,
  go_annotations_uniprot = NULL,
  ontology_type,
  organism_id = NULL,
  go_data = NULL,
  plot = TRUE,
  plot_style = "barplot",
  plot_title = "Gene ontology enrichment of significant proteins",
  barplot_fill_colour = c("#56B4E9", "#E76145"),
  heatmap_fill_colour = protti::mako_colours,
  heatmap_fill_colour_rev = TRUE,
  label = TRUE,
  label_size = 7,
  enrichment_type = "all",
  replace_long_name = TRUE,
  label_move_frac = 0.2,
  min_n_detected_proteins_in_process = 1,
  plot_cutoff = "adj_pval top10"
)

Arguments

data: a data frame that contains at least the input variables.
protein_id: a character column in the data data frame that contains the protein accession numbers.
is_significant: a logical column in the data data frame that indicates if the corresponding protein has a significantly changing peptide. The input data frame may contain peptide level information with significance information. The function is able to extract protein level information from this.
group: optional, character column in the data data frame that contains information by which the analysis should be grouped. The analysis will be performed separately for each of the groups. This is most likely a column that labels separate comparisons of different conditions. In protti the assign_missingness() function creates such a column automatically.
y_axis_free: a logical value that specifies if the y-axis of the plot should be "free" for each facet if a grouping variable is provided. Default is TRUE. If FALSE is selected it is easier to compare GO categories directly with each other.
facet_n_col: a numeric value that specifies the number of columns the faceted plot should have if a column name is provided to group. The default is 2.
go_annotations_uniprot: recommended, a character column in the data data frame that contains gene ontology annotations obtained from UniProt using fetch_uniprot. These annotations are already separated into the desired ontology type so the argument ontology_type is not required.
ontology_type: optional, character value specifying the type of ontology that should be used. Possible values are molecular function (MF), biological process (BP), cellular component (CC). This argument is not required if GO annotations are provided from UniProt in go_annotations_uniprot. It is required if annotations are provided through go_data or automatically fetched.
organism_id: optional, character value specifying an NCBI taxonomy identifier of an organism (TaxId). Possible inputs include only: "9606" (Human), "559292" (Yeast) and "83333" (E. coli). Is only necessary if GO data is not provided either by go_annotations_uniprot or in go_data.
go_data: Optional, a data frame that can be obtained with fetch_go(). If you provide data not obtained with fetch_go() make sure column names for protein ID (db_id) and GO ID (go_id) are the same as for data obtained with fetch_go().
plot: a logical argument indicating whether the result should be plotted or returned as a table.
plot_style: a character argument that specifies the plot style. Can be either "barplot" (default) or "heatmap". The "heatmap" plot is especially useful for the comparison of multiple groups. We recommend, however, that you use it only with enrichment_type = "enriched" or enrichment_type = "deenriched, because otherwise it is not possible to distinguish between enrichment and deenrichment in the plot.
plot_title: a character value that specifies the title of the plot. The default is "Gene ontology enrichment of significant proteins".
barplot_fill_colour: a vector that contains two colours that should be used as the fill colours for deenriched and enriched GO terms, respectively. If enrichment_type = "enriched" or "deenriched, please still provide two values in the vector, the colour not used for the plot can be NA in this case. E.g. c(NA, "red") for enrichment_type = "enriched".
heatmap_fill_colour: a vector that contains colours that should be used to create the gradient in the heatmap plot. Default is mako_colours.
heatmap_fill_colour_rev: a logical value that specifies if the provided colours in heatmap_fill_colour should be reversed in order. Default is TRUE.
label: a logical argument indicating whether labels should be added to the plot. Default is TRUE.
label_size: a numeric argument that specifies the text size of the labels with the unit "pt". The default is 7.
enrichment_type: a character argument that is either "all", "enriched" or "deenriched". This determines if the enrichment analysis should be performed in order to check for both enrichemnt and deenrichemnt or only one of the two. This affects the statistics performed and therefore also the displayed plot.
replace_long_name: a logical argument that specifies if GO term names above 50 characters should be replaced by the GO ID instead for the plot. This ensures that the plotting area doesn't become too small due to the long name. The default is TRUE.
label_move_frac: a numeric argument between 0 and 1 that specifies which labels should be moved outside of the bar. The default is 0.2, which means that the labels of all bars that have a size of 20% or less of the largest bar are moved to the right of the bar. This prevents labels from overlapping with the bar boundaries.
min_n_detected_proteins_in_process: is a numeric argument that specifies the minimum number of detected proteins required for a GO term to be displayed in the plot. The default is 1, meaning no filtering of the plotted data is performed. This argument does not affect any computations or the returned data if plot = FALSE. This argument is useful in order to remove terms that were only detected in for example 1 protein. Even though these terms are sometimes significant, they are not really relevant.
plot_cutoff: a character value indicating if the plot should contain the top n (e.g. top10) most significant proteins (p-value or adjusted p-value), or if a significance cutoff should be used to determine the number of GO terms in the plot. This information should be provided with the type first followed by the threshold separated by a space. Example are plot_cutoff = "adj_pval top10", plot_cutoff = "pval 0.05" or plot_cutoff = "adj_pval 0.01". The threshold can be chosen freely. The default value is "adj_pval top10".

Value

A bar plot or heatmap (depending on plot_style). By default the bar plot displays negative log10 adjusted p-values for the top 10 enriched or deenriched gene ontology terms. Alternatively, plot cutoffs can be chosen individually with the plot_cutoff argument. Bars are colored according to the direction of the enrichment (enriched or deenriched). If a heatmap is returned, terms are organised on the y-axis, while the colour of each tile represents the negative log10 adjusted p-value (default). If a group column is provided the x-axis contains all groups. If plot = FALSE, a data frame is returned. P-values are adjusted with Benjamini-Hochberg.

Examples

# \donttest{
# Load libraries
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
library(stringr)

# Create example data
# Contains artificial de-enrichment for ribosomes.
uniprot_go_data <- fetch_uniprot_proteome(
  organism_id = 83333,
  columns = c(
    "accession",
    "go_f"
  )
)

if (!is(uniprot_go_data, "character")) {
  data <- uniprot_go_data %>%
    mutate(significant = c(
      rep(TRUE, 1000),
      rep(FALSE, n() - 1000)
    )) %>%
    mutate(significant = ifelse(
      str_detect(
        go_f,
        pattern = "ribosome"
      ),
      FALSE,
      significant
    )) %>%
    mutate(group = c(
      rep("A", 500),
      rep("B", 500),
      rep("A", (n() - 1000) / 2),
      rep("B", round((n() - 1000) / 2))
    ))

  # Plot gene ontology enrichment
  calculate_go_enrichment(
    data,
    protein_id = accession,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    plot = TRUE,
    plot_cutoff = "pval 0.01"
  )

  # Plot gene ontology enrichment with group
  calculate_go_enrichment(
    data,
    protein_id = accession,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    group = group,
    facet_n_col = 1,
    plot = TRUE,
    plot_cutoff = "pval 0.01"
  )

  # Plot gene ontology enrichment with group in a heatmap plot
  calculate_go_enrichment(
    data,
    protein_id = accession,
    group = group,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    min_n_detected_proteins_in_process = 15,
    plot = TRUE,
    label = TRUE,
    plot_style = "heatmap",
    enrichment_type = "enriched",
    plot_cutoff = "pval 0.01"
  )

  # Calculate gene ontology enrichment
  go_enrichment <- calculate_go_enrichment(
    data,
    protein_id = accession,
    go_annotations_uniprot = go_f,
    is_significant = significant,
    plot = FALSE,
  )

  head(go_enrichment, n = 10)
}
#> # A tibble: 10 × 10
#>    term       go_id     pval adj_pval n_detected_proteins n_detected_proteins_…¹
#>    <chr>      <chr>    <dbl>    <dbl>               <int>                  <int>
#>  1 "identica… [GO:… 6.89e-55 1.42e-51                4531                    475
#>  2 "protein … [GO:… 1.19e-29 1.23e-26                4531                    315
#>  3 "magnesiu… [GO:… 2.18e-28 1.50e-25                4531                    194
#>  4 "ATP bind… [GO:… 1.96e-24 1.01e-21                4531                    449
#>  5 "guanosin… [GO:… 1.49e-10 5.80e- 8                4531                     35
#>  6 "zinc ion… [GO:… 1.69e-10 5.80e- 8                4531                    169
#>  7 "NADP bin… [GO:… 3.67e- 9 1.08e- 6                4531                     27
#>  8 "DNA-bind… [GO:… 7.88e- 9 2.03e- 6                4531                     30
#>  9 "protein … [GO:… 2.75e- 8 6.29e- 6                4531                     11
#> 10 "unfolded… [GO:… 2.12e- 7 4.36e- 5                4531                     22
#> # ℹ abbreviated name: ¹n_detected_proteins_in_process
#> # ℹ 4 more variables: n_significant_proteins <int>,
#> #   n_significant_proteins_in_process <int>, n_proteins_expected <dbl>,
#> #   enrichment_type <chr>
# }