R/calculate_go_enrichment.R
calculate_go_enrichment.Rd
Analyses enrichment of gene ontology terms associated with proteins in the fraction of
significant proteins compared to all detected proteins. A two-sided Fisher's exact test is
performed to test significance of enrichment or depletion. GO annotations can be provided to
this function either through UniProt go_annotations_uniprot
, through a table obtained
with fetch_go
in the go_data
argument or GO annotations are fetched automatically
by the function by providing ontology_type
and organism_id
.
calculate_go_enrichment(
data,
protein_id,
is_significant,
group = NULL,
y_axis_free = TRUE,
facet_n_col = 2,
go_annotations_uniprot = NULL,
ontology_type,
organism_id = NULL,
go_data = NULL,
plot = TRUE,
plot_style = "barplot",
plot_title = "Gene ontology enrichment of significant proteins",
barplot_fill_colour = c("#56B4E9", "#E76145"),
heatmap_fill_colour = protti::mako_colours,
heatmap_fill_colour_rev = TRUE,
label = TRUE,
enrichment_type = "all",
replace_long_name = TRUE,
label_move_frac = 0.2,
min_n_detected_proteins_in_process = 1,
plot_cutoff = "adj_pval top10"
)
a data frame that contains at least the input variables.
a character column in the data
data frame that contains the protein
accession numbers.
a logical column in the data
data frame that indicates if the
corresponding protein has a significantly changing peptide. The input data frame may contain
peptide level information with significance information. The function is able to extract
protein level information from this.
optional, character column in the data
data frame that contains information by
which the analysis should be grouped. The analysis will be performed separately for each of the
groups. This is most likely a column that labels separate comparisons of different conditions.
In protti the assign_missingness()
function creates such a column automatically.
a logical value that specifies if the y-axis of the plot should be "free"
for each facet if a grouping variable is provided. Default is TRUE
. If FALSE
is selected
it is easier to compare GO categories directly with each other.
a numeric value that specifies the number of columns the faceted plot should have if a column name is provided to group. The default is 2.
recommended, a character column in the data
data frame
that contains gene ontology annotations obtained from UniProt using fetch_uniprot
.
These annotations are already separated into the desired ontology type so the argument
ontology_type
is not required.
optional, character value specifying the type of ontology that should
be used. Possible values are molecular function (MF), biological process (BP), cellular component
(CC). This argument is not required if GO annotations are provided from UniProt in
go_annotations_uniprot
. It is required if annotations are provided through go_data
or
automatically fetched.
optional, character value specifying an NCBI taxonomy identifier of an
organism (TaxId). Possible inputs include only: "9606" (Human), "559292" (Yeast) and "83333"
(E. coli). Is only necessary if GO data is not provided either by go_annotations_uniprot
or in go_data
.
Optional, a data frame that can be obtained with fetch_go()
. If you provide
data not obtained with fetch_go()
make sure column names for protein ID (db_id
) and GO ID
(go_id
) are the same as for data obtained with fetch_go()
.
a logical argument indicating whether the result should be plotted or returned as a table.
a character argument that specifies the plot style. Can be either "barplot" (default)
or "heatmap". The "heatmap" plot is especially useful for the comparison of multiple groups. We recommend,
however, that you use it only with enrichment_type = "enriched"
or enrichment_type = "deenriched
,
because otherwise it is not possible to distinguish between enrichment and deenrichment in the plot.
a character value that specifies the title of the plot. The default is "Gene ontology enrichment of significant proteins".
a vector that contains two colours that should be used as the fill colours for
deenriched and enriched GO terms, respectively. If enrichment_type = "enriched"
or "deenriched
, please
still provide two values in the vector, the colour not used for the plot can be NA
in this case. E.g.
c(NA, "red")
for enrichment_type = "enriched"
.
a vector that contains colours that should be used to create the gradient in the
heatmap plot. Default is mako_colours
.
a logical value that specifies if the provided colours in heatmap_fill_colour
should be reversed in order. Default is TRUE
.
a logical argument indicating whether labels should be added to the plot. Default is TRUE.
a character argument that is either "all", "enriched" or "deenriched". This determines if the enrichment analysis should be performed in order to check for both enrichemnt and deenrichemnt or only one of the two. This affects the statistics performed and therefore also the displayed plot.
a logical argument that specifies if GO term names above 50 characters should
be replaced by the GO ID instead for the plot. This ensures that the plotting area doesn't become
too small due to the long name. The default is TRUE
.
a numeric argument between 0 and 1 that specifies which labels should be moved outside of the bar. The default is 0.2, which means that the labels of all bars that have a size of 20% or less of the largest bar are moved to the right of the bar. This prevents labels from overlapping with the bar boundaries.
is a numeric argument that specifies the minimum number of
detected proteins required for a GO term to be displayed in the plot. The default is 1, meaning
no filtering of the plotted data is performed. This argument does not affect any computations or
the returned data if plot = FALSE
. This argument is useful in order to remove terms that were only
detected in for example 1 protein. Even though these terms are sometimes significant, they are not
really relevant.
a character value indicating if the plot should contain the top n (e.g. top10) most
significant proteins (p-value or adjusted p-value), or if a significance cutoff should be used
to determine the number of GO terms in the plot. This information should be provided with the
type first followed by the threshold separated by a space. Example are
plot_cutoff = "adj_pval top10"
, plot_cutoff = "pval 0.05"
or
plot_cutoff = "adj_pval 0.01"
. The threshold can be chosen freely. The default value is
"adj_pval top10"
.
A bar plot or heatmap (depending on plot_style
). By default the bar plot displays negative log10
adjusted p-values for the top 10 enriched or deenriched gene ontology terms. Alternatively, plot cutoffs
can be chosen individually with the plot_cutoff
argument. Bars are colored according to the direction
of the enrichment (enriched or deenriched). If a heatmap is returned, terms are organised on the y-axis, while
the colour of each tile represents the negative log10 adjusted p-value (default). If a group
column
is provided the x-axis contains all groups. If plot = FALSE
, a data frame is returned. P-values are adjusted with
Benjamini-Hochberg.
# \donttest{
# Load libraries
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
library(stringr)
# Create example data
# Contains artificial de-enrichment for ribosomes.
uniprot_go_data <- fetch_uniprot_proteome(
organism_id = 83333,
columns = c(
"accession",
"go_f"
)
)
if (!is(uniprot_go_data, "character")) {
data <- uniprot_go_data %>%
mutate(significant = c(
rep(TRUE, 1000),
rep(FALSE, n() - 1000)
)) %>%
mutate(significant = ifelse(
str_detect(
go_f,
pattern = "ribosome"
),
FALSE,
significant
)) %>%
mutate(group = c(
rep("A", 500),
rep("B", 500),
rep("A", (n() - 1000) / 2),
rep("B", round((n() - 1000) / 2))
))
# Plot gene ontology enrichment
calculate_go_enrichment(
data,
protein_id = accession,
go_annotations_uniprot = go_f,
is_significant = significant,
plot = TRUE,
plot_cutoff = "pval 0.01"
)
# Plot gene ontology enrichment with group
calculate_go_enrichment(
data,
protein_id = accession,
go_annotations_uniprot = go_f,
is_significant = significant,
group = group,
facet_n_col = 1,
plot = TRUE,
plot_cutoff = "pval 0.01"
)
# Plot gene ontology enrichment with group in a heatmap plot
calculate_go_enrichment(
data,
protein_id = accession,
group = group,
go_annotations_uniprot = go_f,
is_significant = significant,
min_n_detected_proteins_in_process = 15,
plot = TRUE,
label = TRUE,
plot_style = "heatmap",
enrichment_type = "enriched",
plot_cutoff = "pval 0.01"
)
# Calculate gene ontology enrichment
go_enrichment <- calculate_go_enrichment(
data,
protein_id = accession,
go_annotations_uniprot = go_f,
is_significant = significant,
plot = FALSE,
)
head(go_enrichment, n = 10)
}
#> # A tibble: 10 × 10
#> term go_id pval adj_pval n_detected_proteins n_detected_proteins_…¹
#> <chr> <chr> <dbl> <dbl> <int> <int>
#> 1 "identica… [GO:… 1.92e-53 4.04e-50 4530 475
#> 2 "protein … [GO:… 3.24e-31 3.41e-28 4530 315
#> 3 "magnesiu… [GO:… 1.45e-28 1.01e-25 4530 193
#> 4 "ATP bind… [GO:… 1.52e-25 7.99e-23 4530 449
#> 5 "guanosin… [GO:… 1.57e-10 6.61e- 8 4530 35
#> 6 "zinc ion… [GO:… 7.50e-10 2.63e- 7 4530 166
#> 7 "NADP bin… [GO:… 3.83e- 9 1.15e- 6 4530 27
#> 8 "DNA-bind… [GO:… 1.37e- 8 3.60e- 6 4530 26
#> 9 "unfolded… [GO:… 1.99e- 8 4.64e- 6 4530 22
#> 10 "protein … [GO:… 2.82e- 8 5.93e- 6 4530 11
#> # ℹ abbreviated name: ¹n_detected_proteins_in_process
#> # ℹ 4 more variables: n_significant_proteins <int>,
#> # n_significant_proteins_in_process <int>, n_proteins_expected <dbl>,
#> # enrichment_type <chr>
# }