R/calculate_diff_abundance.R
calculate_diff_abundance.Rd
Performs differential abundance calculations and statistical hypothesis tests on data frames with protein, peptide or precursor data. Different methods for statistical testing are available.
calculate_diff_abundance(
data,
sample,
condition,
grouping,
intensity_log2,
missingness = missingness,
comparison = comparison,
mean = NULL,
sd = NULL,
n_samples = NULL,
ref_condition = "all",
filter_NA_missingness = TRUE,
method = c("moderated_t-test", "t-test", "t-test_mean_sd", "proDA"),
p_adj_method = "BH",
retain_columns = NULL
)
a data frame containing at least the input variables that are required for the
selected method. Ideally the output of assign_missingness
or impute
is used.
a character column in the data
data frame that contains the sample name.
Is not required if method = "t-test_mean_sd"
.
a character or numeric column in the data
data frame that contains the
conditions.
a character column in the data
data frame that contains precursor,
peptide or protein identifiers.
a numeric column in the data
data frame that contains intensity
values. The intensity values need to be log2 transformed. Is not required if
method = "t-test_mean_sd"
.
a character column in the data
data frame that contains missingness
information. Can be obtained by calling assign_missingness()
. Is not required if
method = "t-test_mean_sd"
. The type of missingness assigned to a comparison does not have
any influence on the statistical test. However, if filter_NA_missingness = TRUE
and
method = "proDA"
, then comparisons with missingness NA
are filtered out prior
to p-value adjustment.
a character column in the data
data frame that contains information of
treatment/reference condition pairs. Can be obtained by calling assign_missingness
.
Comparisons need to be in the form condition1_vs_condition2, meaning two compared conditions are
separated by "_vs_"
. This column determines for which condition pairs differential
abundances are calculated. Is not required if method = "t-test_mean_sd"
, in that case
please provide a reference condition with the ref_condition argument.
a numeric column in the data
data frame that contains mean values for two
conditions. Is only required if method = "t-test_mean_sd"
.
a numeric column in the data
data frame that contains standard deviations for
two conditions. Is only required if method = "t-test_mean_sd"
.
a numeric column in the data
data frame that contains the number of
samples per condition for two conditions. Is only required if method = "t-test_mean_sd"
.
optional, character value providing the condition that is used as a
reference for differential abundance calculation. Only required for method = "t-test_mean_sd"
.
Instead of providing one reference condition, "all" can be supplied, which will create all
pairwise condition pairs. By default ref_condition = "all"
.
a logical value, default is TRUE
. For all methods except
"t-test_mean_sd"
missingness information has to be provided. This information can be
for example obtained by calling assign_missingness()
. If a reference/treatment pair has
too few samples to be considered robust based on user defined cutoffs, it is annotated with NA
as missingness by the assign_missingness()
function. If this argument is TRUE
,
these NA
reference/treatment pairs are filtered out. For method = "proDA"
this
is done before the p-value adjustment.
a character value, specifies the method used for statistical hypothesis testing.
Methods include Welch test ("t-test"
), a Welch test on means, standard deviations and
number of replicates ("t-test_mean_sd"
) and a moderated t-test based on the limma
package ("moderated_t-test"
). More information on the moderated t-test can be found in
the limma
documentation. Furthermore, the proDA
package specific method ("proDA"
)
can be used to infer means across samples based on a probabilistic dropout model. This
eliminates the need for data imputation since missing values are inferred from the model. More
information can be found in the proDA
documentation. We do not recommend using the
moderated_t-test
or proDA
method if the data was filtered for low CVs or imputation
was performed. Default is method = "moderated_t-test"
.
a character value, specifies the p-value correction method. Possible
methods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default
method is "BH"
.
a vector indicating if certain columns should be retained from the input
data frame. Default is not retaining additional columns retain_columns = NULL
. Specific
columns can be retained by providing their names (not in quotations marks, just like other
column names, but in a vector). Please note that if you retain columns that have multiple
rows per grouped variable there will be duplicated rows in the output.
A data frame that contains differential abundances (diff
), p-values (pval
)
and adjusted p-values (adj_pval
) for each protein, peptide or precursor (depending on
the grouping
variable) and the associated treatment/reference pair. Depending on the
method the data frame contains additional columns:
"t-test": The std_error
column contains the standard error of the differential
abundances. n_obs
contains the number of observations for the specific protein, peptide
or precursor (depending on the grouping
variable) and the associated treatment/reference pair.
"t-test_mean_sd": Columns labeled as control refer to the second condition of the
comparison pairs. Treated refers to the first condition. mean_control
and mean_treated
columns contain the means for the reference and treatment condition, respectively. sd_control
and sd_treated
columns contain the standard deviations for the reference and treatment
condition, respectively. n_control
and n_treated
columns contain the numbers of
samples for the reference and treatment condition, respectively. The std_error
column
contains the standard error of the differential abundances. t_statistic
contains the
t_statistic for the t-test.
"moderated_t-test": CI_2.5
and CI_97.5
contain the 2.5% and 97.5%
confidence interval borders for differential abundances. avg_abundance
contains average
abundances for treatment/reference pairs (mean of the two group means). t_statistic
contains the t_statistic for the t-test. B
The B-statistic is the log-odds that the
protein, peptide or precursor (depending on grouping
) has a differential abundance
between the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e,
about four and a half to one. The probability that there is a differential abundance is
4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentially
abundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentially
abundant.n_obs
contains the number of observations for the specific protein, peptide or
precursor (depending on the grouping
variable) and the associated treatment/reference pair.
"proDA": The std_error
column contains the standard error of the differential
abundances. avg_abundance
contains average abundances for treatment/reference pairs
(mean of the two group means). t_statistic
contains the t_statistic for the t-test.
n_obs
contains the number of observations for the specific protein, peptide or precursor
(depending on the grouping
variable) and the associated treatment/reference pair.
For all methods execept "proDA"
, the p-value adjustment is performed only on the
proportion of data that contains a p-value that is not NA
. For "proDA"
the
p-value adjustment is either performed on the complete dataset (filter_NA_missingness = TRUE
)
or on the subset of the dataset with missingness that is not NA
(filter_NA_missingness = FALSE
).
set.seed(123) # Makes example reproducible
# Create synthetic data
data <- create_synthetic_data(
n_proteins = 10,
frac_change = 0.5,
n_replicates = 4,
n_conditions = 2,
method = "effect_random",
additional_metadata = FALSE
)
# Assign missingness information
data_missing <- assign_missingness(
data,
sample = sample,
condition = condition,
grouping = peptide,
intensity = peptide_intensity_missing,
ref_condition = "all",
retain_columns = c(protein, change_peptide)
)
#> "all" was provided as reference condition. All pairwise comparisons are
#> created from the conditions and assigned their missingness. The created
#> comparisons are:
#> condition_1_vs_condition_2
# Calculate differential abundances
# Using "moderated_t-test" and "proDA" improves
# true positive recovery progressively
diff <- calculate_diff_abundance(
data = data_missing,
sample = sample,
condition = condition,
grouping = peptide,
intensity_log2 = peptide_intensity_missing,
missingness = missingness,
comparison = comparison,
method = "t-test",
retain_columns = c(protein, change_peptide)
)
#> [1/2] Create input for t-tests ...
#> DONE
#> [2/2] Calculate t-tests ...
#> DONE
head(diff, n = 10)
#> # A tibble: 10 × 10
#> protein change_peptide comparison peptide missingness pval std_error
#> <chr> <lgl> <chr> <chr> <chr> <dbl> <dbl>
#> 1 protein_5 TRUE condition_1_v… peptid… complete 9.38e-9 0.0557
#> 2 protein_3 TRUE condition_1_v… peptid… complete 7.01e-7 0.0919
#> 3 protein_1 TRUE condition_1_v… peptid… complete 6.01e-6 0.0670
#> 4 protein_1 FALSE condition_1_v… peptid… MAR 5.12e-2 0.00809
#> 5 protein_4 TRUE condition_1_v… peptid… MAR 6.66e-2 0.308
#> 6 protein_2 FALSE condition_1_v… peptid… complete 7.77e-2 0.275
#> 7 protein_9 FALSE condition_1_v… peptid… MAR 1.89e-1 0.486
#> 8 protein_3 FALSE condition_1_v… peptid… complete 2.23e-1 0.0752
#> 9 protein_1 FALSE condition_1_v… peptid… MAR 2.23e-1 0.0466
#> 10 protein_3 FALSE condition_1_v… peptid… complete 2.71e-1 0.0843
#> # ℹ 3 more variables: diff <dbl>, n_obs <int>, adj_pval <dbl>