`R/calculate_diff_abundance.R`

`calculate_diff_abundance.Rd`

Performs differential abundance calculations and statistical hypothesis tests on data frames with protein, peptide or precursor data. Different methods for statistical testing are available.

```
calculate_diff_abundance(
data,
sample,
condition,
grouping,
intensity_log2,
missingness = missingness,
comparison = comparison,
mean = NULL,
sd = NULL,
n_samples = NULL,
ref_condition = "all",
filter_NA_missingness = TRUE,
method = c("moderated_t-test", "t-test", "t-test_mean_sd", "proDA"),
p_adj_method = "BH",
retain_columns = NULL
)
```

- data
a data frame containing at least the input variables that are required for the selected method. Ideally the output of

`assign_missingness`

or`impute`

is used.- sample
a character column in the

`data`

data frame that contains the sample name. Is not required if`method = "t-test_mean_sd"`

.- condition
a character or numeric column in the

`data`

data frame that contains the conditions.- grouping
a character column in the

`data`

data frame that contains precursor, peptide or protein identifiers.- intensity_log2
a numeric column in the

`data`

data frame that contains intensity values. The intensity values need to be log2 transformed. Is not required if`method = "t-test_mean_sd"`

.- missingness
a character column in the

`data`

data frame that contains missingness information. Can be obtained by calling`assign_missingness()`

. Is not required if`method = "t-test_mean_sd"`

. The type of missingness assigned to a comparison does not have any influence on the statistical test. However, if`filter_NA_missingness = TRUE`

and`method = "proDA"`

, then comparisons with missingness`NA`

are filtered out prior to p-value adjustment.- comparison
a character column in the

`data`

data frame that contains information of treatment/reference condition pairs. Can be obtained by calling`assign_missingness`

. Comparisons need to be in the form condition1_vs_condition2, meaning two compared conditions are separated by`"_vs_"`

. This column determines for which condition pairs differential abundances are calculated. Is not required if`method = "t-test_mean_sd"`

, in that case please provide a reference condition with the ref_condition argument.- mean
a numeric column in the

`data`

data frame that contains mean values for two conditions. Is only required if`method = "t-test_mean_sd"`

.- sd
a numeric column in the

`data`

data frame that contains standard deviations for two conditions. Is only required if`method = "t-test_mean_sd"`

.- n_samples
a numeric column in the

`data`

data frame that contains the number of samples per condition for two conditions. Is only required if`method = "t-test_mean_sd"`

.- ref_condition
optional, character value providing the condition that is used as a reference for differential abundance calculation. Only required for

`method = "t-test_mean_sd"`

. Instead of providing one reference condition, "all" can be supplied, which will create all pairwise condition pairs. By default`ref_condition = "all"`

.- filter_NA_missingness
a logical value, default is

`TRUE`

. For all methods except`"t-test_mean_sd"`

missingness information has to be provided. This information can be for example obtained by calling`assign_missingness()`

. If a reference/treatment pair has too few samples to be considered robust based on user defined cutoffs, it is annotated with`NA`

as missingness by the`assign_missingness()`

function. If this argument is`TRUE`

, these`NA`

reference/treatment pairs are filtered out. For`method = "proDA"`

this is done before the p-value adjustment.- method
a character value, specifies the method used for statistical hypothesis testing. Methods include Welch test (

`"t-test"`

), a Welch test on means, standard deviations and number of replicates (`"t-test_mean_sd"`

) and a moderated t-test based on the`limma`

package (`"moderated_t-test"`

). More information on the moderated t-test can be found in the`limma`

documentation. Furthermore, the`proDA`

package specific method (`"proDA"`

) can be used to infer means across samples based on a probabilistic dropout model. This eliminates the need for data imputation since missing values are inferred from the model. More information can be found in the`proDA`

documentation. We do not recommend using the`moderated_t-test`

or`proDA`

method if the data was filtered for low CVs or imputation was performed. Default is`method = "moderated_t-test"`

.- p_adj_method
a character value, specifies the p-value correction method. Possible methods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default method is

`"BH"`

.- retain_columns
a vector indicating if certain columns should be retained from the input data frame. Default is not retaining additional columns

`retain_columns = NULL`

. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector). Please note that if you retain columns that have multiple rows per grouped variable there will be duplicated rows in the output.

A data frame that contains differential abundances (`diff`

), p-values (`pval`

)
and adjusted p-values (`adj_pval`

) for each protein, peptide or precursor (depending on
the `grouping`

variable) and the associated treatment/reference pair. Depending on the
method the data frame contains additional columns:

"t-test": The

`std_error`

column contains the standard error of the differential abundances.`n_obs`

contains the number of observations for the specific protein, peptide or precursor (depending on the`grouping`

variable) and the associated treatment/reference pair."t-test_mean_sd": Columns labeled as control refer to the second condition of the comparison pairs. Treated refers to the first condition.

`mean_control`

and`mean_treated`

columns contain the means for the reference and treatment condition, respectively.`sd_control`

and`sd_treated`

columns contain the standard deviations for the reference and treatment condition, respectively.`n_control`

and`n_treated`

columns contain the numbers of samples for the reference and treatment condition, respectively. The`std_error`

column contains the standard error of the differential abundances.`t_statistic`

contains the t_statistic for the t-test."moderated_t-test":

`CI_2.5`

and`CI_97.5`

contain the 2.5% and 97.5% confidence interval borders for differential abundances.`avg_abundance`

contains average abundances for treatment/reference pairs (mean of the two group means).`t_statistic`

contains the t_statistic for the t-test.`B`

The B-statistic is the log-odds that the protein, peptide or precursor (depending on`grouping`

) has a differential abundance between the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e, about four and a half to one. The probability that there is a differential abundance is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentially abundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentially abundant.`n_obs`

contains the number of observations for the specific protein, peptide or precursor (depending on the`grouping`

variable) and the associated treatment/reference pair."proDA": The

`std_error`

column contains the standard error of the differential abundances.`avg_abundance`

contains average abundances for treatment/reference pairs (mean of the two group means).`t_statistic`

contains the t_statistic for the t-test.`n_obs`

contains the number of observations for the specific protein, peptide or precursor (depending on the`grouping`

variable) and the associated treatment/reference pair.

For all methods execept `"proDA"`

, the p-value adjustment is performed only on the
proportion of data that contains a p-value that is not `NA`

. For `"proDA"`

the
p-value adjustment is either performed on the complete dataset (`filter_NA_missingness = TRUE`

)
or on the subset of the dataset with missingness that is not `NA`

(`filter_NA_missingness = FALSE`

).

```
set.seed(123) # Makes example reproducible
# Create synthetic data
data <- create_synthetic_data(
n_proteins = 10,
frac_change = 0.5,
n_replicates = 4,
n_conditions = 2,
method = "effect_random",
additional_metadata = FALSE
)
# Assign missingness information
data_missing <- assign_missingness(
data,
sample = sample,
condition = condition,
grouping = peptide,
intensity = peptide_intensity_missing,
ref_condition = "all",
retain_columns = c(protein, change_peptide)
)
#> "all" was provided as reference condition. All pairwise comparisons are
#> created from the conditions and assigned their missingness. The created
#> comparisons are:
#> condition_1_vs_condition_2
# Calculate differential abundances
# Using "moderated_t-test" and "proDA" improves
# true positive recovery progressively
diff <- calculate_diff_abundance(
data = data_missing,
sample = sample,
condition = condition,
grouping = peptide,
intensity_log2 = peptide_intensity_missing,
missingness = missingness,
comparison = comparison,
method = "t-test",
retain_columns = c(protein, change_peptide)
)
#> [1/2] Create input for t-tests ...
#> DONE
#> [2/2] Calculate t-tests ...
#> DONE
head(diff, n = 10)
#> # A tibble: 10 × 10
#> protein change_peptide comparison peptide missingness pval std_error
#> <chr> <lgl> <chr> <chr> <chr> <dbl> <dbl>
#> 1 protein_5 TRUE condition_1_v… peptid… complete 9.38e-9 0.0557
#> 2 protein_3 TRUE condition_1_v… peptid… complete 7.01e-7 0.0919
#> 3 protein_1 TRUE condition_1_v… peptid… complete 6.01e-6 0.0670
#> 4 protein_1 FALSE condition_1_v… peptid… MAR 5.12e-2 0.00809
#> 5 protein_4 TRUE condition_1_v… peptid… MAR 6.66e-2 0.308
#> 6 protein_2 FALSE condition_1_v… peptid… complete 7.77e-2 0.275
#> 7 protein_9 FALSE condition_1_v… peptid… MAR 1.89e-1 0.486
#> 8 protein_3 FALSE condition_1_v… peptid… complete 2.23e-1 0.0752
#> 9 protein_1 FALSE condition_1_v… peptid… MAR 2.23e-1 0.0466
#> 10 protein_3 FALSE condition_1_v… peptid… complete 2.71e-1 0.0843
#> # ℹ 3 more variables: diff <dbl>, n_obs <int>, adj_pval <dbl>
```