Calculate differential abundance between conditions — calculate_diff

Performs differential abundance calculations and statistical hypothesis tests on data frames with protein, peptide or precursor data. Different methods for statistical testing are available.

calculate_diff_abundance(
  data,
  sample,
  condition,
  grouping,
  intensity_log2,
  missingness = missingness,
  comparison = comparison,
  mean = NULL,
  sd = NULL,
  n_samples = NULL,
  ref_condition = "all",
  filter_NA_missingness = TRUE,
  method = c("moderated_t-test", "t-test", "t-test_mean_sd", "proDA"),
  p_adj_method = "BH",
  retain_columns = NULL
)

Arguments

data: a data frame containing at least the input variables that are required for the selected method. Ideally the output of assign_missingness or impute is used.
sample: a character column in the data data frame that contains the sample name. Is not required if method = "t-test_mean_sd".
condition: a character or numeric column in the data data frame that contains the conditions.
grouping: a character column in the data data frame that contains precursor, peptide or protein identifiers.
intensity_log2: a numeric column in the data data frame that contains intensity values. The intensity values need to be log2 transformed. Is not required if method = "t-test_mean_sd".
missingness: a character column in the data data frame that contains missingness information. Can be obtained by calling assign_missingness(). Is not required if method = "t-test_mean_sd". The type of missingness assigned to a comparison does not have any influence on the statistical test. However, if filter_NA_missingness = TRUE and method = "proDA", then comparisons with missingness NA are filtered out prior to p-value adjustment.
comparison: a character column in the data data frame that contains information of treatment/reference condition pairs. Can be obtained by calling assign_missingness. Comparisons need to be in the form condition1_vs_condition2, meaning two compared conditions are separated by "_vs_". This column determines for which condition pairs differential abundances are calculated. Is not required if method = "t-test_mean_sd", in that case please provide a reference condition with the ref_condition argument.
mean: a numeric column in the data data frame that contains mean values for two conditions. Is only required if method = "t-test_mean_sd".
sd: a numeric column in the data data frame that contains standard deviations for two conditions. Is only required if method = "t-test_mean_sd".
n_samples: a numeric column in the data data frame that contains the number of samples per condition for two conditions. Is only required if method = "t-test_mean_sd".
ref_condition: optional, character value providing the condition that is used as a reference for differential abundance calculation. Only required for method = "t-test_mean_sd". Instead of providing one reference condition, "all" can be supplied, which will create all pairwise condition pairs. By default ref_condition = "all".
filter_NA_missingness: a logical value, default is TRUE. For all methods except "t-test_mean_sd" missingness information has to be provided. This information can be for example obtained by calling assign_missingness(). If a reference/treatment pair has too few samples to be considered robust based on user defined cutoffs, it is annotated with NA as missingness by the assign_missingness() function. If this argument is TRUE, these NA reference/treatment pairs are filtered out. For method = "proDA" this is done before the p-value adjustment.
method: a character value, specifies the method used for statistical hypothesis testing. Methods include Welch test ("t-test"), a Welch test on means, standard deviations and number of replicates ("t-test_mean_sd") and a moderated t-test based on the limma package ("moderated_t-test"). More information on the moderated t-test can be found in the limma documentation. Furthermore, the proDA package specific method ("proDA") can be used to infer means across samples based on a probabilistic dropout model. This eliminates the need for data imputation since missing values are inferred from the model. More information can be found in the proDA documentation. We do not recommend using the moderated_t-test or proDA method if the data was filtered for low CVs or imputation was performed. Default is method = "moderated_t-test".
p_adj_method: a character value, specifies the p-value correction method. Possible methods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default method is "BH".
retain_columns: a vector indicating if certain columns should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector). Please note that if you retain columns that have multiple rows per grouped variable there will be duplicated rows in the output.

Value

A data frame that contains differential abundances (diff), p-values (pval) and adjusted p-values (adj_pval) for each protein, peptide or precursor (depending on the grouping variable) and the associated treatment/reference pair. Depending on the method the data frame contains additional columns:

"t-test": The std_error column contains the standard error of the differential abundances. n_obs contains the number of observations for the specific protein, peptide or precursor (depending on the grouping variable) and the associated treatment/reference pair.
"t-test_mean_sd": Columns labeled as control refer to the second condition of the comparison pairs. Treated refers to the first condition. mean_control and mean_treated columns contain the means for the reference and treatment condition, respectively. sd_control and sd_treated columns contain the standard deviations for the reference and treatment condition, respectively. n_control and n_treated columns contain the numbers of samples for the reference and treatment condition, respectively. The std_error column contains the standard error of the differential abundances. t_statistic contains the t_statistic for the t-test.
"moderated_t-test": CI_2.5 and CI_97.5 contain the 2.5% and 97.5% confidence interval borders for differential abundances. avg_abundance contains average abundances for treatment/reference pairs (mean of the two group means). t_statistic contains the t_statistic for the t-test. B The B-statistic is the log-odds that the protein, peptide or precursor (depending on grouping) has a differential abundance between the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e, about four and a half to one. The probability that there is a differential abundance is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentially abundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentially abundant.n_obs contains the number of observations for the specific protein, peptide or precursor (depending on the grouping variable) and the associated treatment/reference pair.
"proDA": The std_error column contains the standard error of the differential abundances. avg_abundance contains average abundances for treatment/reference pairs (mean of the two group means). t_statistic contains the t_statistic for the t-test. n_obs contains the number of observations for the specific protein, peptide or precursor (depending on the grouping variable) and the associated treatment/reference pair.

For all methods execept "proDA", the p-value adjustment is performed only on the proportion of data that contains a p-value that is not NA. For "proDA" the p-value adjustment is either performed on the complete dataset (filter_NA_missingness = TRUE) or on the subset of the dataset with missingness that is not NA (filter_NA_missingness = FALSE).

Examples

set.seed(123) # Makes example reproducible

# Create synthetic data
data <- create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.5,
  n_replicates = 4,
  n_conditions = 2,
  method = "effect_random",
  additional_metadata = FALSE
)

# Assign missingness information
data_missing <- assign_missingness(
  data,
  sample = sample,
  condition = condition,
  grouping = peptide,
  intensity = peptide_intensity_missing,
  ref_condition = "all",
  retain_columns = c(protein, change_peptide)
)
#> "all" was provided as reference condition. All pairwise comparisons are
#> created from the conditions and assigned their missingness. The created
#> comparisons are:
#> condition_1_vs_condition_2

# Calculate differential abundances
# Using "moderated_t-test" and "proDA" improves
# true positive recovery progressively
diff <- calculate_diff_abundance(
  data = data_missing,
  sample = sample,
  condition = condition,
  grouping = peptide,
  intensity_log2 = peptide_intensity_missing,
  missingness = missingness,
  comparison = comparison,
  method = "t-test",
  retain_columns = c(protein, change_peptide)
)
#> [1/2] Create input for t-tests ... 
#> DONE
#> [2/2] Calculate t-tests ... 
#> DONE

head(diff, n = 10)
#> # A tibble: 10 × 10
#>    protein   change_peptide comparison     peptide missingness    pval std_error
#>    <chr>     <lgl>          <chr>          <chr>   <chr>         <dbl>     <dbl>
#>  1 protein_5 TRUE           condition_1_v… peptid… complete    9.38e-9   0.0557 
#>  2 protein_3 TRUE           condition_1_v… peptid… complete    7.01e-7   0.0919 
#>  3 protein_1 TRUE           condition_1_v… peptid… complete    6.01e-6   0.0670 
#>  4 protein_1 FALSE          condition_1_v… peptid… MAR         5.12e-2   0.00809
#>  5 protein_4 TRUE           condition_1_v… peptid… MAR         6.66e-2   0.308  
#>  6 protein_2 FALSE          condition_1_v… peptid… complete    7.77e-2   0.275  
#>  7 protein_9 FALSE          condition_1_v… peptid… MAR         1.89e-1   0.486  
#>  8 protein_3 FALSE          condition_1_v… peptid… complete    2.23e-1   0.0752 
#>  9 protein_1 FALSE          condition_1_v… peptid… MAR         2.23e-1   0.0466 
#> 10 protein_3 FALSE          condition_1_v… peptid… complete    2.71e-1   0.0843 
#> # ℹ 3 more variables: diff <dbl>, n_obs <int>, adj_pval <dbl>