Function for fitting four-parameter dose response curves for each group (precursor, peptide or protein). In addition it can filter data based on completeness, the completeness distribution and statistical testing using ANOVA.

fit_drc_4p(
data,
sample,
grouping,
response,
dose,
filter = "post",
replicate_completeness = 0.7,
condition_completeness = 0.5,
correlation_cutoff = 0.8,
log_logarithmic = TRUE,
include_models = FALSE,
retain_columns = NULL
)

## Arguments

data a data frame that contains at least the input variables. a character column in the data data frame that contains the sample names. a character column in the data data frame that contains the precursor, peptide or protein identifiers. a numeric column in the data data frame that contains the response values, e.g. log2 transformed intensities. a numeric column in the data data frame that contains the dose values, e.g. the treatment concentrations. a character value that determines if models should be filtered and if they should be filtered before or after the curve fits. Filtering of models can be skipped with filter = "none". Data can be filtered prior to model fitting with filter = "pre". In that case models will only be fitted for data that passed the filtering step. This will allow for faster model fitting since only fewer models will be fit. If you plan on performing an enrichment analysis you have to choose filter = "post". All models will be fit (even the ones that do not pass the filtering criteria). For enrichment analysis you should use both good (i.e. models that pass the filtering) and bad (i.e. models that do not pass the filtering) models. Therefore, for post-filtering the full list is returned and it will only contain annotations that indicate (passed_filter) if the filtering was passed or not. Default is "post". For ANOVA an adjusted p-value of 0.05 is used as a cutoff. a numeric value which similar to completenss_MAR of the assign_missingness function sets a threshold for the completeness of data. In contrast to assign_missingness it only determines the completeness for one condition and not the comparison of two conditions. The threshold is used to calculate a minimal degree of data completeness. The value provided to this argument has to be between 0 and 1, default is 0.7. It is multiplied with the number of replicates and then adjusted downward. The resulting number is the minimal number of observations that a condition needs to have to be considered "complete enough" for the condition_completeness argument. a numeric value which determines how many conditions need to at least fulfill the "complete enough" criteria set with replicate_completeness. The value provided to this argument has to be between 0 and 1, default is 0.5. It is multiplied with the number of conditions and then adjusted downward. The resulting number is the minimal number of conditions that need to fulfill the replicate_completeness argument for a peptide to pass the filtering. a numeric vector that specifies the correlation cutoff used for data filtering. a logical value that indicates if a logarithmic or log-logarithmic model is fitted. If response values form a symmetric curve for non-log transformed dose values, a logarithmic model instead of a log-logarithmic model should be used. Usually biological dose response data has a log-logarithmic distribution, which is the reason this is the default. Log-logarithmic models are symmetric if dose values are log transformed. a logical value that indicates if model fit objects should be exported. These are usually very large and not necessary for further analysis. a vector that specifies columns that should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector).

## Value

If include_models = FALSE a data frame is returned that contains correlations of predicted to measured values as a measure of the goodness of the curve fit, an associated p-value and the four parameters of the model for each group. Furthermore, input data for plots is returned in the columns plot_curve (curve and confidence interval) and plot_points (measured points). If \ codeinclude_models = TURE, a list is returned that contains:

• fit_objects: The fit objects of type drc for each group.

• correlations: The correlation data frame described above

## Details

If data filtering options are selected, data is filtered based on multiple criteria. In general, curves are only fitted if there are at least 5 conditions with data points present to ensure that there is potential for a good curve fit. Therefore, this is also the case if no filtering option is selected. Furthermore, a completeness cutoff is defined for filtering. By default each entity (e.g. precursor) is filtered to contain at least 70% of total replicates (adjusted downward) for at least 50% of all conditions (adjusted downward). This can be adjusted with the according arguments. In addition to the completeness cutoff, also a significance cutoff is applied. ANOVA is used to compute the statistical significance of the change for each entity. The resulting p-value is adjusted using the Benjamini-Hochberg method and a cutoff of q <= 0.05 is applied. Curve fits that have a minimal value that is higher than the maximal value are excluded as they were likely wrongly fitted. Curves with a correlation below 0.8 are not passing the filtering. If a fit does not fulfill the significance or completeness cutoff, it has a chance to still be considered if half of its values (+/-1 value) pass the replicate completeness criteria and half do not pass it. In order to fall into this category, the values that fulfill t he completeness cutoff and the ones that do not fulfill it need to be consecutive, meaning located next to each other based on their concentration values. Furthermore, the values that do not pass the completeness cutoff need to be lower in intensity. Lastly, the difference between the two groups is tested for statistical significance using a Welch's t-test and a cutoff of p <= 0.1 (we want to mainly discard curves that falsly fit the other criteria but that have clearly non-significant differences in mean). This allows curves to be considered that have missing values in half of their observations due to a decrease in intensity. It can be thought of as conditions that are missing not at random (MNAR). It is often the case that those entities do not have a significant p-value since half of their conditions are not considered due to data missingness.

The final filtered list is ranked based on a score calculated on entities that pass the filter. The score is the negative log10 of the adjusted ANOVA p-value scaled between 0 and 1 and the correlation scaled between 0 and 1 summed up and divided by 2. Thus, the highest score an entity can have is 1 with both the highest correlation and adjusted p-value. The rank is corresponding to this score. Please note, that entities with MNAR conditions might have a lower score due to the missing or non-significant ANOVA p-value. You should have a look at curves that are TRUE for dose_MNAR in more detail.

## Examples

# \donttest{
library(dplyr)

set.seed(123) # Makes example reproducible

# Create example data
data <- create_synthetic_data(
n_proteins = 2,
frac_change = 1,
n_replicates = 3,
n_conditions = 8,
method = "dose_response",
concentrations = c(0, 1, 10, 50, 100, 500, 1000, 5000),
)

# Perform dose response curve fit
drc_fit <- fit_drc_4p(
data = data,
sample = sample,
grouping = peptide,
response = peptide_intensity_missing,
dose = concentration,
retain_columns = c(protein, change_peptide)
)

glimpse(drc_fit)
#> Rows: 39
#> Columns: 19
#> $rank <dbl> 1, 2, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … #>$ protein           <chr> "protein_2", "protein_2", "protein_1", "protein_2", …
#> $change_peptide <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, F… #>$ peptide           <chr> "peptide_2_1", "peptide_2_3", "peptide_1_1", "peptid…
#> $hill_coefficient <dbl> 2.226694e+00, 3.280387e+00, 1.946307e+00, -3.310549e… #>$ min_model         <dbl> 14.87823, 12.46267, 15.98840, 19.14721, 17.38775, 14…
#> $max_model <dbl> 19.12933, 16.18910, 17.09802, 19.80934, 17.91039, 15… #>$ ec_50             <dbl> 2.453723e+02, 1.417578e+02, 9.428132e+01, 6.293053e+…
#> $correlation <dbl> 0.9986256, 0.9938863, 0.9668223, 0.9280152, 0.769789… #>$ pval              <dbl> 1.131609e-29, 8.423311e-15, 1.584597e-14, 6.625673e-…
#> $plot_curve <list> [<tbl_df[100 x 4]>], [<tbl_df[100 x 4]>], [<tbl_df[… #>$ plot_points       <list> [<tbl_df[24 x 2]>], [<tbl_df[16 x 2]>], [<tbl_df[24…
#> $enough_conditions <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE… #>$ anova_significant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FA…
#> $dose_MNAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS… #>$ passed_filter     <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, …
#> $anova_pval <dbl> 3.666126e-20, 1.471594e-05, 1.966510e-08, 2.858086e-… #>$ anova_adj_pval    <dbl> 1.613095e-18, 1.618753e-04, 4.326322e-07, 4.191860e-…
#> \$ score             <dbl> 1.00000000, 0.46644068, 0.37320103, 0.05860592, NA, …
#> # A tibble: 10 × 19
#>     rank protein   change_peptide peptide   hill_coefficient min_model max_model
#>    <dbl> <chr>     <lgl>          <chr>                <dbl>     <dbl>     <dbl>
#>  1     1 protein_2 TRUE           peptide_…        2.23           14.9      19.1
#>  2     2 protein_2 TRUE           peptide_…        3.28           12.5      16.2
#>  3     3 protein_1 TRUE           peptide_…        1.95           16.0      17.1
#>  4     4 protein_2 TRUE           peptide_…       -3.31           19.1      19.8
#>  5    NA protein_1 TRUE           peptide_…       -4.79           17.4      17.9
#>  6    NA protein_1 FALSE          peptide_…        2.74           14.8      15.2
#>  7    NA protein_1 FALSE          peptide_…       -0.0110         15.3      14.7
#>  8    NA protein_1 FALSE          peptide_…        3.42           16.8      17.2
#>  9    NA protein_1 FALSE          peptide_…       -0.0000174      16.1      15.4
#> 10    NA protein_1 FALSE          peptide_…       -1.35           16.2      16.5
#> # … with 12 more variables: ec_50 <dbl>, correlation <dbl>, pval <dbl>,
#> #   plot_curve <list>, plot_points <list>, enough_conditions <lgl>,
#> #   anova_significant <lgl>, dose_MNAR <lgl>, passed_filter <lgl>,
#> #   anova_pval <dbl>, anova_adj_pval <dbl>, score <dbl># }