Assignment of missingness types — assign

The type of missingness (missing at random, missing not at random) is assigned based on the comparison of a reference condition and every other condition.

assign_missingness(
  data,
  sample,
  condition,
  grouping,
  intensity,
  ref_condition = "all",
  completeness_MAR = 0.7,
  completeness_MNAR = 0.2,
  retain_columns = NULL
)

Arguments

data: a data frame containing at least the input variables.
sample: a character column in the data data frame that contains the sample name.
condition: a character or numeric column in the data data frame that contains the conditions.
grouping: a character column in the data data frame that contains protein, precursor or peptide identifiers.
intensity: a numeric column in the data data frame that contains intensity values that relate to the grouping variable.
ref_condition: a character vector providing the condition that is used as a reference for missingness determination. Instead of providing one reference condition, "all" can be supplied, which will create all pairwise condition pairs. By default ref_condition = "all".
completeness_MAR: a numeric value that specifies the minimal degree of data completeness to be considered as MAR. Value has to be between 0 and 1, default is 0.7. It is multiplied with the number of replicates and then adjusted downward. The resulting number is the minimal number of observations for each condition to be considered as MAR. This number is always at least 1.
completeness_MNAR: a numeric value that specifies the maximal degree of data completeness to be considered as MNAR. Value has to be between 0 and 1, default is 0.20. It is multiplied with the number of replicates and then adjusted downward. The resulting number is the maximal number of observations for one condition to be considered as MNAR when the other condition is complete.
retain_columns: a vector that indicates columns that should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector).

Value

A data frame that contains the reference condition paired with each treatment condition. The comparison column contains the comparison name for the specific treatment/reference pair. The missingness column reports the type of missingness.

"complete": No missing values for every replicate of this reference/treatment pair for the specific grouping variable.
"MNAR": Missing not at random. All replicates of either the reference or treatment condition have missing values for the specific grouping variable.
"MAR": Missing at random. At least n-1 replicates have missing values for the reference/treatment pair for the specific grouping varible.
NA: The comparison is not complete enough to fall into any other category. It will not be imputed if imputation is performed. For statistical significance testing these comparisons are filtered out after the test and prior to p-value adjustment. This can be prevented by setting filter_NA_missingness = FALSE in the calculate_diff_abundance() function.

The type of missingness has an influence on the way values are imputeted if imputation is performed subsequently using the impute() function. How each type of missingness is specifically imputed can be found in the function description. The type of missingness assigned to a comparison does not have any influence on the statistical test in the calculate_diff_abundance() function.

Examples

set.seed(123) # Makes example reproducible

# Create example data
data <- create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.5,
  n_replicates = 4,
  n_conditions = 2,
  method = "effect_random",
  additional_metadata = FALSE
)

head(data, n = 24)
#> # A tibble: 24 × 8
#>    protein   peptide    condition sample peptide_intensity change change_peptide
#>    <chr>     <chr>      <chr>     <chr>              <dbl> <lgl>  <lgl>         
#>  1 protein_1 peptide_1… conditio… sampl…              16.8 TRUE   TRUE          
#>  2 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  3 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  4 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  5 protein_1 peptide_1… conditio… sampl…              15.8 TRUE   TRUE          
#>  6 protein_1 peptide_1… conditio… sampl…              15.9 TRUE   TRUE          
#>  7 protein_1 peptide_1… conditio… sampl…              16.1 TRUE   TRUE          
#>  8 protein_1 peptide_1… conditio… sampl…              15.9 TRUE   TRUE          
#>  9 protein_1 peptide_1… conditio… sampl…              12.6 TRUE   FALSE         
#> 10 protein_1 peptide_1… conditio… sampl…              12.7 TRUE   FALSE         
#> # ℹ 14 more rows
#> # ℹ 1 more variable: peptide_intensity_missing <dbl>

# Assign missingness information
data_missing <- assign_missingness(
  data,
  sample = sample,
  condition = condition,
  grouping = peptide,
  intensity = peptide_intensity_missing,
  ref_condition = "all",
  retain_columns = c(protein)
)
#> "all" was provided as reference condition. All pairwise comparisons are
#> created from the conditions and assigned their missingness. The created
#> comparisons are:
#> condition_1_vs_condition_2

head(data_missing, n = 24)
#> # A tibble: 24 × 7
#>    protein   sample   condition   peptide     peptide_intensity_mis…¹ comparison
#>    <chr>     <chr>    <chr>       <chr>                         <dbl> <chr>     
#>  1 protein_1 sample_1 condition_1 peptide_1_1                    16.8 condition…
#>  2 protein_1 sample_2 condition_1 peptide_1_1                    17.0 condition…
#>  3 protein_1 sample_3 condition_1 peptide_1_1                    17.0 condition…
#>  4 protein_1 sample_4 condition_1 peptide_1_1                    17.0 condition…
#>  5 protein_1 sample_5 condition_2 peptide_1_1                    15.8 condition…
#>  6 protein_1 sample_6 condition_2 peptide_1_1                    15.9 condition…
#>  7 protein_1 sample_7 condition_2 peptide_1_1                    16.1 condition…
#>  8 protein_1 sample_8 condition_2 peptide_1_1                    15.9 condition…
#>  9 protein_1 sample_1 condition_1 peptide_1_2                    NA   condition…
#> 10 protein_1 sample_2 condition_1 peptide_1_2                    NA   condition…
#> # ℹ 14 more rows
#> # ℹ abbreviated name: ¹peptide_intensity_missing
#> # ℹ 1 more variable: missingness <chr>