Imputation of missing values — impute • protti

impute is calculating imputation values for missing data depending on the selected method.

impute(
  data,
  sample,
  grouping,
  intensity_log2,
  condition,
  comparison = comparison,
  missingness = missingness,
  noise = NULL,
  method = "ludovic",
  skip_log2_transform_error = FALSE,
  retain_columns = NULL
)

Arguments

data: a data frame that is ideally the output from the assign_missingness function. It should containing at least the input variables. For each "reference_vs_treatment" comparison, there should be the pair of the reference and treatment condition. That means the reference condition should be doublicated once for every treatment.
sample: a character column in the data data frame that contains the sample names.
grouping: a character column in the data data frame that contains the precursor or peptide identifiers.
intensity_log2: a numeric column in the data data frame that contains the intensity values.
condition: a character or numeric column in the data data frame that contains the the conditions.
comparison: a character column in the data data frame that contains the the comparisons of treatment/reference pairs. This is an output of the assign_missingnes function.
missingness: a character column in the data data frame that contains the missingness type of the data determines how values for imputation are sampled. This should at least contain "MAR" or "MNAR". Missingness assigned as NA will not be imputed.
noise: a numeric column in the data data frame that contains the noise value for the precursor/peptide. Is only required if method = "noise". Note: Noise values need to be log2 transformed.
method: a character value that specifies the method to be used for imputation. For method = "ludovic", MNAR missingness is sampled from a normal distribution around a value that is three lower (log2) than the lowest intensity value recorded for the precursor/peptide and that has a spread of the mean standard deviation for the precursor/peptide. For method = "noise", MNAR missingness is sampled from a normal distribution around the mean noise for the precursor/peptide and that has a spread of the mean standard deviation (from each condition) for the precursor/peptide. Both methods impute MAR data using the mean and variance of the condition with the missing data.
skip_log2_transform_error: a logical value that determines if a check is performed to validate that input values are log2 transformed. If input values are > 40 the test is failed and an error is returned.
retain_columns: a vector that indicates columns that should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector).

Value

A data frame that contains an imputed_intensity and imputed column in addition to the required input columns. The imputed column indicates if a value was imputed. The imputed_intensity column contains imputed intensity values for previously missing intensities.

Examples

set.seed(123) # Makes example reproducible

# Create example data
data <- create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.5,
  n_replicates = 4,
  n_conditions = 2,
  method = "effect_random",
  additional_metadata = FALSE
)

head(data, n = 24)
#> # A tibble: 24 × 8
#>    protein   peptide    condition sample peptide_intensity change change_peptide
#>    <chr>     <chr>      <chr>     <chr>              <dbl> <lgl>  <lgl>         
#>  1 protein_1 peptide_1… conditio… sampl…              16.8 TRUE   TRUE          
#>  2 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  3 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  4 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  5 protein_1 peptide_1… conditio… sampl…              15.8 TRUE   TRUE          
#>  6 protein_1 peptide_1… conditio… sampl…              15.9 TRUE   TRUE          
#>  7 protein_1 peptide_1… conditio… sampl…              16.1 TRUE   TRUE          
#>  8 protein_1 peptide_1… conditio… sampl…              15.9 TRUE   TRUE          
#>  9 protein_1 peptide_1… conditio… sampl…              12.6 TRUE   FALSE         
#> 10 protein_1 peptide_1… conditio… sampl…              12.7 TRUE   FALSE         
#> # ℹ 14 more rows
#> # ℹ 1 more variable: peptide_intensity_missing <dbl>

# Assign missingness information
data_missing <- assign_missingness(
  data,
  sample = sample,
  condition = condition,
  grouping = peptide,
  intensity = peptide_intensity_missing,
  ref_condition = "all",
  retain_columns = c(protein, peptide_intensity)
)
#> "all" was provided as reference condition. All pairwise comparisons are
#> created from the conditions and assigned their missingness. The created
#> comparisons are:
#> condition_1_vs_condition_2

head(data_missing, n = 24)
#> # A tibble: 24 × 8
#>    protein   peptide_intensity sample   condition peptide peptide_intensity_mi…¹
#>    <chr>                 <dbl> <chr>    <chr>     <chr>                    <dbl>
#>  1 protein_1              16.8 sample_1 conditio… peptid…                   16.8
#>  2 protein_1              17.0 sample_2 conditio… peptid…                   17.0
#>  3 protein_1              17.0 sample_3 conditio… peptid…                   17.0
#>  4 protein_1              17.0 sample_4 conditio… peptid…                   17.0
#>  5 protein_1              15.8 sample_5 conditio… peptid…                   15.8
#>  6 protein_1              15.9 sample_6 conditio… peptid…                   15.9
#>  7 protein_1              16.1 sample_7 conditio… peptid…                   16.1
#>  8 protein_1              15.9 sample_8 conditio… peptid…                   15.9
#>  9 protein_1              12.6 sample_1 conditio… peptid…                   NA  
#> 10 protein_1              12.7 sample_2 conditio… peptid…                   NA  
#> # ℹ 14 more rows
#> # ℹ abbreviated name: ¹peptide_intensity_missing
#> # ℹ 2 more variables: comparison <chr>, missingness <chr>

# Perform imputation
data_imputed <- impute(
  data_missing,
  sample = sample,
  grouping = peptide,
  intensity_log2 = peptide_intensity_missing,
  condition = condition,
  comparison = comparison,
  missingness = missingness,
  method = "ludovic",
  retain_columns = c(protein, peptide_intensity)
)

head(data_imputed, n = 24)
#> # A tibble: 24 × 10
#>    protein   peptide_intensity sample   peptide peptide_intensity_mi…¹ condition
#>    <chr>                 <dbl> <chr>    <chr>                    <dbl> <chr>    
#>  1 protein_1              16.8 sample_1 peptid…                   16.8 conditio…
#>  2 protein_1              17.0 sample_2 peptid…                   17.0 conditio…
#>  3 protein_1              17.0 sample_3 peptid…                   17.0 conditio…
#>  4 protein_1              17.0 sample_4 peptid…                   17.0 conditio…
#>  5 protein_1              15.8 sample_5 peptid…                   15.8 conditio…
#>  6 protein_1              15.9 sample_6 peptid…                   15.9 conditio…
#>  7 protein_1              16.1 sample_7 peptid…                   16.1 conditio…
#>  8 protein_1              15.9 sample_8 peptid…                   15.9 conditio…
#>  9 protein_1              12.6 sample_1 peptid…                   NA   conditio…
#> 10 protein_1              12.7 sample_2 peptid…                   NA   conditio…
#> # ℹ 14 more rows
#> # ℹ abbreviated name: ¹peptide_intensity_missing
#> # ℹ 4 more variables: comparison <chr>, missingness <chr>,
#> #   imputed_intensity <dbl>, imputed <lgl>