Imputation of Missing Values Using Random Forest Imputation

impute_randomforest performs imputation for missing values in the data using the random forest-based method implemented in the missForest package.

impute_randomforest(
  data,
  sample,
  grouping,
  intensity_log2,
  retain_columns = NULL,
  ...
)

Arguments

data: A data frame that contains the input variables. This should include columns for the sample names, precursor or peptide identifiers, and intensity values.
sample: A character column in the data data frame that contains the sample names.
grouping: A character column in the data data frame that contains the precursor or peptide identifiers.
intensity_log2: A numeric column in the data data frame that contains the intensity values.
retain_columns: A character vector indicating which columns should be retained from the input data frame. These columns will be preserved in the output alongside the imputed values. By default, no additional columns are retained (retain_columns = NULL), but specific columns can be retained by providing their names as a vector.
...: Additional parameters to pass to the missForest function. These parameters can control aspects such as the number of trees (ntree) and the stopping criteria (maxiter).

Value

A data frame that contains an imputed_intensity column with the imputed values and an imputed column indicating whether each value was imputed (TRUE) or not (FALSE), in addition to any columns retained via retain_columns.

Details

The function imputes missing values by building random forests, where missing values are predicted based on other available values within the dataset. For each variable with missing data, the function trains a random forest model using the available (non-missing) data in that variable, and subsequently predicts the missing values.

In addition to the imputed values, users can choose to retain additional columns from the original input data frame that were not part of the imputation process.

This function allows passing additional parameters to the underlying missForest function, such as controlling the number of trees used in the random forest models or specifying the stopping criteria. For a full list of parameters, refer to the missForest documentation.

To enable parallelisation, ensure that the doParallel package is installed and loaded:

install.packages("doParallel")
library(doParallel)

Then register the desired number of cores for parallel processing:

registerDoParallel(cores = 6)

To leverage parallelisation during the imputation, pass parallelize = "variables" as an argument to the missForest function.

References

Stekhoven, D.J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597

Author

Elena Krismer

Examples

set.seed(123) # Makes example reproducible

# Create example data
data <- create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.5,
  n_replicates = 4,
  n_conditions = 2,
  method = "effect_random",
  additional_metadata = FALSE
)

head(data, n = 24)
#> # A tibble: 24 × 8
#>    protein   peptide    condition sample peptide_intensity change change_peptide
#>    <chr>     <chr>      <chr>     <chr>              <dbl> <lgl>  <lgl>         
#>  1 protein_1 peptide_1… conditio… sampl…              16.8 TRUE   TRUE          
#>  2 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  3 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  4 protein_1 peptide_1… conditio… sampl…              17.0 TRUE   TRUE          
#>  5 protein_1 peptide_1… conditio… sampl…              15.8 TRUE   TRUE          
#>  6 protein_1 peptide_1… conditio… sampl…              15.9 TRUE   TRUE          
#>  7 protein_1 peptide_1… conditio… sampl…              16.1 TRUE   TRUE          
#>  8 protein_1 peptide_1… conditio… sampl…              15.9 TRUE   TRUE          
#>  9 protein_1 peptide_1… conditio… sampl…              12.6 TRUE   FALSE         
#> 10 protein_1 peptide_1… conditio… sampl…              12.7 TRUE   FALSE         
#> # ℹ 14 more rows
#> # ℹ 1 more variable: peptide_intensity_missing <dbl>

# Perform imputation
data_imputed <- impute_randomforest(
  data,
  sample = sample,
  grouping = peptide,
  intensity_log2 = peptide_intensity_missing
)

head(data_imputed, n = 24)
#> # A tibble: 24 × 5
#>    sample   peptide      imputed_intensity peptide_intensity_missing imputed
#>    <chr>    <chr>                    <dbl>                     <dbl> <lgl>  
#>  1 sample_1 peptide_1_1               16.8                      16.8 FALSE  
#>  2 sample_1 peptide_1_2               12.7                      NA   TRUE   
#>  3 sample_1 peptide_1_3               15.9                      15.9 FALSE  
#>  4 sample_1 peptide_1_4               17.1                      17.1 FALSE  
#>  5 sample_1 peptide_1_5               16.6                      16.6 FALSE  
#>  6 sample_1 peptide_1_6               16.2                      16.2 FALSE  
#>  7 sample_1 peptide_1_7               14.7                      14.7 FALSE  
#>  8 sample_1 peptide_1_8               14.3                      NA   TRUE   
#>  9 sample_1 peptide_1_9               13.8                      13.8 FALSE  
#> 10 sample_1 peptide_1_10              16.3                      16.3 FALSE  
#> # ℹ 14 more rows