R/impute_randomforest.R
impute_randomforest.Rdimpute_randomforest performs imputation for missing values in the data using the random
forest-based method implemented in the missForest package.
impute_randomforest(
data,
sample,
grouping,
intensity_log2,
retain_columns = NULL,
...
)A data frame that contains the input variables. This should include columns for the sample names, precursor or peptide identifiers, and intensity values.
A character column in the data data frame that contains the sample names.
A character column in the data data frame that contains the precursor or
peptide identifiers.
A numeric column in the data data frame that contains the intensity
values.
A character vector indicating which columns should be retained from the
input data frame. These columns will be preserved in the output alongside the imputed values.
By default, no additional columns are retained (retain_columns = NULL), but specific
columns can be retained by providing their names as a vector.
Additional parameters to pass to the missForest function. These parameters
can control aspects such as the number of trees (ntree) and the stopping criteria
(maxiter).
A data frame that contains an imputed_intensity column with the imputed values
and an imputed column indicating whether each value was imputed (TRUE) or not
(FALSE), in addition to any columns retained via retain_columns.
The function imputes missing values by building random forests, where missing values are predicted based on other available values within the dataset. For each variable with missing data, the function trains a random forest model using the available (non-missing) data in that variable, and subsequently predicts the missing values.
In addition to the imputed values, users can choose to retain additional columns from the original input data frame that were not part of the imputation process.
This function allows passing additional parameters to the underlying missForest function,
such as controlling the number of trees used in the random forest models or specifying the
stopping criteria. For a full list of parameters, refer to the missForest documentation.
To enable parallelisation, ensure that the doParallel package is installed and loaded:
Then register the desired number of cores for parallel processing:
To leverage parallelisation during the imputation, pass parallelize = "variables"
as an argument to the missForest function.
Stekhoven, D.J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597
set.seed(123) # Makes example reproducible
# Create example data
data <- create_synthetic_data(
n_proteins = 10,
frac_change = 0.5,
n_replicates = 4,
n_conditions = 2,
method = "effect_random",
additional_metadata = FALSE
)
head(data, n = 24)
#> # A tibble: 24 × 8
#> protein peptide condition sample peptide_intensity change change_peptide
#> <chr> <chr> <chr> <chr> <dbl> <lgl> <lgl>
#> 1 protein_1 peptide_1… conditio… sampl… 16.8 TRUE TRUE
#> 2 protein_1 peptide_1… conditio… sampl… 17.0 TRUE TRUE
#> 3 protein_1 peptide_1… conditio… sampl… 17.0 TRUE TRUE
#> 4 protein_1 peptide_1… conditio… sampl… 17.0 TRUE TRUE
#> 5 protein_1 peptide_1… conditio… sampl… 15.8 TRUE TRUE
#> 6 protein_1 peptide_1… conditio… sampl… 15.9 TRUE TRUE
#> 7 protein_1 peptide_1… conditio… sampl… 16.1 TRUE TRUE
#> 8 protein_1 peptide_1… conditio… sampl… 15.9 TRUE TRUE
#> 9 protein_1 peptide_1… conditio… sampl… 12.6 TRUE FALSE
#> 10 protein_1 peptide_1… conditio… sampl… 12.7 TRUE FALSE
#> # ℹ 14 more rows
#> # ℹ 1 more variable: peptide_intensity_missing <dbl>
# Perform imputation
data_imputed <- impute_randomforest(
data,
sample = sample,
grouping = peptide,
intensity_log2 = peptide_intensity_missing
)
head(data_imputed, n = 24)
#> # A tibble: 24 × 5
#> sample peptide imputed_intensity peptide_intensity_missing imputed
#> <chr> <chr> <dbl> <dbl> <lgl>
#> 1 sample_1 peptide_1_1 16.8 16.8 FALSE
#> 2 sample_1 peptide_1_2 12.7 NA TRUE
#> 3 sample_1 peptide_1_3 15.9 15.9 FALSE
#> 4 sample_1 peptide_1_4 17.1 17.1 FALSE
#> 5 sample_1 peptide_1_5 16.6 16.6 FALSE
#> 6 sample_1 peptide_1_6 16.2 16.2 FALSE
#> 7 sample_1 peptide_1_7 14.7 14.7 FALSE
#> 8 sample_1 peptide_1_8 14.3 NA TRUE
#> 9 sample_1 peptide_1_9 13.8 13.8 FALSE
#> 10 sample_1 peptide_1_10 16.3 16.3 FALSE
#> # ℹ 14 more rows