R/create_synthetic_data.R
create_synthetic_data.Rd
This function creates a synthetic limited proteolysis proteomics dataset that can be used to test functions while knowing the ground truth.
create_synthetic_data(
n_proteins,
frac_change,
n_replicates,
n_conditions,
method = "effect_random",
concentrations = NULL,
median_offset_sd = 0.05,
mean_protein_intensity = 16.88,
sd_protein_intensity = 1.4,
mean_n_peptides = 12.75,
size_n_peptides = 0.9,
mean_sd_peptides = 1.7,
sd_sd_peptides = 0.75,
mean_log_replicates = -2.2,
sd_log_replicates = 1.05,
effect_sd = 2,
dropout_curve_inflection = 14,
dropout_curve_sd = -1.2,
additional_metadata = TRUE
)
a numeric value that specifies the number of proteins in the synthetic dataset.
a numeric value that specifies the fraction of proteins that has a peptide changing in abundance. So far only one peptide per protein is changing.
a numeric value that specifies the number of replicates per condition.
a numeric value that specifies the number of conditions.
a character value that specifies the method type for the random sampling of
significantly changing peptides. If method = "effect_random"
, the effect for each
condition is randomly sampled and conditions do not depend on each other. If
method = "dose_response"
, the effect is sampled based on a dose response curve and
conditions are related to each other depending on the curve shape. In this case the
concentrations argument needs to be specified.
a numeric vector of length equal to the number of conditions, only needs
to be specified if method = "dose_response"
. This allows equal sampling of peptide
intensities. It ensures that the same positions of dose response curves are sampled for each
peptide based on the provided concentrations.
a numeric value that specifies the standard deviation of normal distribution that is used for sampling of inter-sample-differences. Default is 0.05.
a numeric value that specifies the mean of the protein intensity distribution. Default: 16.8.
a numeric value that specifies the standard deviation of the protein intensity distribution. Default: 1.4.
a numeric value that specifies the mean number of peptides per protein. Default: 12.75.
a numeric value that specifies the dispersion parameter (the shape
parameter of the gamma mixing distribution). Can be theoretically calculated as
mean + mean^2/variance
, however, it should be rather obtained by fitting the negative
binomial distribution to real data. This can be done by using the optim
function (see
Example section). Default: 0.9.
a numeric value that specifies the mean of peptide intensity standard deviations within a protein. Default: 1.7.
a numeric value that specifies the standard deviation of peptide intensity standard deviation within a protein. Default: 0.75.
a numeric value that specifies the meanlog
and sdlog
of the log normal distribution of replicate standard deviations. Can be
obtained by fitting a log normal distribution to the distribution of replicate standard
deviations from a real dataset. This can be done using the optim
function (see Example
section). Default: -2.2 and 1.05.
a numeric value that specifies the standard deviation of a normal distribution
around mean = 0
that is used to sample the effect of significantly changeing peptides.
Default: 2.
a numeric value that specifies the intensity inflection point of a probabilistic dropout curve that is used to sample intensity dependent missing values. This argument determines how many missing values there are in the dataset. Default: 14.
a numeric value that specifies the standard deviation of the probabilistic dropout curve. Needs to be negative to sample a droupout towards low intensities. Default: -1.2.
a logical value that determines if metadata such as protein coverage, missed cleavages and charge state should be sampled and added to the list.
A data frame that contains complete peptide intensities and peptide intensities with values that were created based on a probabilistic dropout curve.
create_synthetic_data(
n_proteins = 10,
frac_change = 0.1,
n_replicates = 3,
n_conditions = 2
)
#> # A tibble: 1,332 × 14
#> protein peptide condition sample peptide_intensity change change_peptide
#> <chr> <chr> <chr> <chr> <dbl> <lgl> <lgl>
#> 1 protein_1 peptide_1… conditio… sampl… 13.9 TRUE TRUE
#> 2 protein_1 peptide_1… conditio… sampl… 13.9 TRUE TRUE
#> 3 protein_1 peptide_1… conditio… sampl… 14.0 TRUE TRUE
#> 4 protein_1 peptide_1… conditio… sampl… 13.1 TRUE TRUE
#> 5 protein_1 peptide_1… conditio… sampl… 13.3 TRUE TRUE
#> 6 protein_1 peptide_1… conditio… sampl… 13.4 TRUE TRUE
#> 7 protein_1 peptide_1… conditio… sampl… 18.3 TRUE TRUE
#> 8 protein_1 peptide_1… conditio… sampl… 18.1 TRUE TRUE
#> 9 protein_1 peptide_1… conditio… sampl… 18.1 TRUE TRUE
#> 10 protein_1 peptide_1… conditio… sampl… 18.1 TRUE TRUE
#> # ℹ 1,322 more rows
#> # ℹ 7 more variables: peptide_intensity_missing <dbl>, coverage <dbl>,
#> # n_missed_cleavage <int>, charge <dbl>, pep_type <chr>, peak_width <dbl>,
#> # retention_time <dbl>
# determination of mean_n_peptides and size_n_peptides parameters based on real data (count)
# example peptide count per protein
count <- c(6, 3, 2, 0, 1, 0, 1, 2, 2, 0)
theta <- c(mu = 1, k = 1)
negbinom <- function(theta) {
-sum(stats::dnbinom(count, mu = theta[1], size = theta[2], log = TRUE))
}
fit <- stats::optim(theta, negbinom)
fit
#> $par
#> mu k
#> 1.699882 2.124010
#>
#> $value
#> [1] 17.50891
#>
#> $counts
#> function gradient
#> 57 NA
#>
#> $convergence
#> [1] 0
#>
#> $message
#> NULL
#>
# determination of mean_log_replicates and sd_log_replicates parameters
# based on real data (standard_deviations)
# example standard deviations of replicates
standard_deviations <- c(0.61, 0.54, 0.2, 1.2, 0.8, 0.3, 0.2, 0.6)
theta2 <- c(meanlog = 1, sdlog = 1)
lognorm <- function(theta2) {
-sum(stats::dlnorm(standard_deviations, meanlog = theta2[1], sdlog = theta2[2], log = TRUE))
}
fit2 <- stats::optim(theta2, lognorm)
fit2
#> $par
#> meanlog sdlog
#> -0.7606984 0.6093069
#>
#> $value
#> [1] 1.302677
#>
#> $counts
#> function gradient
#> 75 NA
#>
#> $convergence
#> [1] 0
#>
#> $message
#> NULL
#>