Simulation study to evaluate when Plasmode simulation is superior to parametric simulation in comparing classification methods on high-dimensional data

dc.contributor.authorStolte, Marieka
dc.contributor.authorSchreck, Nicholas
dc.contributor.authorSlynko, Alla
dc.contributor.authorSaadati, Maral
dc.contributor.authorBenner, Axel
dc.contributor.authorRahnenfuhrer, Jorg
dc.contributor.authorBommert, Andrea
dc.date.accessioned2025-07-03T19:53:38Z
dc.date.available2025-07-03T19:53:38Z
dc.date.issued2025
dc.description© 2025 Stolte et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
dc.description.abstractSimulation studies, especially neutral comparison studies, are crucial for evaluating and comparing statistical methods as they investigate whether methods work as intended and can guide an appropriate method choice. Typically, the term simulation refers to parametric simulation, i.e. computer experiments using pseudo-random numbers. For these, the full data-generating process (DGP) and outcome-generating model (OGM) are known within the simulation. However, the specification of realistic DGPs might be difficult in practice leading to oversimplified assumptions. The problem is more severe for higher-dimensional data as the number of parameters to specify typically increases with the number of variables in the data. Plasmode simulation, which is a combination of resampling covariates from a real-life dataset from the DGP of interest together with a specified OGM is often claimed to solve this problem since no explicit specification of the DGP is necessary. However, this claim is not well supported by empirical results. Here, parametric and Plasmode simulations are compared in the context of a method comparison study for binary classification methods. We focus on studies conducted with some specific data type or application in mind whose true, unknown data-generating mechanism is mimicked. The performance of Plasmode and parametric comparison studies for estimating classifier performance is compared as well as their ability to reproduce the true method ranking. The influence of misspecifications of the DGP on the results of parametric simulation and of misspecifications of the OGM on the results of parametric and Plasmode simulation are investigated. Moreover, different resampling strategies are compared for Plasmode comparison studies. The study finds that misspecifications of the DGP and OGM negatively influence the ability of the comparison studies to estimate the classification performances and method rankings. The best choice of the resampling strategy in Plasmode simulation depends on the concrete scenario.
dc.description.sponsorshipResearch Training Group "Biostatistical Methods for High-Dimensional Data in Toxicology", RTG 2624 Project P1 || Deutsche Forschungsgemeinschaft (DFG), German Research Foundation - 427806116.
dc.identifier.urihttps://doi.org/10.1371/journal.pone.0322887
dc.identifier.urihttps://hdl.handle.net/10012/21969
dc.language.isoen
dc.publisherPublic Library of Science (PLOS)
dc.relation.ispartofseriesPLOS One; 20(6); e0322887
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectsimulation and modeling
dc.subjectnormal distribution
dc.subjectstatistical distributions
dc.subjectcomputerized simulations
dc.subjectsupport vector machines
dc.subjectstatistical data
dc.subjectstatistical methods
dc.subjectprobability distribution
dc.titleSimulation study to evaluate when Plasmode simulation is superior to parametric simulation in comparing classification methods on high-dimensional data
dc.typeArticle
dcterms.bibliographicCitationStolte, M., Schreck, N., Slynko, A., Saadati, M., Benner, A., Rahnenführer, J., & Bommert, A. (2025a). Simulation study to evaluate when PLASMODE simulation is superior to parametric simulation in comparing classification methods on high-dimensional data. PLOS One, 20(6). https://doi.org/10.1371/journal.pone.0322887
uws.contributor.affiliation1Faculty of Mathematics
uws.contributor.affiliation2Statistics and Actuarial Science
uws.peerReviewStatusReviewed
uws.scholarLevelFaculty
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
journal.pone.0322887.pdf
Size:
14.88 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.47 KB
Format:
Item-specific license agreed upon to submission
Description: