Guiding questions to avoid data leakage in biological machine learning applications

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

Publikationsart: Zeitschriftenbeiträge (peer-reviewed)
Titel: Guiding questions to avoid data leakage in biological machine learning applications
Medien: Nature Methods
DOI: 10.1038/s41592-024-02362-y
Band: 21
ISBN: 1548-7091
Autoren: Judith Bernett, David B Blumenthal, Prof. Dr. Dominik Grimm , Prof. Dr. Florian Haselbeck , Roman Joeres, Olga V Kalinina, Markus List
Herausgeber: Springer Nature
Seiten: 1444-1453
Veröffentlichungsdatum: 09.08.2024
Zitation: Bernett, Judith; Blumenthal, David B; Grimm, Dominik; Haselbeck, Florian; Joeres, Roman; Kalinina, Olga V; List, Markus (2024): Guiding questions to avoid data leakage in biological machine learning applications. Nature Methods 21, S. 1444-1453. DOI: 10.1038/s41592-024-02362-y

Zurück zur Übersicht