Guiding questions to avoid data leakage in biological machine learning applications

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

Publikationsart
Zeitschriftenbeiträge (peer-reviewed)
Titel
Guiding questions to avoid data leakage in biological machine learning applications
Medien
Nature Methods
Band
21
ISBN
1548-7091
Autoren
Judith Bernett, David B Blumenthal, Prof. Dr. Dominik Grimm , Prof. Dr. Florian Haselbeck , Roman Joeres, Olga V Kalinina, Markus List
Herausgeber
Springer Nature
Seiten
1444-1453
Veröffentlichungsdatum
09.08.2024
Zitation
Bernett, Judith; Blumenthal, David B; Grimm, Dominik; Haselbeck, Florian; Joeres, Roman; Kalinina, Olga V; List, Markus (2024): Guiding questions to avoid data leakage in biological machine learning applications. Nature Methods 21, S. 1444-1453. DOI: 10.1038/s41592-024-02362-y