Saarbrücken, October 24, 2024 - The combination of computational and experimental research has led to remarkable successes in recent decades, from deciphering genetic codes to predicting complex protein structures. This year, the foundations of this research were recognized with the Nobel Prizes in Physics and Chemistry. Recently, the increasing availability of artificial intelligence (AI) methods has opened up many new applications for bioinformatics. However, with these new opportunities come challenges that often remain hidden at first. One of the biggest challenges in developing AI-based bioinformatics applications is data leakage.
Researchers at Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) focus primarily on identifying new drug candidates and optimizing them for use in humans. The research group Drug Bioinformatics led by Prof. Olga Kalinina uses state-of-the-art bioinformatics techniques to predict previously unknown resistance mechanisms or the mode of action of new drugs. Increasingly, this involves the use of AI models that are trained on large amounts of data to recognize patterns in examples and make reliable predictions. In order for an AI model to know how to evaluate data, it must first be "trained" on a set of appropriate data. The model then applies the patterns it has learned to analyze new, previously unknown test data. This ensures that the AI is not simply reproducing memorized patterns, but gaining truly generalizable insights. If parts of the training data are found in the test data, this is called data leakage.
In a joint review article in the journal Nature Methods, an international team of data scientists, including Kalinina, has comprehensively demonstrated how problematic data leaks occur in biological research and how they can be remedied or even prevented. When data leaks occur, the model can access information that is not available in reality and thus make predictions that are optimistic but ultimately unsustainable. This is particularly relevant in pharmaceutical research, where incorrect predictions can be the basis for expensive experiments or tests on living organisms.
"The impact of AI on pharmaceutical research and drug development is growing rapidly. Predictions made by AI can have a significant impact not only on the success or failure of research projects, but also on people's lives," says Roman Joeres, co-author of the study and a PhD student in Kalinina's research group. An AI model that has been trained on the wrong data can produce incorrect results - a risk that can have serious consequences, especially when, for example, the diagnosis of a disease or the development of life-saving drugs are at the end of the development process.
A powerful example of how data leakage can dramatically affect an AI's performance: An AI was designed to detect tumors on CT scans and performed impressively in training. In practice, however, it failed. The reason: the training data contained rulers on tumor images because medical doctors had measured the tumors. The AI "learned" to classify images with rulers as tumor images, but was unable to detect tumors in practice without these markers. This illustrates how problematic it can be to train AI on features that are later missing from the application.
The solution proposed by the authors is as challenging as it is necessary: They recommend clear guidelines for data sharing and validation. This will ensure that AI models really learn what they are supposed to - and not just what else might be hidden in the data. These measures could help increase the reliability of AI models in scientific research and ensure that new discoveries are not based on false assumptions.
This may seem like a remote, technical problem, but the implications are far-reaching. If a model for predicting protein structure produces inaccurate results due to a data leak, it could lead to faulty experiments and ultimately costly research setbacks. In medical practice, such mistakes could even have dangerous consequences, for example in the diagnosis of serious diseases.
"We need to be very aware of the structure and origin of our data to avoid data leakage. Our paper tries to focus on the easily overlooked problem of data leakage and raise awareness in the scientific community to develop better AI in the future," says Kalinina. The problem is not only that biological data is often complex and confusing, but also that it is often interconnected. This aspect of research is not only a technical necessity, but also an ethical obligation. It is also about ensuring that science remains reliable, transparent and trustworthy - especially when it is supported by AI. The work of Kalinina and her colleagues is therefore an important step toward a future in which AI and experimental research work hand in hand to solve the great challenges of our time.
Original publication:
Bernett, J., Blumenthal, D.B., Grimm, D.G. et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21, 1444–1453 (2024). DOI: 10.1038/s41592-024-02362-y