Gene expression profiling is among the most commonly used analytical tools in biomedical research and is applied to predict preclinical and clinical endpoints, e.g. diagnosis of disease, risk assessment and response to treatment. However, the reliability of these predictions has not yet been established.
Johan Trygg and Max Bylesjö, researchers at Umeå University, have participated in a large international project (MAQC-II) aimed to examine and generate "best practice" protocols in data analysis for predicting clinical endpoints based on gene expression data. This project was coordinated by the United States Food and Drug Administration (FDA) and is part of its recent launch of a "Critical Path Initiative" to medical product development. The Umeå University researchers contributed with their expertise in the multivariate data analysis technique known as chemometrics.
The results have been published in the latest issue of the journal Nature Biotechnology.
Gene expression data can be used for diagnosis, early detection (screening) and prediction of response to treatment. However, the reliability of the predicted clinical endpoint can profoundly influence the results. In this project, gene expression profiles for 13 different endpoints from more than 3100 samples, including breast and lung cancer, were analyzed by 36 independent analysis teams that generated more than 30,000 prediction models for these 13 endpoints. This provides a unique resource for regulatory agencies and scientists.
"Even though the primary goal was not to evaluate individual contributions, I was very happy to see that our OPLS prediction models did so well, and ranked highest for one of the 13 endpoints," says Johan Trygg, associate professor, Computational Life Science Cluster (CLiC) at Umeå University, coordinator of the Swedish effort.
A large effort was put into the structure and review of the data analysis protocol, generation of 36 candidate models and the statistical validation, including blinded validation sets. Three observations were particularly highlighted. (1) The performance of the prediction models depend largely on the quality and relevance of data (2) The experience and proficiency of the data analysis team are crucial factors for success (3) Different prediction methods yield similar prediction results.
Understanding the limitations using gene expression data for predicting clinical endpoints is critical to the formulation of general guidelines and procedures for safe and effective use, e.g. development of diagnostic tests. The "best practice" guidelines provided by this unprecedented collaboration provide a solid foundation for other types of high-dimensional biological data such as proteins and metabolites to be applied for personalized medicine.
Cite This Page: