Introduction
In this new edition of our newsletter, we are evolving our approach - taking what we’ve learned about bioinformatics and data reuse research efforts from our broad-based survey of the literature over the past year and focusing more specifically on different disease areas.
The idea here is to point out how researchers are making use of computational approaches to drive new research and develop better ways to treat disease. By sharing such studies - and translating technical details into more accessible language - we hope to encourage the broader use of computational tools and biological databases both within the disease area in question and in other disease areas where they may not yet have been applied.
In addition to this new general approach, we are adopting a new writing format. Each edition will be a short story focused on a specific research question (for example: What factors affect the risk of developing a specific disease?) and how computational methods are being used to tackle this question. These stories will feature either one or several different published studies. We will also have a section after the short story that highlights the broader trends in computational research that the featured study or studies help illuminate.
With this new approach, with fewer featured studies per edition, we hope to publish more frequently, though not on a rigid, monthly or bi-monthly schedule. Given the constantly changing activity of other commitments - research, course work, etc. - the publishing frequency may vary from several a month to once every two months. In the long run, we hope to bring in new contributors with a computational background who can bring their own unique perspectives to the table. This will also help make publishing more consistent.
In order to achieve our goal of stimulating the new use and development of computational methods in biomedical research, we need your help to share this newsletter with your colleagues, collaborators, and students:
If you are not a subscriber already, you can subscribe to the Data Reuse Digest here:
What factors affect the risk of developing type II diabetes?
Finding ways to identify people at risk of type II diabetes before they develop it - and coming up with strategies to reduce that risk on an individual basis - would improve the long-term quality of life for millions and relieve a major burden on the health care system. Computational approaches capable of identifying complex patterns in data are a major asset.
Researchers can detect these kinds of patterns by gathering up large amounts of data from patients who have already developed type II diabetes - either from large public databases or individual published studies. Once acquired, they can analyze this data to identify subtle biological patterns that define the diabetes condition. While you can’t turn back the clock for these patients, still-healthy individuals can be screened for these same biological patterns, and they can act as a ‘canary in the coal mine’ to warn the individual and their team of doctors that they may be in the early stages of developing the disease.
One such study, conducted by scientists from the Indian Company Mapmygenome, gathered data from participants in the UK Biobank, a huge public repository: 959 people with type II diabetes and 2818 healthy controls. Notably, the selected Biobank participants were all of the same ethnic background, originally hailing from India. The researchers focused on the human genome - and specifically, mutations in DNA that are associated with diabetes - as a potential biological risk factor. Based on prior GWAS studies that had already linked individual genetic mutations to diabetes, the researchers developed a polygenic risk score (PRS) that predicts, based on the assortment of multiple individual mutations that an individual possesses, what their overall diabetes risk is. The researchers calculated the PRS for each individual in the Biobank, and found that PRS was helpful for predicting diabetes status (i.e., whether or not an individual in the study had diabetes or not) in a logistic regression model even when age, sex, and other known biological risk factors were taken into account.
Featured Study: Genome-wide polygenic risk score for type 2 diabetes in Indian population (Pemmasani et al., July 2023, Scientific Reports)
There are a few further aspects of this study worthy of note. After analyzing the UK Biobank data, the researchers analyzed a separate database (Mapmygenome’s proprietary database GenomemegaDB) of Indian individuals living in India. This helped validate the PRS approach and show that it was still a good predictor for an entirely separate group of patients. This kind of validation is very common in studies that develop and test computational models of disease prediction, and possible nowadays because of the large number of public datasets currently available. Another important feature is the focus on Indian subjects, either living in the UK or India. Prior research suggests that PRS scores developed for one ethnic group may not be applicable to others. If the researchers had used a PRS developed from a European population to predict risk of diabetes in an Indian population, for example, this may have yielded less accurate results.
While the PRS study focused on genetic mutations, it is possible to use other kinds of biological data to predict disease risk as well. An international team of researchers, led by scientists at the University of Edinburgh in the UK, used DNA methylation markers from blood samples for diabetes prediction. Methylation is a phenomenon of epigenetic regulation. As people go through life and experience events, their DNA is imprinted with methylation markers. These markers can affect gene expression, cellular activity, and physical health. Unlike genetic mutations, which one receives at birth, epigenetic markers like methylation markers are a product of individual decisions and experiences.
Featured Study: Development and validation of DNA methylation scores in two European cohorts augment 10-year risk prediction of type 2 diabetes (Cheng et al., April 2023, Nature Aging)
The research team drew on data from the Generation Scotland cohort: over 14,000 individuals, 626 of whom had diabetes, with methylation data and over 15 years of associated electronic health record data. To identify methylation markers that were predictive of type II diabetes, the researchers applied an R programming package called MethylPipeR, developed by the Marioni Group at the University of Edinburgh, that incorporates a number of different machine learning models.
They first divided the patients into two groups (a ‘training’ and a ‘test’ group). Both groups contained individuals with and without diabetes. The researchers then applied their models to identify markers that predicted diabetes in the training group. After training the models on the training group, they were applied to the test set to see whether they were still able to predict effectively which patients had diabetes.
This procedure of splitting up the data into a training and a test group, and applying the model to the two groups in succession, is a very common approach in machine learning studies - as is validating a model on an entirely separate dataset. Just like the PRS study described above, the machine learning models in this study were applied to a different cohort from the German KORA study.
Key Trends in Computational Research
🏆 [Research Goal] You can identify biological factors that predict disease risk using a variety of different machine learning models.
⚙️ [Technical Note] The featured studies are examples of ‘supervised’ machine learning. Patients are already labeled as having diabetes or not having diabetes - the goal is to figure out what underlying features distinguish the two groups and best predict that a patient has diabetes. Unsupervised learning, by contrast, has no such labels attached. Based on the data at hand, unsupervised models determine how samples could be divided into groups, which may have meaningful clinical differences worthy of further study.
📈 [Ongoing Development] There are many tools available (e.g., R programming packages) to help apply these models - and more are being developed by the day. The underlying machine learning models used in these tools are also actively being improved to work better with different forms of data. There is a lot of exchange between different research disciplines - a machine learning model developed for marketing research, for example, may have applications in biological research.
⚙️ [Technical Note] Machine learning models can be applied to search for risk factors in many different kinds of biological data (e.g., genetic mutations, methylation markers, gene and protein expression, clinical observations, etc.)
📈 [Ongoing Development] There are now a wide variety of biological databases available for researchers to mine data from - often with multiple different kinds of biological data linked to the same patient. New databases are consistently being built with new patients and data types.
⚙️ [Technical Note] Studies that train models to predict disease risk based on patterns in biological data validate these models in several ways. A common approach is splitting the initial dataset into a training set (to initially train the model) and a test set (to show that it makes predictions accurately beyond the training data). Another increasingly common approach is applying the trained model to an entirely separate dataset. To get a machine learning study published, the first approach is probably necessary for most journals and the second approach increasingly so as well.
📈 [Ongoing Development] Researchers are now well aware that when predicting disease risk, ethnicity is a factor that needs to be taken into account. Databases and trained models specific to different populations are essential for the comprehensive prediction of disease risk.
This turned out so damn good!!!! Amazing read ✨️