Data Reuse Digest: February 2023

New Format - Capturing the Whole of Bioinformatics Research

Feb 28, 2023

Introduction

Thanks for tuning in to The Data Reuse Digest! In writing this newsletter, the goal is to uncover all the different ways that scientific data can be used to drive research forward - ultimately with an eye on translational developments (new drugs, new clinical guidelines, new technologies, etc.).

This month, we’ve made improvements to the existing format. Now, not only will we continue to feature research projects that involve gathering and analyzing biological data in clever ways to tackle biomedical challenges - we will also share the latest news on bioinformatics algorithms and biological databases. These two areas of research lay a necessary foundation for the analysis projects (discovering new drug candidates, stratifying patient populations to enable more personalized medical treatment, developing new diagnostics, etc.) that we have long featured in the newsletter. The newsletter will now paint a broader picture of modern bioinformatics research and why it matters.

For researchers, especially younger researchers, the idea of this newsletter is to show what successful, publishable work in this area of analyzing big biological datasets looks like right now. At the same time, it also aims to encourage new kinds of research projects that push the field in new directions, towards new translational goals. For non-researchers, the idea is to pull back the curtain to show what research work actually looks like, and why it matters for society at large.

You can subscribe to the Data Reuse Digest here if you are not a subscriber already:

Research News

Bioinformatics Algorithms

Computational tools that are unlocking new kinds of insights from biological data

One strategy to develop new drugs: disrupt protein-protein interactions. Hundreds of thousands of proteins are known to interact - but in order to target these interactions, scientists need to figure out exactly what molecular regions of the proteins are participating. Researchers in Australia have developed a new algorithm that predicts which parts of proteins will interact - setting the stage for drug development [Williams et al., 2023] 🇦🇺
Genetic data is complicated, and often needs to be distilled down into a simpler form in order for its insights to be unlocked. One simple data structure that researchers like to use is called a ‘binary matrix’, which records in a simple table of 1s and 0s whether a gene is mutated or not for each individual subject who participated in a study. Scientists in Italy have developed a new tool for analyzing these matrices to uncover novel insights about the relationship between genes in cancer and other diseases [Vinceti et al., 2023] 🇮🇹
Proteins are a cell’s agents of activity - they build things up, break things down, and move things around. Each protein is made from a defined sequence of chemical building blocks (-ABCDEFG-), but in a special event called circular permutation, this sequence can get reordered. It is like connecting the two ends of the protein sequence to form a circle - and then cutting that circle to form ends in new places (-DEFGABC- ; In this case, the two ends (A and G) were joined and then the circle was cut between the C and the D). These permuted sequences have biotechnology and drug development applications. Researchers in Taiwan have developed a new algorithm that searches protein sequence databases to identify novel circular permutations. [Chen et al., 2023] 🇹🇼
A new algorithm called LamdaPP, developed by an international team of scientists, takes in a protein sequence (a string of amino acids) and predicts the 3D structure of that protein using artificial intelligence. Knowledge of 3D protein structure is essential for developing protein-targeting drug compounds. [Olenyi et al., 2023] 🇩🇪🇰🇷🇺🇸
Researchers from Europe have developed a platform (DARTpaths) that takes chemical compounds with known developmental / reproductive effects and predicts what biological pathways these compounds target. Other researchers can test these predictions in the lab - exposing model organisms like mice or fruit flies to the chemical compounds and defining their biological effects in greater detail. [Bhalla et al., 2023] 🇧🇪🇳🇱🇬🇧

Database Development

Collections of biological data in new forms and arrangements, made more accessible to the research community

New ChemFOnt database encompasses nearly 350,000 biologically relevant compounds: metabolites, food chemicals, drug compounds, pesticides, and more. The database associates each compound with functional terms that describe its health effects, associated diseases, and biological processes [Wishart et al., 2023] 🇨🇦🇺🇸🇳🇱
Update on a large, established database - InterPro - which classifies proteins from many organisms into families with certain biological functions. Updates in 2022 included the integration of new member databases (InterPro gathers together data from many member databases), added tools, and improvements to the visual appearance of the InterPro webpage [Paysan-Lafosse et al., 2023] 🌍
A database called G4Atlas houses data from studies of 10 different species and is focused on special cellular biomolecules called RNA G-quadruplexes (rG4s). These rG4s are known to regulate cellular activity and are thought to play a role in cancer, viral infection, and other illnesses. [Yu et al., 2023] 🇬🇧🇨🇳
A key trend in biomedical research is the integration of different kinds of biological data - so-called ‘multi-omics’ analysis. The IAnimal database, developed by researchers in China, integrates genomic data, transcriptomic data, epigenomic data, and more for 21 different species. [Fu et al., 2023] 🇨🇳
Researchers from the University of Michigan have constructed a massive knowledge graph of the human genome, called GenomicKB, which draws from more than 30 existing biological databases. A knowledge graph is all about linking up different kinds of information. For a familiar example - search for ‘Albert Einstein’ in Google. The ‘About’ section that comes up on the right side of the web page is built using Google’s Knowledge Graph. In GenomicKB, if a user were to specify a given genetic variant, the database will tell them what disease it is associated with as well as what gene it is found in, what tissues that gene is expressed in, and what biological functions it carries out. [Feng et al., 2023] 🇺🇸

Analysis Projects

Efforts to analyze biological data - in-house and/or from public databases - and tackle important biomedical challenges

Chinese researchers have produced a new dataset to study Alzheimer’s disease. The Hi-C data that they collected captures the 3D organization of DNA in the genome. Oftentimes, for a gene to be turned on, a faraway segment of DNA (the ‘enhancer’) must be brought near the start of the gene (towards a regulatory region called the ‘promoter’). Hi-C data helps scientists identify and understand these kinds of long-distance interactions that may play important roles in health and disease. [Meng et al., 2023] 🇨🇳
Researchers in Texas developed a new library of compounds with a specific chemical structure (beta-hairpin peptide macrocycles) that could have potential as antibiotics. After developing this library, they refined it by testing the compounds on bacteria to see which worked best. Then, they used machine learning analysis to figure out what chemical features made these effective compounds special (In the authors’ own words: “Active peptides contain a unique constrained structure and are highly enriched for cationic charge with arginine in their turn region”) [Randal et al., 2023] 🇺🇸
Scientists in St. Louis are studying cancer cells - specifically, the tendency of these cells to secrete small packets of proteins and other compounds called extracellular vesicles (EVs). The researchers performed a proteomic analysis of EVs from four different types of cancer, determining what kinds of proteins are present. EV content differed by cancer type - and as the researchers demonstrate, machine learning methods can be used to classify a patient’s cancer based on the proteins in their EVs [Barlin et al., 2023] 🇺🇸
What kind of biomarkers define patients with sepsis? Chinese researchers have found that certain metabolites are significantly more common in sepsis patients than healthy volunteers (9 metabolites to be exact - you can find the chemical names in the linked paper). This information can help clinicians diagnose sepsis more quickly. [Li et al., 2023] 🇨🇳
Hypertension (high blood pressure) and left ventricular hypertrophy (thickening of the heart wall) puts patients at risk of heart failure. To understand what is happening to the heart at the molecular level, Chinese scientists identified proteins that are more highly expressed in the heart tissue of hypertensive rats than healthy rats. They identified nearly 400 differentially expressed proteins - which can be studied further and considered as potential targets for treating heart-related conditions. [Wang et al., 2023] 🇨🇳

Research Community

This month’s featured research involved researchers in more than 9 countries (including several large international teams)

Spread the Word!

Thanks for reading! If you want to help us in our mission to show how researchers can make the most of public data, please share this newsletter with any colleagues who would be interested. Just press the button below to forward the newsletter along

From the Computer to the Clinic