Data Reuse Digest: July 2023

Your monthly digest of recent bioinformatics & data re-use research

and

Jul 31, 2023

Introduction

Thanks for tuning in to The Data Reuse Digest! In writing this newsletter, our goal is to uncover all the different ways that published scientific data can be used to drive research forward - with an eye on translational developments (new drugs, new clinical guidelines, new technologies).

We try to keep the writing plain and simple so that the newsletter can be useful to researchers of any field (not just bioinformatics) and the general public as well.

For researchers, especially younger researchers, this newsletter will show you what publishable work in the field looks like right now. It also hopes to encourage new kinds of research projects that push the field in new directions, towards new translational goals. For non-researchers, this newsletter will pull back the curtain to show what research actually looks like and why it matters.

If you are not a subscriber already, you can subscribe to the Data Reuse Digest here:

Research News

(A) New Algorithms

Developing Computational Tools to See Biological Data in New Ways

Increasing antibiotic resistance coupled with the threat of bacterial infections in a post-antibiotic era has driven research into development of alternative antibiotic structures and surfaces. These surfaces are found on surgical tools and implanted biomedical devices, among other things. A major bottleneck in the process of antibiotic surface design is testing their viability with various microbiological assays that can take several days or weeks. To resolve this, scientists developed an automated pipeline using scanning electron microscope (SEM) imaging, for the prediction of the antibacterial effectiveness of surfaces based on various surface parameters [Rahimi et al., 2023] 🇸🇪🇩🇰

Radiation therapy is a common procedure to detect and treat many cancers. Clinicians monitor various radiation biomarkers to evaluate the outcome of varying radiation doses on patients exposed medically or accidentally. The identification and screening for these biomarkers, however, remains a complex and time-consuming process. In this study, researchers have implemented a Genetic Algorithm combined with k-Nearest Neighbors (GA/KNN) machine learning analysis to detect biomarker candidates within a public gene expression dataset [Andersson et al., 2023] 🇸🇪🇺🇸

Systems and molecular neuroscience experiments often depend on multimodal microscopy experiments where a small population of neurons is imaged under different experimental conditions. A challenging obstacle encountered in these experiments is a lack of techniques to identify the same neurons under different experimental conditions. In this study, researchers have developed a branch and bound algorithm which matches multimodal images taken from mouse brains [Chen et al., 2023] 🇺🇸

Cell transplantations are being used to treat a variety of diseases. One example is the transplantation of insulin-producing pancreatic islet cells into people with diabetes. Often, these cells are radio-labeled, allowing researchers to assess the success of transplantation by calculating the radioactivity in the region of interest. Simultaneous PET/MRI imaging has proven to be more effective for cell monitoring than many other methods as it requires less exposure to radiation and the images obtained have greater sensitivity and better resolution. The analysis of these images remains daunting, however. In this paper researchers have developed an AI algorithm combining unsupervised machine learning and deep learning approaches to effectively analyze images obtained from PET/MRI scans [Hayat et al., 2023] 🇺🇸

Structural variants (SV) - defined as genetic mutations over 50 base pairs in length - have been linked with various diseases. Currently, detection of SVs is more effective accomplished with long read (LR) sequencing. Because LR sequencing captures longer DNA sequences, it is capable of detecting structural variants more effectively. However, short read (SR) sequencing is the more common method due to its lower cost, and there is a lot more publicly available SR sequencing. In this study, researchers have developed an algorithm that can detect structural variants better from SR sequencing data. They applied the algorithm to SR data from various databases including the 1000 Genomes database and the Biobank Japan (BBJ) database [Kosugi et al., 2023] 🇯🇵

(B) New Databases

Building databases to store and share biological knowledge

As viruses like the coronavirus (SARS-CoV-2) spread through the population and mutate, existing vaccines may become ineffective. This makes it very important to keep track of the ‘mutational landscape’ of the virus, identifying mutations in viral DNA that are becoming more prevalent and may be a sign of vaccine resistance. Researchers from Spain and Finland have developed a database containing over 5 million SARS-CoV-2 genome sequences. The data is available for other researchers to view and analyze in an accessible online portal [Saldivar-Espinoza et al., 2023] 🇪🇸🇫🇮

A team of researchers from China, Australia, and the US have constructed an eye biomarker database - drawing on thousands of previously published studies and centralizing their collective discoveries in a single location. The database contains nearly 1000 different clinical features that can help researchers and clinicians diagnose and manage hundreds of different diseases. Interestingly, the database contains information on ‘non-biomarkers’ as well - clinical features that prior studies have shown are not associated with the development of specific diseases. This will help future researchers avoid spending a lot of time re-investigating dead ends [Zhang et al., 2023] 🇨🇳🇦🇺🇺🇸

Much of the data that we have on the human microbiome is 16S sequencing data, which is cheaper compared to other sequencing methods, but has certain limitations. Due to the nature of 16S sequencing (it captures only a small portion of the bacterial genome as a barcode to detect bacteria), we often know what genera of bacteria are present in 16S data, but usually not the specific species. That is, until the development of RexMap - an algorithm that can analyze existing 16S sequencing data and infer what species are present. RexMap was applied to 16S sequencing data from almost 30,000 people across the world [Segota et al., 2023] 🇺🇸

The brain, like the microbiome, is one of the human body’s remaining frontiers - scientists still have a long way to go in mapping out the variety of different brain cells and their individual functions. The scBrainMap database was developed to support this effort. It is based on single-cell sequencing data - which records the gene expression activity of individual brain cells. In total, scBrainMap defines 124 brain regions and 4881 cell types, each with associated gene expression data. It also identifies marker genes for specific diseases - which can inspire new drug development [Chi et al., 2023] 🇨🇳🇩🇪

When it comes to identifying novel drug compounds, inspiration often comes from natural products. This makes repositories of natural products and their active compounds indispensable. The South American nation of Peru has joined a list of other countries (including Mexico, Brazil, India, and China) in establishing their own national database of natural products. The new Peruvian database, called PeruNPDB, currently contains 280 natural products, their chemical structure, and their known physical attributes (weight, surface area, solubility, etc.). Computational biologists across the world can use their own drug discovery programs to screen these compounds against a variety of drug targets, and evaluate them as potential treatments [Barazorda-Ccahuana et al., 2023] 🇵🇪

(C) Bioinformatics Analysis

Bioinformatics tools applied to answer key biological questions

Atherosclerosis (AS) - the buildup of plaque in the arteries - and atherosclerotic cardiovascular diseases (ACD) are a leading cause of death worldwide. Different types of cell death are involved in the progression of both diseases. Researchers have recently identified a new form of cell death called cuproptosis that is driven by copper accumulation in cells. In this study, researchers identified cuproptosis-related genes (CRGs) from datasets in the Gene Expression Omnibus (GEO). They also constructed a competing endogenous RNA (ceRNA) network to understand possible regulatory mechanisms in AS [Chen et al., 2023] 🇨🇳

Another long-standing public health issue is heart failure (HF), with dilated cardiomyopathy (DCM) being a common form of the disease. In this study, researchers performed a weighted gene co-expression network analysis (WGCNA) to identify correlated sets of genes that are related to DCM. They identified 8 key genes that may play a significant role in the pathogenesis of DCM-HF as possible therapeutic targets [Zhou et al., 2023] 🇨🇳

Another recent study analyzed heart failure (HF) as a response to pressure overload-induced cardiac hypertrophy. In this study researchers have identified Transcription Elongation Factor A3 (Tcea3) as a potential drug target after identifying it as a differentially expressed gene (DEG) in several GEO datasets. They predicted its involvement in regulating fatty acid oxidation (FAO), which is typically dysfunctional in people with heart disease [Guo et al., 2023] 🇨🇳

Hypotrichosis is a rare form of alopecia (hair loss) that affects both men and women around the world. Through previous studies, it has been established that hypotrichosis results from mutations that deactivate the LIPH (Lipase-H) or LPAR6 genes. In this study, researchers have employed various computational techniques to evaluate the role of non-synonymous single nucleotide polymorphisms (nsSNPs) - genetic mutations that change one of the amino acids in the encoded protein in the LIPH gene. Specifically, they identified the nsSNPs most likely to be harmful based on the location of the mutations in the DNA and the 3D structure of the encoded LIPH protein [Khan et al., 2023] 🇵🇰🇸🇦🇺🇸

Circadian rhythm is a physiological process in mammals that regulates the sleep-wake cycle. It is maintained by the circadian clock genes (CCGs). Interestingly, CCGs are also implicated in the development of cancer, and an imbalance in the circadian clock (CC) can contribute to the progression of endocrine cancers like ovarian cancer (OC). In this study, using data from The Cancer Genome Atlas, researchers have identified 15 potential key genes related to the circadian clock which are associated with OC patient survival and immune cell infiltration in the tumor immune microenvironment (TIME). Understanding the molecular mechanisms of action for these genes in the cancer context will stimulate the discovery of novel biomarkers and immunotherapy targets [Zhao et al., 2023] 🇨🇳

(D) Commercial News

Bioinformatics-based businesses driving new drug development

Cancer: Several business partnerships are advancing personalized medicine for cancer patients. In the UK, the government-owned company Genomics England has partnered with the bioinformatics company Seqera Labs to gather and analyze whole genome sequencing data for patients through the British National Health Service. The companies Invivoscribe and Complete Genomics are working together to develop biomarker tests that are built to detect specific cancer mutations. In both cases, identifying genetic mutations in patients that worsen prognosis or influence the success of drug treatment will make cancer care more personalized 🇬🇧🇪🇸🇺🇸

Gene Silencing: Researchers at the Institute for Research in Biomedicine (IRB) in Barcelona are using computer simulations to design drugs that can silence disease-causing genes. The drugs themselves are strings of nucleotides - the basic components that make up DNA - which bind to the mRNA molecules that mediate gene expression and disable the production of proteins. The researchers are designing strings of nucleotides that can bind to a target mRNA, then using their simulations to tweak the nucleotide sequence at every possible position - to see if these tweaks will improve the chemical stability or effectiveness of the drug candidate. Researchers at IRB are working closely with several biotech firms (Nostrum Biodiscovery, Biogen, Ionis Pharmaceuticals) and recently published a paper on their drug-finding approach 🇪🇸🇺🇸

De-Extinction: Is Jurassic Park a real possibility? One company is trying to bring back the wooly mammoth, as well as the dodo bird and the Tasmanian tiger. Scientists at Colossal Biosciences are gathering up ancient mammoth bone samples to recreate the mammoth genome, filling in gaps in the DNA sequences (because samples are so old, the DNA has degraded) with DNA from the mammoth’s closest relative: the modern Asian elephant. The Asian elephant is vital to this project for another reason. Ultimately, the scientists plan to edit the DNA in Asian elephant cells so that it is more like mammoth DNA, then implant the edited cells into the eggs of a surrogate elephant mother. In about two years (elephants have a very long gestation period), a new ‘mammoth’ will be born 🇺🇸

Cell & Gene Therapy: We are becoming ever more capable of retuning the body of someone with disease through cell and gene therapies - the implantation of healthy cells and the editing of DNA, respectively. In this space, bioinformatics companies play an important role. The firm Form Bio (which, coincidentally, was spun out of Colossal Biosciences) has recently developed an AI service called FORMsight that improves the manufacturing of cell and gene therapies. FORMsight identifies signs of manufacturing contamination and poor batch quality very early in the drug manufacturing process - enabling companies to identify the root causes of these failures so that they can be addressed. Ultimately, this will help companies get drugs out faster and at lower cost 🇺🇸

De Novo Protein Design: You can edit DNA to fix mutant proteins - but what about creating new kinds of proteins to improve health? This is the goal of ‘de novo protein design’. A new player in this space is the company AI Proteins, which just formed a Scientific Advisory Board to guide its drug development efforts (establishing an SAB is an early step in the formation of most successful biotech/pharmaceutical companies). The company is using AI to design so-called ‘miniproteins’ that are specifically tailored to bind molecular targets in the body. These miniproteins are touted as more stable and cost-effective, and less toxic than traditional small molecule drugs and antibody treatments 🇺🇸

Research Community

This edition’s new studies involved 13 countries around the world

Spread the Word!

Thanks for reading! If you want to help us in our mission to show how researchers can make the most of public data, please share this newsletter with any colleagues who would be interested. Just press the button below to forward the newsletter along

From the Computer to the Clinic

Discussion about this post