The Future of Synthetic Biology: Simulation and Discovery

Chapter 1: The Intersection of Simulation and Biology

In a recent virtual presentation, I witnessed an intriguing demonstration where a pair of virtual hands navigated a screen displaying a digital environment. "By accessing the PubChem chemical database, we can retrieve the 2D structure of numerous small molecules," explained Steve McCloskey, the founder of Nanome, a startup specializing in virtual reality tools for scientific research. Moments later, the well-known 2D representation of atomic bonds began to oscillate. "We are currently executing an energy minimization function to predict the 3D conformation of this molecule," he elaborated. Another participant asked, "Is this the small molecule that interacts with the COVID-19 protease?" Steve clarified, "No, but that provides a great segue into our next example." With that, a new floating screen emerged, showcasing the 3D model of the COVID-19 protease.

As Steve highlighted a pocket within the COVID-19 protein, where a small molecule was fitted, I found myself reflecting on the surreal nature of attending a public webinar from my home during a pandemic. Here I was, alongside scientists, entrepreneurs, and bloggers, collectively exploring a massive viral protein model in a virtual space. It made me ponder how vastly different this experience was from that of a contemplative Charles Darwin, who once sat on a beach in the Galápagos Islands, meticulously documenting wildlife and theorizing about species evolution. I couldn't help but wonder what Darwin would think of the rapid advancements and methodologies in modern biology.

The emphasis on big data is increasingly prominent in biological research. To illustrate this growth, researchers have compared data generation in platforms like YouTube with genomic data accumulation. They project that by 2025, YouTube will see uploads of 1,000 to 1,700 hours of content every minute, resulting in a total of 1 to 2 Exabytes per year (1 Exabyte equals 1 billion Gigabytes). In contrast, genomic sequencing alone is expected to yield between 2 to 20 Exabytes in the same year—potentially twenty times more than YouTube! This exponential increase in data volume has spurred the development of computational tools and software that help researchers manage and extract valuable insights from such vast datasets. A 2016 survey of 704 biologists from the National Science Foundation indicated that 90% of respondents are currently or will soon be working with large datasets. It has become common for biological publications to accompany web applications or open-source software packages. The traditional hypothesis-driven scientific approach is gradually giving way to a discovery-driven model, where data collection and analysis precede hypothesis formulation.

This transformation in the biological sciences is largely a response to the intricate nature of biological systems. Each of the 10 to 40 trillion cells in the human body maintains a delicate equilibrium of millions of molecular machines called proteins, which perform various functions. The six billion letters of genetic information can encode over twenty thousand distinct protein types. Furthermore, these proteins can combine to undertake remarkably complex tasks, such as the transport of materials within cells.

The first video titled "Simulation #268 Dr. George Church - Synthetic Biology" delves into the transformative potential of synthetic biology and how it can reshape various sectors.

Chapter 2: The Origins and Evolution of Synthetic Biology

The subfield of synthetic biology emerged early in our exploration of biological complexity. In 1961, scientists theorized the existence of regulatory gene circuits that enable cells to adapt to their surroundings, following mutation experiments in bacteria. This concept inspired engineers and scientists to envision a future where principles of rational design and standardized modularization would enable the reliable creation of living cells capable of performing a wide range of useful functions in industry and healthcare. Thus began the ongoing struggle of human innovation against the intricacies of biology.

Over the following decades, advancements in data collection technology have allowed us to investigate biological complexity at unprecedented rates. The cost of sequencing an entire human genome plummeted from over $100 million upon the completion of the Human Genome Project in 2003 to around $1,000 by 2015. Additionally, reagent kits and instruments for RNA transcriptome analysis have provided insights into protein expression levels within cells. DNA synthesis costs have also decreased, allowing gene blocks—ranging from several hundred to thousands of base pairs—to be purchased for a few hundred dollars. These technological advancements have enabled undergraduates and even high school students to engage in genetic engineering projects, exemplified by the 2018 iGEM contest's winning high school team, which developed a bacterial strain that produces the active component of catnip.

However, challenges persist. Many gene circuits created in laboratory settings never make it beyond those environments, and when other labs attempt to replicate the experiments, they often encounter difficulties achieving the same outcomes. A survey conducted by Nature involving 706 biology researchers revealed that, on average, they believe only 59% of published studies in their field are reproducible. In response, automation in laboratories has surged. The global market for automated liquid handling was valued at $585 million in 2016 and is expected to double by 2023. This automation is particularly advantageous in genome engineering, with CRISPR-Cas9 technology enabling precise edits at specific locations within the genome, facilitated by synthesized targeting RNA. This gene-editing system allows for high-throughput genetic screening in cells, perturbing thousands of mutations with unprecedented efficiency. The forefront of lab automation is represented by microfluidics, which seeks to miniaturize large liquid handling robots into devices small enough to fit in the palm of a hand.

Yet, for some, the rapid pace of advancement still falls short. Even with automation, cell growth and chemical reactions require time. Leading gene-editing firms, equipped with expertise and cutting-edge machinery, often take several months to deliver genetically engineered cells. Moreover, the data extracted from cells is merely a snapshot of the complete information available. Even some of the most data-intensive methods, such as whole transcriptome sequencing, only offer insights into protein populations without revealing their spatial arrangements or functional dynamics. Techniques like Förster resonance energy transfer (FRET) can provide detailed information on specific protein interactions and conformational changes, yet they can be complex to set up and are not scalable for observing all protein interactions simultaneously. Additionally, most experiments consume valuable samples, and many measurements necessitate breaking open cells, making time-lapse studies on individual cells impractical.

To address these limitations, some scientists have begun to envision a future where biological processes can be explored within a computer environment. While not a novel concept—the first molecular dynamics simulation of a protein dates back to the late 1970s—questions remain: Can we simulate an entire cell on a computer? A groundbreaking 2012 paper titled "A Whole-Cell Computational Model Predicts Phenotype from Genotype" suggested that this might be possible. This work represented a significant advancement, allowing researchers to gain insights into cellular functions and predict how gene disruptions affect growth rates. However, their predictions regarding gene disruption and growth rates were accurate in only two-thirds of the cases. Furthermore, determining the impact of minor alterations to protein sequences remains impossible without experimental validation, which limits the utility of computational models for saving experimental time.

The second video, "How Synthetic Biology Will Impact You [EP# 238]," explores the potential implications of synthetic biology for everyday life and the future of biotechnology.

Continuing the exploration of computational models, the gold standard would be to simulate an entire cell at the atomic level. In such a scenario, the functional and kinetic effects of protein changes could be determined through computational means rather than experiments. This is achievable because protein dynamics emerge from the well-understood minimization of energy fields among atoms. One of the leading frameworks for calculating molecular dynamics is Nanoscale Molecular Dynamics (NAMD), initially released by the University of Illinois at Urbana–Champaign in 1995. The surge in computational power and accessibility has catalyzed significant growth in this field.

So how near are we to scaling this technology to encompass an entire human cell? A recent 2016 study modeled the Mycoplasma genitalium bacterial cell using molecular dynamics software to achieve atomic-level detail. Their simulation captured approximately 100 million atoms for 20 nanoseconds and about 10 million atoms for 140 nanoseconds. To put this into perspective, a typical human cell is estimated to contain 100 trillion atoms—one million times more than the researchers' simulation. Moreover, it takes an average protein a few milliseconds to fold, which is approximately 50,000 times longer than the longest simulation conducted by the team. While breakthroughs in computational power may enable larger simulations, determining the initial arrangement of molecules in a human cell remains an arduous task that will require extensive laboratory data. Additionally, most simulation force fields do not permit the breaking of covalent bonds to expedite calculations. The breaking of covalent bonds, a process performed by many enzymes, necessitates quantum mechanical chemical simulations, which are significantly more computationally demanding.

Due to these current constraints, most researchers concentrate on simulating interactions between a limited number of proteins or between a protein and a small molecule over extended periods. To achieve initial confirmation of a protein's structure, scientists often rely on X-ray crystallography or cryo-electron microscopy, processes that can take months to yield a single structure. Many of these structures are publicly accessible through the RCSB Protein Data Bank, which has seen the number of available structures grow from fewer than 10,000 to over 150,000 in the past two decades. During the 2019 CASP protein prediction competition, teams from various academic institutions and Google DeepMind successfully trained machine learning models using these experimental structures to accurately predict the folded structure of proteins solely based on their amino acid sequences. This breakthrough enables computational biologists to initiate simulations on proteins lacking experimentally determined structures.

As we witness the revolutions in discovery-driven science and big data within biology, could we be on the brink of another transformation where computational simulations become the primary means of hypothesis testing prior to experimental validation?

"My mind seems to have become a kind of machine for grinding general laws out of large collections of facts." — Charles Darwin