Guest Column | November 7, 2019

Synthetic Biology Data Management Solutions Enabling Early Gene Therapy Discovery

Radha Krishnakumar

The last decade spurred enormous advances in the field of synthetic biology and gene therapy. We have been witnessing those results since 2017, with growing FDA approvals for novel therapeutics. The concept of gene therapy is a simple one — replace something faulty with a functioning version. This encompasses “traditional” gene replacement via viral or non-viral vectors, or gene editing using nucleases, CRISPR-Cas, RNAi. The binding commonality among all gene therapy methods is the need for rapid design, building and testing of the DNA components toward creation of the final gene delivery product. This article aims to highlight the difficulties in searching for and storing biomedical general and sequence related information with traceable meta-data. This articles also provides a snapshot of some of the available resources for electronic data management, including guidelines, data consolidation, design tools, documentation and sharing for synthetic biology and molecular biology-based human therapeutics.

The first component in any gene therapy platform development is the synthetic DNA, but even before selecting the source of synthetic DNA, the researchers must create the in silico designs to initiate the design-build-test cycle. Capturing the DNA sequences is the first step and no matter where this is initiated — an academic laboratory, large biotech, or start-up company, the process invariably involves manual look ups of freely available full or partial sequences in published literature and public databases. Sources frequently used by researchers to collect electronic sequences such as NCBI, Addgene, Snapgene, UniProt provide the routes to trace back the source of the sequence and the publication (if there are any associated). Then there are the commercial vectors that provide convenience and traceability, but not all are created the same. Even when this collection process is completed, there is frequently no explanation or reasoning or meta-data available with the sequences as to why these particular (regulatory) elements were chosen in a gene delivery vector. In several instances even a precise description of the construction process is untraceable. An example is the commonly used promoter hEF1A (or referred to as Ef1A) of which there are varying sizes depending on the presence or absence of intronic sequences. However, the reason for the shortened length of this promoter is some plasmids may not be obvious, nor the explanation as to why a certain version was chosen. Studies have shown subtle differences in sequences can have a direct effect of the expression of the effector molecule in the cell line of interest (example, Zhen and Baum, Int J Med Sci. 2014). With lack of traceability of components, it behooves the research lab to design and test varieties of the same regulatory sequences in combination with the effector molecule in the required in vivo environment.

It is apparent that there is a lack of a consolidated open source for sequences and meta-data critical for initiating the design of a therapeutic platform and this is simply with reference to regulatory sequences commonly used for non-specific or specific expression, not referring to proprietary effector molecules. What is needed is a platform that provides information on parts and associated meta-data, an integrated workflow that combines building a database, creating in silico designs, and analytics, to initiate and complete the process of assembling the whole gene delivery vector. Ideally this platform would collaborate with any laboratory’s own database and/or independent analytical and design tools. While synthetic biologists and metabolic pathway engineers are benefiting from the vast resources within SBOL, BioBricks and JBEI registry for standardization of parts, nomenclature and visualization, researchers in gene therapy have not yet been introduced to such a dedicated open source compilation of biotherapeutics relevant sequences and meta-data. 

In this aspect, researchers are recommended to refer to the Harvard Biomedical Data Management resource page, ideally prior to initiating any large-scale experiment designs. This resource recommends best practices and offers effective management options for every stage of biomedical research. Researchers can particularly benefit from the Repositories Matrix, a compilation and comparison of general and publication data repositories.

After the compilation of nucleotide sequences begins the process of ordering the parts and this can be cumbersome on account of the varying competencies of the numerous synthetic DNA companies. There is a consortium that many (but not all) synthetic DNA companies have registered and pledged to follow government laid biosecurity guidelines (see a recent article in NPR on this topic: This recommendation does put all players on an even field with regards to resources for synthetic DNA. The process of placing an order for sequences containing low-complexity or repeats also has hurdles, and can involve essentially going through the rigorous process of having non-disclosure agreements (NDA) in place with several synthetic DNA companies, then either working with technical support or in most cases, copying and pasting the desired sequence into the interface of each company to gauge the possibility of successful synthesis. There are companies that help researchers design their vectors on their platform or website and proceed directly to place the order. Vector Builder is one such example and Genscript offers GENSMART as a free design platform and storage and integrated ordering process (

The Rise of Commercial Off-the-Shelf Solutions

Within the last 5-10 years there has been a growth of commercial off-the-shelf (COTS) solutions developed to enable gene design in metabolic engineering or therapeutics and that can be configured specifically for each laboratory process and workflow. Some of these software solutions are available as web-based applications, are either competitively priced for academics and individual access or even available as open source downloadable applications offering basic and essential functionalities of search, compile, design and share. One of the first in this space is Teselagen, the AI powered platform for engineering biological systems that provides user with the single platform for DNA design, assembly, data gathering and analytics.

Catalytic Data Science is a cloud-based platform that ingests and integrates data and supports a collaborative environment for research and development. Catalytic has an extensive (~30 million) and growing open source repository of publicly available research articles that can be searched using their AI driven algorithms, and a resource dashboard that links the user to sources for publications, sequences and general data (example, NCBI, GeneCards, UniPro, The community resources dashboard also hosts the most extensive links to analytical tools, web-based applications and information sources covering over 50 categories including bacterial genomes, synthetic biology, drug compounds, diseases, to name a small subset. The platform allows the users to add their own resources, even other gene design software like Geneious, CLC, etc. Catalytic also has a chat message application within the platform that allows sharing of workflows and data between team members or any authenticated user. The Catalytic Essentials platform is available outside of enterprise solutions to life science professionals from any type of industry. To date this is likely the most concentrated resource for information related life science research, encouraging researchers to upload and share their articles on this platform.

Genome Compiler is another resource that offers a platform for search (link to NCBI embedded), collection, design and analytics of your synthetic biology build process. The gene design platform is online, freely available to anyone who can sign up and be verified and offers an intuitive design and interface for the user. Genome Complier also offers enterprise tailored software solutions tailored to a company’s needs. Towards easing the synthetic DNA ordering process, Genome Compiler features a tool that allows the user to directly to get a quote for the design and primers from synthetic DNA companies listed in their dashboard. Genome Complier was recently acquired by Twist Bioscience, a leading provider of synthetic DNA solutions for biological data storage.

Benchling has workflow elements specifically addressing each step-in gene therapy and gene editing research and development. The platform allows academics and non-profit users to sign up for the basic gene design platform at no cost.  The basic platform in Benchling has easy access to Addgene to import biological parts, plasmids and other sequences and allows user to associate meta-data with every entry, thus helping to eliminate lack of traceability of part in the database. The Bioregistry is available with the enterprise solution. And towards eliminating any sub-optimal legacy software or processes, Benchling offers an electronic lab notebook (even a calendar to plan your experiments!) in combination with their all-encompassing cloud-based suite of tools to create and share data and workflows with members of the same team.

After speaking with researchers from start-ups, established companies, independent consultants and academics, it appears that there may not be one single consolidated source of information for gene therapy components, and there is an interest (and need) expressed by researchers in this field towards building such a registry. The lack of such a registry is mitigated by the availability of these commercial platforms, some free and others at an affordable cost, that encourage researchers to follow best practices in all stages of research and enable sharing this ever-growing wealth of knowledge in human therapeutics.

Radha Krishnakumar is an independent consultant in synthetic biology and gene therapy, working with scientists at early-stage research companies to provide technical support for gene design and molecular biology for the development of novel platform technologies. Previously, Dr. Krishnakumar was at the J. Craig Venter Institute where she worked on the creation of the first synthetic bacterial cell, genetic code expansion, bacterial whole genome engineering and intrinsic biocontainment. At Intrexon Corporation (now Precigen), Dr. Krishnakumar led the DNA assembly and molecular biology team to provide viral and non-viral gene delivery vectors for regulated expression of proprietary effectors and gene-based therapies. Dr. Krishnakumar currently works as a scientific systems analyst at Axle Informatics. She received her Ph.D  in Microbiology from the University of Illinois, Urbana-Champaign.