By Mostapha Benhenda, data scientist at Melwy
Big data and machine learning are increasingly applied to cell therapy R&D. They can reduce risks and accelerate development. This article reviews some use cases.
Cell State and Activity Monitoring
Monitoring cell state and activity is a significant part of cell therapy engineering. For example, for immunotherapies in solid tumors, it is important to accurately measure T cell exhaustion, because it is a common cause of tumor resistance.
Cell state and activity are often monitored with transcriptomics data, such as Single-cell RNA-seq. This technology can count the quantity of RNA transcripted by each gene in each cell. This tells about the proteins that get synthesized in the cell. Proteins are the doers of the cells and perform various activities and functions.
The most elementary way to analyze transcriptomics data is with single-gene biomarkers. For example, T cell exhaustion is often monitored with gene expression levels of PD1, TIGIT, and so on.
Single-gene biomarkers are simple, but they imperfectly account for cell state and activity. For example, PD1 is not only correlated with T cell exhaustion, it is also a marker of T cell activation. Therefore, the significance of PD1 overexpression can be ambiguous.
A cell state, like T cell exhaustion, involves complex associations between genes. Genome-wide biomarkers, learnt from clinical data, can have better accuracy. They work better because they connect gene expression data with actual patient survival data.
Example: Data-Driven Monitoring of JUN CAR T Cells Exhaustion
For example, in order to evaluate "exhaustion-resistant” JUN CAR T cells, developed at the Mackall lab at Stanford Center for Cancer Cell Therapy, I applied the data-driven dysfunction signature TID (Tumor Immune Dysfunction). TID is a predictive biomarker trained on 73 human cancer studies, designed at the Shirley Liu lab at Dana–Farber Cancer Institute. This biomarker was initially designed for another immunotherapy, immune checkpoint inhibitors, but it is trained on general cancer survival data. Therefore, it can be re-purposed for CAR T cells therapies.
The benefit of this data-driven approach is to show a more refined picture: with TID, JUN CAR T cells appear to be more heterogeneous than with single-gene biomarkers. That could eventually become an issue, since heterogeneity brings risks of tumor resistance.
Transcriptomics biomarkers, derived from RNA-seq data, are not the only ones. For example, histology images also provide predictive biomarkers, based on computer vision and deep learning algorithms. Their advantage is that they are easily available from pathology laboratories for most patients, whereas transcriptomics require additional tissue sample. Images are informative biomarkers because cell morphology (apparent on images) is related to cell state and activity.
Image biomarkers have not been applied to prediction of cell therapies outcomes yet, but that's only a matter of time: they are already being applied to predict outcomes of various types of cancers, and to predict checkpoint inhibitors outcomes.
However, the performance of image-based biomarkers still remains lower than transcriptomics-based biomarkers. But combined, image and transcriptomics biomarkers will be able to improve predictions further.
Imaging Mass Cytometry Biomarkers
Imaging mass cytometry (IMC) biomarkers are halfway between single-cell RNA-seq and histology image biomarkers. IMC can simultaneously localize up to 37 proteins at the single-cell resolution. These 37 channels give more precision than plain histology images, which only have 3 color channels. Compared with scRNA-seq, imaging mass cytometry has the benefit to provide spatial information. Therefore, cell spatial interactions are better captured by this technology.
IMC also has limitations and is only complementary with the other technologies. Compared to plain histology images, IMC data is more complex, and is not routinely gathered in the clinic. Compared to scRNA-seq, IMC only has 37 channels, whereas scRNA-seq has ~20k channels (the whole human genome).
To my knowledge, IMC biomarkers have not been applied to cell therapy yet. However, they have been successfully applied to survival prediction in breast cancer. Bodenmiller lab, at the University of Zurich, simultaneously quantified 35 biomarkers, resulting in high-dimensional pathology images of tumor tissue and combined them with long-term survival data for 281 patients. Their analysis revealed 18 new subgroups of breast cancers associated with distinct clinical outcomes. The same approach is expected to emerge in cell therapy research.
Experimental Data and CRISPR Engineering
Clinical data has the advantage of being close to the 'real-world' scenario. On the other hand, clinical trials are slow and expensive, compared with in vitro data, and animal in vivo data.
Therefore, an emerging experimental method is to engineer several different T cells with CRISPR-Cas9, with different gene edits, and then pool screen them in vitro, and in vivo in mice.
This approach relies on the combination of CRISPR screening, which is a way to perturbate cells with gene edits, and single-cell RNA-seq (scRNA-seq). Several technologies emerged, such as Perturb-seq, Mosaic-seq, and CROP-seq.
For example, Marson lab at UCSF designed a novel TGF-bR2-41BB chimeric receptor that improves solid tumor clearance by primary human T cells, in vitro and in vivo.
Using CRISPR-Cas9 editing for cell therapies is a particularly promising direction, after the recent phase 1 clinical trial by June lab at the University of Pennsylvania, which showed that CRISPR-edited T cells are feasible and safe.
What's Next? Data-Driven CRISPR Gene Editing
By collecting enough CRISPR-Seq data, it will become possible to predict outcomes of experiments, using machine learning. To my knowledge, there is no published work yet about predicting cell therapy CRISPR experiments outcomes from data. However, there is plenty of machine learning and data science activity around this question.
For example, Wei Li lab at George Washington University proposed scMAGeCK, a computational tool to study genotype-phenotype relationships at a single-cell level. Using a linear regression algorithm, scMAGeCK unravels how editing a gene X affects the expression level of other genes. Once this genotype-phenotype network is estimated, it becomes possible to predict the transcriptional phenotype of a given CRISPR experiment.
These predictions require a lot of data, and public CRISPR screening databases are emerging. For example, there's CRISPR-view by Wei Li lab, GenomeCRISPR by Boutros lab at the German Cancer Research Center, and DepMap by the Broad Institute and Wellcome-Sanger Institute (although the DepMap project is focused on cancer cells, whereas designing cell immunotherapies also requires immune cell data).
However, CRISPR editing does not always work as intended. This technology makes a lot of mistakes, but here too, machine learning algorithms have a role to play. They can predict which type of mistakes are likely to occur during CRISPR editing. Predicting editing errors can help designing CRISPR experiments, as there are often several options, and the experimental scientist should choose the most precise way to edit a gene.
For example, SPROUT algorithm, from Zou group at Stanford University, predicts the length, probability and sequence of nucleotide insertions and deletions, using a gradient-boosting machine learning algorithm trained on 1,656 on-target genomic sites in primary human T cells. This predictive model can contribute to better design T cell therapies with CRISPR.
Finally, machine learning will also be able to automatically suggest which gene editing to perform, using generative algorithms like Generative Adversarial Networks (GAN). This approach will be similar as in cheminformatics, where AI algorithms can propose molecule designs with desired properties. Here, algorithms will propose cell designs with desired activities.
There is no published work about this exact topic yet, but there are some papers around it. For example, David Rouquié and Joerg Wichard's team at Bayer proposed a generative AI model that can automatically design small molecules inducing a desired transcriptomic profile. The model takes as input the gene expression signature of the desired state, and outputs the chemical formula of a small molecule that has a high probability of inducing this state.
The same strategy can be followed for getting a generative model that outputs gene edits instead of small molecules. That is a promising area of future work. These technologies work together in a virtuous circle: more data gives better predictions, which in turn allow designing better experiments. This will accelerate therapy discovery.
In conclusion, there are exciting opportunities at the intersection of cell therapies, gene editing, and machine learning.