Research


Decoding the human genome through AI.

Overview

My research mission is to model large-scale, multimodal genomic data to understand how the human genome functions in both health and disease. By building models that can interpret the grammar and syntax of human DNA, we can discover the underlying causes of complex diseases and accelerate the discovery of novel, life-saving treatments.


Foundation Models for Genomics

I develop DNA foundation models that learn the evolutionary and functional language of the genome from massive datasets.

For example, Decima is a state-of-the-art model designed to predict gene expression directly from DNA sequences with unprecedented cell-type specificity. By leveraging large-scale single-cell genomic data, Decima learns to predict the functional impact of genetic mutations at a high resolution, revealing how genetic variants drive disease pathology across different cells in the human body.


Generative AI for DNA

Beyond prediction, I am interested in synthesis - engineering DNA sequences to achieve better therapeutic outcomes. For example:

  • regLM: As the first autoregressive language model trained specifically to design regulatory DNA, regLM treats genomic sequences as a structured language. It can generate synthetic promoters and enhancers that follow the “syntax” of the human genome. This work demonstrated that a language model can design realistic functional genomic elements with tailored regulatory properties, opening new doors for synthetic biology.

  • Polygraph: As generative models become widespread in biology, validating their output is essential. Polygraph uses a suite of biological “filters” and deep learning evaluators to test whether AI-designed regulatory sequences are biologically realistic.


Scalable Infrastructure for Genomic AI

To bridge the gap between AI research and widespread biological application, I build high-performance software frameworks that enable biologists to easily adopt AI.

  • gReLU: Developed at Genentech, gReLU is a comprehensive software package designed to standardize and accelerate the training of genomic deep learning models. It provides modular components for data handling, model architecture, and interpretability, allowing researchers to move seamlessly from raw DNA sequences to biologically meaningful insights.

  • RAPIDS for Single-Cell Genomics: At NVIDIA, I led the development of GPU-accelerated workflows for single-cell analysis. We reduced data processing times from hours to seconds, enabling the real-time analysis of datasets containing millions of cells.


Improving Genomic Data Quality with Deep Learning

Meaningful discovery requires high-fidelity data. At NVIDIA, I worked on deep learning methods to clean and enhance genomic data, ensuring that biological signals are not lost in technical noise. These include AtacWorks, a deep learning tool to denoise epigenomic data, as well as deep learning models for reference-free error correction in long-read sequencing data.


Translation: AI in the Clinic & Drug Discovery

Ultimately, my work aims to translate insights derived from AI models into therapeutic impact. At insitro and Genentech, I have worked on applying AI models to identify novel drug targets and design therapeutic molecules. During my postdoctoral work at Stanford University, I developed AI models to predict patient outcomes and discover mutational signatures using personalized genomic data from cancer patients.