MaxToki: Temporal AI Model for Predicting Drivers of Cell State Trajectories Across Time

christinatheodoris
Apr 2
10 min read

Opening virtual cell models to the axis of time, temporal models may enable programming cells towards lasting therapeutic trajectories.

MaxToki is a temporal AI model for generating past, intervening, and future cell states across dynamic trajectories and in silico predicting interventions to induce desired cell state transitions.
MaxToki was trained on nearly 1 trillion gene tokens including cell state trajectories across the human lifespan to generate cell states across long timelapses of human aging.
We developed a temporal training strategy using continuous numerical tokenization to train the model to generate cell states along trajectories as well as understand the numerical continuum of time.
We developed a prompting strategy to query the trained model and found that MaxToki generalized to unseen trajectories (held-out ages and held-out cell types) through in-context learning.
The model inferred age acceleration in diseases of aging never seen during training, including pulmonary fibrosis and Alzheimer dementia.
MaxToki predicted candidate age-promoting vs. rejuvenating perturbations in cardiac cell types that we experimentally validated to cause age-related gene network dysregulation and functional decline both in vitro and in vivo.

Building on foundation models for network biology

Biological foundation models have recently shown promise for predicting the impact of perturbations on cell states. We previously developed one of the first foundational models for network biology, Geneformer, pretrained back in 2021 on ~30 million and more recently >100 million human single-cell transcriptomes (gene activity profiles of individual cells) from a broad range of tissues to gain a fundamental understanding of gene network dynamics (1, 2). We demonstrated that Geneformer could enable biologically meaningful predictions in a diverse range of tasks, consistently outperforming baselines logistic regression, random forest, and support vector machines. Most importantly, however, Geneformer drove novel biological insights through in silico modeling of perturbations that we verified in the lab, including discovering a new regulator in heart cells and predicting therapeutic targets that restored the beating of disease heart muscle cells.

Since the initial Geneformer model, there has been a rapid growth in foundation models for network biology, including scGPT, scFoundation, GeneCompass, UCE, Nicheformer, scSimilarity, TranscriptFormer, STATE, and more. Yet, a key limitation of Geneformer and related models is that they only model one cell state (gene expression profile) at a time, precluding them from learning the progression of cell states along dynamic trajectories such as development, aging, and disease. This is like trying to learn natural language by reading only one isolated sentence at a time with no context, rather than full stories. However, to train a foundation model with a full story of cell states, one has to overcome two obstacles: one in data availability and one in computational feasibility.

Opening virtual cell models to the axis of time

To address this, we developed MaxToki, a temporal AI model designed to generate past, intervening, and future cell states across dynamic trajectories and predict perturbations that would induce desired trajectory shifts (3). MaxToki was trained on nearly 1 trillion gene tokens including ~175 million single-cell transcriptomes and 100 million cell state trajectories across the human lifespan from birth to the tenth decade of life to generate cell states across the long timelapses of human aging. Full stories of cell states across aging are limited by inaccessibility of disease-relevant human tissues; for example, we are unable to serially biopsy the same individual’s brain over the course of their lifespan. However, we overcame this by assembling a population-based corpus of human aging with single-cell transcriptomes from ~3800 individuals from which we constructed aging trajectories across individuals stratified by cell type and sex. This allowed us to train the model to understand the mechanisms of gene network dysregulation that occur over the long timelapses of human aging that were common across the population.

We first pretrained the model (decoder-only transformer architecture) with an autoregressive task of generating individual transcriptomes, where transcriptomes are represented as rank value encodings with genes ranked by their expression after scaling for their general (nonzero median) level of expression across the pretraining corpus. This method prevents housekeeping genes that are always highly expressed from dominating every encoding and allows lowly expressed but potent regulators to move to the front of the rank value encoding in cell states where they are relatively overexpressed compared to their general levels. This nonparametric view of the transcriptome may also be more robust to technical artifacts. First presented by Geneformer, this method demonstrates that sequence in transformer input data does not have to be solely physical position but can encode other sequential values of importance (like ranking students in a classroom by their grade on the math exam or their speed in the mile run, as opposed to the physical position of their desk in the classroom).

After the first-stage pretraining on individual transcriptomes with input size 4096, we then extended the context length to 16,384 with RoPE scaling to accommodate an input of multiple single-cell transcriptomes along a cell state trajectory. RoPE scaling was an important feature that allowed us to interpolate more tokens into the existing positional framework during the second-stage temporal training. During this second-stage training, we sought to design a strategy to train the model not only how to generate cell states along the trajectory but to understand the numerical continuum of time such that it could perform operations within the space. We had the key insight that if we trained the model to generate the timelapses between states, rather than the age of the cells, this would allow the model to understand the axis of time, as opposed to overfitting to age labels, and promote generalizability of the model’s learning.

Rather than considering the numerical timelapse tokens the same as gene tokens, where the model learned with cross-entropy loss, we instead employed a form of continuous numerical tokenization (4) with a mean-squared error loss function that rewards timelapse predictions closer to the target value. This allowed the model to learn that timelapse tokens fall along a numerical continuum. Furthermore, we trained the model with both positive and negative trajectories so that it could learn to generate past and intervening, as well as future, cell states, and so that it did not get overtrained to generate late stage timepoints.

Another critical advance was the design of a prompting strategy to allow in-context learning of new trajectories at inference time. We trained the model with two tasks: given a context trajectory of 2-3 cell states and the time elapsed between them (giving the model 2+ points to draw the line), we then query the model with 1) a cell state, prompting the response of the timelapse needed to now induce this next cell state, or 2) a timelapse, prompting the model to generate the cell state that would arise after that time elapses. Although the trajectories are constructed based on consistent cell types and sexes, the model is never explicitly given the label of the cell type it is expected to generate, so it must learn to base its generation on the context trajectory provided in the prompt, arming it with the versatility needed to adapt predictions to previously unseen contexts.

Scaling up the computation

While RoPE scaling allowed us to expand the context length to ~16k tokens to model multiple cells within the time series, doing so came with increased computational demands, especially as we scaled up the model parameters based on scaling laws we defined in the stage 1 pretraining. To train a 1 billion parameter model with nearly 1 trillion tokens, we needed to accelerate the throughput of the model and optimize the memory efficiency. To accomplish this, we collaborated with NVIDIA to leverage the NVIDIA BioNeMo and NVIDIA CUDA-X libraries to increase training throughput by 5x and achievable micro-batch size by 4x using Transformer Engine and FlashAttention-2. Furthermore, KV caching at inference time enabled >400x faster generation speed, vastly enhancing our ability to efficiently perform large-scale genome- and tissue-wide in silico perturbation screens. This accelerated computing also improves the accessibility of the model to other researchers with limited GPU resources. We provide both the 217M and 1B parameter models (pretrained in full and mixed precision, respectively) on the MaxToki Hugging Face repository for users to apply to their own downstream questions.

In-context learning of unseen trajectories

We found that MaxToki was able to accurately predict the timelapses between states even when the order of the cell states was permuted, changing the correct response for the relative timelapse between cell states (correlation of predicted to ground truth: 0.88). This demonstrated that MaxToki was able to effectively “do the math” within the numerical continuum of time. MaxToki was also able to accurately predict the timelapse between cell states for query cells from held-out ages (and therefore also held-out donors), demonstrating generalizability (correlation of predicted to ground truth: 0.77). Finally, using in-context learning from new context trajectories provided at inference time, MaxToki was able to accurately predict the timelapses between states for held-out cell types unseen during training (correlation of predicted to ground truth: 0.85). MaxToki significantly outperformed baselines including predictions based on the most common age of the given cell type and sex context and a linear model baseline of SGDRegressor (chosen for memory efficiency due to its ability to train incrementally).

The generated cell states in response to a query timelapse also matched the prompted age (defined in time as the age of the last context cell plus the query timelapse) and the cell type context. The nearest ground truth neighbor of generated cells in an external model’s embedding space matched their prompted age. Furthermore, whether we trained an external model on only ground truth or only generated cells, the external model was able to classify the cell type of held-out ground truth cells nearly equivalently well, indicating the generated cells retain the important features for cell type identity. Importantly, “bag of genes” versions of ground truth cells (composed of the same genes but with their rank value order was shuffled) were insufficient to train an external model to classify the cell type of held-out ground truth cells, highlighting that the generated cells are not simply an agglomeration of genes but that they retain the appropriate rank value structure.

Interpretability of context-specific trajectory generation

When we performed ablation studies to determine the input features most critical to the model predictions, we found that shuffling the prompt cells significantly damaged in-context learning of unseen cell type trajectories, revealing that the model does not only use binary expression but also rank value order to optimize its predictions. The context trajectory and query were equally important to model performance, and the model learned in a completely self-supervised manner to attend significantly more to transcription factors, key regulators of cell state trajectories, despite having no prior information about gene function. Finally, we found that the attention heads specialized to particular contexts, with some paying more attention to given features such as the query cell or timelapse tokens within trajectories of specific cell types.

Inferring aging acceleration in age-related diseases unseen during training

Interestingly, although we only trained the model on trajectories from normal aging (no diseases), when presented with query cells from patients affected with age-related disease, we found that MaxToki inferred aging acceleration compared to age-matched controls. MaxToki inferred aging acceleration for lung cells in patients exposed to heavy smoking or affected by pulmonary fibrosis (a known age-related lung disease associated with telomere shortening) and brain microglia cells in patients affected with Alzheimer dementia in two different studies. Crucially, the age acceleration in the brain microglia, a type of brain immune cell, was specific to patients affected by Alzheimer dementia and was not inferred in patients with only mild cognitive impairment or Alzheimer-resilient patients who have the same neuropathology in their brains but no cognitive impairment. This opens the opportunity for future studies to perform in silico perturbation analysis to identify genes expected to decelerate the aging trajectory in these disease cells, potentially promoting resilience to these age-related diseases.

Prediction of cardiac pro-aging drivers with genome-wide, tissue-wide in silico screen

We then designed an in silico perturbation strategy to predicted age-promoting vs. rejuvenating targets by applying perturbations to the query cell in a given prompt and testing whether the model inferred that the timelapse needed to induce the perturbed cell would be shorter or longer than the unperturbed trajectory. We applied this approach to systematically in silico perturb each gene in each major cardiac cell type to perform a genome-wide, tissue-wide in silico perturbation screen for drivers of cardiac aging. Indeed we found the predicted drivers correlated with expression changes across human aging and in telomere-shortened mice, which model the aging process. These drivers were significantly enriched for key aging pathways such as mTOR signaling and stress response.

Experimental validation of novel predictions in human cells and mice in vivo

We believe it is important to validate models beyond reconstruction and bring novel predictions back to the lab to demonstrate the utility of the models in enabling verifiable biological insights. We therefore experimentally tested the in silico predictions of previously undiscovered cardiac-pro-aging drivers in the lab, both in vitro and in vivo. When we tested overexpression of new MaxToki-predicted pro-aging targets in human heart muscle cells, we found this caused a shift in the gene network state, activating gene programs associated with inflammation and mitochondrial dysfunction. The perturbed heart muscle cells also exhibited function defects including slowing of the calcium cycle that is necessary for proper beating and increased rates of irregular beating.

Remarkably, when we tested perturbing the top two pro-aging targets in young mice, we found that both perturbations were sufficient to induce cardiovascular decline in vivo with reduced heart function by echocardiography within six weeks. We are now testing whether modulating these targets in aged mice will promote resilience to age-related cardiovascular decline, pointing to new therapeutic strategies for the top cause of mortality worldwide.

This highlights the ability of MaxToki to reveal new trajectory drivers we would not have otherwise explored that are biologically verifiable both in human cells and in mice in vivo, vastly accelerating therapeutic discovery for human disease.

Overall, MaxToki represents a temporal AI model for generating cell states across dynamic trajectories and predicting perturbations to induce desired trajectory shifts that can now be applied to accelerate the discovery of candidate targets for programming lasting therapeutic cellular trajectories.

MaxToki is freely available and open source, with relevant resources below:

Manuscript: https://www.biorxiv.org/content/10.64898/2026.03.30.715396v1.full.pdf
Models: https://huggingface.co/theodoris-lab/MaxToki
Code: https://github.com/NVIDIA-Digital-Bio/MaxToki

A note on the model name: MaxToki is Christina’s kids’ favorite shinkansen (Japanese bullet train), whose name Toki is a homonym for “time” in Japanese. It is also a double-decker shinkansen, reminiscent of our two-stage temporal model training.

References

1. C. V. Theodoris, L. Xiao, A. Chopra, M. D. Chaffin, Z. R. Al Sayed, M. C. Hill, H. Mantineo, E. M. Brydon, Z. Zeng, X. S. Liu, P. T. Ellinor, Transfer learning enables predictions in network biology. Nature 2023.

2. H. Chen, M. S. Venkatesh, J. Gόmez Ortega, S. V. Mahesh, T. N. Nandi, R. K. Madduri, K. Pelka, C. V. Theodoris, Scaling and quantization of large-scale foundation model enables resource-efficient predictions in network biology. Nature Computational Science 2026.

3. J. Gómez Ortega, R. D. Nadadur, A. Kunitomi, S. Kothen-Hill, J. U. G. Wagner, S. D. Kurtoglu, B. Kim, M. M. Reid, T. Lu, K. Washizu, L. Zanders, H. Chen, Y. Zhang, S. Ancheta, S. Lichtarge, W. A. Johnson, C. Thompson, D. M. Phan, A. J. Combes, A. C. Yang, N. Tadimeti, S. Dimmeler, S. Yamanaka, M. Alexanian, C. V. Theodoris, Temporal AI model predicts drivers of cell state trajectories across human aging, bioRxiv 2026. https://doi.org/10.64898/2026.03.30.715396.

4. S. Golkar, M. Pettee, M. Eickenberg, A. Bietti, M. Cranmer, G. Krawezik, F. Lanusse, M. McCabe, R. Ohana, L. Parker, B. R.-S. Blancard, T. Tesileanu, K. Cho, S. Ho, XVal: A continuous numerical tokenization for scientific language models, arXiv 2023.