Home Blog Page 7000

What Is Mezcal?

Like wine or beer, mezcal is a blanket term. In its broadest definition, it’s any distilled spirit made from the agave plant, which is known in Mexico as a maguey. Mezcal is meant to be sipped neat in small cups, but in the hands of a trustworthy bartender, a mezcal cocktail is a winning situation. Today, we’re digging into how it gets made. 

The Agave Plant

Agave is a succulent native to Mexico. There are roughly 30 different varieties of agave that can be used to make mezcal. They grow in the wild and on farms, and take seven to 20 years to mature. The most well-known maguey is the blue weber agave, because it’s used to make tequila. You read that right—tequila’s a type of mezcal. Excluding tequila, the majority of mezcal is made from the espadin agave plant because it’s high in sugar and matures quickly.

Mezcal Production

Mezcal’s distilled from the heart of the agave plant. The piña, as it’s known in Spanish, looks like an oversized pineapple. It can weigh up to 300 pounds and takes very difficult labor to harvest and transport. Once it’s pulled from the earth, the leaves are removed and the piña gets roasted to release its natural sugars.

Fermentation occurs when cooked agaves are mashed to a pulp and combined with water and yeast. After sitting for days, those sugars turn into alcohol. The liquid is then run through a still at least twice to refine it into a drinkable spirit.

There are industrial mezcal producers (often companies owned by Americans), but most mezcals are made in rural areas by Mexican families who don’t have access to expensive machinery. Farming, harvesting, and distillation processes have been passed down for generations and distillers, known as mezcaleros, have found creative ways to fabricate the equipment they need for mezcal production. A hole in the ground can be an oven, if you put your back into it. Need portable fermentation tanks? Rawhide will do. Want a mezcal with a fuller mouthfeel? Make your still from clay pots. Where there’s a will, there’s a way.

Types of Mezcal

Not all mezcals have a smoky flavor. The smokiness comes from the traditional method of roasting magueys in a pit covered with firewood and rocks. However, some mezcals taste bright and delicate, while others are herbaceous and viscous. Flavor varies based on the varietal, age, and terroir of the agave plant, as well as the water and production process.

The most produced type of mezcal is the ensamble. This is when a mezcalero takes different species of agave and roasts, ferments, and distills them together to balance flavors. There are also mezcals made from a single type of agave, and blends that mix mezcals after distillation.

Then there’s pechuga. Loosely translated, this means “breast.” The term refers to the animal breast, usually chicken, that’s hung over the still to impart savory flavors. Pechuga is made in small batches for special occasions like festivals and weddings. If you can find one, it’ll cost you a pretty peso.

Once distilled, mezcal’s stored in food-grade plastic or glass, and ultimately bottled joven (unaged). When it comes to alcohol content, the range will be 40–55% ABV, a little stronger than your average vodka or scotch.

Reposado and anejo mezcals aren’t as common because barrel-aging imparts flavors that detract from the natural taste of the mezcal. Besides, outside of the state of Jalisco where tequila’s made, oak barrels aren’t readily available in Mexico. Only tequila gets that treatment. 

Regulations and the Law

This wouldn’t be a proper article about booze if we didn’t address liquor laws and the quasi-illegal shenanigans surrounding their enforcement. The Mexican government relies on four “independent” companies to regulate mezcal. The Consejo Regulador del Mezcal (CRM) is the biggest and most influential.

Many traditional mezcaleros ignore the CRM’s regulations because they view it as a pay-to-play system, which they either can’t afford or don’t want to endorse. Additionally, the CRM limits mezcal’s Denominacion de Origen (DO) to 10 of the 32 Mexican states. Oaxaca is probably the one you’ve heard of, but there’s also Durango, Guanajuato, Guerrero, Michoacan, Puebla, Tamaulipas, San Luis Potosi, Sinaloa, and Zacatecas. The result is that some of the most authentic, family-owned distilleries make “agave spirits” because their mezcal can’t legally be called mezcal.

Shopping Tips

My three rules when shopping for mezcal: (1) Celebrity brands overcharge and underdeliver; (2) glass is hard to come by in rural Mexico, so elaborate bottles aren’t good mezcal, they’re good marketing; (3) the label should be transparent about who, what, where, when, and how it was made. 

Read More
OJ Lima

Identification of patient-specific CD4+ and CD8+ T cell neoantigens through HLA-unbiased genetic screens

Science & Nature

Main

Cancer immunotherapies that aim to harness the antitumor activity of T cells have shown impressive clinical results in a subset of patients with cancer, and accumulating evidence suggests that the efficacy of these therapies is driven largely by T cells that recognize cancer neoantigens that result from patient-specific nonsynonymous tumor mutations1. Consequently, there is a strong interest in developing approaches to specifically boost the number or activity of neoantigen-reactive T cells in individual patients. However, identification of T cell-recognized neoantigens is challenging due to their patient-specific nature2. Previous antigen discovery methods have been limited by relying on the use of single or selected HLA alleles3,4,5,6,7 and are therefore not straightforwardly compatible with identifying T cell (neo)antigens across the complete HLA haplotypes of individual patients with cancer. Moreover, while CD4+ T cells have important roles in tumor control and response to immunotherapy8,9,10,11, previous methods have focused primarily on the identification of CD8+ T cell-recognized neoantigens. Thus, experimental tools are required to enable the routine and HLA-unbiased identification of CD4+ and CD8+ T cell-recognized neoantigens in individual patients.

Here, we present a high-throughput genetic system for the personalized identification of CD4+ and CD8+ T cell-recognized (neo)antigens (Fig. 1a). In this method, termed HANSolo (HLA-Agnostic Neoantigen Screening), patient-matched, Bcl-6/xL-immortalized B cell lines are engineered to express large libraries of minigenes that encode candidate T cell antigens. As the resulting B cells are fully MHC class I and class II proficient, this enables the unbiased screening of T cell specificities across the complete MHC class I and class II genotypes of individual patients using T cell pools as selective pressure. To this purpose, antigen library-expressing B cells are coincubated with patient T cell populations of interest (for example, tumor-infiltrating lymphocytes (TIL) or T cells engineered to express patient-derived T cell receptors (TCRs)12), and antigen hits are identified by next-generation sequencing to measure the depletion of those B cells that express T cell-recognized epitopes.

Fig. 1: Overview and validation of neoantigen discovery technology.
Science & Nature figure 1

a, Schematic overview of the methodology. b, Antigen discovery screen of CD8+ TCR #53 T cells against immortalized HLA-A*02:01+ B cells transduced with a model antigen library of n = 4,764 minigenes. Dots represent individual minigenes. Fold change, defined as the relative abundance of minigenes in the presence of TCR #53 T cells compared with mock T cells, and mean normalized read counts are plotted for each individual minigene. Minigenes encoding the model CDK4 mutant and WT epitopes are highlighted. CDK4R24L minigenes: P = 9.4 × 10−43, P = 1.5 × 10−9. P values were generated using the DESeq2 Wald test (one-sided) and adjusted for several comparisons. c, Antigen screen using CD8+ T cells expressing either the DMF4 (left panel) or DMF5 (right panel) TCR against model library-expressing HLA-A*02:01+ B cells. Data are plotted as in b with fold change showing relative minigene abundance when exposed to either DMF4 or DMF5 TCR T cells as compared with mock T cells. DMF4 MART1-ELA minigenes: P = 1.9 × 10−12, P = 1.7 × 10−9; DMF5 MART1-ELA minigenes: P = 4.1 × 10−14, P = 8.8 × 10−7. P values were generated as in b. d, Antigen screen using CD4+ T cells expressing patient-derived TCRs specific for the MHC class II-restricted SNORD73AR>W and MANSC1D>H against patient-matched immortalized B cells transduced with the CD74 signal-fused model library. Data are plotted as in b with fold change showing relative minigene abundance in the presence of CD4+ SNORD73A TCR or MANSC1 TCR T cells relative to mock T cells. SNORD73AR>W: P = 2.1 × 10−53; P = 2.5 × 10−44; MANSC1D>H: P = 2.1 × 10−26, P = 3.2 × 10−8. P values were generated as in b.

Full size image

To first evaluate the feasibility and sensitivity of our method, we took advantage of well-described HLA-A*02:01-restricted TCRs specific for either the CDK4R24L neoantigen (TCR #53)13 or for the melanocyte differentiation antigen-derived MART126-35 epitope (TCRs DMF4 and DMF5)14 (Supplementary Fig. 1). Activity of the CDK4R24L neoantigen-specific TCR should result in strong depletion of B cells expressing the mutant, but not the wild-type (WT), CDK4 sequence. TCR DMF4 has an affinity towards the MART1 self antigen that is around fivefold lower as compared with the DMF5 TCR14,15, providing a means to assess the sensitivity of the method in the context of weak T cell–target cell interactions. Furthermore, the use of the parental MART1 epitope as well as a previously identified variant with increased affinity for MHC-I16 (here referred to as MART1-ELA) should allow one to determine whether the level of epitope presentation can be gauged from screening data. To provide first proof-of-concept, we designed a model antigen library with a complexity (4,764 minigenes) that would be sufficient to enable the screening of the entire mutational repertoire of human tumors with the highest mutational burden, such as melanomas, lung tumors and microsatellite-instable tumors17. Individual MHC class I-restricted antigens, including the CDK4R24L and MART1 antigens and immunodominant epitopes of EBV, CMV and influenza, as well as MHC class II-restricted neoantigens (Supplementary Table 1) were expressed as minigenes, each coupled to two unique barcode identifiers to provide internal replicate measurements. Subsequently, HLA-A*02:01-positive immortalized B cells were created and modified to express the epitope library.

Following optimization of conditions to ensure maximal sensitivity of antigen screens (Supplementary Fig. 2), screening of this proof-of-concept library with T cells expressing the CDK4R24L-specific TCR resulted in clear depletion of CDK4R24L-expressing B cells, but crucially not B cells expressing the WT CDK4 minigene (Fig. 1b). Furthermore, B cells expressing the MART1-ELA epitope showed substantial depletion after exposure to T cells transduced with the MART1-specific DMF4 or DMF5 TCRs (Fig. 1c). Notably, the level of depletion mediated by the low affinity DMF4 TCR was comparable with that of the DMF5 TCR. Moreover, when using the high affinity 1D3 TCR18, depletion was observed for both MART1 epitopes but was substantially stronger for the MART1-ELA epitope (Supplementary Figs. 1 and 3). Next, to test whether this system allows the profiling of the antigen-specificities of T cell populations in which T cells specific for a given antigen make up only a minority of the total T cell pool (such as patient TIL cultures, or donor T cells expressing libraries of patient-derived TCRs), we mixed T cells expressing either the DMF4, DMF5 or 1D3 TCR with mock-transduced T cells, such that MART1-specific T cells represented 10%, 1%, 0.3% or 0.1% of total T cells. Analysis of epitope abundance after exposure to these different T cell populations demonstrated that the MART1-ELA epitope was robustly identified when cognate TCR-expressing T cells comprised as little as 0.1–0.3% of all T cells (Supplementary Fig. 3). Depletion of the native MART1 epitope was detected only when using the high affinity 1D3 TCR. Together, these data demonstrate that our genetic screening methodology allows the efficient discovery of MHC class I-restricted T cell (neo)antigens from large antigen libraries. Furthermore, the technology allows one to distinguish high and low avidity TCR-pMHC interactions and genetic screens may be performed with clonally diverse T cell populations.

A substantial fraction of T cell-recognized cancer neoantigens is restricted by MHC class II molecules, and CD4+ T cells recognizing such MHC class II-restricted neoantigens contribute to tumor control8,9,10,11. To test the suitability of HANSolo for the discovery of MHC class II-restricted neoantigens, we explored a previously established engineering method that routes individual minigene products through both the MHC class I and class II presentation pathways. In line with expectations, fusion of neoantigen-encoding minigenes to the sorting signal of the invariant chain (CD74) resulted in robust activation of both CD4+ and CD8+ neoantigen-specific T cells (Supplementary Fig. 4), and this universal antigen expression system was therefore selected for further use. We next took advantage of two MHC class II-restricted neoantigen-specific TCRs that were isolated from tumor-infiltrating T cells of a melanoma patient (Supplementary Fig. 4), transduced both TCRs into donor CD4+ T cells and expressed the model antigen library in patient-matched immortalized B cells. Screening of library-expressing B cells with T cells expressing either MHC class II-restricted TCR resulted in the notable depletion of B cells that expressed the cognate neoantigen, but not its WT counterpart (Fig. 1d). Furthermore, the use of CD4+ T cell populations in which T cells expressing either of the MHC-II-restricted neoantigen-specific TCRs were present at low frequency demonstrated clear depletion of the relevant neoantigens at antigen-specific CD4+ T cell frequencies as low as 0.3–1% (Supplementary Fig. 5).

As compared with previously developed genetic screening technologies, HANSolo has the advantage of allowing the identification of T cell epitopes restricted by any of the class I or II alleles of an individual patient. To demonstrate the utility of such unbiased screening, we first focused on analysis of neoantigen reactivity among intratumoral T cells in a patient with metastatic melanoma (patient NKIRTIL063). CD4+ and CD8+ T cell cultures were generated by in vitro expansion of TIL, and both resulting T cell populations possessed cytotoxic potential, as measured by degranulation potential upon polyclonal stimulation (Supplementary Fig. 6). In parallel, nonsynonymous mutations in protein-coding genes were identified by exome and RNA sequencing, yielding 685 nonsynonymous expressed tumor variants, and a library of 2,762 minigenes that encoded all identified tumor mutations, as well as their corresponding WT sequences, was generated and expressed in autologous immortalized B cells. Screening of this patient mutanome library with in vitro-expanded CD8+ TIL revealed TIL reactivity towards four neoantigens (Fig. 2a). Importantly, no reactivity against the corresponding WT minigenes in the library was detected. Furthermore, screening the same neoantigen library with CD4+ TIL yielded reactivity against six neoantigens (Fig. 2b). Both minigenes encoding the tumor variant MYLKD>N showed reproducible low-level depletion after coculture with CD4+ TIL, and this variant was therefore considered a putative screen hit. Recognition of screen-identified neoantigens, but not WT counterpart sequences, was subsequently validated upon expression of the individual sequences in patient B cells, resulting in confirmed CD4+ and CD8+ TIL reactivity towards 10 out of 11 identified screen hits (Fig. 2c,d and Supplementary Fig. 6). Notably, three neoantigens—GFPT2A>V, TNFAIP2P>A and CCSER2P>L—were recognized by both CD8+ and CD4+ TIL of this patient.

Fig. 2: Personalized and HLA-agnostic neoantigen screening of patient-derived CD4+ and CD8+ T cells.
Science & Nature figure 2

a,b, Nonsynonymous tumor mutations of patient NKIRTIL063 were identified by exome and RNA sequencing and used to design a personalized mutanome minigene library consisting of n = 2,762 unique minigenes. Patient B cells were immortalized, transduced with the mutanome library and screened with in vitro-expanded tumor-infiltrating CD8+ (a) and CD4+ (b) T cells. Fold change represents relative minigene abundance in cultures with or without patient T cells. Screen hits were defined as outlined in Methods and are marked by colored dots. c,d, Validation of neoantigen hits identified in a and b by incubating patient CD8+ (c) or CD4+ (d) with autologous B cells expressing either neoantigen hits (mut) or respective WT sequences as single minigenes. T cell activation was assessed by measuring IFNγ levels in supernatants. Dots represent technical replicates. e, NKIRTIL063 CD8+ and CD4+ T cells were incubated with patient B cells expressing indicated TMG constructs, followed by measuring IFNγ concentrations in culture supernatants. Asterisks indicate TMG constructs that encode a neoantigen identified using the antigen screens in ad. Dots represent technical replicates. f, Summary of NKIRTIL063 neoantigens identified using the HANSolo screens and TMG approach. g,h, Patient NKIRTIL027 immortalized B cells were transduced with the patient mutanome library (n = 2,586 minigenes) and screened using in vitro-expanded NKIRTIL027 CD8+ (g) and CD4+ (h) tumor-infiltrating T cells. Fold change depicts relative minigene abundance in cultures with or without patient T cells. Screen hits are marked by colored dots. i,j, NKIRTIL027 CD8+ TIL screen hits were validated by incubating patient T cells with matched B cells expressing the single mutant or corresponding WT sequences, and measuring IFNγ levels in supernatants (i) or killing of transduced B cells after exposure to patient T cells at indicated effector:target (E:T) ratios (j). Dots represent technical replicates. N/A, not available. k, The NKIRTIL027 CD4+ T cell screen hit was validated as in i. l,m, Neoantigen specificities of patient ITO34 CD8+ TIL were screened against the patient mutatome library (n = 952 minigenes) and validated as in g and i.

Full size image

To assess the sensitivity of our method in comparison with other available neoantigen discovery methods, we next analyzed neoantigen reactivity among CD4+ and CD8+ TIL of patient NKIRTIL063 using the previously established tandem minigene (TMG) approach4,19, in which generally ten minigenes are concatenated and expressed as a single transgene in separate pools of antigen-presenting cells. To screen neoantigen specificities of patient NKIRTIL063 CD4+ and CD8+ TIL using TMGs within a reasonable timeframe, 200 of the 685 mutations were selected on the basis of expression level and mutation clonality and used to generate 20 pools of patient B cells. Incubation of these cell pools with either CD4+ or CD8+ TIL of patient NKIRTIL063 revealed notable reactivity of CD4+ TIL to three TMGs (#6, #9 and #13) and reactivity of CD8+ TIL to four TMGs (#8, #11, #15 and #16) (Fig. 2e). Recognition of six out of seven TMGs was mediated by neoantigens identified before using the genetic library screens, as demonstrated by a subsequent deconvolution step (Supplementary Fig. 6). TMG#9 did not encode a neoantigen hit from our screens but did elicit low-level reactivity of CD4+ TIL. Conversely, the TMG screen failed to identify four CD4+ TIL-recognized neoantigens that were detected using the HANSolo screens (Fig. 2f), demonstrating the potential of the method to mine patient neoantigens with increased depth compared with existing methodologies.

Next, to assess the value of the developed system for the routine discovery of neoantigens across patients with cancer, we mapped neoantigen specificities in three additional patient samples. Tumor mutations were identified in an additional melanoma tumor (NKIRTIL027; 660 nonsynonymous expressed mutations) and used to construct a patient mutanome library of 2,562 minigenes. Screening the neoantigen specificities of CD4+ and CD8+ TIL resulted in six putative CD8+ TIL-recognized neoantigens (Fig. 2g) and one neoantigen recognized by CD4+ TIL (Fig. 2h), and recognition of these epitopes was confirmed for five out of seven neoantigens (Fig. 2i–k and Supplementary Fig. 7). In addition, as observed in genetic screens using model antigens and TCRs, the level of epitope depletion in this patient screen correlated with the capacity of patient T cells to produce interferon gamma (IFNγ) in response to minigene-expressing B cells and kill such cells (Fig. 2i,j). We next analyzed neoantigen specificities of intratumoral CD4+ and CD8+ T cells in a nonsmall cell lung tumor (patient ITO34; 231 mutations), resulting in the detection of CD8+ TIL reactivity against one neoantigen (Fig. 2l,m and Supplementary Fig. 8). Recently, strategies that enrich T cell populations for tumor-specific T cells by culture with patient tumor organoids20,21 or antigen-expressing APCs22 have been reported. To assess whether such strategies may complement our methodology, for instance, in settings where fresh tumor material for the generation of TIL cultures is unavailable, we applied our screening method to a microsatellite-instable colorectal tumor (ITO66; 1,834 mutations). For this purpose, the patient mutanome was screened using a CD8+ T cell product that was generated by ex vivo culture of patient peripheral blood mononuclear cells (PBMCs) with matched tumor organoids, resulting in the identification of two CD8+ T cell-recognized neoantigens (Supplementary Fig. 9). Thus, the use of our screening methodology enabled the successful identification of patient neoantigens in all four tested patients.

Collectively, these data demonstrate the feasibility of personalized and HLA-agnostic discovery of CD4+ and CD8+ T cell neoantigens from large genetic libraries. Benchmarking against the existing TMG method demonstrated enhanced sensitivity of our approach, in particular for the discovery of CD4+ T cell-recognized neoantigens, while enabling substantially improved throughput. From a translational perspective, identified neoantigens may be used to select TCRs for use in next-generation TCR gene therapies or may be utilized in patient-specific cancer vaccines22,23,24,25,26. Of note, state-of-the-art algorithms that predict the immunogenicity of tumor mutations for use in personalized neoantigen vaccines ranked only 3 out of all 14 identified patient neoantigens as actionable vaccination targets (Supplementary Table 2), underlining the value of approaches that allow the unbiased and functional identification of patient neoantigens. With the current next-generation sequencing and DNA synthesis technologies and dedicated screening workflows, our system enables patient neoantigen discovery within 10 weeks (Supplementary Fig. 10), a timespan that is compatible with the production of personalized immunotherapies24.

Methods

Antibodies

The following antibodies were used for flow cytometry: CD3-PerCP-Cy5.5 (clone SK7; eBioscience; used 1:20); CD4-FITC (clone RPA-T4; BD Biosciences; used 1:20), CD4-APC (clone RPA-T4; BD Biosciences; used 1:30), CD4-BV421 (clone SK3, Biolegend; used 1:100), CD8-BV421 (clone RPA-T8; BD Biosciences; used 1:50), CD14-APC-H7 (clone MoP9, BD Biosciences; used 1:100), CD16-APC-H7 (clone 3G8, BD Biosciences; used 1:100), CD19-FITC (clone 4G7, BD Biosciences; used 1:30), CD137-BV421 (clone 4B4-1; Biolegend; used 1:200), CD137-APC (clone 4B4-1; BD Biosciences; used 1:30), OX40-PE-Cy7 (clone Ber-ACT35, Biolegend), CD107-PE (clone H4A3, BD Biosciences; used 1:150) and PE-conjugated anti-mouse TCRβ constant domain (clone H57-597; BD Biosciences; used 1:150). The viability stain IR-Dye (Thermo Fisher, used 1:2,000) was used to identify live cells.

Generation of patient T cell products, Bcl-6/Bcl-xL-immortalized B cells and tumor organoids

Tumor tissue and PBMCs were collected from patients treated at the Netherlands Cancer Institute—Antoni van Leeuwenhoek Hospital (NKI-AVL) with written informed consent and in accordance with guidelines of the Medical Ethical Committee. The study protocol was approved by the Medical Ethical Committee of the NKI-AVL. Fresh tumor tissue obtained by surgical resection was mechanically disrupted and digested overnight in RPMI 1640 medium (Life Technologies) supplemented with 1 mg ml−1 collagenase type IV (BD Biosciences), penicillin-streptomycin (Roche) and 0.01 mg ml−1 pulmozyme (Roche).

For patients NKIRTIL027, NKIRTIL063 and ITO34, TIL cultures were generated by culturing tumor digest suspensions in T cell medium (RPMI 1640 medium supplemented with 10% human AB serum (Life Technologies), penicillin-streptomycin, l-glutamine (Life Technologies)), supplemented with 6,000 U ml−1 IL-2 (Proleukin, Novartis) for 2–4 weeks. Obtained TIL cultures were subsequently stained with IR-Dye and antibodies against CD3, CD4 and CD8, and single CD3+CD4+ and CD3+CD8+ T cells were sorted using a FACSAria Fusion cell sorter (BD Biosciences). Isolated CD4+ and CD8+ T cells were expanded using the rapid expansion protocol (REP), using 30 ng ml−1 anti-CD3 antibody (clone OKT-3; eBioscience) and 3,000 U ml−1 IL-2 in a 1:1 mixture of RPMI 1640 and AIM-V medium (Gibco) supplemented with 5% human AB serum, in the presence of irradiated (40 Gy) allogeneic PBMCs (200:1 feeder/T cell ratio). After 7 days of REP culture, medium was refreshed with medium and IL-2 every 2 days. Purity of the resultant CD4+ and CD8+ T cell populations was confirmed by flow cytometry at day 14 after start of REP (routinely >99%), and cells were subsequently either used directly in antigen discovery screens or cryopreserved in liquid nitrogen. Data from flow cytometry experiments was acquired using FACSDiva software and analyzed using Flowjo (BD Biosciences).

Immortalized patient B cell lines were generated by retroviral transduction with Bcl-6/Bcl-xL27. Patient PBMCs were isolated from peripheral blood by Ficoll-Paque density gradient separation and stained with IR-Dye and antibodies against CD3, CD14, CD16 and CD19. Single IR-DyeCD3CD14CD16CD19+ cells were sorted using a FACSAria Fusion cell sorter and stimulated for 36 h with irradiated (55 Gy) CD40L+ mouse L cells in B cell medium (IMDM medium (Gibco) supplemented with penicillin-streptomycin, 10% heat-inactivated fetal bovine serum (Sigma-Aldrich) and 50 ng ml−1 IL-21 (Peprotech)), followed by retroviral transduction of Bcl-6 and Bcl-xL. The Bcl-6/Bcl-xL-encoding vector also encodes GFP to allow evaluation of transduction efficiency. Bcl-6/Bcl-xL-immortalized (GFP+) B cells were cultured in B cell medium and were stimulated every week by addition of irradiated CD40L+ L cells. Medium and IL-21 were refreshed every 3–4 days.

For patient ITO66, tumor organoids were established20,21. Tumor-reactive patient T cells were generated by coculturing PBMCs and tumor organoids as follows. Following incubation with 200 ng ml−1 IFNγ (Peprotech) for 24 h, tumor organoids were dissociated into single-cell suspensions using TripLE Express (Gibco). Tumor organoid cells were mixed with patient PBMCs (20:1 PBMC/tumor cell ratio) and 1 × 105 PBMC were seeded in each well of a U-bottom 96-well plate precoated with 5 μg ml−1 anti-CD28 antibody (clone CD28.2; eBioscience). Coculture medium consisted of T cell medium supplemented with 150 U ml−1 IL-2 and 20 µg ml−1 anti-PD1 blocking antibody (clone 5C4; kindly provided by Merus). Coculture medium was refreshed every 2–3 days. PBMCs were harvested and restimulated every 7 days by replating with fresh tumor organoid cells.

Retroviral transduction of TCRs

Codon-optimized TCR α and β variable sequences (encompassing V-CDR3-J domains) of selected TCRs were gene-synthesized (Twist Biosciences) and subcloned into a modified pMP71 retroviral vector12. This vector contains mouse TCR constant regions to reduce mispairing of introduced and endogenous TCR chains, as well as the puromycin N-acetyltransferase resistance gene. Retrovirus was produced by transfecting FLY-RD18 packaging cells with pMP71-TCR plasmid DNA using Xtremegene 9 transfection reagent (Roche). In parallel, healthy donor PBMCs (Sanquin Blood Bank) were separated into CD8+ and CD8 (for transduction with MHC class I- and MHC class II-restricted TCRs, respectively) cells using the CD8+ T Cell Isolation Kit (Miltenyi Biotec). Isolated cell fractions were stimulated with CD3/CD28 Dynabeads (Life Technologies) in T cell medium with 150 U ml−1 IL-2. After 48 h, retroviral supernatants were collected and used to infect prestimulated CD8/CD8+ PBMCs by spinoculation (2,000 g for 90 min) in Retronectin (Takara)-coated plates. Transduction efficiency was measured 72 h later by staining with an anti-mouse TCRβ constant domain antibody and analysis by flow cytometry. TCR-transduced T cells were then selected with 2.5 µg ml−1 puromycin (Gibco) for 48 h and received fresh medium and IL-2 every 3–4 days. After 12–14 days of culture, transduced T cells were expanded using the REP as described above.

T cell activation assays

Reactivity of TCR-transduced donor T cells was determined by coincubating T cells and target cells for 18–24 h in U-bottom 96-well plates (1:1 T cell/target cell ratio) in T cell medium. Incubation of T cells without target cells, and in the presence of 50 ng ml−1 phorbol 12-myristate 13-acetate (Sigma-Aldrich) and 1 µg ml−1 ionomycin (Sigma-Aldrich) served as negative and positive controls, respectively. Following incubation, cells were stained with IR-Dye and antibodies against CD3, CD4, CD8 and the activation markers CD137 or OX40 and analyzed by flow cytometry. When T cell reactivity towards tumor organoids was tested, IFNγ-pretreated organoids were incubated with T cells in the presence of 20 µg ml−1 anti-PD1 blocking antibody (Merus) in anti-CD28 antibody precoated plates.

The cytotoxic capacity of T cells was assessed by coincubating T cells and target cells for 72 h in 96-well plates at a T cell/target cell ratio of 5:1, unless indicated otherwise. Target cells cultured in the absence of T cells served as negative control. Following incubation, 7.46 µm AccuCount blank counting beads (Spherotech) were added to individual cultures to enable quantification of remaining live target cells. Cells were subsequently harvested, stained with 4,6-diamidino-2-phenylindole and anti-CD3 antibody, and measured by flow cytometry. When cytotoxicity against tumor organoids was assessed, IFNγ-pretreated organoids were incubated with T cells in the presence of 20 µg ml−1 anti-PD1 blocking antibody (Merus) and 10 µM Y-27632 in anti-CD28 antibody precoated 96-well plates. Where indicated, target cells were incubated with 50 µg ml−1 MHC class I blocking antibody (clone W6/32) for 30 min at 37 °C before incubation with T cells. Data from functional T cell assay was analyzed using Graphpad Prism v.9.

Exome and RNA sequencing

Tumor genomic DNA and RNA was extracted from formalin-fixed paraffin embedded tumor material using the AllPrep DNA/RNA kit (Qiagen). For patient ITO66, genomic DNA and RNA were isolated from tumor organoids. Genomic DNA of patient PBMCs was extracted using the DNeasy Blood & Tissue kit (Qiagen). Exome enrichment was performed using the SureSelect XT2 Human All Exon V6 kit (Agilent) and strand-specific libraries were generated using the TruSeq Stranded mRNA sample preparation kit (Illumina) according to the manufacturer’s instructions. Resulting libraries were sequenced on HiSeq 2500 or NovaSeq 6000 DNA analyzers (Illumina). Whole-exome and RNA sequencing was processed using bcbio-nextgen. Briefly, DNA reads were mapped against GRCh38 using Burrows–Wheeler aligner (BWA), duplicates were marked with Picard MarkDuplicates and low complexity regions were excluded. Somatic and germline mutations were identified using Mutect2 and HaplotypeCaller, respectively, followed by annotation by SnpSift. RNA reads were quality filtered and mapped with STAR or TopHat2, transcript-level expression was quantified by Salmon and gene fusions were determined by Arriba12,21.

Antigen library design

To design the model antigen library used to validate the screening system, protein sequences of genes encoding known human nonmutated cancer regression antigens, as well as selected immunodominant epitope-encoding genes of Epstein-Barr virus, cytomegalovirus and influenza, were collected from the Uniprot database (https://www.uniprot.org/) (Supplementary Table 1). Protein sequences were reverse-translated and codon-optimized, and resulting nucleotide sequences were segmented into 93 nucleotide (nt) minigenes with 45 nt overlap between neighboring minigenes. In addition, a set of previously characterized neoantigens was included, all encoded by 93 nt minigenes in which the mutant codon was flanked on either side by 45 nt of the relevant nonmutant gene sequence. Minigene sequences encoding the corresponding nonmutated peptides were included for each model neoantigen. A stop codon was added directly following each minigene sequence, and internal BbsI recognition sites were removed without altering the encoded peptide sequences. Each 93 nt sequence was duplicated for a total of 4,764 sequences, and a unique 12 nt barcode sequence was incorporated into each minigene sequence following the stop codon. The resulting sequences were flanked by sequences to enable PCR amplification and subcloning using BbsI (New England Biolabs) into a pMSCV retroviral vector that also encodes the puromycin N-acetyltransferase resistance gene and mCherry (pMSCV-puroR-mCherry).

To design NKIRTIL027 and NKIRTIL063 patient mutanome libraries, all single nucleotide variants (SNVs) and frameshifting indels with confirmed RNA expression within tumor cells were encoded as 93 nt minigenes. RNA sequencing data of tumor ITO34 was unavailable, and the library was designed without taking RNA expression of tumor variants into account. For SNVs, minigenes were designed that encoded peptides in which the mutant codon was flanked on either side by 45 nt of the relevant nonmutant gene sequence. In the case of frameshifting indels, or when SNVs resulted in loss of a stop codon, the newly formed open reading frame was segmented in 93 nt minigenes with 45 nt overlap between adjacent minigenes. Minigenes encoding corresponding WT sequences were included for all tumor variant minigenes. Minigenes encoding the MART126–35 and CDK4R24L epitopes were included in all libraries as internal controls. Internal BbsI recognition sites were removed without altering encoded peptide sequences, and minigenes were flanked by sequences for PCR amplification and subcloning as described above. For patient ITO66, the mutanome library was designed to encode tumor variants as 63 nt minigenes, and no corresponding WT minigenes were included. All minigene libraries were synthesized by Twist Biosciences.

Generation of a universal antigen expression vector

To establish a library expression system that enables the concurrent processing and presentation of minigene products through both the MHC class I and class II pathways, constructs were designed in which a TMG encoding two previously identified neoantigens recognized by either CD4+ or CD8+ TIL of patient NKIRTIL027 (LEMD2P>L (ref. 28) and TTC37A>V (unpublished data), respectively) was either fused or not fused to the signal sequence of CD74 (Supplementary Fig. 4)29. Codon-optimized constructs were synthesized (Twist Biosciences) and subcloned into the retroviral pMSCV-puroR-mCherry vector. NKIRTIL027 immortalized B cells were transduced with TMG constructs, selected to over 90% purity (by measuring mCherry expression) with 5 μg ml−1 puromycin and incubated with NKIRTIL027 CD4+ or CD8+ TIL at a ratio of 1:1 for 48 h in T cell medium with 30 U ml−1 IL-2. T cell activation was subsequently assessed by measuring IFNγ levels in the culture supernatant using the Cytometric Bead Array kit (BD Biosciences), following the manufacturer’s instructions.

Library cloning and transduction

Oligonucleotide libraries were amplified by 12 cycles of PCR using Phusion High-Fidelity DNA Polymerase (New England Biolabs) and primers Preamp Forward (5′-ACTGTCAGAAGACTGCAAGC-3′) and Preamp Reverse (5′-TGACAGCGAAGACCATAGTG-3′). For first proof-of-concept screening experiments using MHC class I-restricted TCRs, the amplified model antigen library was cloned by Golden Gate assembly using BbsI into the pMSCV-puroR-mCherry retroviral vector. For all other screens, amplified libraries were cloned into the pMSCV-puroR-mCherry vector modified to include the sorting sequence of CD74. Subcloned libraries were amplified using Endura electrocompetent cells (Lucigen) and library DNA was extracted using the PureLink HiPure Maxiprep kit (Invitrogen). During all cloning steps, a library representation of at least 100× was maintained.

Libraries were retrovirally transduced in duplicate into immortalized B cell lines, as described above. To ensure single retroviral integrations, B cells were transduced at an infection rate of less than 10%. One day after transduction, B cells were transferred to B cell medium in the presence of irradiated CD40L+ L cells. Transduction efficiency was assessed 3 days post-transduction by measuring mCherry expression by flow cytometry, followed by selection with 5 µg ml−1 puromycin for 2 days and expansion of the B cell cultures until used in screens.

Antigen discovery screens

For proof-of-concept screens using MHC class I-restricted TCRs, the antigen library encoding known cancer regression antigens was transduced into a previously immortalized HLA-A*02:01+ patient B cell line (OVC21)12. Library-expressing B cells were coincubated in duplicate with donor CD8+ T cells transduced with the CDK4R24L-specific TCR #53 or MART26–35-specific TCRs DMF4, DMF5 or 1D3 (all HLA-A*02:01-restricted) in T cell medium with 25 U ml−1 IL-2 at a T cell:B cell ratio of 5:1 and at a density of 2 × 106 total cells cm−2. Cultures were resuspended on day 1 and 2 of the experiment. For screens using patient-derived MHC class II-restricted neoantigen-specific TCRs, the model library was transduced into patient-matched immortalized B cells (patient NKIRTIL017), and library-expressing B cells were cocultured with donor CD4+ T cells transduced with either the MANSC1D>H– or SNORD73AR>W-specific TCR as described above. To simulate screening conditions using clonally diverse T cell populations, TCR-expressing T cells were mixed with donor-matched mock-transduced T cells at indicated ratios. Library coverage of at least 300× was maintained in all experiments. After 72 h of coincubation, cells were washed in PBS, and cell debris was removed by either Ficoll-Paque density gradient separation or using the Dead Cell Removal kit (Miltenyi Biotec). Isolated cells were subsequently resuspended in DirectPCR Lysis Reagent (Viagen) containing 500 µg ml−1 proteinase K and lysed by incubation at 55 °C for 60 min, 85 °C for 30 min and 94 °C for 5 min. Minigene sequences were then amplified by PCR using NEBNext Ultra II Q5 Master Mix (New England Biolabs), using the following primers:

Prep-I Forward (for screens with MHC class I-restricted TCRs):

5′-CAAGCAGAAGACGGCATACGATGGAGGAGAACCCTGGACCTACAAGC-3′

Prep-II Forward (for all other screens):

5′-CAAGCAGAAGACGGCATACGACCTGCGGATGAAGCTGCCCG-3′

Prep Reverse:

5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNG ATCCGACTCGGTGCCACTTTTTCAAC-3′

The 7-nt stretch of N nucleotides indicates a unique barcode sequence used to enable the multiplexed preparation of sequencing libraries. Following PCR, samples were pooled equimolarly and run on a 1% agarose gel to separate minigene amplicons from potential primer dimers. Minigene amplicons were extracted from gel using the Monarch DNA Gel Extraction Kit (New England Biolabs) and deep sequenced on an Illumina HiSeq 2500 Sequencing system (single read 65 bp). Sequencing data were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive under accession code PRJNA884260 (ref. 30).

For patient neoantigen screens, mutanome libraries were transduced into autologous immortalized B cells, followed by selection with puromycin. The cytotoxic potential of expanded patient TIL was confirmed before neoantigen screens by measuring their capacity to degranulate. To this end, CD4+ and CD8+ TIL were polyclonally stimulated using CD3/CD28 Dynabeads in T cell medium in the presence of Golgistop (BD Biosciences) and an antibody against CD107 for 12 h. Following incubation, cells were stained with IR-Dye and an anti-CD4 antibody and analyzed by flow cytometry. Neoantigen screens were subsequently performed by incubating library-expressing B cells in duplicate with patient T cells at a T cell:B cell ratio of 5:1. Library-transduced B cells cultured in the absence of patient T cells served as a negative control. After 72 h of coincubation, cells were processed as described above.

Sequence analysis

Initial sequence quality profiles were quantified by FastQC and demultiplexed using fastq-multx (ea-utils) with one mismatch allowed. Vector sequences were trimmed from sequence reads using fastq-mcf (ea-utils) against the UniVec database, and samples were subsequently quality filtered using cutadapt. The unique 12 nt barcodes that were added to individual minigene sequences were extracted using seqkit and mapped using Bowtie2 with no multimatched hits allowed. For the ITO66 neoantigen screen, high-quality reads were mapped against the full minigene sequences of the patient library using BBMap with ambiguously mapped reads removed and only perfect mappings allowed. Per sample count tables were differentially compared and normalized using DESeq2. Minigenes with an average abundance below the fourth percentile and a coefficient of variation greater than one across the two internal replicates were removed from analyses. Statistical testing was performed using the DESeq2 Wald test and log-fold change cut-off of 0.25. Tumor variants were defined as screen hits when at least one of the duplicate mutant sequences, but neither of the corresponding WT-encoding minigenes, had an false discovery rate-corrected P value less than 0.2 and a log2 fold change of less than −0.5. All data analysis was performed using R and visualized using the ggplot2 package.

To validate patient neoantigens identified in screens, minigenes encoding the screen hits, as well as their WT counterparts, were synthesized as individual gBlocks (IDT), cloned into the CD74 signal sequence-modified pMSCV-puroR-mCherry vector and transduced into immortalized patient B cells. Following selection with puromycin, minigene-transduced B cells were cocultured with expanded patient CD4+ or CD8+ TIL for 48 h in T cell medium with 30 U ml−1 IL-2 and T cell activation was assessed by measuring IFNγ levels in the culture supernatant using the Cytometric Bead Array kit (BD Biosciences). Reactivity towards neoantigens was considered confirmed when T cells secreted at least twofold more IFNγ in response to the mutant sequence compared with the WT control sequence.

NKIRTIL063 TMG screen

The number of tumor variants of patient NKIRTIL063 selected for TMG screening was reasonably limited to 200 (of a total of 685 nonsynonymous expressed mutations). Mutations were selected by first including the 25 most clonal mutations (based on variant allele frequency), followed by including mutations with highest gene expression up to a total of 200 tumor variants. TMG constructs were designed to encode ten variant-encoding minigenes (93 nt each) in which the mutant codon was flanked by 45 nt of nonmutant gene sequence on either side. Codon-optimized sequences were synthesized (Twist Biosciences) and subcloned into the CD74-modified pMSCV-puroR-mCherry retroviral vector. NKIRTIL063 immortalized B cells were transduced with TMG constructs and selected to more than 80% purity with 5 μg ml−1 puromycin. Next, TMG-expressing B cells were cocultured with NKIRTIL063 CD4+ or CD8+ TIL (from the same expansion cultures as used for the antigen discovery screen) at a ratio of 1:1 for 48 h in T cell medium with 30 U m−1 lL-2, and activation of T cells was determined by measuring IFNγ levels in the culture supernatant using the Cytometric Bead Array kit (BD Biosciences). To validate that the observed reactivity to selected TMG constructs was mediated by neoantigens identified using our neoantigen discovery screen, modified versions of these TMGs were designed such that exclusively the minigene that encoded the identified neoantigen was reverted to its WT sequence. Reactivity of NKIRTIL063 CD4+ or CD8+ TIL to B cells transduced with these modified TMGs was subsequently assessed as above.

In silico selection of neoantigen vaccine targets

The computational tool Vaxrank31 was used to rank tumor mutations of patients NKIRTIL063, NKIRTIL027 and ITO66 for use in a putative personalized cancer vaccine. Patient ITO34 was omitted from this analysis because RNA expression data were unavailable. HLA typing of patients was performed using OptiType for HLA-A, -B and -C alleles. The set of somatic variant calls and aligned RNA reads were used as input, with parameters set to a peptide length of 25, an epitope length of 8–11 and utilization of the MHCFlurry prediction algorithm. In line with ongoing clinical trials of personalized neoantigen-based vaccines32, the 20 top ranking predicted neoantigens were considered for putative neoantigen vaccines (Supplementary Table 2).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

DNA sequencing data of antigen discovery screens have been deposited in the NCBI Sequence Read Archive under accession code PRJNA884260 (ref. 30). Protein sequences of genes encoding known human nonmutated cancer regression antigens, as well as selected viral genes were collected from the Uniprot database (https://www.uniprot.org/).

Code availability

Scripts used for analyzing sequencing data from antigen discovery screens are available at https://github.com/twbattaglia/amplicon-nf (ref. 33). Script output for the presented analyses is available at https://github.com/twbattaglia/HANSolo-manuscript (ref. 34).

References

  1. Schumacher, T. N., Scheper, W. & Kvistborg, P. Cancer neoantigens. Annu. Rev. Immunol. 37, 173–200 (2018).

    Article 

    Google Scholar
     

  2. Schumacher, T. N. & Schreiber, R. D. Neoantigens in cancer immunotherapy. Science 348, 69–74 (2015).

    Article 
    CAS 

    Google Scholar
     

  3. Bentzen, A. K. et al. Large-scale detection of antigen-specific T cells using peptide-MHC-I multimers labeled with DNA barcodes. Nat. Biotechnol. 34, 1037–1045 (2016).

    Article 
    CAS 

    Google Scholar
     

  4. Lu, Y. C. et al. Efficient identification of mutated cancer antigens recognized by T cells associated with durable tumor regressions. Clin. Cancer Res. 20, 3401–3410 (2014).

    Article 
    CAS 

    Google Scholar
     

  5. Kula, T. et al. T-Scan: a genome-wide method for the systematic discovery of T cell epitopes. Cell 178, 1016–1028.e13 (2019).

    Article 
    CAS 

    Google Scholar
     

  6. Joglekar, A. V. et al. T cell antigen discovery via signaling and antigen-presenting bifunctional receptors. Nat. Methods 16, 191–198 (2019).

    Article 
    CAS 

    Google Scholar
     

  7. Li, G. et al. T cell antigen discovery via trogocytosis. Nat. Methods 16, 183–190 (2019).

    Article 
    CAS 

    Google Scholar
     

  8. Alspach, E. et al. MHC-II neoantigens shape tumour immunity and response to immunotherapy. Nature 574, 696–701 (2019).

    Article 
    CAS 

    Google Scholar
     

  9. Borst, J., Ahrends, T., Babala, N., Melief, C. J. M. & Kastenmuller, W. CD4+ T cell help in cancer immunology and immunotherapy. Nat. Rev. Immunol. 18, 635–647 (2018).

    Article 
    CAS 

    Google Scholar
     

  10. Oh, D. Y. et al. Intratumoral CD4+ T cells mediate anti-tumor cytotoxicity in human bladder cancer. Cell 181, 1612–1625.e13 (2020).

    Article 
    CAS 

    Google Scholar
     

  11. Tran, E. et al. Cancer immunotherapy based on mutation-specific CD4+ T cells in a patient with epithelial cancer. Science 344, 641–645 (2014).

    Article 
    CAS 

    Google Scholar
     

  12. Scheper, W. et al. Low and variable tumor reactivity of the intratumoral TCR repertoire in human cancers. Nat. Med. 25, 89–94 (2019).

    Article 
    CAS 

    Google Scholar
     

  13. Stronen, E. et al. Targeting of cancer neoantigens with donor-derived T cell receptor repertoires. Science 352, 1337–1341 (2016).

    Article 
    CAS 

    Google Scholar
     

  14. Johnson, L. A. et al. Gene transfer of tumor-reactive TCR confers both high avidity and tumor reactivity to nonreactive peripheral blood mononuclear cells and tumor-infiltrating lymphocytes. J. Immunol. 177, 6548–6559 (2006).

    Article 
    CAS 

    Google Scholar
     

  15. Borbulevych, O. Y., Santhanagopolan, S. M., Hossain, M. & Baker, B. M. TCRs used in cancer gene therapy cross-react with MART-1/Melan-A tumor antigens via distinct mechanisms. J. Immunol. 187, 2453–2463 (2011).

    Article 
    CAS 

    Google Scholar
     

  16. Valmori, D. et al. Vaccination with a Melan-A peptide selects an oligoclonal T cell population with increased functional avidity and tumor reactivity. J. Immunol. 168, 4231–4240 (2002).

    Article 
    CAS 

    Google Scholar
     

  17. Chalmers, Z. R. et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med. 9, 34 (2017).

    Article 

    Google Scholar
     

  18. Jorritsma, A. et al. Selecting highly affine and well-expressed TCRs for gene therapy of melanoma. Blood 110, 3564–3572 (2007).

    Article 
    CAS 

    Google Scholar
     

  19. Tran, E. et al. Immunogenicity of somatic mutations in human gastrointestinal cancers. Science 350, 1387–1390 (2015).

    Article 
    CAS 

    Google Scholar
     

  20. Cattaneo, C. M. et al. Tumor organoid–T-cell coculture systems. Nat. Protoc. 15, 15–39 (2020).

    Article 
    CAS 

    Google Scholar
     

  21. Dijkstra, K. K. et al. Generation of tumor-reactive T cells by co-culture of peripheral blood lymphocytes and tumor organoids. Cell 174, 1586–1598.e12 (2018).

  22. Arnaud, M. et al. Sensitive identification of neoantigens and cognate TCRs in human solid tumors. Nat. Biotechnol. 40, 656–660 (2022).

    Article 
    CAS 

    Google Scholar
     

  23. Sahin, U. et al. Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer. Nature 547, 222–226 (2017).

    Article 
    CAS 

    Google Scholar
     

  24. Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).

    Article 
    CAS 

    Google Scholar
     

  25. Hilf, N. et al. Publisher correction: Actively personalized vaccination trial for newly diagnosed glioblastoma. Nature 566, E13–E13 (2019).

    Article 
    CAS 

    Google Scholar
     

  26. Keskin, D. B. et al. Neoantigen vaccine generates intratumoral T cell responses in phase Ib glioblastoma trial. Nature 565, 234–239 (2019).

    Article 
    CAS 

    Google Scholar
     

  27. Kwakkenbos, M. J. et al. Generation of stable monoclonal antibody–producing B cell receptor–positive human memory B cells by genetic programming. Nat. Med. 16, 123–128 (2010).

    Article 
    CAS 

    Google Scholar
     

  28. Linnemann, C. et al. High-throughput epitope discovery reveals frequent recognition of neo-antigens by CD4+ T cells in human melanoma. Nat. Med. 21, 81–85 (2015).

    Article 
    CAS 

    Google Scholar
     

  29. Bonehill, A. et al. Messenger RNA-electroporated dendritic cells presenting MAGE-A3 simultaneously in HLA class I and class II molecules. J. Immunol. 172, 6649–6657 (2004).

    Article 
    CAS 

    Google Scholar
     

  30. Cattaneo, C.M. et al. HLA-agnostic Neoantigen Screening (HANSolo) – raw sequencing data. NCBI Sequence Read Archive (SRA) https://www.ncbi.nlm.nih.gov/bioproject/PRJNA884260 (2022).

  31. Rubinsteyn, A., Hodes, I., Kodysh, J. & Hammerbacher, J. Vaxrank: A computational tool for designing personalized cancer vaccines. Preprint at bioRxiv https://doi.org/10.1101/142919 (2017).

  32. Blass, E. & Ott, P. A. Advances in the development of personalized neoantigen-based therapeutic cancer vaccines. Nat. Rev. Clin. Oncol. 18, 215–229 (2021).

    Article 

    Google Scholar
     

  33. Battaglia, T. HANSolo amplicon-nf pipeline. GitHub https://github.com/twbattaglia/amplicon-nf (2022).

  34. Battaglia, T. HANSolo analysis code. GitHub https://github.com/twbattaglia/HANSolo-manuscript (2022).

Download references

Acknowledgements

We would like to thank M. Slagter and L. Wessels for bioinformatic and statistical support, K. Dijkstra for support with single-cell TCR sequencing, A. van de Leun for support with isolation of neoantigen-specific TCRs, M. Wolkers for kindly sharing patient material, K. Bresser and D. Vredevoogd for helpful discussions on library design, the NKI-AVL Flow Cytometry Facility for flow cytometric support, the NKI-AVL Core Facility Molecular Pathology and Biobanking for supplying NKI-AVL Biobank material and laboratory support and the NKI-AVL Genomics Core Facility for support with next-generation sequencing. This work was supported by the Dutch Cancer Society Young Investigator Grant (grant No. 2020-1/12977) (to W.S.), ZonMw Translational Research Program 2 (grant No. 446002001) (to W.S. and J.B.A.G.H.), the Queen Wilhelmina Cancer Research Award and ERC AdG SENSIT (grant agreement No. 742259) (to T.N.S.), the NWO Gravitation program (NWO 2012-2022) (to E.E.V.) and Oncode Institute (to T.N.S. and E.E.V.). Figure 1a was created with BioRender.com.

Author information

Author notes

  1. These authors contributed equally: Thomas Battaglia, Jos Urbanus.

  2. These authors jointly supervised this work: Emile E. Voest, Ton N. Schumacher, Wouter Scheper.

Authors and Affiliations

  1. Department of Molecular Oncology and Immunology, The Netherlands Cancer Institute, Amsterdam, The Netherlands

    Chiara M. Cattaneo, Thomas Battaglia, Jos Urbanus, Ziva Moravec, Rhianne Voogd, John B. A. G. Haanen, Emile E. Voest, Ton N. Schumacher & Wouter Scheper

  2. Oncode Institute, Utrecht, The Netherlands

    Chiara M. Cattaneo, Thomas Battaglia, Jos Urbanus, Emile E. Voest & Ton N. Schumacher

  3. Department of Genomics of Cancer and Targeted Therapies, IFOM, FIRC Institute of Molecular Oncology, Milan, Italy

    Chiara M. Cattaneo

  4. Department of Hematopoiesis, Sanquin Research, Amsterdam, The Netherlands

    Rosa de Groot

  5. Department of Hematology, Leiden University Medical Centre, Leiden, The Netherlands

    Rosa de Groot & Ton N. Schumacher

  6. Department of Surgery, The Netherlands Cancer Institute, Amsterdam, The Netherlands

    Koen J. Hartemink

  7. Department of Medical Oncology, The Netherlands Cancer Institute, Amsterdam, The Netherlands

    John B. A. G. Haanen & Emile E. Voest

Contributions

C.M.C., J.U., Z.M., R.V. and W.S. designed, performed, analyzed and interpreted experiments. T.B. analyzed sequencing data of screens. K.J.H. and R.d.G. supplied patient tumor material. C.M.C., J.B.A.G.H., E.E.V., T.N.S. and W.S. wrote the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to
Emile E. Voest, Ton N. Schumacher or Wouter Scheper.

Ethics declarations

Competing interests

T.N.S. is advisor for Allogene Therapeutics, Celsius, Merus, Neogene Therapeutics and Scenic Biotech; is a recipient of research support from Merck KgaA; is a stockholder in Allogene Therapeutics, Cell Control, Celsius, Merus, Neogene Therapeutics and Scenic Biotech and is venture partner at Third Rock Ventures, all outside of the current work. J.B.A.G.H. is advisor for BioNTech, Neogene Therapeutics, Scenic Biotech and T-Knife; is a recipient of research grant support from BioNTech; is a stock option holder in Neogene Therapeutics, all outside of the current work. All other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Paul Robbins and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Science & Nature Verify currency and authenticity via CrossMark

Cite this article

Cattaneo, C.M., Battaglia, T., Urbanus, J. et al. Identification of patient-specific CD4+ and CD8+ T cell neoantigens through HLA-unbiased genetic screens.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01547-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-022-01547-0

Read More
Chiara M. Cattaneo

Accurate isoform discovery with IsoQuant using long reads

Main

Long-read RNA sequencing is now widely used in bulk, sorted cells, single cells and spatial approaches. This wide field of applications has led to the development of multiple spliced alignment programs1,2,3,4, transcript discovery methods5,6,7,8,9,10,11, tools for transcript classification12, annotation13 and visualization14,15. Additionally, several reference-free tools for RNA long-read correction and assembly have been developed16,17. Current community efforts address the problem of understanding performance, weaknesses and advantages of each approach for various applications18.

Here we present IsoQuant—a tool for transcript discovery and quantification with long RNA reads. IsoQuant takes as input a reference genome and a dataset containing PacBio or ONT (Oxford Nanopore Technologies) RNA reads. By default, IsoQuant maps input reads to the genome via minimap2 in splice mode2. Alternatively, a user may provide BAM files generated with a spliced aligner of their choice, for example STARlong1 for PacBio and uLTRA4 or deSALT3 for ONT reads. In two distinct modes, IsoQuant can be used for de novo annotation-free transcript discovery as well as with the reference gene annotation.

IsoQuant uses long-read spliced alignments to construct an intron graph, in which vertices are splice junctions, that is, pairs of splice sites (donor and acceptor), and two vertices are connected with a directed edge if the corresponding splice junctions are consecutive in at least one read (Methods). This graph is exploited for constructing paths that correspond to full-length transcripts (Fig. 1a). If the reference annotation is provided, IsoQuant first assigns reads to known isoforms via an inexact intron-chain matching algorithm that accounts for splice site shifts, which are typical for alignment of error-prone reads19. These assignments are further used for reference transcript quantification and correction of inaccurately detected splice junctions and misalignments, such as skipped microexons.

Fig. 1: IsoQuant pipeline outline and characteristics of novel transcripts generated from mouse simulated data.
 figure 1

a, Outline of the IsoQuant pipeline. When a reference gene annotation is provided, reads are assigned to annotated isoforms and alignment artifacts are corrected (top). The intron graph is constructed from read alignments (middle) and transcripts are discovered via path construction (bottom). b, F1-score for novel transcripts reported by different tools on simulated ONT (left) and PacBio data (right). c, Precision and recall for novel transcripts reported by different tools on simulated ONT data broken up by expression levels in TPM. TPM bins are presented by dot sizes. d, Precision (left) and recall (right) for novel transcripts reported by different tools on simulated ONT data. e, Same as d, but for simulated PacBio data.

Source Data

Full size image

To compare IsoQuant performance against existing transcript discovery tools, we first simulated mouse PacBio and ONT data using realistic gene expression profiles with IsoSeqSim (https://github.com/yunhaowang/IsoSeqSim) and Trans-NanoSim20 respectively. For more informative benchmarking, we simulated an ONT R9.4 dataset representing R9.4 chemistry and an ONT R10.4 dataset corresponding to a more accurate R10.4 chemistry (Methods).

To mimic real-life datasets containing unannotated transcripts, we arbitrarily removed 5,311 (15%) of 35,684 expressed isoforms (the ones contributing to at least one read during the simulation) from the GENCODE21 gene annotation. These 5,311 hidden transcripts were further used as a ground truth for novel transcript discovery. The reduced GENCODE annotation was used as an input for all tools. Each output annotation was then separated into a set of known and a set of novel transcripts, which were compared against the respective baselines using gffcompare22 (Methods).

For known transcripts, IsoQuant has the highest F1-score (the harmonic mean of precision and recall) compared to TALON7, FLAIR8, Bambu11 and StringTie5, but these advances are not dramatic (Supplementary Tables 13). However, IsoQuant produces novel transcripts with a 1.9-fold higher F1-score on ONT R10.4 data compared to the second-best tool, StringTie. In comparison to TALON, FLAIR and Bambu, the improvement in F1-score is even more noticeable (Fig. 1b, left). On PacBio data, IsoQuant again shows the best F1-score, but the difference from other tools is smaller than for ONT R10.4 data (Fig. 1b, right).

Compared to most tools, IsoQuant’s improvements in F1-score is primarily caused by its very high precision of novel transcripts. As compared to TALON, FLAIR and StringTie, IsoQuant shows a minimum of fivefold drop in false-positive rate on ONT R10.4 data, while still maintaining slight gains in recall (Fig. 1d). The situation is of a different nature for Bambu. IsoQuant has higher precision (86.3 versus 69.9%), but substantially higher recall: while Bambu only reconstructs 73 out of 5,311 novel isoforms (1% recall), IsoQuant reconstructs 3,848 (62.6%). On ONT R9.4 simulated data IsoQuant similarly shows a notably lower false-positive rate compared to other tools (Supplementary Table 2).

On PacBio simulated data, similar trends can be observed for novel transcripts, although with a less drastic difference in specificity. Bambu shows slightly higher precision (95.8%) compared to IsoQuant (94.4%), but again has the lowest recall (18.7% for Bambu versus 76.8% for IsoQuant). StringTie, TALON and FLAIR again predict transcripts with comparable recall, but have at least fivefold higher false-positive rate compared to IsoQuant (Fig. 1e, detailed analysis of the false-positive transcript is provided in Supplementary Note 8).

Further, we measured precision and recall for novel transcripts with different expression levels (Fig. 1c and Supplementary Fig. 1). While all tools tend to show lower recall and precision for lowly expressed transcripts, IsoQuant yields highly specific transcript models (≥80% precision) and maintains advances for novel transcript discovery regardless of the expression levels. Thus, IsoQuant is likely to be highly useful across many genes, including but not limited to low-expressed long-noncoding RNAs and marker genes of cell types.

Among the five listed methods, only StringTie and IsoQuant support annotation-free transcript discovery. Thus, we compared these two tools on the same simulated datasets used above without providing any annotation (Supplementary Table 4). On PacBio data both tools yield highly accurate transcript models. On ONT data StringTie shows higher recall, while IsoQuant generates transcripts with substantially lower false-positive rates (2.5-fold decrease for ONT R10.4 dataset and 3.7-fold for ONT R9.4). While overall quality of transcripts discovered in reference-based mode is, indeed, higher compared to annotation-free runs, the precision and recall of novel transcripts appears to be rather similar in both modes.

To complement our benchmarks on simulated data, we also sequenced Lexogen spike-in RNA variant (SIRV) synthetic molecules on the Oxford Nanopore MinION using ONT R10.4 flowcells (Methods). Along with the complete SIRV annotation, Lexogen provides an incomplete annotation, missing 26 out of the total 69 SIRV isoforms, which allows the evaluation of novel transcript discovery, similar to the one we performed for simulated data with the reduced GENCODE annotation.

Results on SIRV sequencing data resemble the ones obtained on simulated reads. When predicting novel isoforms, IsoQuant shows at least four times higher F1-score and eightfold lower false-positive rate than any other tool. In comparison to most tools, with the exception of TALON, IsoQuant shows high gains in both precision and recall. TALON has a better recall (42.3 versus 38.5%), but IsoQuant has tenfold higher precision (Fig. 2a). Similar to simulated data, all tools are able to accurately predict SIRV transcripts kept in the annotation, with Bambu, StringTie and IsoQuant having perfect precision for known isoforms alone (Supplementary Table 5).

To support our observations, we also applied all tools to the real human ONT complementary DNA, ONT direct RNA (dRNA)23 and PacBio public datasets, for which the ground truth is indeed unknown. We used gffcompare to estimate the consistency of predictions by computing the number of identical transcript models reported by the different tools. On the human ONT dRNA dataset, IsoQuant shows the highest percentage of transcripts confirmed by at least three other methods (70.1%), while no other tool surpasses the 40% threshold. This suggests that IsoQuant transcript models are notably more consistent with other methods (Fig. 2b, middle). In comparison to the other approaches, IsoQuant also reports the lowest number of transcripts that are not predicted by any other method. If one interprets such transcript models as potential false positives, IsoQuant again stands out in the lowest false-discovery rate (3.5%, 1,162 transcripts). In contrast, other tools output annotations containing more than 33% of unconfirmed transcript models (varying from 18,000 to 48,000). Additionally, for each tool we computed the number of potentially missed transcripts that were reported by all other methods. While TALON has the lowest number of such transcripts (75), Bambu shows the second-best results of 1,089 possible false negatives and IsoQuant shows the third-best results of 1,521 such transcripts (Supplementary Table 6).

Fig. 2: Characteristics of transcripts obtained from real sequencing data.
 figure 2

a, Precision, recall and F1-score for novel transcripts generated on real SIRV ONT cDNA sequencing data. b, Consistency of predictions made by different methods on real human ONT cDNA, ONT dRNA and PacBio data.

Source Data

Full size image

Similar trends can be observed in ONT cDNA and PacBio datasets, although the overall percentage of common transcripts appears to be lower compared to ONT dRNA data (Fig. 2b, left and right). IsoQuant again shows the highest fraction of transcripts predicted by at least three other tools (35.6% for ONT cDNA, 55.6% for PacBio), while other programs have correspondingly 25 and 40% at best. All four other tools produce annotations containing a high number of transcripts that are not confirmed by any other method (> 50% of all transcripts for ONT cDNA, > 30% for PacBio), while IsoQuant’s potential false predictions are below 25% on ONT cDNA dataset and below 10% on the PacBio dataset.

Although these values cannot be explicitly treated as false positives and false negatives, they advocate that, unlike other tools, IsoQuant produces highly specific annotations that are strongly consistent with transcripts reported by several alternative approaches. Moreover, because IsoQuant typically misses very few isoforms predicted by all other tools simultaneously, it is likely to also be highly sensitive (Supplementary Table 6, the number of potentially missed transcripts).

Additionally, we used long-read RNA sequencing data from a mouse brain sample, in which a previous study reported 76 novel isoforms of high biological importance24, which were confirmed by manual annotation by the GENCODE team. Here, we compared IsoQuant only with StringTie, which has the second-best F1-score across all simulated datasets. On PacBio data, IsoQuant correctly reconstructs 71% of the confirmed novel isoforms, while StringTie restores approximately half as many novel transcripts—37% (Supplementary Table 7). Similarly, on the single-cell ONT dataset from the same brain sample IsoQuant restores almost 50% of these 76 novel isoforms, whereas StringTie reports 30%. Although it is not possible to evaluate specificity in this kind of experiment, it confirms that IsoQuant can maintain high recall values on real sequencing data.

Beside transcript discovery, IsoQuant implements additional functionality, such as read-to-isoform assignment and transcript quantification. Benchmarks of these supplementary features, information on computational performance, as well as IsoQuant results obtained with different spliced aligners can be found in the Supplementary Notes 27.

In summary, IsoQuant accurately predicts transcript models from PacBio or ONT RNA sequencing data. For known isoforms, IsoQuant has higher F1-score compared to other tested tools, but these differences are not dramatic. For unannotated isoforms, however, IsoQuant provides very strong increases in F1-score over other existing approaches. In comparison to most tools, it achieves this F1-score increase by maintaining higher recall, while substantially increasing precision. Thus, IsoQuant is a valuable tool for predicting novel alternatively spliced isoforms in the age of long-read sequencing.

Methods

Sequencing Lexogen SIRV transcripts

First, total RNA from HeLa cells was extracted using the miRNeasy Tissue/Cells Advanced Mini Kit (Qiagen, 217604), and polyA transcripts were pulled-down using the NEBNext Poly(A) messenger RNA Magnetic Isolation Module (NEB, E7490S). Next, the SIRV-Set 4 (Iso Mix E0/ERCC/Long SIRVs) (Lexogen, 141.01) was spiked-in to the RNA and reverse transcribed using the Maxima H Minus Reverse Transcriptase (Thermo Scientific, EP0752). The reverse transcriptase reaction final concentrations are as follows: 1.25 ng μl−1 polyA HeLa RNA, 0.33 ng μl−1 SIRV-Set 4, 0.5 mM dNTP, 5 μM dT-VN oligo, 5 μM TSO, 1× reverse transcriptase buffer, 2 U μl−1 RiboLock RNase Inhibitor (Thermo Scientific, EO0382) and 20 U μl−1 Maxima H Minus Reverse Transcriptase. The reaction was incubated for 30 min at 50 °C and 5 min at 85 °C. Then, 5 μl of reverse transcriptase reaction were amplified using the Platinum Superfi II Mastermix (ThermoFisher, 12368010) for 12 cycles, according to the manufacturer’s instructions and using Forward- and Reverse-Amplification primers. Finally, the cDNA was cleaned up using SPRIselect beads at a 0.8× ratio (Beckman Coulter, B23318) and used as input for Oxford Nanopore Technology sequencing with both the Kit 12 (SQK-LSK110 kit and FLO-MIN106D flowcells) and Q20+(SQK-LSK112 kit and FLO-MIN112 flowcells) chemistries. Both were run for 72 h and basecalled using the Super Accuracy model.

Data simulation

To simulate PacBio circular consensus sequencing (CCS) reads we used IsoSeqSim (https://github.com/yunhaowang/IsoSeqSim), which generates a read by truncating a transcript sequence according to given probabilities and randomly inserts sequencing errors at a specified rate with uniform distribution. As reported in previous studies25, a uniform error distribution is a realistic model for PacBio CCS reads. Here we used 5′ and 3′ truncation probabilities typical for PacBio Sequel II (provided within the package) and an overall error rate of 1.6%: 0.6% deletions, 0.6% insertions and 0.4% substitutions. While these discrepancies do not necessarily represent sequencing errors, they must nevertheless be modeled, as they can confuse transcript reconstruction. The above values were obtained by mapping real PacBio CCS reads to the reference genome18.

ONT reads were simulated with the NanoSim software in the transcriptome mode20. NanoSim is designed specifically for simulating ONT-specific sequencing errors and biases. It first constructs error-profile and length-distribution models, which are further used to mutate reference transcript sequences. We trained the model using the ONT R10.4 sequencing data (average error rate of 2.8%: 0.7% deletions, 1.1% insertions, 1% substitutions.). To simulate ONT R9.4 chemistry, we used a pretrained model provided within the NanoSim package, which was obtained using publicly available ONT cDNA data23 from the NA12878 human cell line and has an average error rate of 15.9%: 6% deletions, 5.1% insertions and 4.8% substitutions. In addition, we turned off the simulation of intron retention events and random unaligned reads representing the background noise.

However, additional analysis of the simulated ONT data and NanoSim code revealed that NanoSim randomly selects a start position of a read in a transcript sequence with a uniform distribution, thus introducing no 5′ or 3′ bias. To simulate more realistic ONT reads, we aligned real ONT cDNA data obtained from the mouse brain sample to the reference transcriptome using minimap2 and derived empirical truncation probability distributions on both 5′ and 3′ ends. Further, we changed the NanoSim source code to enable sequence truncation with respect to obtained probabilities (Supplementary Fig. 2). The modified version is available at https://github.com/andrewprzh/lrgasp-simulation.

For both ONT and PacBio simulation we used Mouse GENCODE v.26 and Human GENCODE v.36 basic annotations21. Before simulation, we also attached a 30 basepair (bp) polyA tail to every transcript sequence. To simulate realistic mouse data, a transcript expression profile was obtained using PacBio data from a mouse brain sample24. For human data, a gene expression profile was computed with PacBio GM12878 data. A complete description of every dataset used in this study is provided in the Supplementary Table 8.

Quality evaluation of predicted novel transcripts

To mimic real-life situations and assess the ability of an algorithm to predict novel transcripts, we created reduced gene annotations by removing a fraction of expressed isoforms. First, we define a subset of true expressed transcripts that contributed to at least one read during the simulation. Among this set, we select a fraction of transcripts to be excluded from the annotation. These transcripts are denoted as the true novel isoforms. The remaining transcripts (among the expressed) are defined as true known isoforms. To create a reduced gene annotation, we remove all true novel isoforms from the comprehensive GENCODE annotation. Here we created a reduced mouse annotation with 15% of expressed transcripts removed, and four human reduced annotations with 10, 15, 20 and 25% of excluded expressed isoforms (Supplementary Note 2).

To evaluate a transcript prediction tool, we provided the entire set of simulated reads and the reduced annotation as an input. Thus, true novel isoforms are hidden from the annotation, but present in the reads. We then compute precision and recall by running gffcompare22 for (1) the entire output annotation versus the complete set of expressed transcripts, (2) reported known isoforms versus the set of true known isoforms and (3) predicted novel transcript models versus the true novel set. The information on whether a transcript is known or novel is obtained from the output GTF file. The script for computing these metrics can be found in the IsoQuant repository in misc/reduced_db_gffcompare.py.

For the annotation-free benchmarks we simply compared the entire output annotation with the true set of expressed isoforms using gffcompare.

To estimate how recall and precision of novel transcripts depend on the expression levels, predicted transcripts are grouped into bins by their transcripts per million (TPM) values. For computing recall the number of false negative calls (undetected transcripts) in each TPM bin is required. We thus group transcripts by their TPM values used during the simulation. However, computing precision requires the number of false-positive predictions within each bin and thus only reported TPM values can be used (the true TPM for a false prediction is 0). Thus, it may happen that the same transcript may fall into different bins when benchmarking different tools. Although it is not possible to compute precision and recall exactly for an arbitrary TPM range, the bias has a minor effect as only a small number of bins was used in this experiment (five). Therefore, despite being imperfect, these estimations can provide additional insights on whether a transcript discovery method has any bias toward high- or low-expressed isoforms.

To evaluate SIRV transcripts we used an incomplete SIRV annotation containing only 43 out of 69 SIRV transcripts. The output annotations were again split into known and novel transcripts, and compared against the respective reference set using gffcompare. The SIRV-Set 4 annotations are available at https://www.lexogen.com/sirvs/download/.

Estimating consistency between annotations

Consistency between transcripts generated on real data was estimated using gffcompare (without providing a reference annotation). Based on gffcompare output, for each tool we computed how many of its transcripts are supported by (1) all four other tools, (2) exactly three other tools, (3) one or two other tools and (4) no other tool (possible false predictions). We also counted the number of potentially missed transcripts that were reported by all methods except the one being evaluated (possible false negative). This approach is implemented in misc/denovo_model_stats.py.

Command line options

For PacBio data minimap2 was launched with ‘splice:hq’ preset; for ONT data we used k-mer size 14 with the usual ‘splice’ preset. We also provided annotated splice junctions in BED format as an input. In each experiment, all tools were provided with the same BAM file and the same reference annotation. IsoQuant was launched with the default parameters setting the appropriate data type via ‘–data_type’ option. StringTie2 was launched with the ‘-L’ option. All other tools were run with the default parameters in 20 threads. In contrast to all other tools, Bambu outputs all reference transcripts, including unexpressed ones. Thus, we filtered out all transcripts with read count values <1 from the Bambu output. As recommended in the user manual, we also ran TALON using preliminary alignment correction with TranscriptClean26 (https://github.com/mortazavilab/TALON). However, as the results with and without correction were almost identical, we decided to use the annotations obtained from raw data for a fair comparison. Complete information on all options and software versions are provided in the Supplementary Table 9.

IsoQuant algorithm

To process long RNA reads, IsoQuant requires a reference genome and optionally—a corresponding gene annotation. If the reads are provided in the FASTQ format, IsoQuant maps them to the reference with minimap2 in splice mode2. Alternatively, a user may provide a sorted and indexed BAM file generated with a spliced aligner of their choice. If the reference annotation is provided, the IsoQuant algorithm includes four main steps: (1) assigning mapped reads to known isoforms, (2) transcript quantification, (3) alignment correction and (4) transcript model construction. In the annotation-free mode, the pipeline simply proceeds to the transcript discovery step. Below, we describe the key aspects of all four procedures.

Assigning long reads to known isoforms

The algorithm for assigning long reads to annotated isoforms is based on intron-chain matching and detecting exonic overlaps. To assign reads, IsoQuant processes each gene individually by extracting reads that map to the respective region from the sorted BAM file.

IsoQuant first processes the annotation to construct splice junction and exon profiles of all known isoforms. A set of annotated splice junctions in the gene is sorted according to their coordinates in the genome and enumerated from 1 to N. Thus, an annotated isoform can be represented as a vector of length N, in which the element at position i is set to 1 if this isoform includes the ith splice junction and −1 otherwise (Supplementary Fig. 3a). This vector is henceforth referred to as an isoform splice junction profile. The exon profile is constructed in a similar manner: all annotated exons are first split into a minimal set of M nonoverlapping fragments, such that every exon can be represented as their combination, and these exonic fragments are sorted and enumerated. The exon profile for an annotated isoform is similarly denoted as a vector of length M, where the ith element is set to 1 if this isoform contains the ith exon fragment and −1 otherwise (Supplementary Fig. 3b).

To assign a read to an annotated isoform, each splice junction from the alignment is matched against annotated splice junctions from the current gene and a read splice junction profile is constructed (also a vector of length N). In this vector the ith element is set to 1 if the annotated splice junction with index i matches to a splice junction from the read, −1 if it is overlapped or spanned by the read, but no match is detected, and 0 otherwise. A zero value indicates that the splice junction is located outside the alignment region and therefore no information can be derived, for example due to read truncation. Similarly, the exon profile of the read is constructed based on M exonic fragments described above: 1 indicates that the respective exonic fragment is overlapped, −1 means it is spanned and 0 is set for exonic fragments outside the alignment region (Supplementary Fig. 4).

Due to sequencing errors, an aligner may detect splice site positions inaccurately19. To avoid considering them as alternative or novel, the algorithm allows a small difference Δ between annotated and alignment splice site coordinates when matching splice junctions. Formally speaking, an annotated splice junction (x1, x2) matches a read splice junction (y1, y2) if |x1 − y1| ≤ Δ and |x2 − y2| ≤ Δ. The default Δ value varies for different types of input data: 4 bp used for PacBio CCS reads and 6 bp for ONT reads (can be set manually). Although an aligned read can be assigned to an isoform by simply comparing its intron chain and exonic coordinates to the annotation, vectorizing the alignment as described above allows one to easily implement inexact splice site comparison with a delta, and quickly detect candidate isoforms for read assignment.

Further, to assign a read to an isoform, its exon and splice junction profiles are matched against the respective profiles of the annotated isoforms. The distance between two profiles is computed simply as the number of distinct elements in which the read profile has nonzero values. A read is said to be consistent with an isoform if the distances between their exon and splice junction profiles are 0, and the read has no unannotated splice junctions/exons (Supplementary Fig. 4). When a read is consistent with a single isoform, it is reported as a unique match. When a read is consistent with multiple isoforms simultaneously, it is classified as ambiguous, which may happen, for example, due to read truncation. If a read contains unannotated splice junctions/exons, or its profiles are not consistent with any isoform, it is marked as inconsistent. For such alignments IsoQuant reports the most similar reference transcript and detected alternative splicing events.

Some inconsistencies can be, however, caused by misalignments, rather than by real alternative splicing events19: (1) skipped short exons, (2) intron shifts exceeding Δ bp and (3) short unannotated exons at transcript ends (Supplementary Fig. 5). If an inconsistent alignment contains only these types of discrepancy, the read is reclassified as conditionally consistent.

Transcript quantification

Once long reads are assigned to annotated isoforms, quantification becomes rather trivial. Uniquely assigned reads are counted as a single detected transcript, while ambiguous reads are treated as multi-mappers and contribute to multiple assigned isoforms with lower weight. A transcript is reported as expressed only if it has at least one uniquely assigned read. Inconsistent reads are considered as potential novel isoforms and ignored during the quantification step. Beside genes and transcripts, IsoQuant can also count inclusion and exclusion abundances for separate exons and introns, which can be useful for computing percentage spliced-in values.

IsoQuant implements additional functionality for barcoded long RNA reads, for example barcoded by single-cell or spatial location24,27. A user can provide information on how the reads are grouped, for example, as a TSV file that indicates a barcode or a cell type of origin for every read. Isoform and gene abundances are then calculated for every read group separately, which can facilitate an expression comparison between different groups or cell types.

Spliced alignment correction

IsoQuant corrects each uniquely assigned read individually. If a read contains misalignments described above (Supplementary Fig. 5) or its intron chain is not identical to the intron chain of the assigned isoform, the alignment is corrected as follows. Short skipped exons are restored according to the annotation and minor splice junction shifts are replaced with the respective splice junctions from the assigned transcript. Unannotated terminal microexons are simply removed from the alignment. Finally, any unannotated splice site is substituted with the nearest site from the assigned transcript if (1) these splice sites are located within Δ bp and (2) read alignment contains sequencing errors near this splice site. Coordinates of corrected alignments are then saved in BED12 format.

Transcript model construction

The transcript reconstruction procedure implemented in IsoQuant includes four steps: (1) intron graph construction from read alignments, (2) intron graph simplification, (3) attaching terminal vertices and (4) construction of paths representing full-length transcripts. This stage does not require any information on reference transcripts and thus can be used for both de novo and annotation-based transcript discovery. Below we provide a detailed description of all algorithms and intuition behind them.

Intron graph construction

To construct transcript models, IsoQuant implements a concept of an intron graph, which was influenced by the previously designed splice graph approach28, used, for example, in StringTie5. For a given set of transcripts, an intron graph is constructed as follows. First, we define internal vertices as a set of all splice junctions from all transcripts. Thus, each vertex represents a pair of splice sites (donor and acceptor) or, more formally, an ordered pair of coordinates in the genome. Two vertices are connected with a directed edge if the respective splice junctions are consecutive in any transcript. Finally, for every first or last splice junction in a transcript, the corresponding vertex is connected with a terminating vertex that represents the transcript start and end positions (formally, a single integer). The intron graph is a directed acyclic graph since every edge connects only consecutive elements. Each transcript can now be represented as a path in the graph that traverses from the initial to terminal vertex, where internal vertices denote its intron chain (Supplementary Fig. 6a).

The described approach can be used to construct an intron graph from read alignments. Similarly, to the read-to-isoform assignment procedure, the genes are processed by IsoQuant individually. First, the algorithm constructs a set of internal vertices corresponding to splice junctions from the selected alignments. Two vertices are likewise connected when the respective splice junctions are consecutive in any read alignment. Due to the presence of inexactly detected splice sites, which may remain even after the alignment correction, such a graph may contain false vertices and connections. These false nodes typically form topological patterns, such as tips and bulges. A tip is defined as a dead end (dead start) edge that has a starting (ending) vertex with outdegree (indegree) at least 2. A bulge consists of two alternative paths having the same start and end vertices (Supplementary Fig. 6b). Similar patterns are also typical for de Bruijn graphs, which are used for short read assembly, where bulges and tips are caused by sequencing errors. To remove tips and bulges assemblers exploit various techniques broadly called graph simplification29,30.

Intron graph simplification

Here we implement a graph simplification procedure based on the following observations: (1) a false splice junction is typically unannotated, (2) splice site shifts that cause a false intron are short and (3) the number of reads supporting the correct splice junction often exceeds read support of a false one. Formally, a bulge/tip is removed from the graph if it represents an unannotated splice junction that has at least twice lower read support compared to the alternative vertex and the alternative vertex has splice sites within 20 bp (10 bp for PacBio). In other cases, when an unannotated splice junction has a high read support or no similar splice junction exists, a bulge or a tip is likely to represent a part of a novel isoform and thus should be preserved (Supplementary Fig. 6b). Although intron graph simplification strongly resembles naive splice junctions clustering, it has an important difference: a splice junction is removed not only based on its properties, such as splice site positions and read support, but based on the graph topology as well, thus considering adjacent splice junctions. Such a method allows one to, for example, preserve similar splice junctions from distinct isoforms. It is worth noting that the simplification procedure keeps track of all collapsed tips and bulges, thus preserving the possibility to later traverse alignment containing removed splice junctions through the graph.

Collecting terminal positions

After the graph is simplified, the algorithm proceeds to attach starting and terminal vertices. In contrast to annotated transcripts, read alignments do not provide the exact terminal positions, as their sequences can be truncated. Thus, to avoid having an extreme number of terminal vertices, terminal positions are detected using the heuristics presented below. Without loss of generality here we assume that the gene of interest is on the forward strand and polyA tails are on the right.

For every splice junction V in the graph, the algorithm selects only read alignments that contain V as a terminal splice junction and processes them as follows. First, the polyA sites are collected and clustered. Clustered polyA positions {p1, …, pk} are added to the graph as terminal vertices and connected to vertex V (Supplementary Fig. 7a). Further, the algorithm adds the rightmost non-polyA terminal position P as a terminal vertex if one of the conditions is satisfied: (1) V has no outgoing edges, (2) V has an outgoing edge to a splice junction (u1, u2) and P > u1 + Δ or (3) V has adjacent polyA vertices {p1, …, pk} and P > max(p1, …, pk) + Δ (where Δ is the parameter defined above). Thus, a non-polyA terminal position can only be attached if it is located to the right of adjacent exons or polyA vertices. Starting positions are collected in a similar manner, but without looking for polyA sites (Supplementary Fig. 7b). The described approach, however, may lose information when several isoforms share the same starting splice junction but have distinct transcription start and end sites. Thus, we also apply an additional transcripts correction, which is described below.

Transcript discovery via path construction

Once the intron graph is constructed and simplified, IsoQuant detects full-length paths that connect starting and terminal vertices. Paths entirely supported by at least a single read alignment (that is, full-splice match) are marked as transcript prediction candidates (Supplementary Fig. 7c). To filter out unreliable novel transcripts IsoQuant applies read support cutoffs: at least five full-splice match reads (three for PacBio) and at least 2% from the maximum graph coverage. Since some isoforms may not have a full-splice matching alignment, IsoQuant also reports known transcripts that (1) have at least one uniquely assigned read and (2) can be traversed through the intron graph. It also reports known mono-exonic transcripts that have (1) a uniquely assigned read and (2) a confirmed polyA site.

To correct terminal positions of a novel transcript, the algorithm selects all alignments consistent with this transcript and uses them to extract terminal positions using the approach described above (Supplementary Fig. 7d). In contrast to detecting terminal vertices for the entire graph, where all alignments are used, the subset of consistent reads likely belongs specifically to this isoform and thus provides correct start and end positions. The resulting transcripts are saved in GTF format, providing additional information about transcript types and their reference genes.

While the previously designed splice graph structure and the intron graph implemented in this work are designed to represent alternatively spliced transcripts and, in general, are highly similar, there are a few differences that can be highlighted. First of all, the splice graph natively supports transcription start and polyA sites as well as mono-exonic transcripts. The intron graph, however, requires the introduction of additional types of ‘terminal vertex’ that denote transcript start and end positions. At the same time, any exonic overlap between alternative transcripts will lead to a merged node in the splice graph, while the intron graph requires an exact match of both splice sites between two transcripts to form a single connected component. Thus, the intron graph can potentially be less tangled for the genes containing multiple alternatively spliced isoforms and, therefore, less complex to traverse through. Moreover, the intron graph natively provides information on neighboring splice junctions, which allows to easily detect incorrectly detected splice sites caused by misalignments and perform graph simplification. While this procedure can definitely be implemented within the splice graph concept, it seems to be more straightforward and native for the intron graph.

To evaluate how different steps of the transcript model construction algorithm affect recall and precision of IsoQuant, we performed a separate experiment described in Supplementary Note 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Nanopore sequencing data obtained from the human NA12878 cell line is available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md. PacBio human GM12878 data is available at ENCODE (https://www.encodeproject.org/search) under the accession numbers ENCFF450VAU and ENCFF694DIE. Sequencing data obtained from mouse brain samples is available at NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) under accession numbers GSE158450 and GSE178175. ONT SIRV data, simulated data and reduced gene annotations are published at https://zenodo.org/record/7121404 (ref. 31).

Code availability

IsoQuant and the supplementary scripts used for the evaluation are available at https://github.com/ablab/IsoQuant. Scripts for data simulation are available at https://github.com/andrewprzh/lrgasp-simulation.

References

  1. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article 
    CAS 

    Google Scholar
     

  2. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article 
    CAS 

    Google Scholar
     

  3. Liu, B. et al. deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 20, 274 (2019).

    Article 

    Google Scholar
     

  4. Sahlin, K. & Mäkinen, V. Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 37, 4643–4651 (2021).

    Article 
    CAS 

    Google Scholar
     

  5. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).

    Article 
    CAS 

    Google Scholar
     

  6. Tung, L. H., Shao, M. & Kingsford, C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol. 20, 287 (2019).

    Article 
    CAS 

    Google Scholar
     

  7. Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2020).

  8. Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).

    Article 
    CAS 

    Google Scholar
     

  9. Kuo, R. I. et al. Illuminating the dark side of the human transcriptome with long read transcript sequencing. BMC Genomics 21, 751 (2020).

    Article 
    CAS 

    Google Scholar
     

  10. Byrne, A. et al. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun. 8, 16027 (2017).

    Article 
    CAS 

    Google Scholar
     

  11. Chen, Y. et al. Context-aware transcript quantification from long read RNA-Seq data. Bioconductor https://doi.org/10.18129/B9.bioc.bambu (2022).

  12. Tardaguila, M. et al. Corrigendum: SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 1096–1096 (2018).

    Article 
    CAS 

    Google Scholar
     

  13. de la Fuente, L. et al. tappAS: a comprehensive computational framework for the analysis of the functional impact of differential splicing. Genome Biol. 21, 119 (2020).

    Article 

    Google Scholar
     

  14. Reese, F. & Mortazavi, A. Swan: a library for the analysis and visualization of long-read transcriptomes. Bioinformatics 37, 1322–1323 (2021).

    Article 
    CAS 

    Google Scholar
     

  15. Stein, A. N., Joglekar, A., Poon, C.-L. & Tilgner, H. U. ScisorWiz: visualizing differential isoform expression in single-cell long-read data. Bioinformatics 38, 3474–3476 (2022).

    Article 
    CAS 

    Google Scholar
     

  16. Sahlin, K. & Medvedev, P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat. Commun. 12, 2 (2021).

    Article 
    CAS 

    Google Scholar
     

  17. Nip, K. M. et al. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Res. 30, 1191–1200 (2020).

    Article 
    CAS 

    Google Scholar
     

  18. Pardo-Palacios, F. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantifican. Preprint at https://doi.org/10.21203/rs.3.rs-777702/v1 (2021).

  19. Mikheenko, A., Prjibelski, A. D., Joglekar, A. & Tilgner, H. U. Sequencing of individual barcoded cDNAs using Pacific Biosciences and Oxford Nanopore Technologies reveals platform-specific error patterns. Genome Res. 32, 726–737 (2022).

    Article 

    Google Scholar
     

  20. Hafezqorani, S. et al. Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data. Gigascience 9, giaa061 (2020).

    Article 

    Google Scholar
     

  21. Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021).

    Article 
    CAS 

    Google Scholar
     

  22. Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).

    Article 

    Google Scholar
     

  23. Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).

    Article 
    CAS 

    Google Scholar
     

  24. Joglekar, A. et al. A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain. Nat. Commun. 12, 463 (2021).

    Article 
    CAS 

    Google Scholar
     

  25. Ono, Y. et al. PBSIM: PacBio reads simulator—toward accurate genome assembly. Bioinformatics 29, S119–S121 (2013).

    Article 

    Google Scholar
     

  26. Wyman, D. & Mortazavi, A. TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts. Bioinformatics 35, 340–342 (2019).

    Article 
    CAS 

    Google Scholar
     

  27. Gupta, I. et al. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells. Nat. Biotechnol. 36, 1197–1202 (2018).

    Article 
    CAS 

    Google Scholar
     

  28. Heber, S. et al. Splicing graphs and EST assembly problem. Bioinformatics 18, S181–S188 (2002).

    Article 

    Google Scholar
     

  29. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article 
    CAS 

    Google Scholar
     

  30. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    Article 
    CAS 

    Google Scholar
     

  31. Prjibelski, A., Mikheenko, A., Joglekar, A., Jarroux, J. & Tilgner, H. U. Mouse SIRV and simulated data used in the IsoQuant publication. Zenodo https://doi.org/10.5281/zenodo.7121404 (2022).

Download references

Acknowledgements

We thank Nanopore WGS consortium and Ali Mortazavi’s laboratory at the University of California, Irvine for making the ONT and PacBio data publicly available. This work was supported by St. Petersburg State University, Russia (grant ID no. PURE 93023437 to A.M., A.S., A.L.L. and A.D.P.). Scientific research was performed at the Research Park of St. Petersburg State University Computing Center.

Author information

Author notes

  1. These authors contributed equally: Andrey D. Prjibelski, Alla Mikheenko.

Authors and Affiliations

  1. Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia

    Andrey D. Prjibelski, Alla Mikheenko & Alla L. Lapidus

  2. Department of Computer Science, University of Helsinki, Helsinki, Finland

    Andrey D. Prjibelski

  3. Tri-Institutional Computational Biology and Medicine, Weill Cornell Medicine, New York, NY, USA

    Anoushka Joglekar

  4. Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA

    Anoushka Joglekar, Julien Jarroux & Hagen U. Tilgner

  5. Center for Neurogenetics, Weill Cornell Medicine, New York, NY, USA

    Anoushka Joglekar, Julien Jarroux & Hagen U. Tilgner

  6. Bioinformatics Institute, St. Petersburg, Russia

    Alexander Smetanin

Contributions

A.D.P., A.M. and A.S. designed and implemented the software. A.D.P., A.M. and A.J. performed the benchmarks. J.J. performed the sequencing experiments. A.D.P. and A.L.L. acquired funding. H.U.T. suggested the project. A.L.L. and H.U.T. supervised the project. A.D.P., A.M., A.J. and H.U.T. wrote the manuscript.

Corresponding authors

Correspondence to
Andrey D. Prjibelski or Hagen U. Tilgner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Heng Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Source data

About this article

 Verify currency and authenticity via CrossMark

Cite this article

Prjibelski, A.D., Mikheenko, A., Joglekar, A. et al. Accurate isoform discovery with IsoQuant using long reads.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01565-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-022-01565-y

Read More
Andrey D. Prjibelski

Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models

Main

Drug-response patterns in individuals with complex disease, such as type 2 diabetes (T2D), are intricate. Multiple organs and confounders are typically involved including comorbidities and polypharmacy1,2. Conversely, treatment with one or more drugs and the associated polypharmacy effects can have considerable impact on the molecular profile of the individual; however, such changes are still largely unknown3. The increasing availability of deep phenotyping and multi-omics screening has proven to be beneficial in the characterization of T2D and other diseases4,5,6,7, and offer the opportunity to gain mechanistic insights on the action of drugs on disease processes.

Cohort studies can be highly useful for investigating associations between drugs and molecular phenotypes, and can be used to tailor the design of randomized control studies to assess direct causal relationships8. Common approaches to analysis of cohort data apply univariate statistical methods, linear and logistic regression, dimensionality reduction and clustering analyses. However, when expanding to multi-omics data such analyses are not straightforward and traditional methods of data interpretation are insufficient to exploit the full scope of multi-modality data.

Here we investigate vertical data integration, where multiple omics datasets have been generated for the same samples. Challenges that must be overcome include integration of data across multiple continuous and discrete data modalities, efficient handling of missing data or even large missing parts of specific data types, differences in dimensionality, modality-specific noise and how to extract associations across data modalities9,10,11. There are several strategies for vertical integration of multi-modal datasets, such as element-wise addition of one dataset at a time, learning individual representations for each dataset before fusion, or multi-dimensional fusion where representations are learned from the input data altogether9,12,13,14. Examples are multi-omics factor analysis (MOFA), iCluster, and data integration analysis for biomarker discovery using latent components (DIABLO) implemented in mixOmics, which can integrate multiple modalities11,14,15,16. However, these methods primarily focus on discovering factors or latent variables that can be used for visualization, clustering, or prediction of disease.

We have previously developed a deep-learning framework on the basis of variational autoencoders (VAE)17,18 for integration and binning of large amounts of unstructured metagenomics data19. Specifically, a VAE is based on deep neural networks and learns to transform high-dimensional data into a lower-dimensional space, termed a latent representation. During this process the two networks of the VAE learn the structure of input data and associations between the input variables. In our previous study, we found that the VAE could learn to integrate two datasets without any prior knowledge or statistical model19. Similarly, others have shown the capabilities of VAEs as integrative models for extracting the underlying signal in data for improving clustering and prediction12,20,21,22,23, as well as for handling large proportions of missing data24. We, therefore, speculated that such a model could be used to integrate even deeper cohort-level multi-omics datasets. While previous studies have primarily focused on stratifying patients using the underlying latent representation22,25,26 we were also interested in whether we could acquire insights into the complex relationships that the network learns through data integration.

For this purpose, we exploited that the decoder of the VAE is a generative model. Thus, the final trained decoder will be able to generate new examples of data from the learned latent distribution. On the basis of this principle, a variety of generative models have been used to generate new examples of data, such as single-cell RNA data and artificial human chromosomes27,28. Additionally, when combined with Bayesian decision theory they have been used for analysis of single-cell RNA data on the basis of variational inference29,30,31. Generative models also allow investigation of the effect that a virtual perturbation of the input data will have on the generated examples. For instance, Yeo et al. trained a generative model on single-cell RNA time-series data and then perturbed the input data to identify the effect of the perturbation on the output of the generative model32. Similarly, a recent study used the generative model of a VAE trained on protein evolutionary data to predict the effect that genetic variants have on the fitness of human proteins33. For our multi-modal data, we hypothesized that the generative ability of the VAE would allow us to identify associations between, for example, patient exposures and omics features.

We therefore developed a framework that is based on VAEs that we applied to a cohort of 789 people with newly diagnosed T2D with extensive multi-omics characterization. These modalities included genomics, transcriptomics, proteomics, metabolomics, and microbiomes as well as data on medication, diet questionnaires, and clinical measurements. Our method was able to integrate multi-omics data with clinical and categorical data and was resistant to systematic biases in the data as well as large amounts of missing data. Using an ensemble of generative VAE models, feature perturbation, univariate statistical methods, and Bayesian decision theory we identify cross omics associations. We compared the drug multi-omics profiles and showed that different drugs are associated with unique clinical and molecular profiles. Our method, multi-omics variational autoencoders (MOVE) is freely available, easily scalable, can integrate any number of categorical and continuous datasets, and able to identify features to multi-omics associations.

Results

Designing a VAE for multi-omics data integration

We used a dataset of 789 newly diagnosed T2D individuals with extensive multi-omics characterization (Supplementary Table 1). In total the data included 8,807 variables per individual with median missingness within an omics dataset of less than 5% except for metagenomics data where two thirds of the individuals (532) did not have any data (Supplementary Data 1 and Supplementary Fig. 1). Therefore, these individuals had up to 24.7% missingness across the multi-omics data. For the clinical data missingness was higher with a per individual median of 14% and 7% for continuous and categorical clinical data, respectively. We designed the MOVE framework to be flexible in relation to the number of input data types and to be able to handle both continuous and categorical features (Fig. 1a). To identify the optimal hyperparameters that would capture the structure of the data without losing the ability to generalize on unseen individuals, we initially divided the dataset into training and test sets. We then measured the ability of the models to reconstruct the input as well as the stability when refitting the model to the data several times (Supplementary Figs. 24). The median reconstruction accuracies were between 0.95–1 and the final models were highly stable when retrained five times with average change of cosine similarities in the latent space of 0.037. Thus, the VAE models were able to reconstruct the data with high accuracy across the individuals (Supplementary Fig. 5).

Fig. 1: Integrating multi-omics data with a VAE.
figure 1

a, Principle of integration and analysis approach using MOVE. Individual-level non-omics and multi-omics data were used as input to a VAE. The optimal network hyperparameters were estimated from the summed test set error across all individuals in the test (test likelihood), training reconstruction accuracy, and model stability. Significant drug–omics associations were identified by perturbing drug status from no (0) to yes (1) for all individuals that were not already administered the drug. b, UMAP representation of the latent representation from the 789 people with newly diagnosed T2D. Individuals were colored according to their z-scaled Matsuda index from low (blue), average (yellow), and high (red). c, Overlap in significant drug–omics associations between standard t-test (two-sided, Benjamini–Hochberg FDR < 0.01) on the input data, MOVE t-test (multi-stage Bonferroni-corrected, P adjust < 0.05) and MOVE Bayes approaches (FDR Bayes < 0.05). The different methods of multiple testing correction corresponded to FDR of 0.05 on the ground-truth dataset. The overlap between MOVE t-test and MOVE Bayes was used for further analysis (n = 573). d, The number of significant associations found between drugs and features in the multi-omics datasets using MOVE t-test and MOVE Bayes (purple), t-test (green) or ANOVA (orange). See c for information on the tests. e, Fraction of features in the multi-omics datasets that was found by MOVE to be significantly associated with at least one drug (n = 20). The lower and upper hinges correspond to the first and third quartiles. The upper and lower whiskers extend from the hinge to the highest and lowest values, respectively, but no further than 1.5× interquartile range from the hinge. Data beyond the ends of whiskers are outliers and are plotted individually.

Source data

Full size image

The latent space contains important clinical signatures

To illustrate how well the model captured the structure of the clinical data, we analyzed the neural network weights connected to the input variables of the encoder. Here we found the majority of the clinical and dietary variables to be among the top 50 most important (Supplementary Fig. 6). This was also the case when we investigated how the continuous features impacted the positioning of the individuals in the latent space using a Shapley additive explanation (SHAP) analysis34, whereas for discrete features we found T2D-associated genetic variants as well as clinically related features to be important (Supplementary Fig. 7). Then, we investigated how individuals would be differentiated by characteristics such as insulin sensitivity quantified by the Matsuda index (Fig. 1b). Here we found a trend of the Matsuda Index correlating with the two uniform manifold approximation and projection (UMAP) dimensions using Pearson’s correlation coefficient (PCC) of 0.34 and −0.35 for dimensions one and two, respectively. Using k-nearest-neighbor (kNN) regression on the latent representation we found that R2 for Matsuda Index (k = 5) was 0.70 compared to 0.37–0.38 when using residualized data or dimensionality reduction using principal component analysis (PCA) and that this trend was consistent for larger k (Supplementary Figs. 8 and 9). This indicated that the MOVE latent representation captured a clinical signal that was not as easily identified from the residualized data or by using PCA for dimensionality reduction. Furthermore, we did not find any strong local effects of missingness (R2 = 0.05 at k = 5) and only small effects of age (R2 < 0.01, k = 100). Similarly, we used a kNN classifier to investigate the effect of the confounders sex and recruitment center on the global structure of the latent representation. These achieved accuracies of 0.58 and 0.25 for sex and center, respectively, which should be compared to by-chance accuracies of 0.50 and 0.17, respectively (Supplementary Figs. 10 and 11). If we used non-residualized data, that is, when not correcting for confounding effects including age, sex, and center, we observed larger effects (Supplementary Figs. 10 and 11). This demonstrates the ability of the VAE to integrate heterogeneous data but also that substantial confounding factors can influence the latent representation.

Extracting drug to clinical and multi-omics associations

We then investigated if the model had learned associations between the clinical, drug and multi-omics data. To do this, we developed an approach that is based on perturbating input features one at a time (Fig. 1a). For instance, to identify associations between a particular drug and all other features, we simulated that we gave the drug to each of the individuals that did not receive the drug. In addition to excluding individuals that were already receiving the drug we also excluded individuals taking a drug of the same therapeutic drug-class in the anatomical therapeutic chemical classification (ATC) system (Supplementary Table 2). We then assessed if the change in each of the feature reconstructions was significantly different compared to when passing the original data through the model (Fig. 1a). Because VAE models are stochastic, we used results across an ensemble of models and developed two different approaches to identify significant associations. One approach was based on applying t-tests with Bonferroni correction across four different models, where each model was refitted 10 times (MOVE t-test), while we also, inspired by earlier variational work29,30,31, used Bayesian decision theory and a single model refitted 30 times (MOVE Bayes). To identify different parameters of the approaches that would allow for comparison across and to standard methods (t-test, analysis of variance (ANOVA)), we applied them to two datasets consisting of randomized clinical, drug and multi-omics data. Our findings showed that MOVE t-test and MOVE Bayes had good performance to identify drug–omics associations compared with t-test and ANOVA at a ground-truth false discovery rate (FDR) of 0.05 (Supplementary Fig. 12 and Supplementary Table 3 and Methods).

MOVE identifies drug and multi-omics associations

We then applied the MOVE framework to identify drug associations in the DIRECT multi-modal data. The two methods, MOVE t-test and MOVE Bayes, identified 3,143 and 763 significant associations to the multi-omics and clinical features, respectively (Supplementary Tables 46 and Supplementary Data 24). We analyzed the intersection of the two approaches and found that 573 of the 763 (75%) of the significant associations were found by both methods (Fig. 1c). Making a conservative choice, we used the associations identified by both methods for further analyses. When compared to traditional tests such as the Student’s t-test and ANOVA we found this to add 211% more significant associations, from 184 to 573 (Fig. 1d). In addition, the significant associations identified by MOVE were distributed across the drugs (two-sided t-test, P = 0.016) and not only for the drugs administered to most individuals such as Simvastatin, Atorvastatin, and Metformin. For instance, MOVE identified a median of 20 associations per drug compared to 1 for t-test and 0 for ANOVA, highlighting that our method was more sensitive for extracting associations for drugs given to a smaller number of individuals (Supplementary Tables 5 and 6). Among the multi-omics datasets, we found that the largest number of significant drug associations was to the metabolomics, clinical, and transcriptomics data with an average of six associations per drug (Fig. 1e and Supplementary Fig. 13). When normalizing for all possible associations, the highest fraction of associations was to the clinical data (8%) followed by targeted and untargeted metabolomics with an average of 5.1% and 2.8% of the features associated to a drug, respectively. Finally, we investigated if our results could be driven by disease subtypes within the T2D cohort. To do this, we used four archetype clusters from Wesolowska–Andersen and Brorsson et al.7 that were based on clustering from 32 clinical features. Here we found that a median of 6.5% of the significant drug–omics associations were specific to one of the subgroups indicating that the associations were not primarily driven by the archetypes (Supplementary Table 7).

Changes in T2D biomarkers were associated with metformin

We then investigated drug and multi-omics interactions (Fig. 2a and Supplementary Figs. 1418), and initially focused on expected clinical drug interactions. For instance, for metformin, we identified 88 significant clinical and multi-omics interactions across all the datasets. When investigating associations across the individuals we found low intra-patient variability indicating that the changes were stable (Fig. 2b and Supplementary Fig. 19). We found that metformin was significantly associated with 12 clinical markers of T2D such as insulin clearance, active GLP-1, glucose levels from mixed-meal glucose tolerance test, glucose sensitivity, and blood pressure (Fig. 2a and Supplementary Data 24). The directions of some of the associations were opposite to the expected metformin effects, that is, metformin was associated with decreased glucose sensitivity at baseline (average Z-score change −0.029, confidence intervals [−0.030, −0.029]). This could be due to confounding by indication in terms of the study design where newly diagnosed T2D individuals that have been prescribed metformin are expected to have more severe clinical T2D values compared to individuals not needing medical treatments35,36. Therefore, since all individuals have T2D the confounding effect of their diabetic status could not be disentangled from the effect of metformin. When investigating the multi-omics associations of metformin we found two of the seven associated proteins (ERAP2 and CD40L) could be linked to the immune system (Fig. 3a and Supplementary Data 4). Similarly, for the transcriptomics data we found CXCL8 and CD177 to be altered by metformin where the former has been shown to be altered in healthy individuals and cancer patients37,38,39. In the targeted metabolomics data we identified a significant enrichment of metabolites associated with aminoacyl-tRNA biosynthesis (hypergeometric test, P = 2.2 × 10−4, FDR corrected). This pathway has previously been associated with metformin in functional pathway analysis of microbial change in mice40. Finally, for the untargeted metabolomics data, metformin had the highest number of associations of any drug (22 associations) indicating that new metabolic effectors of metformin treatment could potentially be identified (Supplementary Fig. 17 and Supplementary Table 4).

Fig. 2: Significant associations between drugs, clinical, and multi-omics features.
figure 2

a, Significant associations between drugs and clinical features. Effects are given as effect size (z-scaled units) from negative (blue) to positive (red). Significant associations identified by both MOVE t-test and MOVE Bayes are indicated using a star. Features (y-axis) and drugs (x-axis) are clustered using hierarchical clustering on the basis of Euclidean distances. b, As in a but showing per individual-level associations of metformin to multi-omics features demonstrating that associations are highly stable across individuals. Features (y-axis) and newly diagnosed T2D individuals (x-axis).

Source data

Full size image

Fig. 3: Drug associations with metagenomics species and drug–drug similarities.
figure 3

a, Display of effect sizes (z-scaled units) for (outer to inner) metformin, simvastatin, atorvastatin, omeprazole, lansoprazole, paracetamol, and codeine. Only significant associations to any of the drugs are shown and effect size is visualized as brown (negative), gray (none), and green (positive). Selected omics features are indicated. The Gene Ontologies element represents significantly over-represented Gene Ontology terms using transcriptomics (hypergeometric test, FDR < 0.05) (green). The innermost ring indicates SHAP importance for the individual features in the encoding from input data to the latent representation. b, Effect size (z-scaled units) (x-axis) of the human gut metagenomics species that were significantly associated with metformin (orange) or omeprazole (teal). c, Drug–drug similarities by comparing drug-response profiles across the multi-omics datasets. Cosine similarity indicated from no similarity (blue) to identical profiles (red). d, Average effect (z-score) of drugs for the omics datasets. All 20 drugs are shown, however, only metformin (red), omeprazole (purple), atorvastatin (green), and simvastatin (blue) are indicated. All other drugs are colored gray without a text label. e, Distribution of multi-omics ranks for the different drugs. The ranks are determined as a number between 1–20 (drugs) on the basis of the average effect size from d. The boxes are colored according to number of individuals taking a particular drug from 0 (white) to 323 (purple). There was no correlation between rank scores and number of individuals taking a drug (PCC = 0.14). The lower and upper hinges correspond to the first and third quartiles. The upper and lower whiskers extend from the hinge to the highest and lowest values, respectively, but no further than 1.5× interquartile range from the hinge. Data beyond the ends of whiskers are outliers and are plotted individually.

Source data

Full size image

Association of metformin and omeprazole with gut microbiota

Recent studies have shown how drug intake can influence the human gut microbiome composition41,42. Here we found metformin and omeprazole to be the only drugs to have significant associations to the metagenomics data with an increase of eleven metagenomics species as well as a decrease of six other species (Fig. 3b). Remarkably, the findings of increased Escherichia coli and decreased levels of Intestinibacter bartlettii and Peptostreptococcaceae sp. have been reported in healthy individuals taking metformin in an intervention study43 (Supplementary Data 4). As the study first reporting the findings was performed in healthy individuals, the changes are most likely not explained by other factors than metformin treatment. For omeprazole, a protein pump inhibitor (PPI), we identified three Streptococcus species to be significantly increased (Streptococcus sp., Streptococcus parasanguinis, and Streptococcus vestibularis) (Supplementary Data 4). Previous work by others has specifically shown PPIs to influence the abundance of Streptococcus parasanguinis and vestibularis in the human gut44. Interestingly, both omeprazole and lansoprazole target the K-transporter ATPase alpha channel 1 and increases pH in the stomach. The two drugs, however, have different speed to effect rates where omeprazole elicits its effect with a slower rate compared to lansoprazole45. This, in combination with more individuals being administered omeprazole (125) compared to lansoprazole (57), could explain why we identified significant alterations of gut microbiota for omeprazole and not lansoprazole.

Statins were associated with decreased low-density lipoprotein and cholesterol

Next, we investigated associations between the two statins, simvastatin, and atorvastatin, which are widely used to treat high blood cholesterol by lowering low-density lipoprotein (LDL)46. In agreement with their potential to treat dyslipidemia, we found both LDL and overall cholesterol levels to be significantly associated and decreased with average LDL z-score change of −0.039 (CI [−0.040, −0.038]) and −0.015 (CI [−0.016, −0.014]) for simvastatin and atorvastatin, respectively (Supplementary Data 4). This effect could be a consequence of many of the participants having been administered statins before their T2D diagnosis (simvastatin median duration 1.9 years and atorvastatin median duration 1.7 years; Supplementary Table 8), thereby increasing the chance of observing the effect of the drug with reduced confounding by indication. Interestingly, we noticed that besides the downregulation of LDL and general cholesterol levels some of the remaining clinical associations were not similar. Simvastatin was associated with an increase in the health marker high-density lipoprotein (HDL) cholesterol whereas atorvastatin had a decrease. This agrees with known effects of the two statins on HDL, where simvastatin and atorvastatin, respectively, increase and decrease HDL levels with increasing doses47.

Different molecular profiles of simvastatin and atorvastatin

When investigating the multi-omics associations, the two statins had diverse effects across the omics data (Fig. 3a and Supplementary Figs. 14–18 and 20). In agreement with the analysis of the clinical data, we found simvastatin to be significantly associated with downregulation of cholesterol homeostasis (Hypergeometric test, P = 0.005, FDR) and lipid transportation pathways (Hypergeometric test, P = 0.002, FDR) from the enrichment analysis of the associated transcripts (Fig. 3a and Supplementary Data 4 and 5). Specifically, we identified changes in LDLR, SREBF2, ABCA1, and ABCG1 expression, previously associated with simvastatin usage and accumulation of fatty acid and triglyceride in the liver through different pathways48,49,50,51,52 (Supplementary Data 4). In the proteomics data of atorvastatin, we identified known associations to FADS1 (ref. 53), as well as EIF2AK3, which has been reported associated with cholesterol homeostasis54,55. Additionally, two insulin growth factor binding proteins (IGFBP1 and IGFBP4) were associated with atorvastatin and IGFBP4 for simvastatin as well (Supplementary Data 4). These have previously been reported specifically for people with T2D and atorvastatin use54,56. Finally, in the targeted metabolomics data, we identified simvastatin to be associated with an increase in glycine levels, which in low systemic concentration has been associated with obesity and T2D57 (Supplementary Data 4). Furthermore, we observed a decrease of several phosphatidylcholines (11 of 17 decreased metabolites), and an increase of sphingomyelin and ceramide (2 of 11 increased metabolites), a ratio which has previously been shown to be altered with high doses of simvastatin compared to other statins58 (Supplementary Data 24). For atorvastatin, we observed a non-significant decrease of glycine levels and that the overall ratio of sphingomyelin and ceramide decreased (4 of 13 decreased metabolites).

Drug polypharmacy and similarity across multi-omics data

We then investigated similarities between drugs and their multi-omics associations. Overall, we observed four clusters containing three to six drugs each and found that some of the drugs within a cluster could potentially be associated with polypharmacy (Fig. 3c). Therefore, we investigated the impact of a drug–drug combination on the associations and found a correlation between overall drug association similarity and the individuals taking the two drugs (PCC 0.75, P value of 2.2 × 10−35). This finding indicates possible polypharmacy effects introduced by taking the two drugs together resulting in a higher drug–drug similarity across all clinical and multi-omics changes. However, some of the similarities might to some extent be driven by overlapping patient groups and non-drug-related similarities such as the underlying reason for taking the drug. An example could be the drug similarity cluster of Ramipril, Acetylsalicylic Acid, Bisoprolol, Amlodipine and Atorvastatin, which can be linked to cardiovascular diseases. Furthermore, the drugs that had the most similar drug and multi-omics associations were codeine and paracetamol with a cosine similarity of 0.78. Most (38 of 46) of the individuals in the cohort taking codeine were also taking paracetamol while a large fraction of individuals (52 of 90) was only taking paracetamol. We therefore cannot rule out that the correlated multi-omics profiles of the two drugs could be driven by the partial overlap leading to similar latent representation and model reconstructions. Finally, we investigated known drug–drug interactions and association with drug multi-omics profiles; however, found no statistically significant correlations (Supplementary Note and Supplementary Fig. 21).

The effects of drugs are widespread across the omics data

Currently, there are widespread efforts in investigating drugs and gut microbiome interactions suggesting that the microbiome is a potential target and mediator of drug effect42,59,60. As we investigated several multi-omics datasets besides the gut microbiome (metagenomics), we can compare the effect size of the drugs across the omics datasets. Interestingly, we found that the gut microbiome was the dataset with the second fewest number of statistically significant hits across the drugs with 17 significant associations (Supplementary Table 4 and Supplementary Fig. 13). Only diet and wearable data had fewer associations (11); transcriptomics, proteomics, targeted, and untargeted metabolomics had between 44–134 significant associations. We then asked if the effect size of the drugs were different across datasets and determined the cumulative effect size of the drugs in the respective multi-omics datasets. Here we found that the average effect sizes in transcriptomics and metagenomics data were the lowest for all drugs, and that those in the metagenomics dataset were significantly lower compared to all other omics datasets but transcriptomics (ANOVA, Tukey HSD test, adjusted P < 0.05) (Fig. 3d and Supplementary Table 9). When we subset to significant drug–omics associations, of which the gut microbiome only had two drugs with significant associations (metformin and omeprazole), we found that the effect of these two drugs were similar or lower compared to the effect sizes of the other multi-omics datasets (Supplementary Fig. 22). Finally, we investigated if this could be caused by increased uncertainty when learning and reconstructing a given modality but only found small correlations with PCCs of −0.15 to 0.16 between modality uncertainty and inferred effect sizes in a modality (Supplementary Table 10). Overall, this observation implies that the multi-omics response to drug stimuli are not only targeting the gut microbiome and that multiple omics datasets should be included when attempting to understand drug effects.

Ranking the impact of drugs in multi-omics data

Finally, we investigated the effect sizes of the individual drugs across the multi-omics datasets. We found that metformin and omeprazole, in general, had the most pronounced effects on the multi-omics data (cumulative rank scores) and that the two statins ranked 14 and 20 out of the 20 drugs (Fig. 3e) where simvastatin had the lowest overall rank of cumulative effect sizes. This analysis was not confounded by the number of individuals taking a particular drug as there was no correlation (PCC = 0.14) between the number of individuals and drug effect. This was opposed to when investigating only significant associations where statins ranked 2 and 4 with high effect sizes (Supplementary Figs. 22 and 23). This observation may indicate that statins had fewer strong effects, whereas, for instance, both metformin and omeprazole with the highest average rank had larger systemic effects.

Discussion

Here we show that it is possible to use unsupervised deep learning to integrate and extract associations from a deeply phenotyped cohort of people with T2D. While existing methods for vertical integration of multi-omics data focus on encoding the data to factors or latent representations that can be used for clustering and classification, we took this further by using the generative capacity of VAE models. In comparison to traditional univariate statistical tests, MOVE can identify significant drug–omics associations for a wider selection of drugs. We believe that these improvements come from the ability of the generative models to infer multi-omics changes for individuals not receiving a drug thus increasing power.

Previous work to stratify the newly diagnosed T2D individuals from this cohort used 32 clinical features to identify four archetypes representing different T2D subtypes7. In addition, they used metformin status of the individuals to investigate if the subgroups were confounded by metformin treatment and found no significant impact on the clusters and their multi-omics correlations. In contrast to their work, we added medication data on 19 additional drugs and used all data as input to our unsupervised deep-learning model allowing the model to learn from all inputs simultaneously. Thus, we were able to identify associations between the drugs and multi-omics data, including for metformin indicating the importance of vertical integration.

The cross-sectional design and clinical data-guided medical decisions make it difficult to assess the directionality of drug associations and further complicates causal inference. Hence, it is not possible to draw causal conclusions on drug effects; however, the results can be considered as input to design informed studies as well as randomized clinical control studies. In the future, expansion with longitudinal multi-omics data and modeling time could add more information on the causality of the drugs by investigating the long-term effects and associations32.

Similarly, our approach opens up for individualized analysis of patients in an N-of-1 approach61. It is well-known in health care that often selecting a drug or treatment in a situation at the same time excludes performing the control experiment of using another drug. Using MOVE, we can in principle ask what would happen if we gave the patient a drug and compare to the result of choosing another drug. Our cohort size is limited, but for larger cohorts of tens to hundreds of thousands of patients this could potentially be powerful to identify molecular associations and treatment outcomes for individual patients.

Finally, we emphasize that our approach is, of course, not limited to drug associations; in principle, all the omics data could be assessed for associations across the datasets. We therefore believe that our generative method opens new possibilities in big multi-omics data analysis for discoveries of potential new biomarkers, carrying out gedankenexperiments, and investigating potential direct effects of drugs in high dimensionality molecular data that leads to testable hypotheses.

Methods

The cohort

The cohort and available data included in the study are described in detail in Koivula et al.62,63 and Wesolowska–Andersen and Brorsson et al. (ref. 7). In brief, we used the newly diagnosed sub-cohort of the IMI-DIRECT study consisting of 789 participants. Fifty-eight percent of participants was male and participants had the following characteristics at baseline: age 62 (8.1) years; body mass index 30.5 (5.0) kg m−2; fasting glucose 7.2 (1.4) mmol l−1; 2 h glucose 8.6 (2.8) mmol l−1. Participants were diagnosed within 2 years before recruitment and had glycated hemoglobin (HbA1c) < 60.0 mmol mol−1 (<7.6%) within the previous 3 months. All samples represent distinct individuals. Furthermore, while Wesolowska–Andersen and Brorsson et al.7 used data from baseline and follow up at 18 and 36 months we only used baseline data for modeling. In addition to the baseline data from Wesolowska–Andersen and Brorsson, we carried out extensive curation and harmonization of the medication records included in the electronic case forms by the research nurses in the different recruitment centers and thus used standardized ATC annotated medication data for the individuals (see further detail below). Approval for the study protocol was obtained from each of the regional research ethics review boards separately (Lund, Sweden: 20130312105459927; Copenhagen, Denmark: H-1-2012-166 and H-1-2012-100; Amsterdam, Netherlands: NL40099.029.12; Newcastle, Dundee, and Exeter, UK: 12/NE/0132) and all participants provided written informed consent at enrollment. The research conformed to the ethical principles for medical research involving human participants outlined in the declaration of Helsinki. Further details about the data generation can be found in Wesolowska–Andersen and Brorsson et al.7.

Pre-processing of data

From the clinical, environmental, and questionnaire data only variables with variation across the dataset that were present in at least 10% of the individuals were included. The genomic data was included as the genotypes of risk alleles identified in Mahajan et al.64. In total 393 risk alleles were identified in our cohort out of the 403 associations mentioned in the paper. The genotypes were included as homozygous for risk allele, heterozygote, not having the allele, or missing if the locus was not identified for the individual. Diet data was included as 47 features on self-reported total intake of macronutrients and vitamins across a 24-h period. The wearables measured with an accelerometer included 25 measurements that summarize the movement and heart rate during the day. Transcriptomics data (RNA sequencing) from fasting whole blood samples were processed with RailRNA (v0.2.4b)65 to obtain scaled counts for all samples and only the most variable genes were included. The variable genes were selected by calculating the standard deviation across all individuals for each gene and selecting genes with an above-average standard deviation. Both targeted and untargeted metabolomics data in fasting plasma were included for all measurements passing quality control. In the proteomics data, all measurements within the measurable range based on the OLINK antibody panel were included and residualized for plate layout. The metagenomics data was only available for approximately one-third (256) of the individuals and were included as normalized read counts of identified Metagenomic Species66. Categorical data, including questionnaire responses, drug data, and genomics, was one-hot encoded. The continuous data were residualized by the collection center as the data was collected from six different European countries and, thus, handled by different nurses and lab technicians, as well as differences in the time-of-day samples were taken, which could have a large effect on the measurements. Additionally, the data were residualized for age and sex as these could be biological non-disease-related confounders in the data. Lastly, each continuous dataset was z-scale normalized per feature to ensure that each feature was distributed around zero.

Classification of drugs using the ATC system

The ATC system is the WHO classification system for therapeutic drugs. The system has a hierarchical structure, where the topmost level, ‘level 1—Anatomical main group’, specifies the target organ or tissue, and the lowermost level, ‘level 5—chemical substance’, specifies the active chemical compound. The three levels in between specify the therapeutic, pharmacological, and chemical levels, respectively. We, therefore, mapped all drugs to the lowest possible level to prevent information loss. A total of 4,155 entries could be mapped to level 5. For 55 entries, only a higher-level mapping was possible owing to lack of specificity and 43 entries could not be mapped to the ATC system, either because of the compound not existing in the database, for example nutraceutical compounds, or when we were unable to identify which drug was registered for the participant. The ATC system does not only specify compound names, but also administration route and daily dosages for over half of level 5 entries. However, owing to uncertainty of the reliability of the registered dosages, only drug names and administration routes were used for mapping. In instances where the administration route was not available, the drug was mapped by drug name only.

Drug data collection and clean-up

The study participants were asked to register their current drug usage at screening and baseline. Drug names were registered as free text together with administration route, dosage and frequency, and indication. Metformin was recorded separately from other anti-diabetic and non-anti-diabetic drugs. The collected data was variable in quality, using both generic and brand names, which were in many cases specific to the country of the participant. The data was cleaned in four steps: (1) removal of special characters, company names, formulations, and other non-relevant information; (2) automatic mapping to the PubChem database; (3) manual mapping to generic drug names; and (4) mapping to the ATC system. Indications of placebo use, for example participation in clinical drug trials, were noted as such. Only active compounds were included and consequently, possible brand variation was ignored, including for dietary supplements. Drug combinations were mapped, when possible, to the ATC code specifying said combination. However, when the specificity of the proposed ATC code was less specific than the registered drugs, the drug combinations were mapped to individual ATC codes, that is, ‘Perindopril’ (C09AA04) and ‘Indapamide’ (C03BA11) was used instead of ‘Perindopril and diuretics’ (C09BA04). Entries were mapped to ATC codes with the administration route when possible and otherwise mapped without the administration route. Dosage information was not used in the mapping process. In the manual mapping process, 99.4% of terms were assigned and a total of 359 drugs and drug combinations were identified. A total of 339 drugs (94.4%) was mapped to 441 ATC codes.

Design of the VAE

The VAE framework was constructed to account for a variable number of fully connected hidden layers in both the encoder and decoder and a latent layer that samples from a Gaussian distribution N(0, 1) of two vectors of size NL representing the means, µ, and standard deviations, σ. Each hidden layer included both batch normalization and dropout67 and with leaky rectified linear units (LeakyReLU)68 as activation function. Each dataset was concatenated to one input layer of both categorical and continuous variables. To allow for dataset-specific weights the error calculation was done separately for each dataset. Here we applied cross-entropy loss for categorical data and mean squared error for continuous data as implemented in PyTorch69. The loss was normalized by dataset input size and batch size. Deviance from the Gaussian distribution was penalized by adding the Kullback–Leibler divergence (KLD) to the loss. The final loss was defined as

$$L = mathbf{W}_{mathrm{cat}} times mathbf{E}_{mathrm{cat}} + mathbf{W}_{mathrm{con}} times mathbf{E}_{mathrm{con}} + mathbf{W}_{mathrm{KLD}} times mathrm{KLD}$$

Here, Ecat and Econ are vectors of normalized reconstruction error for each of the continuous and categorical datasets. Wcat and Wcon are vectors as well of the same length as the errors to introduce dataset-specific weights. We applied an equal weight of 1 for all datasets except for continuous clinical data where we used a weight of 2. WKLD is a weight put on the KLD defined as WKLD = β × NL−1 for which we used a β of 0.0001 for the final model. The KLD was defined as

$$mathrm{KLD} = {sum} { – frac{1}{2}(1 + ln left( sigma right) – mu ^2 – sigma )}$$

To efficiently handle missing data for the continuous features we encoded them as mean values across a particular feature during training and excluded the missing data points during back-propagation. With the data being z-score normalized the mean value is represented as zero. For the categorical features, we included them as a zero vector and the ignore index feature in the cross-entropy implementation in PyTorch was used to not include errors for missing data in the back-propagation. The VAE model was trained with the Adam optimizer70, with a mini-batch size of 10 and increasing batch size with a factor of 1.25 during training after every 50 epochs. The number of training epochs was set to 200 on the basis of early stopping on the test set as described below. Additionally, we trained the model using warm-up by first including the full KLD after 10 epochs slowly increasing the weight at epochs 4, 6, and 8. The latent representation of each patient was obtained by passing them through the trained VAE and extracting the µ layer. The VAE was implemented using PyTorch69 (v.1.7.0) and run using a GPU running CUDA (v.10.2.89).

Hyperparameter optimization for multi-omics integration

We initially divided the dataset into training (90%) and test (10%) sets to identify the optimal hyperparameter settings to efficiently capture the data structure without losing the ability to generalize on the test data (Supplementary Figs. 2 and 3). We tested different combinations of sizes of hidden layers, the number of hidden layers, size of latent space, dropout, and weight on the KLD. We then evaluated the model on the basis of both test log-likelihood and reconstruction accuracy. For the number of hidden neurons, the variations used were 200, 500, 800, 1,000, and 1,200, with the number of layers ranging between 1 and 5. The tested latent sizes were between 20 and 400 as well as dropout of 10%, 20%, and 30% and KLD weights of 0.001, 0.0001, and 0.0001. We defined an accurate reconstruction for categorical variables as the class with the highest probability corresponding to the class given by the input. For continuous variables, the accuracy was assessed by comparing the reconstructed array with the input array using cosine similarity for each individual instead of using exact matching. For both categorical and continuous data only non-missing values were used when calculating the accuracy in the reconstruction. We chose the number of training epochs on the basis of when the optimal test likelihood was achieved during testing rounded up to the nearest 100 epochs to ensure sufficient training to learn the complexity of the data. Here we found that more complex models, with higher numbers of hidden neurons and layers, resulted in worse performance on the test set (Supplementary Fig. 2) and that models with more than one hidden layer were unable to provide a decent reconstruction on the training data without overfitting. The only exception was the size of the latent representation, which gave a worse performance with smaller sizes (<50) and equally good performance for larger sizes (from 100 to 400) (Supplementary Fig. 3). For the five best performing models, stability was measured to choose the final model. The stability of the model was evaluated by repeating training with the same hyperparameters and calculating the difference in cosine similarity of the latent space to all other individuals. If the model produced the same result the average change in cosine similarity should be zero. The model with the average change closest to zero was then considered the most stable. The final hyperparameters were set to be one hidden layer of 2,000 neurons, a latent size of at least 100, and a 10% dropout for regularization.

Evaluating feature importance

Feature importance was extracted from the weights of the network for the models with only one hidden layer and because the input data was z-score normalized calculated as

$$I_i = mathop {sum }limits_{j = 1}^{n_{mathrm{hidden}}} left| {w_{ij}} right|$$

where Ii is the ith feature input and (left| {w_{ij}} right|) is the absolute value of the weight from ith input to the jth hidden neuron. To assess the actual impact on the latent representation an adaptation of the SHAP19 analysis was applied. The difference in model performance was assessed as the absolute differences of the latent representation when changing each input to missing for all individuals and passing it through the trained model.

Extracting significant drug associations

Drug associations were extracted by perturbation of the input data after training the final model on all individuals. Thus, for each drug we changed the drug status for all individuals with ‘not receiving’ to ‘receiving’. Importantly, we only included individuals that did not receive the specific drug or another drug within the same therapeutic subgroup (ATC level 2). Then, for each drug change, we compared the change in reconstructions to when we passed the original (un-perturbed) data through the network. In other words we determined the differences that the network infers from the change in drug status that during training was learned from all individuals receiving the drug. We used two strategies for this, one was based on an ensemble of Student’s t-tests using benchmarked thresholds, and another was based on Bayesian decision theory. Both approaches were benchmarked against randomized datasets where all the input data matrices were shuffled on rows and columns. We simulated effects in the shuffled data by randomly sampling a combination of a drug, a multi-omics dataset, and a feature within that omics dataset. For each combination, we then sampled an effect from the standard normal distribution N(0,1) and added this value to the omics feature whenever the selected drug was taken by an individual. We, therefore, did not expect that all effects would be significant in the statistical tests because we sample from N(0,1) and some effects will be close to 0. We added a total of 100 effects to the shuffled data and repeated the entire procedure to generate two shuffled datasets each with their unique added effects. Additionally, we investigated if the number of significant associations, effect size estimates and model uncertainty in the reconstruction were not biased by individual dataset uncertainties. This was done by calculating PCCs between the average estimated effect size across all 20 drugs and the difference between model input and the reconstructions for each of the omics features.

Significant associations using MOVE t-test

To evaluate if the change in the reconstruction was significant, we first determined the expected average change when passing the original and perturbed data through the model ten times. On the basis of these averages, we used a Student’s t-test for related samples as implemented in Python SciPy (v.1.3.1)71 between the baseline and drug-perturbed data for all non-missing continuous data. All P values were subsequently Bonferroni-corrected independently for each drug, and we applied a significance threshold of adjusted P < 0.05. We repeated the entire analysis with retraining of the model 10 times for each of four latent sizes (150, 200, 250, and 300). Associations were only included for analysis if they were significant for at least three of the four latent sizes and in at least five out of ten of the repeats. Therefore, reported P values were the averaged P value across the 10 replicate and 4 model tests, that is a total of 40 two-sided Bonferroni-corrected t-tests. The change in reconstruction, what we report as effect size, was calculated as the average difference across the 10 replicates and 4 model tests and were reported with 95% confidence intervals.

Significant associations using Bayes decision theory

For the method that was based on Bayesian decision theory we used an approach inspired by single-cell variational inference29 and Lopez et al.31. We trained VAE models with a latent size of 150 neurons and benchmarked the approach using different latent sizes and ensembling 1, 5, 10, 20, 30, 35, 40, or 50 models, which we termed refits. For the refits we averaged the reconstructions and used these to obtain the posteriors for the non-perturbed data and each of the drug perturbations. Thus, for VAE ensemble refit i, individual n, feature f, and drug d we define the variational reconstructions as (hat x_{infd}). By averaging across VAE refits, we obtain estimates of the average posteriors (hat x_{nfd}). Then, for each drug d we compare between two models: (M_d^f) where feature f is significantly associated with the drug, and the alternative model (M_0^f) where feature f is not significantly associated with drug d. Hence, we evaluate how often (left| {hat x_{nfd} – hat x_{nf0}} right| > 0) and calculate Bayes factors (K) as:

$$K = {{{mathrm{log}}}}_eleft| {frac{{mathrm{P}left( {M_d^f|hat x_{fd},,hat x_{f0}} right)}}{{mathrm{P}(M_0^f|hat x_{fd},,hat x_{f0})}}} right|$$

We ranked the associated features according to K (ref. 72). We set a FDR of α by accepting associations (n) between features and a drug until the cumulative evidence of P(M0) across accepted features for the drug was above the threshold. Since (mathrm{P}(M_0^f)=(1-mathrm{P}(M_d^f))) we accepted drug-feature associations while the cumulative evidence E is lower than α

$$E = mathop {sum }limits_f frac{{(1 – mathrm{P}(M_d^f))}}{n} < alpha$$

Benchmarking of t-test, MOVE t-test and MOVE Bayes

To be able to compare the number of significant associations between methods we used the two randomized datasets to estimate FDR from the ground truth, that is the added drug–omics effects (Supplementary Table 3). Here we found that a t-test with Benjamini–Hochberg FDR of 0.01 had ground-truth FDR of 0.00 and 0.06 on the two randomized datasets, corresponding to 52 and 67 true positives as well as 0 and 4 false positives, respectively. For MOVE t-test, we benchmarked the number of refits of the 4 models and found 10 refits to have a ground-truth FDR of 0.02 and 0.06, with 48 and 61 true positives as well as 1 and 3 false positives, respectively. For MOVE Bayes we benchmarked the number of refits for a model with 150 latent neurons and found FDR from the cumulative evidence to be well aligned with FDR of the ground truth. Using Bayes FDR of 0.05 we found 30 refits to have ground-truth FDR of 0.02 and 0.05, respectively. Across the two shuffled datasets 42 and 59 true positives were found by all three methods (Supplementary Fig. 12).

Calculation of drug associations using other methods

We compared our findings to associations identified with standard statistical approaches using Student’s t-test for unrelated samples and an ANOVA between two groups of individuals ‘not receiving’ and ‘receiving’ each drug. Here we used Benjamini–Hochberg correction for FDR73 with an adjusted P < 0.01. Additionally, we tested if a least absolute shrinkage and selection operator (LASSO) model was able to identify features with significant impact on predicting the ‘not receiving’ or ‘receiving’ groups for each drug. However, the LASSO model was unable to converge possibly owing to the high input feature dimensionality. All statistical tests were done with Python SciPy (v.1.3.1)71.

Drug effect size and similarities across omics data

Drug effect sizes were determined as the difference between the baseline and drug-perturbed variational reconstructions, that is, as the average difference across the VAE ensemble refits reported with 95% confidence intervals. Drug similarities were calculated as the cosine similarity as implemented in Python SciPy (v.1.3.1)71 between the average effect sizes on all features identified as significantly associated for at least one of the drugs both across and within each dataset. The difference was only calculated for non-missing data and individuals not already on the drug or a drug in the same ACT group. The rank of drug effect sizes was determined for each omics dataset ranking the effect sizes from 1 to 20. A rank of 20 indicates that the drug had the highest average effect size in this omics dataset compared to the other drugs. Correlations between multi-omics profiles and number of individuals taking the drug pair were calculated from the fraction of individuals that overlapped between the two drugs.

Molecular-focused analysis of the multi-omics data

To get a better understanding of the molecular profiles identified in the associations for the transcriptomics and proteomics data we tested for enriched Gene Ontology terms as well as molecular pathways. For the transcriptomics data, we assessed the molecular patterns of biological processes and pathways from Reactome74 (v.3.7) using the significantly associated genes for each drug against a background list of all genes included in the data integration. We used WebGestaltR75 (v.0.4.4) for the analysis with default settings (hypergeometric test) and evaluated all results with an FDR < 0.05. The targeted metabolomics data was analyzed for potential metabolite enrichments using MetaboAnalyst76 (v.5) over-representation analysis using a hypergeometric test and FDR of 0.05. We investigated both enrichments in known pathways in the KEGG database as well as enrichment of chemical structures sub-, main- and super-class levels. For all analyses, we used the included panel of targeted metabolites as the reference data.

Association differences within diabetes archetypes

As mentioned, previous work by Wesolowska–Andersen and Brorsson et al. performed archetype analysis of the multi-omics data with only metformin medication data7. Here they based the archetypes on clinical markers and identified four distinct and one ‘mixed’ T2D archetypes with clinical and omics profiles. To investigate if these distinct archetypes differed in their drug associations we used a t-test on the average effect size change for the individuals of each archetype against the remaining individuals. The analysis was only done for the significant drug associations for each drug. All analysis was only done for individuals not taking the drug or a drug within the same ATC therapeutical class similarly to the main analysis.

Drug–drug interactions

We used an in-house drug–drug interaction compendium generated from publicly available sources (Supplementary Table 11) to assess whether drug combinations had been reported previously to be interacting or not77. The compendium contains interactions from 26 different datasets of pharmacovigilance, clinically oriented information, schemas for NLP corpora, and drug–Cytochrome P450 relationships sources. For 12 of the drug–drug pairs in our dataset we could identify drug–drug interactions with reported severity (major, moderate, minor, possible, undetermined, and none) indicating clinical significance.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Owing to the informed consent given by study participants, the various national ethical approvals for the present study, and the European General Data Protection Regulation (GDPR), individual-level clinical and omics data cannot be transferred from the centralized IMI-DIRECT repository. Requests for access to summary statistics of the IMI-DIRECT data, including those presented here, can be made to di**************@*******ac.uk. Requesters will be informed on how summary-level data can be accessed via the DIRECT secure analysis platform following submission of an appropriate application. The IMI-DIRECT data access policy is available at https://directdiabetes.org. Example data is available at https://github.com/RasmussenLab/MOVE/ for testing of MOVE. As described in the methods section we used ATC (https://www.who.int/tools/atc-ddd-toolkit/atc-classification) and WebGestalt (v.0.4.4 at http://www.webgestalt.org) for analysis of Gene Ontologies, Reactome (v.3.7 at https://reactome.org) for analysis of molecular pathways, and MetaboAnalyst (v.5 at https://www.metaboanalyst.ca) for analysis of targeted metabolomics data. The 25 databases of drug–drug interactions are listed in Supplementary Table 11. Source data are provided with this paper.

Code availability

References

  1. Fares, H., DiNicolantonio, J. J., O’Keefe, J. H. & Lavie, C. J. Amlodipine in hypertension: a first-line agent with efficacy for improving blood pressure and patient outcomes. Open Heart 3, e000473 (2016).

    Article 

    Google Scholar
     

  2. Hu, J. X., Thomas, C. E. & Brunak, S. Network biology concepts in complex disease comorbidities. Nat. Rev. Genet. 17, 615–629 (2016).

    Article 
    CAS 

    Google Scholar
     

  3. Austin, R. P. Polypharmacy as a risk factor in the treatment of type 2 diabetes. Diabetes Spectr. 19, 13–16 (2006).

    Article 

    Google Scholar
     

  4. Zhou, W. et al. Longitudinal multi-omics of host–microbe dynamics in prediabetes. Nature 569, 663–671 (2019).

    Article 
    CAS 

    Google Scholar
     

  5. Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease. Genome Biol. 18, 83 (2017).

    Article 

    Google Scholar
     

  6. Gudmundsdottir, V. et al. Whole blood co-expression modules associate with metabolic traits and type 2 diabetes: an IMI-DIRECT study. Genome Med. 12, 109 (2020).

    Article 
    CAS 

    Google Scholar
     

  7. Wesolowska-Andersen, A. et al. Four groups of type 2 diabetes contribute to the etiological and clinical heterogeneity in newly diagnosed individuals: an IMI DIRECT study. Cell Reports Medicine 3, 100477 (2022).

    Article 

    Google Scholar
     

  8. Song, J. W. & Chung, K. C. Observational studies: cohort and case-control studies. Plast. Reconstr. Surg. 126, 2234–2242 (2010).

    Article 
    CAS 

    Google Scholar
     

  9. Picard, M., Scott-Boyer, M.-P., Bodein, A., Périn, O. & Droit, A. Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J. 19, 3735–3746 (2021).

    Article 
    CAS 

    Google Scholar
     

  10. Nicora, G., Vitali, F., Dagliati, A., Geifman, N. & Bellazzi, R. Integrated multi-omics analyses in oncology: a review of machine learning methods and tools. Front. Oncol. 10, 1030 (2020).

    Article 

    Google Scholar
     

  11. Rohart, F., Gautier, B., Singh, A. & Lê Cao, K.-A. mixOmics: an R package for ’omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).

    Article 

    Google Scholar
     

  12. Chung, N. C. et al. Unsupervised classification of multi-omics data during cardiac remodeling using deep learning. Methods 166, 66–73 (2019).

    Article 
    CAS 

    Google Scholar
     

  13. Kriebel, A. R. & Welch, J. D. UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization. Nat. Commun. 13, 780 (2022).

    Article 
    CAS 

    Google Scholar
     

  14. Argelaguet, R. et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14, e8124 (2018).

    Article 

    Google Scholar
     

  15. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009).

    Article 
    CAS 

    Google Scholar
     

  16. Singh, A. et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35, 3055–3062 (2019).

    Article 
    CAS 

    Google Scholar
     

  17. Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. Preprint at arXiv https://doi.org/10.48550/arXiv.1312.6114 (2013).

  18. Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. Preprint at arXiv https://doi.org/10.48550/arXiv.1401.4082 (2014).

  19. Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).

    Article 
    CAS 

    Google Scholar
     

  20. Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

    Article 

    Google Scholar
     

  21. Chaudhary, K., Poirion, O. B., Lu, L. & Garmire, L. X. Deep learning-based multi-omics integration robustly predicts survival in liver cancer. Clin. Cancer Res. 24, 1248–1259 (2018).

    Article 
    CAS 

    Google Scholar
     

  22. Zhang, L. et al. Deep learning-based multi-omics data integration reveals two prognostic subtypes in high-risk neuroblastoma. Front. Genet. 9, 477 (2018).

    Article 
    CAS 

    Google Scholar
     

  23. Cao, Z. -J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40, 1458–1466 (2022).

    Article 
    CAS 

    Google Scholar
     

  24. Mattei, P.-A. & Frellsen, J. MIWAE: deep generative modelling and imputation of incomplete data. In Proceedings of the 36th International Conference on Machine Learning 4413–4423 (PMLR, 2019).

  25. Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).


    Google Scholar
     

  26. Allesøe, R. L. et al. Deep learning-based integration of genetics with registry data for stratification of schizophrenia and depression. Sci. Adv. 8, eabi7293 (2022).

    Article 

    Google Scholar
     

  27. Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate gene expression and predict perturbations in single cells. Preprint at bioRxiv https://doi.org/10.1101/262501 (2018).

  28. Yelmen, B. et al. Creating artificial human genomes using generative neural networks. PLoS Genet. 17, e1009303 (2021).

    Article 
    CAS 

    Google Scholar
     

  29. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article 
    CAS 

    Google Scholar
     

  30. Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).

    Article 
    CAS 

    Google Scholar
     

  31. Lopez, R., Boyeau, P., Yosef, N., Jordan, M. I. & Regier, J. Decision-making with auto-encoding variational Bayes. In Proceedings of the 34th International Conference on Neural Information Processing Systems 5081–5092 (Curran Associates Inc., 2020).

  32. Yeo, G. H. T., Saksena, S. D. & Gifford, D. K. Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions. Nat. Commun. 12, 3222 (2021).

    Article 
    CAS 

    Google Scholar
     

  33. Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    Article 
    CAS 

    Google Scholar
     

  34. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).

  35. Hirst, J. A., Farmer, A. J., Ali, R., Roberts, N. W. & Stevens, R. J. Quantifying the effect of metformin treatment and dose on glycemic control. Diabetes Care 35, 446–454 (2012).

    Article 
    CAS 

    Google Scholar
     

  36. Knowler, W. C. et al. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. N. Engl. J. Med. 346, 393–403 (2002).

    Article 
    CAS 

    Google Scholar
     

  37. Ustinova, M. et al. Metformin strongly affects transcriptome of peripheral blood cells in healthy individuals. PLoS One 14, e0224835 (2019).

    Article 
    CAS 

    Google Scholar
     

  38. Xiao, Z., Wu, W. & Poltoratsky, V. Metformin suppressed CXCL8 expression and cell migration in HEK293/TLR4 cell line. Mediators Inflamm. 2017, 6589423 (2017).

    Article 

    Google Scholar
     

  39. Bruno, S. et al. Metformin inhibits cell cycle progression of B-cell chronic lymphocytic leukemia cells. Oncotarget 6, 22624–22640 (2015).

    Article 

    Google Scholar
     

  40. Ma, W. et al. Metformin alters gut microbiota of healthy mice: implication for its potential role in gut microbiota homeostasis. Front. Microbiol. 9, 1336 (2018).

    Article 

    Google Scholar
     

  41. Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015).

    Article 
    CAS 

    Google Scholar
     

  42. Vieira-Silva, S. et al. Statin therapy is associated with lower prevalence of gut microbiota dysbiosis. Nature 581, 310–315 (2020).

    Article 
    CAS 

    Google Scholar
     

  43. Bryrup, T. et al. Metformin-induced changes of the gut microbiota in healthy young men: results of a non-blinded, one-armed intervention study. Diabetologia 62, 1024–1035 (2019).

    Article 
    CAS 

    Google Scholar
     

  44. Vich Vila, A. et al. Impact of commonly used drugs on the composition and metabolic function of the gut microbiota. Nat. Commun. 11, 362 (2020).

    Article 
    CAS 

    Google Scholar
     

  45. Shin, J. M., Munson, K., Vagin, O. & Sachs, G. The gastric HK-ATPase: structure, function, and inhibition. Pflugers Arch. 457, 609–622 (2009).

    Article 
    CAS 

    Google Scholar
     

  46. Cholesterol Treatment Trialists’ (CTT) Collaboration. et al. Efficacy and safety of more intensive lowering of LDL cholesterol: a meta-analysis of data from 170,000 participants in 26 randomised trials. Lancet 376, 1670–1681 (2010).

    Article 

    Google Scholar
     

  47. Barter, P. J., Brandrup-Wognsen, G., Palmer, M. K. & Nicholls, S. J. Effect of statins on HDL-C: a complex process unrelated to changes in LDL-C: analysis of the VOYAGER database. J. Lipid Res. 51, 1546–1553 (2010).

    Article 
    CAS 

    Google Scholar
     

  48. Aguayo-Orozco, A. et al. sAOP: linking chemical stressors to adverse outcomes pathway networks. Bioinformatics 35, 5391–5392 (2019).

    CAS 

    Google Scholar
     

  49. Margerie, D. et al. Hepatic transcriptomic signatures of statin treatment are associated with impaired glucose homeostasis in severely obese patients. BMC Med. Genomics 12, 80 (2019).

    Article 

    Google Scholar
     

  50. Gilbert, R., Al-Janabi, A., Tomkins-Netzer, O. & Lightman, S. Statins as anti-inflammatory agents: a potential therapeutic role in sight-threatening non-infectious uveitis. Porto Biomed J 2, 33–39 (2017).

    Article 

    Google Scholar
     

  51. Aguayo-Orozco, A., Bois, F. Y., Brunak, S. & Taboureau, O. Analysis of time-series gene expression data to explore mechanisms of chemical-induced hepatic steatosis toxicity. Front. Genet. 9, 396 (2018).

    Article 

    Google Scholar
     

  52. Kennedy, M. A. et al. ABCG1 has a critical role in mediating cholesterol efflux to HDL and preventing cellular lipid accumulation. Cell Metab. 1, 121–131 (2005).

    Article 
    CAS 

    Google Scholar
     

  53. Ishihara, N. et al. Atorvastatin increases Fads1, Fads2 and Elovl5 gene expression via the geranylgeranyl pyrophosphate-dependent Rho kinase pathway in 3T3-L1 cells. Mol. Med. Rep. 16, 4756–4762 (2017).

    Article 
    CAS 

    Google Scholar
     

  54. Ferretti, G., Bacchetti, T., Banach, M., Simental-Mendía, L. E. & Sahebkar, A. Impact of statin therapy on plasma MMP-3, MMP-9, and TIMP-1 concentrations: a systematic review and meta-analysis of randomized placebo-controlled trials. Angiology 68, 850–862 (2017).

    Article 
    CAS 

    Google Scholar
     

  55. Orekhov, A. N. et al. Role of phagocytosis in the pro-inflammatory response in LDL-induced foam cell formation; a transcriptome analysis. Int. J. Mol. Sci. 21, 817 (2020).

    Article 
    CAS 

    Google Scholar
     

  56. Osório, J. Statins and T2DM—an IGF link? Nat. Rev. Endocrinol. 9, 187–187 (2013).

    Article 

    Google Scholar
     

  57. Alves, A., Bassot, A., Bulteau, A.-L., Pirola, L. & Morio, B. Glycine metabolism and its alterations in obesity and metabolic diseases. Nutrients 11, 1356 (2019).

    Article 
    CAS 

    Google Scholar
     

  58. Snowden, S. G. et al. High-dose simvastatin exhibits enhanced lipid-lowering effects relative to simvastatin/ezetimibe combination therapy. Circ. Cardiovasc. Genet. 7, 955–964 (2014).

    Article 
    CAS 

    Google Scholar
     

  59. Forslund, S. K. et al. Combinatorial, additive and dose-dependent drug-microbiome associations. Nature 600, 500–505 (2021).

    Article 
    CAS 

    Google Scholar
     

  60. Zimmermann, M., Zimmermann-Kogadeeva, M., Wegmann, R. & Goodman, A. L. Mapping human microbiome drug metabolism by gut bacteria and their genes. Nature 570, 462–467 (2019).

    Article 
    CAS 

    Google Scholar
     

  61. Lillie, E. O. et al. The n-of-1 clinical trial: the ultimate strategy for individualizing medicine? Per. Med. 8, 161–173 (2011).

    Article 

    Google Scholar
     

  62. Koivula, R. W. et al. Discovery of biomarkers for glycaemic deterioration before and after the onset of type 2 diabetes: rationale and design of the epidemiological studies within the IMI DIRECT Consortium. Diabetologia 57, 1132–1142 (2014).

    Article 
    CAS 

    Google Scholar
     

  63. Koivula, R. W. et al. Discovery of biomarkers for glycaemic deterioration before and after the onset of type 2 diabetes: descriptive characteristics of the epidemiological studies within the IMI DIRECT Consortium. Diabetologia 62, 1601–1615 (2019).

    Article 
    CAS 

    Google Scholar
     

  64. Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet. 50, 1505–1513 (2018).

    Article 
    CAS 

    Google Scholar
     

  65. Nellore, A. et al. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics 33, 4033–4040 (2017).

    CAS 

    Google Scholar
     

  66. Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

    Article 
    CAS 

    Google Scholar
     

  67. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).


    Google Scholar
     

  68. Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. In Proc. 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) (JMLR, 2013).

  69. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) (Curran Associates, 2019).

  70. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at arXiv https://doi.org/10.48550/arXiv.1412.6980 (2014).

  71. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article 
    CAS 

    Google Scholar
     

  72. Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).

    Article 

    Google Scholar
     

  73. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 289–300 (1995).


    Google Scholar
     

  74. Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).

    CAS 

    Google Scholar
     

  75. Liao, Y., Wang, J., Jaehnig, E. J., Shi, Z. & Zhang, B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 47, W199–W205 (2019).

    Article 
    CAS 

    Google Scholar
     

  76. Chong, J. & Xia, J. MetaboAnalystR: an R package for flexible and reproducible analysis of metabolomics data. Bioinformatics 34, 4313–4314 (2018).

    Article 
    CAS 

    Google Scholar
     

  77. Leal Rodríguez, C. et al. Drug interactions in hospital prescriptions in Denmark: prevalence and associations with adverse outcomes. Pharmacoepidemiol. Drug Saf. 31, 632–642 (2022).

    Article 

    Google Scholar
     

Download references

Acknowledgements

We are grateful to IMI-DIRECT study participants who volunteered for phenotyping, and clinical and technical staff across involved European study centers who contributed to recruitment and clinical assessment of study participants. The work leading to this publication has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement 115317 (DIRECT), resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution. Information on the initiatives and activities of the IMI-DIRECT research consortium is available at https://directdiabetes.org. R.L.A., A.T.L, R.H.M., J.J., V.B., H.W., J.N.N., C.B., G.M., L.N., P.J.C., U.P.J, K.B., S.R., and S.B. were supported also by the Novo Nordisk Foundation (grants NNF14CC0001 and NNF17OC0027594). Additionally, R.H.M. was supported by Novo Nordisk Foundation (grant NNF20SA0035590). Figure 1a was partly created using BioRender.com. Finally, we would like to thank Tugce Karaderi for critical comments on the manuscript.

Author information

Author notes

  1. Mark I. McCarthy

    Present address: Genentech, South San Francisco, CA, USA

Authors and Affiliations

  1. Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

    Rosa Lundbye Allesøe, Agnete Troen Lundgaard, Ricardo Hernández Medina, Alejandro Aguayo-Orozco, Joachim Johansen, Jakob Nybo Nissen, Caroline Brorsson, Gianluca Mazzoni, Lili Niu, Jorge Hernansanz Biel, Valentas Brasas, Henry Webel, Piotr Jaroslaw Chmura, Ulrik Plesner Jacobsen, Federico De Masi, Valborg Gudmundsdottir, Karina Banasik, Simon Rasmussen, Søren Brunak, Cecilia Engel Thomas, Birgitte Nilsson & Konstantinos Tsirigos

  2. Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark

    Rosa Lundbye Allesøe, Agnete Troen Lundgaard, Alejandro Aguayo-Orozco, Joachim Johansen, Caroline Brorsson, Gianluca Mazzoni, Jorge Hernansanz Biel, Anders Gorm Pedersen, Piotr Jaroslaw Chmura, Ulrik Plesner Jacobsen, Federico De Masi, Helle Krogh Pedersen, Valborg Gudmundsdottir, Karina Banasik, Søren Brunak, Cecilia Engel Thomas, Hans-Henrik Stærfeldt, Ramneek Gupta, Peter Wad Sackett, Birgitte Nilsson & Konstantinos Tsirigos

  3. Copenhagen Research Centre for Mental Health, Mental Health Centre Copenhagen, Copenhagen University Hospital, Copenhagen, Denmark

    Rosa Lundbye Allesøe & Michael Eriksen Benros

  4. Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

    Michael Eriksen Benros

  5. C.N.R. Institute of Neuroscience, Padova, Italy

    Andrea Mari, Roberto Bizzotto & Andrea Tura

  6. Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK

    Robert Koivula, Anubha Mahajan, Juan Fernandez Tajes, Mark I. McCarthy & Moustafa Abdalla

  7. Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland

    Ana Vinuela, Emmanouil Dermitzakis & Anna Ramisch

  8. Biosciences Institute, Faculty of Medical Sciences, Newcastle University, Newcastle, UK

    Ana Vinuela

  9. Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Bavaria, Germany

    Sapna Sharma

  10. Institute of Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Bavaria, Germany

    Sapna Sharma, Barbara Thorand, Anna Artati, Cornelia Prehn, Jonathan Adam & Harald Grallert

  11. Chair of Food Chemistry and Molecular and Sensory Science, Technical University of Munich, Freising, Germany

    Sapna Sharma

  12. Metabolomics and Proteomics Core, Helmholtz Zentrum Muenchen, German Research Center for Environmental Health, Neuherberg, Germany

    Mark Haid

  13. Affinity Proteomics, Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Solna, Sweden

    Mun-Gwan Hong, Jochen M. Schwenk, Cecilia Engel Thomas & Ragna Haussler

  14. Research and Development Global Development, Translational Medicine and Clinical Pharmacology, Sanofi-Aventis Deutschland, Frankfurt, Germany

    Petra B. Musholt & Hartmut Ruetten

  15. Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

    Josef Vogt, Helle Krogh Pedersen, Tue Haldor Hansen, Henrik Vestergaard, Oluf Pedersen, Torben Hansen, Kristine Allin, Manimozhiyan Arumugam, Anna Jonsson, Line Engelbrechtsen, Annemette Forman, Avirup Dutta, Nadja Sondertoft & Yong Fan

  16. University of Exeter Medical School, Exeter, UK

    Angus Jones, Andrew Hattersley & Timothy McDonald

  17. The Immunoassay Biomarker Core Laboratory, School of Medicine, University of Dundee, Dundee, UK

    Gwen Kennedy

  18. Research Centre for Optimal Health, Department of Life Sciences, University of Westminster, London, UK

    Jimmy Bell, E. Louise Thomas & Brandon Whitcher

  19. Section for Nutrition Research, Faculty of Medicine, Imperial College London, London, UK

    Gary Frost & Rebeca Eriksen

  20. Department of Radiology, Copenhagen University Hospital Herlev-Gentofte, Herlev, Denmark

    Henrik Thomsen & Elizaveta Hansen

  21. Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands

    Mirthe Muilwijk, Leen M. ‘t Hart, Joline Beulens, Femke Rutters, Giel Nijpels, Sabine van Oort, Lenka Groeneveld & Roderick Slieker

  22. Department of General Practice, Amsterdam Public Health Research Institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands

    Marieke T. Blom & Petra Elders

  23. Department of Biomedical Data Science, Section Molecular Epidemiology, Leiden University Medical Center, Leiden, the Netherlands

    Leen M. ‘t Hart

  24. Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, the Netherlands

    Leen M. ‘t Hart, Koen Dekkers, Nienke van Leeuwen & Roderick Slieker

  25. Inserm, Univ Lille, CHU Lille, Lille Pasteur Institute, EGID, Lille, France

    Francois Pattou, Violeta Raverdy, Philippe Froguel, Amelie Bonnefond, Mickael Canouil, Robert Caiazzo & Helene Verkindt

  26. MRC Epidemiology Unit, University of Cambridge School of Clinical Medicine, Cambridge, UK

    Soren Brage

  27. Department of Medicine, University of Eastern Finland, Kuopio, Finland

    Tarja Kokkola

  28. Institute of Cellular Medicine, Newcastle University, Newcastle, UK

    Alison Heggie & Harshal Deshmukh

  29. Diabetes Research Network, Royal Victoria Infirmary, Newcastle, UK

    Donna McEvoy & Ian McVittie

  30. Centre for Health, Law and Emerging Technologies (HeLEX), Faculty of Law, University of Oxford, Oxford, UK

    Miranda Mourby, Jane Kaye, Nisha Shah & Harriet Teare

  31. Lund University Diabetes Centre, Department of Clinical Sciences, Lund University, Malmö, Sweden

    Martin Ridderstråle, Paul W. Franks & Leif Groop

  32. Translational and Clinical Research Institute, Faculty of Medical Sciences, Newcastle University, Newcastle, UK

    Mark Walker

  33. Division of Population Health & Genomics, School of Medicine, University of Dundee, Dundee, UK

    Ian Forgie, Ewan Pearson, Andrew Brown, David Davtian, Adem Dawed, Louise Donnelly, Colin Palmer & Margaret White

  34. Genetic and Molecular Epidemiology Unit, Lund University Diabetes Centre, Department of Clinical Sciences, CRC, Lund University, SUS, Malmö, Sweden

    Giuseppe N. Giordano, Naeimeh Atabaki Pasdar, Hugo Fitipaldi, Azra Kurbasic, Pascal Mutie, Hugo Pomares-Millan & Maria Klintenberg

  35. Eli Lilly Regional Operations, Vienna, Austria

    Imre Pavo

  36. Harvard T.H. Chan School of Public Health, Boston, MA, USA

    Paul W. Franks

  37. OCDEM, Radcliffe Department of Medicine, University of Oxford, Oxford, UK

    Paul W. Franks

  38. Institute of Experimental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany

    Jerzy Adamski

  39. Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore

    Jerzy Adamski

  40. Institute of Biochemistry, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

    Jerzy Adamski

  41. Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, UK

    Mark I. McCarthy, Stephen Gough, Neil Robertson, Nicky McRobert & Agata Wesolowska-Andersen

  42. Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK

    Philippe Froguel

  43. Biophysics Institute (IBF-CNR), National Research Council of Italy, Milan, Italy

    Toni Giorgino

  44. Department of Biosciences, University of Milan, Milan, Italy

    Toni Giorgino

  45. Biotech & Biomarkers Research Department, Institut de Recherches Internationales Servier, Croissy sur Seine, France

    Marianne Rodriquez

  46. Blood Sciences, Royal Devon and Exeter NHS Foundation Trust, Exeter, UK

    Rachel Nice & Mandy Perry

  47. Boehringer Ingelheim International, Therapeutic Area CardioMetabolism and Respiratory Medicine, Ingelheim am Rhein, Germany

    Susanna Bianzano

  48. Boehringer Ingelheim International, Therapeutic Area CNS, Retinopathies and Emerging Areas, Ingelheim am Rhein, Germany

    Ulrike Graefe-Mody

  49. Boehringer Ingelheim International, Medicine Cardiometabolism and Respiratory, Biberach an der Riss, Germany

    Anita Hennige

  50. Boehringer Ingelheim International, Translational Medicine & Clinical Pharmacology, Biberach an der Riss, Germany

    Rolf Grempler & Patrick Baum

  51. Centre for Mathematics and Algorithms for Data, University of Bath, Bath, UK

    Beate Ehrhardt

  52. Clinical Operations, Sanofi-Aventis Deutschland, Frankfurt, Germany

    Joachim Tillner

  53. Clinical Pharmacy, Saarland University, Saarbrücken, Germany

    Christiane Dings, Thorsten Lehr, Nina Scherer & Iryna Sihinevich

  54. Clinical Research Centre, Ninewells Hospital and Medical School, University of Dundee, Dundee, Scotland, UK

    Louise Cabrelli & Heather Loftus

  55. Department of Mathematical Sciences, University of Bath, Bath, UK

    Christopher Jennison

  56. Digital and Data Sciences, Sanofi-Aventis Deutschland, Frankfurt, Germany

    Francesca Frau

  57. Eli Lilly and Company, Indianapolis, IN, USA

    Birgit Steckel-Hamann, Kofi Adragni & Melissa Thomas

  58. Institute for Epidemiology and Medical Biometry, ZIBMT, University of Ulm, Ulm, Germany

    Reinhard Holl

  59. Institute of Biomedicine, Bioinformatics Center, University of Eastern Finland, Kuopio, Finland

    Teemu Kuulasmaa

  60. Institute of Clinical Medicine, Internal Medicine, University of Eastern Finland, Kuopio, Finland

    Henna Cederberg, Markku Laakso, Jagadish Vangipurapu & Matilda Dale

  61. German Center for Diabetes Research, München-Neuherberg, Germany

    Barbara Thorand & Harald Grallert

  62. Lilly Deutschland, Bad Homburg, Germany

    Claudia Nicolay

  63. Medizinische Universitätsklinik Tübingen, Eberhard Karls Universität Tübingen, Tübingen, Germany

    Andreas Fritsche

  64. NIHR Exeter Clinical Research Facility, University of Exeter Medical School, Exeter, UK

    Anita Hill, Michelle Hudson & Claire Thorne

  65. Regulatory Genomics and Diabetes, Centre for Genomic Regulation, CIBERDEM, Barcelona, Spain

    Jorge Ferrer

  66. Strategy and Innovation, Sanofi-Aventis Deutschland, Frankfurt, Germany

    Bernd Jablonka

  67. Systems Biology, Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Solna, Sweden

    Mathias Uhlen

  68. TMED, Sanofi-Aventis Deutschland, Frankfurt, Germany

    Johann Gassenhuber

  69. Translational and Clinical Research, Metabolism Innovation Pole, Institut de Recherches Internationales Servier, Suresnes Cedex, France

    Tania Baltauss & Nathalie de Preville

Consortia

IMI DIRECT Consortium

  • Philippe Froguel
  • , Cecilia Engel Thomas
  • , Ragna Haussler
  • , Joline Beulens
  • , Femke Rutters
  • , Giel Nijpels
  • , Sabine van Oort
  • , Lenka Groeneveld
  • , Petra Elders
  • , Toni Giorgino
  • , Marianne Rodriquez
  • , Rachel Nice
  • , Mandy Perry
  • , Susanna Bianzano
  • , Ulrike Graefe-Mody
  • , Anita Hennige
  • , Rolf Grempler
  • , Patrick Baum
  • , Hans-Henrik Stærfeldt
  • , Nisha Shah
  • , Harriet Teare
  • , Beate Ehrhardt
  • , Joachim Tillner
  • , Christiane Dings
  • , Thorsten Lehr
  • , Nina Scherer
  • , Iryna Sihinevich
  • , Louise Cabrelli
  • , Heather Loftus
  • , Roberto Bizzotto
  • , Andrea Tura
  • , Koen Dekkers
  • , Nienke van Leeuwen
  • , Leif Groop
  • , Roderick Slieker
  • , Anna Ramisch
  • , Christopher Jennison
  • , Ian McVittie
  • , Francesca Frau
  • , Birgit Steckel-Hamann
  • , Kofi Adragni
  • , Melissa Thomas
  • , Naeimeh Atabaki Pasdar
  • , Hugo Fitipaldi
  • , Azra Kurbasic
  • , Pascal Mutie
  • , Hugo Pomares-Millan
  • , Amelie Bonnefond
  • , Mickael Canouil
  • , Robert Caiazzo
  • , Helene Verkindt
  • , Reinhard Holl
  • , Teemu Kuulasmaa
  • , Harshal Deshmukh
  • , Henna Cederberg
  • , Markku Laakso
  • , Jagadish Vangipurapu
  • , Matilda Dale
  • , Barbara Thorand
  • , Claudia Nicolay
  • , Andreas Fritsche
  • , Anita Hill
  • , Michelle Hudson
  • , Claire Thorne
  • , Kristine Allin
  • , Manimozhiyan Arumugam
  • , Anna Jonsson
  • , Line Engelbrechtsen
  • , Annemette Forman
  • , Avirup Dutta
  • , Nadja Sondertoft
  • , Yong Fan
  • , Stephen Gough
  • , Neil Robertson
  • , Nicky McRobert
  • , Agata Wesolowska-Andersen
  • , Andrew Brown
  • , David Davtian
  • , Adem Dawed
  • , Louise Donnelly
  • , Colin Palmer
  • , Margaret White
  • , Jorge Ferrer
  • , Brandon Whitcher
  • , Anna Artati
  • , Cornelia Prehn
  • , Jonathan Adam
  • , Harald Grallert
  • , Ramneek Gupta
  • , Peter Wad Sackett
  • , Birgitte Nilsson
  • , Konstantinos Tsirigos
  • , Rebeca Eriksen
  • , Bernd Jablonka
  • , Mathias Uhlen
  • , Johann Gassenhuber
  • , Tania Baltauss
  • , Nathalie de Preville
  • , Maria Klintenberg
  •  & Moustafa Abdalla

Contributions

S.R. and S. Brunak designed and supervised the analyses, interpreted the results, and wrote the manuscript. R.L.A. wrote the code, carried out the analyses, interpreted the results, and wrote the manuscript. A.T.L. cleaned and prepared the drug data to be used in the analysis. R.H.M. developed the Bayesian analysis. J.J., J.N.N., V.B. and H.W. assisted in the development of the method and analysis. A.A-O. assisted in interpretation of the drug interactions and writing the manuscript. G.M. processed transcriptomics data and L.N. assisted in interpreting the proteomics results, C.B., K.B., M.E.B. and A.G.P. assisted in the interpretation of results and design of the analysis. J.H.B. analyzed drug–drug interaction data. Data acquisition and pre-processing was performed by A.J., A. Mari, A.T., A. Mahajan, A.V., E.H., H.K.P., H.T., J.F., J.V., M.H., M.-G.H., P.B.M., G.K., J.B., L.T., G.F., R.K., S.S., T.M., T.K., and V.G. Clinical investigations was performed by A. Hattersley, A. Heggie, D.M., F.P., F.R., G.N., P.E., T.H.H., H.V., T.H., H.T., M.R., and V.R. High-performance computing support and administration was performed by P.J.C. and U.P.J. Project administration was performed by F.D.M., I.F., J.K., R.K., G.N.G., I.P., H.R., O.P., M.M., M.W., E.P., S. Brage, and P.W.F. Funding acquisition was by M.W., O.P., E.D., P.W.F., J.M.S., J.A., E.P., M.I.C., and S. Brunak. All authors reviewed and edited the final manuscript.

Corresponding authors

Correspondence to
Simon Rasmussen or Søren Brunak.

Ethics declarations

Competing interests

S. Brunak has ownerships in Intomics A/S, Hoba Therapeutics Aps, Novo Nordisk A/S, Lundbeck A/S, and managing board memberships in Proscion A/S and Intomics A/S. M.I.C. has served on advisory panels for Pfizer, Novo Nordisk, and Zoe Global; has received honoraria from Merck, Pfizer, Novo Nordisk, and Eli Lilly; and has received research funding from Abbvie, Astra Zeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, Novo Nordisk, Pfizer, Roche, Sanofi Aventis, Servier, and Takeda. As of June 2019, M.I.C. is an employee of Genentech and a holder of Roche stock. E.P. has received honoraria from Sanofi and Lilly. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Yasuhiro Kojima and Elin Nyman for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Source data

About this article

Verify currency and authenticity via CrossMark

Cite this article

Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. et al. Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01520-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-022-01520-x

Read More
Rosa Lundbye Allesøe

Real-time, volumetric imaging of radiation dose delivery deep into the liver during cancer treatment

Main

Radiation therapy (RT) has been shown to improve the outcomes of patients with cancer and provide palliation of related symptoms1. Successful RT is contingent on delivering intended sufficient radiation dose to tumor while sparing surrounding normal tissues2. Achieving such a desired therapeutic ratio, that is, maximizing tumor control while minimizing toxicity, requires that the planned radiation dose is delivered accurately3,4.

To improve the efficacy of RT, advanced image-guided delivery technologies have been proposed and developed over the past decades5,6. Technologies such as intensity modulated RT and volumetric modulated arc RT can offset some of the limitations associated with three-dimensional (3D) conformal RT7,8; however, targeting of moving lesions remains challenging. Several studies have highlighted discrepancies between planned and delivered RT and their impact on tumor control9. These differences are exacerbated by setup errors, organ motion, as well as anatomical deformations10,11, which may markedly alter the intended doses delivered to the target or adjacent normal tissues over the course of treatment12,13,14. Currently, the common practice for creating a planning target volume (PTV) is to expand the clinical target volume with a spatial margin to allow for setup uncertainties and organ deformations15. Moreover, dose escalation in many diseases is limited by adjacent normal tissue radiosensitivity16,17. In the case of patients with liver cancer, a previous study showed reducing the margin for organ motion can reduce the effective treatment volume by up to 5% (resulting in a reduced complication risk of 4.5%), which would allow escalation of radiation dose by 6–8 Gy, resulting in improved tumor control by an estimated 6–7% (ref. 18).

To mitigate problems with target and normal tissue motion, technologies capable of monitoring tumor location and mapping of the delivered dose during treatment are required. Surrogates of motion such as fiducials19 or active breath hold with spirometry are sometimes used for respiratory gating20. In addition, several onboard image-guidance RT (IGRT)21,22 technologies have been used, including electronic portal imaging device23,24, kilovolt fluoroscopic imaging and kilo- or megavolt cone beam computed tomography (CT) (CBCT) imaging. However, none of these technologies can provide real-time information of the 3D dose deposition. Safer nonionizing technologies were also explored, such as ultrasound imaging25 and surface camera-based systems, which are susceptible to subtle sources of error and interuser variability. To better resolve tissue discrimination with real-time imaging, integrated technologies such as CT-linear accelerators (LINACs), magnetic resonance imaging- (MRI-) LINACs and positron emission tomography-LINACs have been introduced for clinical use26, but CT, MRI or positron emission tomography cannot monitor the location of the X-ray radiation beam nor the dose deposition in the normal tissues or the target. Currently, image guidance with delivered dose feedback monitoring remains inherently limited27. On the other hand, there are a wide variety of devices for clinical dose measurements (for example, diodes, thermal/optical stimulated dosimeters, metal oxide semiconductor field effect transistors, plastic scintillators, electronic portal imaging devices, gels and films). These devices, however, are mostly limited to point measurements on the external surface of a patient and are not volumetric, not real time and some are dose rate or energy dependent28. New generations of detectors can be used in vivo but do not provide any of the necessary detailed anatomical information29,30,31. Therefore, there is a long-standing clinical need for more effective imaging technologies capable of volumetric, real-time, in vivo dose delivery monitoring during RT for feedback guidance.

Ionizing radiation acoustic imaging (iRAI) is a noninvasive imaging technology that reconstructs the radiation dose using acoustic waves stemming from the absorption of pulsed ionizing radiation beams in soft tissue32,33. iRAI has the potential to map the dose deposition and monitor the dose accumulation at in-depth anatomical structures in real time during RT. In contrast to other dose mapping methods, iRAI is directly proportional to the radiation dose absorbed by the targeted tissue. With precalibration of the Grüneisen parameter, medium density, pulse time profile and sensor sensitivity, the linear relationship between the absorbed dose and deposited dose could enable iRAI to both localize and quantify the absolute dose deposition during RT32,33,34,35,36,37. Most recently, the feasibility of iRAI for real-time monitoring of misalignment between the targeted tumor and the delivered beam has been presented for conventional as well as ultra-high dose rate (FLASH) radiation treatments32,34,38.

To further develop iRAI and promote its clinical translation, in this study we demonstrate a clinical ready iRAI system for real-time, volumetric imaging of radiation dose with high sensitivity and high spatial resolution, as shown in Fig. 1a. This imaging system was developed with a custom-designed two-dimensional (2D) matrix array transducer and a matching multi-channel preamplifier board (Fig. 1b), which were driven by a commercial research ultrasound system. Using this imaging system, iRAI was successfully performed with a lard phantom (Fig. 1c), an in vivo rabbit model (Fig. 1d,e) and patients with cancer undergoing radiotherapy on a clinical LINAC system. This study realized 3D semiquantitative mapping of X-ray beam delivery deep into the body during cancer treatment.

Fig. 1: iRAI system schematic and the experimental setup.
figure 1

a, 3D schematic of the iRAI system for mapping the dose deposition in a patient during RT delivery. b, CAD view of a 2D matrix array with an integrated preamplifier board. The xyz coordinate system for the 3D iRAI imaging space is marked. c, The experimental setup for the phantom studies. d, The side view of the rabbit experiment setup in a clinical environment. e, Details regarding the transducer position and coupling of the rabbit experiment.

Full size image

Results

iRAI system calibration

Using the schematic setup shown in Fig. 2a, the iRAI result for a small field with a lateral plane on a cylindrical lard phantom is shown in Fig. 2b. The normalized intensity profile along the dotted line in the red box is presented in Fig. 2c, where the dots show the pixel intensities. The curve shows the fitted point spread function, which has a full-width at half-maximum of 5 mm, suggesting a lateral spatial resolution of roughly 5 mm. The cross-section of the iRAI result along the axial direction with a 1 × 3 cm beam is shown in Fig. 2d. Figure 2e shows the fitted line spread function (LSF) extracted from the front edge of the iRAI image with a 1 × 3 cm beam. The 4 mm full-width at half-maximum of the LSF suggests that the axial resolution of the 2D array is better than 4 mm, which is about the predicted theoretical resolution of our 350 kHz transducer. The iRAI detected beam sizes versus the beam sizes of the radiation beam along the axial direction are shown in Fig. 2f. For each delivered beam size, the mean and the standard deviation (s.d.) of the iRAI measurements are shown. A linear fitting was performed, and an R2 = 0.989 was achieved, demonstrating that the 2D array based iRAI imaging system can accurately measure the beam size with a maximum deviation of 1.75 mm and a mean ± s.d. of 1.25 mm.

Fig. 2: The performance of the 2D array transducer.
figure 2

a, Schematic of the iRAI phantom experiment for performance calibration. b, iRAI imaging with the 5 × 5 mm radiation beam field. Scale bar, 5 mm. c, Point spread function (PSF) of iRAI in lateral direction. d, Cross-section of iRAI imaging with 3 × 1 cm radiation beam field. Scale bar, 5 mm. e, LSF of the iRAI in the axial direction. f, Beam widths of iRAI versus the beam field sizes of radiation source along axial direction. Error bars are s.d. for n = 5 independent measurements.

Full size image

Mapping the dose distribution and temporal dose accumulation

A C-shaped treatment plan with a dose distribution shown in Fig. 3a was delivered to a lard-based cylindrical phantom (Fig. 1c). The iRAI image showing the measured relative dose distribution in the phantom presents a C-shape, as shown in Fig. 3b. The planned dose distribution and the iRAI imaged dose distribution are compared in Fig. 3c, where isodose lines of 60% (blue) and 80% (brown) of the maximum dose are shown. There is good agreement in the shape of the 60 and 80% isodose lines between the planned dose and the iRAI imaged dose with an average root mean square error (r.m.s.e.) of 0.0987. A variation of less than 2% was achieved between the five independent iRAI imaging results, as shown in Supplementary Fig. 1 and Supplementary Video 1, which suggests that iRAI has high stability for measuring the dose deposition during RT.

Fig. 3: iRAI imaging for a C-shaped dose distribution treatment plan.
figure 3

a, The planned dose for the C-shaped 3D CRT treatment plan. b, iRAI imaging of relative deposited dose result for a C-shaped dose distribution treatment plan. c, The 60 and 80% isodose lines on the planned dose distribution and the iRAI imaged relative dose distribution. d, The temporal dose accumulation at different time points imaged by iRAI during the dose delivery of a C-shaped treatment plan. Scale bars ad, 5 mm.

Full size image

With the C-shaped treatment plan, the Truebeam LINAC system (Varian) delivered the dose with 1,400 monitor units per minute. The temporal dose accumulation in the phantom over the delivery time of around 19 s was continuously monitored by iRAI, as shown in Fig. 3d. A gradually formed C-shaped dose distribution was clearly demonstrated by the iRAI image as a function of time with a 2.4-s interval. With averaging more than 100 pulses for iRAI image reconstruction, a frame rate of 3.3 Hz was achieved for monitoring the temporal dose accumulation in this study and is provided in Supplementary Video 2. The results for showing the delivered dose between two consecutive reconstruction time points are shown in Supplementary Fig. 2 and Supplementary Video 3. Since it typically takes around 60–120 s for a patient to receive one fraction of treatment, the iRAI system would be able to provide sufficient temporal resolution for monitoring dose delivery clinically.

Mapping dose deposition of a treatment plan in an animal model

Before the treatment planning simulation, the rabbit anatomy was obtained by CT scanning. The anterior CT cross-section image in the anterior plane of the front and the rear edges of the planned dose are shown in Fig. 4a,d, respectively. The definition of the front and rear edges is shown in the sagittal plane of the rabbit cross-section images in Supplementary Fig. 3. Fusion of the treatment planned dose distributions and the CT images at the same positions is shown in Fig. 4b,e, respectively. As shown in Fig. 4c,f, the front and rear edges of the iRAI images, which were extracted from the iRAI volumetric image based on the distance between the 2D matrix array and the isocenter of the treatment plan, were fused onto the corresponding CT images. By comparing the iRAI images and the treatment plan, the higher dose areas of the iRAI images were highly consistent with the plan, yielding an r.m.s.e. of 0.0570 and 0.0691 for the front and rear edges, respectively.

Fig. 4: In vivo iRAI imaging versus the treatment plan of a rabbit model.
figure 4

a, The CT cross-section image of a rabbit in the front edge of the treatment dose delivery. b, The treatment plan fused onto CT the anatomy structure in the front edge of the dose delivery boundary. c, The iRAI image showing the dose distribution fused onto the CT scan at the same location of b. d, The CT cross-section image of the rabbit at the rear edge of the treatment dose delivery. e, The treatment plan fused onto the CT anatomy structure at the rear edge of the dose delivery boundary. f, The iRAI image showing the dose distribution fused onto CT scan at the same location of e. g, The 60 and 80% isodose lines of the iRAI measurement and the treatment plan in the front edge cross-section. h, The DVH of iRAI measurement in the front edge of the rabbit liver. The data with the blue areas are presented as mean ± s.d. for n = 3 independent iRAI measurements. i, The 60 and 80% isodose lines of the IRAI measurement and the treatment plan in the rear edge cross-section. j, The DVH of iRAI measurement in the rear edge of the rabbit liver. The data with the blue areas are presented as mean ± s.d. for n = 3 independent iRAI measurements. Scale bars in a and d, 2 cm; g and i, 5 mm.

Full size image

To further quantify the dose distribution, 60 and 80% isodose lines and a digital area histogram (DAH)39 from the iRAI result were compared to those of the treatment plan. As shown in Fig. 4g, the total distribution of isodose lines in the front edge of the iRAI image matched well with the treatment plan. Along the vertical direction, the iRAI image can resolve the same dose distribution as the treatment plan. Along the horizontal direction, the dose distribution presented by the iRAI result appears narrower than that of the treatment plan. Three independent iRAI measurements at the front edge were also quantified with DAH, as shown in Fig. 4h. The trend of the histogram percentage of the iRAI measurement is similar to the treatment plan. The blue area shows the s.d. of three independent iRAI measurements with a mean ± s.d. of 0.0199, which indicates that the iRAI imaging of deposited dose is stable. In addition, the rear edge isodose line shown in Fig. 4i has a consistent dose distribution in the bottom part. Although the top area shows some mismatch between the treatment plan and the iRAI results, overall, there is a good overlap agreement between the two distributions. The DAH results in Fig. 4j represent the relationship between three independent iRAI measurements and the treatment plan with a variation of less than 5%. iRAI measurements had a small s.d. of 0.0288. A slightly higher mismatch can be found with 70 to 90% of the maximum dose, which is also consistent with the isodose line results of Fig. 4i.

Mapping dose deposition of a treatment plan in a cancer patient case

The clinical setting for performing iRAI imaging on a patient is shown in Fig. 5a. Due to the limited field of view of the 2D matrix array, only the radiation induced acoustic effects occurring in the liver were analyzed. As shown in Fig. 5b, a liver mask was applied to the treatment plan, which ensured that only the dose deposited to the liver was shown in the CT scan. The iRAI measurement results of the relative dose delivery of the two sagittal static fields are shown in Fig. 5c. The sagittal plane position of the iRAI image is shown in the sagittal plane of the patient’s cross-section images in Supplementary Fig. 4. Due to the limited signal-to-noise ratio (SNR), only the central part of the dose distribution was mapped by iRAI. The beam path of the two anterior beams was not resolved by iRAI. Taking into account the dose distribution of the treatment plan, doses lower than 50% of the maximum dose were removed from the treatment plan. This resulted in a diamond-shaped dose map, as shown in Fig. 5d. Comparing the iRAI measurement in Fig. 4c to the treatment plan in Fig. 5d, both the dose locations and the overall distributions are matched well. To further quantify the accuracy of the iRAI relative dose mapping, the 50 and 90% isodose lines were drawn based on the normalized dose in both the iRAI image and the clinical treatment plan40. The central two dose distributions matched well, especially for higher doses (90% isodose line), as shown in Fig. 5e. In addition, the 50% isodose line had relatively strong variation, only the central part around the target was imaged successfully by iRAI, which is reasonable considering the limited field of view of the 2D matrix array with an r.m.s.e. of 0.0787.

Fig. 5: In vivo iRAI imaging versus the treatment plan on a patient.
figure 5

a, A photograph of the iRAI imaging on a patient taken during RT. b, The dose distribution of only the two static sagittal beams of the treatment plan with a liver mask fused onto the CT scan anatomy structure. Scale bar, 5 cm. c, The iRAI measurement of dose with a liver mask fused onto the CT anatomy structure with the same position as b. The yellow dashed box indicates the field of view of the 2D matrix array. d, Dose distribution (>50%) of the treatment plan with a liver mask fused on the CT anatomy structure. e, The 50 and 90% isodose lines in the iRAI measurement and the treatment plan. Scale bar, 2 cm. The red line in bd indicates the boundary of the liver.

Full size image

Discussion

The goal of this study was to demonstrate a clinically applicable technique to increase the precision of in vivo dose monitoring during RT by mapping the dose deposition and resolving the temporal dose accumulation while the treatment is being delivered in real time. To achieve this goal, a clinical grade iRAI volumetric imaging system was developed. This was achieved by using a custom-designed 2D matrix array with a central frequency and bandwidth to match the spectrum of the acoustic wave induced by a 4-µs radiation pulse. This, together with the specially designed large size of the transducer elements, enhanced the sensitivity of detecting the weak radiation induced acoustic signal. To further improve the detection sensitivity, a custom-designed low-noise multi-channel preamplifier board was integrated with the matrix array for signal amplification before the signals are acquired by the research ultrasound system. This study has been able to detect the intrinsically weak thermoacoustic signal induced by the radiation beam in deep tissue such as the liver.

As demonstrated by the results, the C-shaped dose distribution can be imaged online using iRAI with high accuracy, while the iRAI measurements of the rabbit showed high consistency between the measured dose distribution and the one generated by the treatment planning system. Both in vitro and in vivo repeated stability measurements suggest that the iRAI system has high stability in mapping the delivered dose. In the patient study, despite the fact that the acoustic inhomogeneity of human tissues was neglected and the field of view of the 2D matrix array was limited here, the iRAI measurement clearly visualized a dose distribution similar to that of the treatment plan in vivo. Although the treatment plans for both the rabbit model and the patient are relatively simpler than common treatment planning procedures, the results from this study demonstrated that the iRAI is a clinically feasible and practical technique for real-time mapping of the 3D dose deposition during radiotherapy. By using state-of-the-art image processing and displaying technologies, iRAI volumetric dose measurement was achieved simultaneously during the radiation dose delivery of a deeply seated organ such as the liver. A continuously formed C-shaped dose during the radiation treatment shows a promising result for directly visualizing the dose accumulation of a treatment plan during delivery, which is an important step for establishing an online feedback system for RT active monitoring. To quantify the accuracy of iRAI for dose mapping, isodose line and DAH, which are two of the clinical standard quality assurance methods, were estimated for iRAI relative dose measurement39,41. The well-matched isodose lines of normalized iRAI measurement and the clinical treatment plan provide a proof of principle for the spatial accuracy of iRAI measurement in mapping the dose deposition in a clinical environment. The DAH results of iRAI measurement and the treatment plan in the rabbit liver show the same dose distribution, which also corroborate the accuracy of iRAI for relative 3D dose distribution mapping.

Despite the promising results achieved by the iRAI volumetric imaging system, there are still several limitations that could be addressed by future development of this technology. First, the sensitivity of iRAI in detecting the dose distribution should be improved. As demonstrated by our patient study, the high dose area can be mapped by iRAI volumetric imaging with high accuracy, while the lower dose intensity areas are still a challenge to image with the current system. Since multiple averaging is needed to achieve sufficient SNR for dose reconstruction, the current detection range is limited by the magnitude of the absolute dose delivered in the region of interest. To improve the detection sensitivity, not only the ultrasound array and the preamplifiers but also the system for signal digitization, processing, and image reconstruction should be further optimized. Second, volumetric dose distribution in deep tissues presented by the current imaging system is only semiquantitative, which provides relative dose measurements. The normalized color bar of each iRAI image indicates the relative dose instead of the absolute dose. To achieve iRAI imaging capable of providing absolute dose measurement, a protocol for comprehensive calibration is needed, which would consider the signal response of the imaging system, the temporal shape of the radiation pulse and the tissue properties (for example, physical density, speed of sound, coefficient of thermal expansion and specific heat capacity). This process has been demonstrated for photon and electron Cerenkov imaging with corresponding budget uncertainty and could be applied here too42,43. Specifically, for iRAI, the tissue properties are different for each individual, which, however, could be measured by the existing imaging methods such as CT, MRI and ultrasound, and information could be incorporated into the reconstruction algorithm using artificial intelligence methods36,44,45. Third, the spatial resolution of the current imaging system is still limited. As demonstrated by the quantified imaging results, the axial resolution and the lateral resolution of the current system are 4 and 5 mm, respectively. This spatial resolution, although already better than the clinical realistic accuracy of 5 mm46, can be further improved. To accommodate the low frequency of the acoustic signal produced by this 4-µs duration of the radiation pulse, the custom-designed matrix array works at a central frequency of 350 kHz. In the future, when working with a radiation beam with a shorter pulse duration, transducers with higher center frequencies leading to higher spatial resolution can be used. Fourth, the current iRAI system is a single-modality, and cannot enable pulse-echo ultrasound imaging at the same time. This is due to the limitation that the preamplifier board of the current iRAI system is receiving only and cannot transmit ultrasound pulses. Moreover, the central frequency of the current 2D array is only matched with iRAI acquisition and is unable to provide acceptable ultrasound imaging quality, which is typically in MHz range (roughly 1–3 MHz). In the future, powered by a well-designed preamplifier board and dual-frequency 2D matrix array enabling both receiving and transmission, iRAI and ultrasound volumetric imaging could be performed at the same time during RT so that both the 3D dose deposition and the tissue motion can be monitored simultaneously. Last, due to the limited bandwidth of the 2D matrix array, iRAI mostly images the edges of the radiation field, which also has consequences when aiming to assess the absolute dose delivery in 3D. Potential solutions can be learned from the well-developed photoacoustic imaging field by implementing better reconstruction algorithms and acquisition hardware47,48. In addition, as an ultrasound-based imaging modality, iRAI is applicable to ultrasound imaging compatible organs (for example, liver, breast, prostate and cervical) and shares the same limitations of ultrasound imaging within organs containing body cavities and bones.

In summary, this study describes an online iRAI volumetric imaging system that directly maps the dose deposition deep inside a human patient receiving a radiotherapy fraction without interrupting the treatment delivery. Despite the fact that both the sensitivity and the spatial resolution of iRAI could be further improved, the current system enabled these proof-of-concept experiments on phantoms, animals and especially human studies, demonstrating the feasibility of iRAI for clinical application during conventional RT by mapping the dose deposition for each treatment fraction. The iRAI system presented in this work also holds promise for applications in advanced RT modalities in online monitoring and accurate quantification of radiation dose deposition, such as real-time adaptive radiotherapy, FLASH RT and proton therapy.

Methods

iRAI acquisition system design

A clinically ready iRAI imaging system was adapted from our previous prototype iRAI and ultrasound dual-modality imaging system32, shown in Fig. 1a. To further improve the system sensitivity and add the volumetric imaging capability, the iRAI detector and amplification components were thoroughly redesigned to achieve real-time 3D imaging of deposited dose during RT. In this system, the radiation acoustic signals were detected by a custom-designed 2D planar matrix array (Imasonics, Inc.) with 32 × 32 = 1,024 (116.6 × 116.6 mm) elements, 3.45 × 3.45 mm element dimension and 0.2 mm kerf. The central frequency of 0.35 MHz, with 50% bandwidth, was chosen to match the power spectrum of the radiation acoustic signals generated by the approximately square, 4 µs X-ray pulse. This is crucial to enhance the SNR when detecting radiation acoustic signals so that highly sensitive dose mapping can be realized in real time. To further enhance the SNR, a custom-designed 1,024-channel preamplifier (AMP 1024-19-001, Photosound Technologies, Inc.) with 46 dB gain was fully integrated with the 2D matrix array, shown in Fig. 1b. This design avoided the cable connection between the transducer elements and the preamplifier and minimized the noise that could be introduced. The 2D matrix array with the integrated preamplifier board was driven by a 256-channel research ultrasound system with operation software v.4.4.0 (Vantage, Verasonics Inc.) via a 4 to 1 multiplexer, which was controlled by an Arduino microcontroller. The pulse trigger from the LINAC was precisely controlled by a delay generator and synchronized with the multiplexer and the ultrasound system. An acquisition process by the 1,024 channels was achieved for every four LINAC triggers. The iRAI images were displayed with 25 times averaging to further improve the SNR.

iRAI system performance calibration

To verify the performance of the newly developed 3D iRAI imaging system based on the 2D matrix array, a resolution calibration with a 6-MV static beam from a clinical LINAC (TrueBeam, Varian Medical System Inc.) was performed. As shown in Fig. 1c, a cylindrical lard phantom in a 15 cm diameter plastic jar was made as a reference for calibration. The bottom part of the jar was removed and coupled with the surface of the 2D matrix array using ultrasound coupling gel. To calibrate the lateral resolution, a 5 × 5 mm radiation beam field was delivered by the LINAC, targeted to the middle of the lard phantom. The beam to array distance through the lard was approximately 10 cm. The axial resolution was calibrated through a front edge of a 1 × 3 cm beam using a LSF. To verify the performance of the system in measuring the size of the radiation beam in 3D, radiation beams with different sizes irradiated the phantom from above. The size of the beam along the lateral direction of the 2D array was kept at 1 cm, while the size along the axial direction was changed from 1 to 5 cm, with increments of 1 cm, shaped by controlling the multi-leaf collimator of the LINAC. Five independent iRAI volumetric images of different beam sizes were acquired for further statistical analysis.

Mapping the dose distribution and temporal dose accumulation

To verify the feasibility of this imaging system in mapping the dose deposition and monitoring the temporal dose accumulation during a radiation treatment, a treatment plan with a C-shaped dose distribution was created, following a clinical protocol. The radiation treatment was on the same cylindrical lard phantom previously described. The 3D conformal radiation treatment (3D CRT), shown in Fig. 3a, consisted of 23 beam angles delivered with a maximum dose of 7 Gy by a TrueBeam accelerator (Varian Medical Systems) with 6-MV flattening filter free. During the radiation delivery, the isocenter of the treatment was aligned with the geometrical center of the phantom. Two different experiments were performed based on this C-shaped target treatment plan to evaluate both the dose distribution mapping and temporal dose accumulation monitoring. To assess the mapping of the dose deposition of each planned beam, the radiation induced acoustic signals were continuously acquired during the dose delivery and then processed by a delay-and-sum image reconstruction algorithm via MATLAB 2020a (Mathworks). Once the dose delivery was completed, the acquired acoustic signals from each beam were combined coherently by summing the signals from each pulse and each element to form an iRAI image for the whole treatment plan. An envelope was formed along the normal direction of the 2D matrix array after the delay-and-sum reconstruction. Five independent iRAI volumetric image acquisitions of the same treatment plan were acquired for further statistical analysis. For monitoring the temporal dose accumulation, the iRAI image was reconstructed and displayed during the radiation beam delivery with an average of every 25 full acquisitions (equivalent to 100 radiation pulses). The online displayed image was shown in two formats: (1) total accumulated dose; and (2) the delivered dose between two consecutive reconstruction time points. Three independent iRAI volumetric images were acquired of the same treatment plan for further statistical analysis.

Mapping dose deposition of a treatment in an animal model

Animal experiments were performed using a rabbit model to examine the feasibility of iRAI in mapping the dose deposition during RT in vivo with a clinical treatment plan. The photography of the imaging setup is shown in Fig. 1d. All the animal experiments were approved by University of South Florida Research Integrity and Compliance Institutional Animal Care and Use Committee (Combined Radiation Acoustics and Ultrasound Imaging for Real-Time Guidance in Radiotherapy, IS00008026). Two female New Zealand white rabbits (4.5–5 kg) of 6 months old, ordered from Charles River, were involved in this study. CT scanning (CT simulation) for these two rabbits was performed as input into the treatment planning system (Raystation 11A, RaySearch Laboratories). The treatment plan consisted of four 6-MV flattening filter free 3 × 3 cm beams at various gantry angles (30, 40, 320 and 340°) along the anterior plane of the rabbit with the liver placed at isocenter, consisting of a maximum dose of 5.36 Gy for each fraction.

During the experiment, anesthesia was induced using ketamine (40 mg kg−1) via intramuscular injection and maintained with 1.5% isoflurane and oxygen using a V-Gel (J1350D, Jorgensen Laboratories) and Matrx vaporizer (MidMark Corporation). Vitals (heart rate, respiratory rate, oxygen saturation and body temperature) were continuously monitored using a SurgiVet Advisor vital signs monitor (Smiths Medical) to ensure animal safety and to evaluate the anesthesia level. An adjustable water-circulating heating pad (TP-700, Stryker Corporation) was used to keep the body temperature stable. The 2D matrix array was directly facing the isocenter of the animals. The detection surface of the 2D matrix array was directly facing the isocenter and positioned parallel to the anterior plane of the rabbits, which was in the supine position with the head toward the gantry. A water-filled balloon was used for acoustic coupling between the rabbit abdomen and the array surface, as shown in Fig. 1e. The clearance distance between the isocenter and the array surface was 15 cm. A CBCT scan was performed before the treatment for image guidance during the positioning setup and, subsequently, three consecutive treatment fractions were performed to deliver the dose to the rabbit liver and imaged by iRAI for statistical analysis. Animals were euthanized right after the last treatment.

Mapping dose deposition of a treatment plan in a cancer patient case

This human patient study was conducted to further evaluate the clinical feasibility of iRAI in mapping dose deposition in a treatment fraction. The study was approved by the institutional review board of the University of Michigan (UMCC 2017.160 Pilot Study of Combined Radiation Acoustics and Ultrasound Imaging for Guidance in Radiotherapy, HUM00139322). Informed consent was obtained after the nature and possible consequences of the studies were explained. A 60-year-old man diagnosed with liver metastasis was treated in this study. To minimize the interference for RT, the treatment plan for each fraction was divided into two parts. The first part was for iRAI imaging and consisted of 2.087 and 0.877 Gy beams delivered in the superior and inferior anterior directions, respectively. Two anterior beams with an angle of 60° formed a diamond-shaped dose in the central part of the liver, where the tumor was located. The second part was a volumetric modulated arc therapy (VMAT) plan to ensure that the total delivered dose met the clinical requirements. The 3D beam arrangements of the treatment plan are shown in Supplementary Fig. 5. Specifically, CT simulation included a 4D CT and a breath hold 40 s delay contrast scan. The contrast scan was fused with the 4D CT, and a gross tumor volume and internal target volume were made to include the respiratory motion of the tumor. A margin of 5 mm in the axial plane and 8 mm superior and inferior was applied to the internal target volume to make the PTV. The prescribed radiation dose was 54 Gy in total, delivered in three 18 Gy fractions to the PTV. The PTV volume receiving 100% of the prescribed dose (V100%) was 98.5% and the minimum dose to 100% of the PTV volume (D100%) was 90.1%. The treatment plan went through a standard optimization process. All standard organ at risk limits in the treatment plan directive were met. The beam arrangement consisted of one axial VMAT arc that delivered 89% of the prescription and two sagittal static fields that delivered 4.8 and 6.2%. The static fields were selected to avoid the transducer and optimized to limit dose to organ at risk limits as shown in the dose volume histograms (DVHs) in Supplementary Fig. 5d. Treatment delivery used standard CBCT-based IGRT followed by delivery of the axial arc. There was no iRAI imaging during VMAT. After the axial arc was treated and the couch rotated 90°, the iRAI was used on the two sagittal static beams as seen in Fig. 5. The two beams were 6-MV X-ray using the flattening filter-free (FFF) mode. The anterior field delivered 141 monitor inferior beam used 187 monitor units at a dose rate of 1,400 monitor units per min.

During the iRAI imaging, the 2D matrix array was held by a homemade mechanical arm, which provided four degrees of freedom. The arm was directly attached to a mobile cart, which carried all the electronic devices, shown in Fig. 5a. To locate the targeted area in the central axis of the field of view, the geometry center of the 2D matrix array was set 4 cm above the isocenter. For acoustic coupling, a water-filled balloon, with its surface applied with ultrasound coupling gel, was directly attached to the surface of the array. The other side of balloon touched the skin of the abdomen with a light pressure. The total distance between the 2D matrix array and the center of target was set to 17 cm.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The authors declare that the data supporting the findings of this study are available within the paper and its supplementary information files. The imaging raw data from the acquisition device are available from University of Michigan Deep Blue Data (https://doi.org/10.7302/g05r-5a43).

Code availability

The codes for data collection and data processing are available from University of Michigan Deep Blue Data (https://doi.org/10.7302/g05r-5a43).

References

  1. Jaffray, D. A. & Gospodarowicz, M. K. in Cancer: Disease Control Priorities 3rd edn, Vol. 3 (eds Gelband, H. et al.) (The International Bank for Reconstruction and Development/The World Bank, 2015).

  2. Liauw, S. L., Connell, P. P. & Weichselbaum, R. R. New paradigms and future challenges in radiation oncology: an update of biological targets and technology. Sci. Transl. Med. 5, 173sr172 (2013).

    Article 

    Google Scholar
     

  3. Wambersie, A. ICRU Report 62, Prescribing, recording and reporting photon beam therapy (supplement to ICRU Report 50). ICRU News (1999).

  4. ICRU Report 50. Prescribing, I. Recording and Reporting Photon Beam Therapy (International Commission on Radiation Units and Measurements, 1993).

  5. Bucci, M. K., Bevan, A. & Roach, M. 3rd Advances in radiation therapy: conventional to 3D, to IMRT, to 4D, and beyond. CA Cancer J. Clin. 55, 117–134 (2005).

    Article 

    Google Scholar
     

  6. Ling, C. C., Yorke, E. & Fuks, Z. From IMRT to IGRT: frontierland or neverland? Radiother. Oncol. 78, 119–122 (2006).

    Article 

    Google Scholar
     

  7. Sandler, H. M. et al. Three dimensional conformal radiotherapy for the treatment of prostate cancer: low risk of chronic rectal morbidity observed in a large series of patients. Int. J. Radiat. Oncol. Biol. Phys. 33, 797–801 (1995).

    Article 
    CAS 

    Google Scholar
     

  8. Lin, C. et al. Effect of radiotherapy techniques (IMRT vs. 3D-CRT) on outcome in patients with intermediate-risk rhabdomyosarcoma enrolled in COG D9803—a report from the Children’s Oncology Group. Int. J. Radiat. Oncol. Biol. Phys. 82, 1764–1770 (2012).

    Article 

    Google Scholar
     

  9. Ezzell, G. A. et al. Guidance document on delivery, treatment planning, and clinical implementation of IMRT: report of the IMRT Subcommittee of the AAPM Radiation Therapy Committee. Med. Phys. 30, 2089–2115 (2003).

    Article 

    Google Scholar
     

  10. Xing, L. et al. Overview of image-guided radiation therapy. Med. Dosim. 31, 91–112 (2006).

    Article 

    Google Scholar
     

  11. Sterzing, F., Engenhart-Cabillic, R., Flentje, M. & Debus, J. Image-guided radiotherapy: a new dimension in radiation oncology. Dtsch Arztebl. Int. 108, 274–280 (2011).


    Google Scholar
     

  12. Van Herk, M. Errors and margins in radiotherapy. Semin. Radiati. Oncol. 14, 52–64 (2004).

  13. Bortfeld, T., Jokivarsi, K., Goitein, M., Kung, J. & Jiang, S. B. Effects of intra-fraction motion on IMRT dose delivery: statistical analysis and simulation. Phys. Med. Biol. 47, 2203–2220 (2002).

    Article 

    Google Scholar
     

  14. Brock, K. K. & Dawson, L. A. Adaptive management of liver cancer radiotherapy. Semin. Radiat. Oncol. 20, 107–115 (2010).

    Article 

    Google Scholar
     

  15. Kron, T. Reduction of margins in external beam radiotherapy. J. Med. Phys. 33, 41 (2008).

    Article 

    Google Scholar
     

  16. Marks, L. B. et al. Use of normal tissue complication probability models in the clinic. Int. J. Radiat. Oncol. Biol. Phys. 76, S10–S19 (2010).

    Article 

    Google Scholar
     

  17. Thomas, T. O. et al. The tolerance of gastrointestinal organs to stereotactic body radiation therapy: what do we know so far? J. Gastrointest. Oncol. 5, 236–246 (2014).


    Google Scholar
     

  18. Ten Haken, R. K., Balter, J. M., Marsh, L. H., Robertson, J. M. & Lawrence, T. S. Potential benefits of eliminating planning target volume expansions for patient breathing in the treatment of liver tumors. Int. J. Radiat. Oncol. Biol. Phys. 38, 613–617 (1997).

    Article 

    Google Scholar
     

  19. Choi, J.-H., Seo, D.-W., Park, D. H., Lee, S. K. & Kim, M.-H. Fiducial placement for stereotactic body radiation therapy under only endoscopic ultrasonography guidance in pancreatic and hepatic malignancy: practical feasibility and safety. Gut and Liver 8, 88–93 (2014).

    Article 

    Google Scholar
     

  20. Giraud, P. & Houle, A. Respiratory gating for radiotherapy: main technical aspects and clinical benefits. ISRN Pulmonology 2013, 13 (2013).

    Article 

    Google Scholar
     

  21. De Los Santos, J. et al. Image guided radiation therapy (IGRT) technologies for radiation therapy localization and delivery. Int. J. Radiat. Oncol. Biol. Phys. 87, 33–45 (2013).

    Article 

    Google Scholar
     

  22. Balter, J. M. & Cao, Y. Advanced technologies in image-guided radiation therapy. Semin. Radiat. Oncol. 17, 293–297 (2007).

  23. Keall, P. et al. On the use of EPID‐based implanted marker tracking for 4D radiotherapy. Med. Phys. 31, 3492–3499 (2004).

    Article 
    CAS 

    Google Scholar
     

  24. Berbeco, R. I., Neicu, T., Rietzel, E., Chen, G. T. & Jiang, S. B. A technique for respiratory-gated radiotherapy treatment verification with an EPID in cine mode. Phys. Med. Biol. 50, 3669–3679 (2005).

    Article 

    Google Scholar
     

  25. Chinnaiyan, P., Tomé, W., Patel, R., Chappell, R. & Ritter, M. 3D-ultrasound guided radiation therapy in the post-prostatectomy setting. Technol. Cancer Res. Treat. 2, 455–458 (2003).

    Article 

    Google Scholar
     

  26. Kerkmeijer, L. G. W. et al. The MRI-linear accelerator consortium: evidence-based clinical introduction of an innovation in radiation oncology connecting researchers, methodology, data collection, quality assurance, and technical development. Front. Oncol. https://doi.org/10.3389/fonc.2016.00215 (2016).

  27. Liu, H. & Wu, Q. Evaluations of an adaptive planning technique incorporating dose feedback in image-guided radiotherapy of prostate cancer. Med. Phys. 38, 6362–6370 (2011).

    Article 

    Google Scholar
     

  28. Mijnheer, B., Beddar, S., Izewska, J. & Reft, C. In vivo dosimetry in external beam radiotherapy. Med. Phys. 40, 070903 (2013).

    Article 

    Google Scholar
     

  29. Islam, M. K. et al. An integral quality monitoring system for real-time verification of intensity modulated radiation therapy. Med. Phys. 36, 5420–5428 (2009).

    Article 
    CAS 

    Google Scholar
     

  30. Poppe, B. et al. Clinical performance of a transmission detector array for the permanent supervision of IMRT deliveries. Radiother. Oncol. 95, 158–165 (2010).

    Article 

    Google Scholar
     

  31. Johnson, D., Weston, S. J., Cosgrove, V. P. & Thwaites, D. I. A simple model for predicting the signal for a head‐mounted transmission chamber system, allowing IMRT in‐vivo dosimetry without pretreatment linac time. J. Appl. Clin. Med. Phys. 15, 270–279 (2014).

    Article 

    Google Scholar
     

  32. Zhang, W. et al. Dual-Modality X-Ray-induced radiation acoustic and ultrasound imaging for real-time monitoring of radiotherapy. BME Frontiers 2020, 9853609 (2020).

    Article 

    Google Scholar
     

  33. Xiang, L., Tang, S., Ahmad, M. & Xing, L. High resolution X-ray-induced acoustic tomography. Sci Rep. 6, 26118 (2016).

    Article 
    CAS 

    Google Scholar
     

  34. Oraiqat, I. et al. An ionizing radiation acoustic imaging (iRAI) technique for real-time dosimetric measurements for FLASH radiotherapy. Med. Phys. 47, 5090–5101 (2020).

  35. Lei, H. et al. Toward in vivo dosimetry in external beam radiotherapy using X-ray acoustic computed tomography: a soft-tissue phantom study validation. Med. Phys. https://doi.org/10.1002/mp.13070 (2018).

  36. Hickling, S. et al. Ionizing radiation-induced acoustics for radiotherapy and diagnostic radiology applications. Med. Phys. 45, e707–e721 (2018).

    Article 

    Google Scholar
     

  37. Hickling, S., Hobson, M. & El Naqa, I. Characterization of X-ray acoustic computed tomography for applications in radiotherapy dosimetry. IEEE Trans. Radiat. Plasma Med. Sci. 2, 337–344 (2018).

    Article 

    Google Scholar
     

  38. El Naqa, I., Pogue, B. W., Zhang, R., Oraiqat, I. & Parodi, K. Image guidance for FLASH radiotherapy. Med. Phys. 49, 4109–4122 (2022).

    Article 

    Google Scholar
     

  39. Sothmann, T., Blanck, O., Poels, K., Werner, R. & Gauer, T. Real time tracking in liver SBRT: comparison of CyberKnife and Vero by planning structure-based γ-evaluation and dose-area-histograms. Phys. Med. Biol. 61, 1677 (2016).

    Article 
    CAS 

    Google Scholar
     

  40. Fuss, M. & Salter, B. J. Intensity-modulated radiosurgery: improving dose gradients and maximum dose using post inverse-optimization interactive dose shaping. Technol. Cancer Res. Treat. 6, 197–203 (2007).

    Article 

    Google Scholar
     

  41. Oku, Y. et al. Analysis of suitable prescribed isodose line fitting to planning target volume in stereotactic body radiotherapy using dynamic conformal multiple arc therapy. Pract. Radiat. Oncol. 2, 46–53 (2012).

    Article 

    Google Scholar
     

  42. Zlateva, Y., Muir, B. R., El Naqa, I. & Seuntjens, J. P. Cherenkov emission‐based external radiotherapy dosimetry: I. Formalism and feasibility. Med. Phys. 46, 2370–2382 (2019).

    Article 
    CAS 

    Google Scholar
     

  43. Zlateva, Y., Muir, B. R., Seuntjens, J. P. & El Naqa, I. Cherenkov emission‐based external radiotherapy dosimetry: II. Electron beam quality specification and uncertainties. Med. Phys. 46, 2383–2393 (2019).

    Article 
    CAS 

    Google Scholar
     

  44. Wang, G., Ye, J. C. & De Man, B. Deep learning for tomographic image reconstruction. Nat. Mach. Intell. 2, 737–748 (2020).

    Article 

    Google Scholar
     

  45. Gröhl, J., Schellenberg, M., Dreher, K. & Maier-Hein, L. Deep learning for biomedical photoacoustic imaging: a review. Photoacoustics 22, 100241 (2021).

    Article 

    Google Scholar
     

  46. Van Dyk, J., Battista, J. J. & Bauman, G. S. in The Modern Technology of Radiation Oncology Vol. 3 (ed. Van Dyk, J.) 361–412 (Medical Physics Publishing, 2013).

  47. Ku, G., Wang, X., Stoica, G. & Wang, L. V. Multiple-bandwidth photoacoustic tomography. Phys. Med. Biol. 49, 1329 (2004).

    Article 

    Google Scholar
     

  48. Gutta, S. et al. Deep neural network-based bandwidth enhancement of photoacoustic data. J. Biomed. Opt. 22, 116001 (2017).

    Article 

    Google Scholar
     

Download references

Acknowledgements

We thank the staff in animal care facilities in Moffitt Cancer Center for helping us handle the rabbits during the experiment and the staff in Department of Radiation Oncology at the University of Michigan for cooperating in the patient study. This work was supported by National Cancer Institute grant no. NIH R37CA222215 (I.E.N.), National Cancer Institute grant no. P30CA046592 and the Michigan Institute for Clinical and Health Research under grant no. UL1TR002240 (W.Z.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Author notes

  1. These authors contributed equally: Wei Zhang, Ibrahim Oraiqat, Dale Litzenberg.

Authors and Affiliations

  1. Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI, USA

    Wei Zhang, Kai-Wei Chang, Paul L. Carson & Xueding Wang

  2. Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, USA

    Ibrahim Oraiqat & Issam El Naqa

  3. Department of Radiation Oncology, University of Michigan, Ann Arbor, MI, USA

    Dale Litzenberg, Scott Hadley, Martha M. Matuszak, Kyle C. Cuneo & Issam El Naqa

  4. Department of Nuclear Engineering, University of Michigan, Ann Arbor, MI, USA

    Noora Ba Sunbul & Martha M. Matuszak

  5. Department of Radiation Oncology, Moffitt Cancer Center, Tampa, FL, USA

    Christopher J. Tichacek, Eduardo G. Moros & Issam El Naqa

  6. Department of Radiology, University of Michigan, Ann Arbor, MI, USA

    Paul L. Carson & Xueding Wang

Contributions

I.E.N., X.W. and K.C.C. generated the idea and designed the experiments. E.G.M. and P.L.C. were involved in the optimization of the experimental design and progress discussion. W.Z., I.O., D.L., K.-W.C. and N.B.S. performed the experiments. S.H., M.M.M. and C.J.T. handled the treatment plan. W.Z. wrote the initial draft of the manuscript. All authors were involved in the data analysis and critical revision of the manuscript.

Corresponding authors

Correspondence to
Kyle C. Cuneo, Xueding Wang or Issam El Naqa.

Ethics declarations

Competing interests

The following authors have previously disclosed a patent application (no. WO2020227719) that is relevant to this manuscript: I.E.N., X.W., P.L.C., K.C.C., W.Z. and I.O. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Mohamed Abazeed, Julie Lascaud and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Oraiqat, I., Litzenberg, D. et al. Real-time, volumetric imaging of radiation dose delivery deep into the liver during cancer treatment.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01593-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-022-01593-8

Read More
Wei Zhang

scChIX-seq infers dynamic relationships between histone modifications in single cells

Main

Gene expression in animals relies on epigenetic marks such as histone modifications to regulate the accessibility and function of the genome in different cell types1. Large-scale efforts characterizing different histone modifications in a variety of cell populations commonly use chromatin immunoprecipitation followed by sequencing (ChIP–seq)2,3,4,5,6,7,8. Alternative strategies to ChIP–seq based on enzyme tethering (chromatin immunocleavage, ChIC) have reduced the background signal in profiling the epigenome9, and have enabled single-cell profiling of histone modifications8,10,11,12,13,14,15,16,17,18,19. Tethering strategies involve incubating cells with an antibody against a histone modification of interest, which then tethers either protein A-MNase10,12,18,19 or protein A-Tn511,13,14,15,16,17 fusion protein to generate targeted DNA fragments in single cells. However, most experimental techniques to map single-cell histone modifications are limited to only one histone modification per single cell.

We present an integrated experimental and computational framework for multiplexing histone modifications in single cells. To profile two histone modifications in single cells (Fig. 1a), we first generate three genome-wide sortChIC18 datasets: two datasets by incubating cells with one of the two histone modification antibodies separately (single-incubated; Fig. 1b), and the third by incubating cells with both histone modification antibodies together (double-incubated; Fig. 1b). We then use our two single-incubated datasets as training data to generate the possible pairs of genome-wide histone modification profiles that, when added together, fit to a single-cell profile from the double-incubated dataset (Fig. 1c). For each double-incubated cell, we then deconvolve the multiplexed data by probabilistically assigning each fragment back to their respective histone modification.

Fig. 1: Overview of the scChIX-seq method.
figure 1

a, Chromatin regulation of different cell types (different colored cells) is regulated in part through several histone modifications (two histone modifications shown as an example). b, scChIX-seq uses three sortChIC antibody incubation conditions: two conditions each target a single histone modification (single-incubated) only and the third condition targets both histone modifications simultaneously (double-incubated). c, Schematic of scChIX-seq for deconvolving multiplexed histone modifications. The two single-incubated sortChIC datasets (one targeting an orange histone modification, the other a blue modification, each modification reveals three clusters) are training data to define the possible pairs of histone modification distributions that can be combined to generate a hypothetical double-incubated cell. For each observed double-incubated cell, we then assign the cell to the most probable pair of cell states, one from each histone modification. We then probabilistically assign each pA-MNase cut into their respective histone modification. Cartoons represent genome-wide distribution of histone modification signals in different modifications and cell types; x axes represent genomic distance, and vertical ticks are arbitrary distance markers. d, Label transfer allows joint analysis of two single-incubated sortChIC datasets targeting functionally distinct histone modifications. Information derived from one histone modification, such as cell types, histone mark levels and pseudotime, can be transferred to another histone modification using the double-incubated cells as a link. e, Simulation study shows that scChIX-seq can unbiasedly assign reads to each mark regardless of the amount of overlap there is between the two marks across the genome. x axis of cartoon genome-wide distributions (middle-left) is genomic distance. Right: ground truth probabilities versus inferred probabilities from scChIX. p is the expected fraction of double-incubated reads in a genomic locus that belongs to mark 1. (hat{p}) is the estimate of the probability; n = 101 simulation datapoints spread evenly between 0 and 1 inclusive. Error bars are 95% CI, centers are the mean.

Full size image

scChIX-seq links single-cell maps of different histone modifications, revealing relationships between histone modifications in single cells. In these linked maps, information derived from one chromatin state, such as cell types, histone mark levels and pseudotimes, can transfer to another chromatin state (Fig. 1d), unlocking joint analysis of several histone modifications in single cells. We first validated scChIX-seq using simulation, purified blood cell types and whole bone marrow. We then applied scChIX-seq to two complex biological systems, one in mouse organogenesis to uncover orthogonal dynamics in H3K36me3 and H3K9me3, and the other in macrophage in vitro differentiation to reveal coordinated dynamics between H3K4me1 and H3K36me3.

Results

Benchmarking across histone modification relationships

To test whether scChIX-seq is accurate for histone modification patterns that are mutually exclusive as well as highly overlapping, we apply scChIX-seq to simulated single-cell data with known amounts of overlap to benchmark our method across different overlapping patterns between histone modifications. We simulate single-cell histone modification data by modifying simATAC20 to generate sparse count data from different overlapping patterns from the same cell (Fig. 1e and Extended Data Fig. 1a,b; Methods). Our simulations span three scenarios to cover varying degrees of overlapping patterns (Extended Data Fig. 1c). (1) Mutually exclusive scenario with only 1% of loci overlapping. (2) Intermediate scenario with 50% of loci overlapping. (3) Correlated scenario with 99% of loci overlapping. In these simulations, we provide a ground truth parameter p for each genomic locus and then estimate this parameter using our statistical framework to assess the uncertainty in our inferences. Here, p is the expected fraction of double-incubated reads in a locus that belongs to a reference histone modification (that is, p = 0.5 if locus is exactly overlapping, p = 1 or 0 if locus is exactly mutually exclusive). Applying scChIX-seq to each scenario, we find that the distribution of our estimates (hat{p}) across all loci are comparable with the ground truth distribution of p (Extended Data Fig. 1c,d). Furthermore, scChIX-seq accurately recovers the different cell types underlying the simulated data, and links the two histone modification landscapes into a joint uniform manifold approximation and projection (UMAP) (Extended Data Fig. 1e). Summarizing the three scenarios, scChIX-seq can estimate p accurately for all degrees of overlap, with confidence intervals (CI) better than (hat{p}pm 0.05) (Fig. 1e (right) and Extended Data Fig. 1f). Our simulation study confirms that scChIX-seq is accurate in inferring several histone modifications in single cells in both mutually exclusive as well as overlapping histone modification patterns.

Validating with ground truth data from purified cell types

To validate our method experimentally, we generate a ground truth sortChIC dataset by purifying three known cell types from mouse bone marrow: B cells, granulocytes and natural killer (NK) cells, using fluorescence-activated cell sorting (FACS) and applying scChIX-seq (Methods). Of note, the sortChIC method is designed to integrate FACS with histone modification mapping18, so we can enrich for a cell type and map histone modifications in one workflow. We split bone marrow cells into three technical batches: one batch incubated with anti-H3K27me3 antibody alone (single-incubated), one with anti-H3K9me3 alone (single-incubated) and the third with both anti-H3K27me3 and anti-H3K9me3 antibodies together (double-incubated, H3K27me3+H3K9me3). We then sorted cells into 384-well plates, each plate containing all three cell types, and generate targeted cut fragments (Extended Data Fig. 2a,b). We chose H3K27me3 and H3K9me3 because they have been shown to have a mutually exclusive relationship21, allowing us to verify whether we can infer the correct cell type as well as the generally mutually exclusive relationship. Of note, although H3K27me3 and H3K9me3 are known to be nonoverlapping, it is unclear how this relationship precisely changes to make cell-type-specific patterns at different loci, and therefore modeling the two relationships is still needed to accurately infer the two chromatin profiles in individual cells.

From the double-incubated data alone, we would not know which cut fragments correspond to H3K27me3 and which to H3K9me3, but would observe only a superposition of the two profiles. We therefore used the single-incubated sortChIC data to train a statistical model of how cells from the same cell type combine their H3K27me3 and H3K9me3 profiles to generate double-incubated cut fragments. This model was then used to deconvolve the single-cell multiplexed signal into their respective histone modifications (Methods).

To learn an interpretable latent space for H3K27me3 and H3K9me3, we applied latent Dirichlet allocation (LDA)22,23 to the single-incubated H3K27me3 and H3K9me3 datasets, which factorizes count matrices based on a multinomial model (Methods). (Extended Data Fig. 2c,d). LDA learns cell-type-specific vectors of probabilities. These parameters model the probability that a cut fragment would fall into a specific genomic region. These probabilities can therefore be interpreted as genome-wide histone modification distributions that depend on cell type, and each cell generates a high-dimensional sparse count vector with n total fragments by drawing n independent trials from these multinomial distributions.

Demultiplexing the double-incubated data involves two steps. First, we used the training data to infer which genome-wide H3K27me3 distribution was added to which H3K9me3 distribution to generate a linear combination of two distributions (H3K27me3+H3K9me3). Second, we probabilistically assigned each double-incubated cut fragment to either H3K27me3 or H3K9me3, given that we know the underlying linear combination of the two profiles.

The deconvolved H3K27me3+H3K9me3 data generated two sets of cuts for each cell: one set coming from H3K27me3 and the other from H3K9me3. We projected the two sets of cuts onto the H3K27me3 or H3K9me3 latent space (learned from LDA), respectively (Fig. 2a). Since each deconvolved cell has a set of cuts in H3K27me3 and H3K9me3 simultaneously, we can link the UMAPs together, creating a joint chromatin regulation space (Fig. 2a).

Fig. 2: scChIX-seq accurately deconvolves multiplexed histone modifications in single cells.
figure 2

a, UMAP representation of the H3K27me3 (n = 367) and H3K9me3 (n = 376) histone modification space derived from the two single-incubated datasets (right two panels), and the H3K27me3+H3K9me3 space (left panel, n = 290) derived from the double-incubated data. Cells are colored by their ground truth cell-type labels. The cells in the H3K27me3- and H3K9me3-only space have unmixed double-incubated cells whose deconvolved signal has been projected onto their respective UMAPs. Lines connecting across datasets connect where each double-incubated cell is located in each of the three histone modification space. b, Matrix summarizing the cluster pair that scChIX-seq selected for each double-incubated cell. Cells along the diagonal are predicted to be B cells, granulocytes and NK cells, respectively. Cells in the off-diagonal are false negatives. Barplots summarizing FDR, sensitivity and specificity of assigning each cell type (right). c, Zoom-in coverage plot and single-cell cut fragments in B cells of mixed (H3K27me3+H3K9me3, gray bars), unmixed (H3K27me3 and H3K9me3, orange and blue bars). Positions of cut fragments are shown for four single cells (single cells A, B, C and D) for H3K27me3+H3K9me3 signal (gray ticks) as well as their unmixed outputs (orange and blue ticks). Circled reads and arrow highlight examples of cut fragments being assigned to either H3K27me3 (orange) or H3K9me3 (blue). d, Zoom-out of the Serpinb5 locus. Cut fragments from H3K27me3+H3K9me3 are colored based on whether they have been assigned to H3K27me3 (orange) or H3K9me3 (blue). Ground truth coverage are single-incubated sortChIC data targeting H3K27me3 (orange) and H3K9me3 (blue). e, Heatmap of probabilities p of assigning reads to H3K27me3 (p = 1, red) or H3K9me3 (p = 0, blue) around the Bcl2 locus. Rows are single cells (ordered by predicted cell type), columns are genomic regions (50 kb bins). Transitions between H3K9me3- and H3K27me3-marked chromatin states are independent of cell type. f, Same as e but at the Crim1 locus, where transitions from H3K9me3 to H3K27me3 (blue to red) are cell-type specific.

Full size image

The double- and single-incubated cells in the H3K27me3 and H3K9me3 UMAPs intermingle, suggesting that the model accurately assigns cut fragments to their respective histone modification (Extended Data Fig. 2e,f). Comparing the H3K27me3 deconvolved pseudobulk signal with our ground truth single-incubated pseudobulk shows high correlation for the expected cell type, and lower for the other two cell types (Extended Data Fig. 2g). The H3K9me3 deconvolved pseudobulk signal also shows highest correlation with the expected cell type, with lower correlation from other cell types (Extended Data Fig. 2h). Finally, we compared the fragments per cell obtained from scChIX-seq versus multi-CUT&TAG24, and found that scChIX-seq achieves higher sensitivity than multi-CUT&TAG (Extended Data Fig. 2i). Overall, our ground truth dataset demonstrates that scChIX-seq is accurate and sensitive in assigning cut fragments to their respective histone modification.

To quantify the accuracy of scChIX-seq in selecting the correct H3K27me3-H3K9me3 cluster pair to mix together, we color each cell by its ground truth label and plot its inferred H3K27me3-H3K9me3 pair on a two-dimensional (2D) grid (Fig. 2b, left). The false discovery rates (FDRs) of scChIX-seq predicting B cells, granulocytes or NK cells are 10%, 3% and 1%, respectively (Fig. 2b, right). Similarly, scChIX-seq has high specificity and sensitivity in inferring the correct cluster pairs (Fig. 2b, right).

Next, scChIX-seq assigns each double-incubated cut fragment to either H3K27me3 or H3K9me3 (Fig. 2c; Methods). The deconvolved B cell repressive landscapes correspond with their respective ground truth, exemplified in the Bcl2 (Fig. 2d) and Crim1 (Extended Data Fig. 3a) locus. We also find cell-type-specific signal in H3K27me3 (Extended Data Fig. 3b) and H3K9me3 signal (Extended Data Fig. 3c).

Our model infers p, the expected fraction of double-incubated fragments at a locus that belongs to H3K27me3. That is, p = 0 if all fragments belong to H3K9me3 and p = 1 if they all belong to H3K27me3. Plotting these probabilities across all loci reveals a bimodal distribution with peaks near 0 and 1 (Extended Data Fig. 3d). Classifying these loci as H3K9me3-specific (P < 0.5) or H3K27me3-specific (P ≥ 0.5), we compare the GC content and distance to transcription start site (TSS) of the two classes of loci (Extended Data Fig. 3e,f). We find H3K9me3-specific regions to have lower GC content and increased distance from TSSs compared with H3K27me3-specific regions. Of note, we observe this difference across all three cell types, suggesting that GC-poor and gene-poor regions of the genome is a general feature of H3K9me3-specific regions21.

Summarizing these probabilities in single cells along the genome as a heatmap, the Bcl2 locus reveals the mutual exclusive relationship between H3K27me3 and H3K9me3, where the chromatin state is predominantly H3K9me3, then switches to H3K27me3, and then switches back to H3K9me3 (Fig. 2e). For Bcl2, these transitions occur at the same location independent of the cell type. However, we also find that these transitions can be cell-type specific, as exemplified by the Crim1 locus (Fig. 2f), where the H3K27me3 region extends further upstream of Crim1 in NK cells compared with B cells and granulocytes. Our ground truth experiment demonstrates that scChIX-seq can accurately map two histone modifications in single cells, and the inferred probabilities can be biologically interpreted as relationships between the two histone modifications in single cells.

scChIX-seq reveals H3K4me1/H3K27me3 relationships in bone marrow

We next apply scChIX-seq to integrate active (H3K4me1) and repressive (H3K27me3) chromatin states in a complex mixture of cells by sampling mouse bone marrow (Extended Data Fig. 4a,b). We use scChIX-seq to transfer labels and link UMAPs between active and repressive histone modifications (Fig. 3a,b) to perform a joint analysis of the two marks.

Fig. 3: scChIX-seq enables joint analysis of distinct histone modifications in single cells.
figure 3

a, UMAP of sortChIC signal of H3K4me1 in bone marrow (n = 639 cells). Clusters are colored by cell type. Latent space calculated using LDA with 50 kb bins. b, UMAP of sortChIC signal of H3K27me3 in whole bone marrow (n = 517 cells). Cell types in H3K27me3 are inferred by transferring labels from H3K4me1. c, H3K4me1 and H3K27me3 UMAPs linked together by deconvolved double-incubated cells (n = 1,711 cells). H3K4me1 and H3K27me3 portions of the double-incubated cells are projected onto their respective UMAPs. Lines connect where the active signal and the corresponding repressive signal are located for each double-incubated cell. DC, dendritic cells; pDC, plasmacytoid dendritic cells. d, Heatmap showing probability of assigning a read in a region to either H3K27me3 or H3K4me1 at 5 kb resolution. Heatmap shows the Igk locus for pro-B versus B cells. Rows are single cells, columns are 5 kb genomic regions. Blue represents regions where cut fragments are probably coming from H3K27me3, while red represents regions where cut fragments are probably coming from H3K4me1.

Full size image

To define cell types from the H3K4me1 sortChIC data, we ranked the top 150 genes associated with different clusters from sortChIC and used a publicly available scRNA-seq dataset to compare mRNA abundances of cluster-specific genes across different blood cell types25 (Extended Data Fig. 4c). scChIX-seq takes each H3K4me1+H3K27me3 cell and infers the most probable cluster pair (one from H3K4me1, the other from H3K27me3), which systematically transfers cell-type labels defined from H3K4me1 onto the H3K27me3 data (Extended Data Fig. 4d). We find that a small minority of double-incubated cells have low-confidence cluster pair predictions. Plotting the cluster pairs onto the H3K4me1+H3K27me3 UMAP confirms that the single-cell assignment produces precise clusters where neighboring cells are probably assigned to the same pair. Low-confidence predictions arise from cells that border between clusters (Extended Data Fig. 4e), which we remove from further analysis. Overall, scChIX-seq allows systematic transfer of cell-type labels from one histone modification to another.

We next deconvolve the double-incubated cells into their respective histone modification. The UMAPs from H3K4me1 and H3K27me3 show that single-incubated and deconvolved single cells intermingle, suggesting that deconvolution does not produce batch effects (Extended Data Fig. 4f,g). The deconvolved single cells provide anchors to systematically link one histone modification with another (Fig. 3c). To validate the predicted cell types in both the single and deconvolved datasets, we compared with data from cell types purified by FACS. For H3K4me1 clusters, we compared with publicly available ChIP–seq5. Pearson correlation between ChIP–seq of B cells, erythroids, granulocytes and NK cells versus sortChIC from single- and double-incubated cells is highest for the predicted cell type (Extended Data Fig. 5a–d). Although single-incubated cells have higher correlation with ChIP–seq reference data than deconvolved cells for the matched cell type, the deconvolved cells of the matched cell type consistently had higher correlation with ChIP–seq than unmatched cell types. For H3K27me3 clusters, we used our ground truth sortChIC data purified from FACS. Pearson correlation of sortChIC signal between FACS-sorted B cells, granulocytes and NK cells versus pseudobulks derived from whole bone marrow is highest for the predicted cell type (Extended Data Fig. 5e–g).

Classifying these loci as H3K27me3-specific or H3K4me1-specific using a cluster-specific cutoff for p (Extended Data Fig. 5h), we again compare the GC content and distance to TSS of the two classes of loci. We find that H3K4me1-marked regions tend to be closer to TSSs compared with H3K27me3 (Extended Data Fig. 5i), and that GC content is higher in H3K27me3-specific compared with H3K4me1-specific regions (Extended Data Fig. 5j). The increase in GC content for H3K27me3-marked regions is consistent with previous studies showing that GC-rich elements in transcriptionally inactive regions can recruit PRC2 (ref. 26).

We use the joint landscape to reveal active and repressive histone modification dynamics within cell types. To find differences in chromatin regulation between pro-B cells versus B cells, we select only pro-B or B cells and recluster the cells in both H3K4me1 and H3K27me3 separately (Extended Data Fig. 6a,b). With multimodal data, we can transfer cell-type-specific H3K4me1 signal onto the H3K27me3 UMAP to distinguish pro-B and B cells with more confidence. Using pro-B cell-specific genes, Pax5 (ref. 27) and Pten28, we project the H3K4me1 signal at loci overlapping these genes onto both H3K4me1 and H3K27me3 landscapes, confirming a subset of pro-B cells within the B cell population (Extended Data Fig. 6c). Similarly, we use marker genes associated with more differentiated B cells, such as Irf4 (ref. 27), Igkv3-2 locus29 and Cd72 (ref. 30) to confirm a more differentiated B cell population (Extended Data Fig. 6d). Plotting the heatmap of H3K4me1-H3K27me3 assignment probabilities at the IgK locus reveals that the chromatin state is repressed in pro-B cells but becomes activated in B cells (Fig. 3d), consistent with the progressive activation of the chromatin state during B cell development29.

Next, we recluster neutrophils to analyze differences in chromatin regulation along pseudotime (Extended Data Fig. 7a). Reclustering neutrophils in H3K27me3 reveals a shared pseudotime trajectory that varies smoothly between neutrophils in both the H3K27me3 and H3K4me1 landscapes. H3K4me1 levels at the Retnlg locus—a marker gene for mature neutrophils31—increases along pseudotime, while H3K27me3 levels decreases (Extended Data Fig. 7b). The H3K27me3 gene loadings associated with pseudotime consists of a module of Hox and other developmental genes (Extended Data Fig. 7c–e). Of note, these genes have low levels of mRNA abundances in neutrophils (Extended Data Fig. 7f), suggesting that this module is transcriptionally silent. At a locus overlapping the Hoxa locus, we find that H3K27me3 was highly marked while H3K4me1 was lowly marked across all neutrophils. Along pseudotime, H3K27me3 increases further, while H3K4me1 decreases further (Extended Data Fig. 7c). Our pseudotime analysis suggests that dynamics in histone modifications can occur even in regions associated with low-expressed genes.

H3K36me3/H3K9me3 relationships during mouse organogenesis

To demonstrate the method in more complex biological scenarios, we applied scChIX-seq during mouse organogenesis (E9.5 to E11.5) to study H3K36me3 and H3K9me3 dynamics at single-cell resolution (Fig. 4a and Extended Data Fig. 8a,b). We took the top 250 cluster-specific bins from the H3K36me3 data to identify cell types (Methods). These loci associate with gene bodies of cell-type-specific genes. For example, we find H3K36me3 signal around genes enriched in specific cell types, such as erythroids (Sptb)32, white blood cells (Lcp2 (ref. 33), endothelial cells (Emcn)34, neural tube (Rfx4)35, neurons (Elavl4)36, Schwann precursors (Cdh6)37, epithelial cells (Grhl2)38, mesenchymal progenitors (Prx1)39 and cardiomyocytes (Gata6, Tpm1)40,41 (Extended Data Fig. 8c–l).

Fig. 4: Applying scChIX-seq to mouse organogenesis reveals shared heterchromatin landscapes and cell-type-specific differences in H3K36me3:H3K9me3 ratios.
figure 4

a, Schematic of mouse organogenesis experiment to study H3K36me3 and H3K9me3 in single cells. b, Joint UMAP of mouse organogenesis after deconvolution from scChIX-seq (n = 2,911 H3K36me3 cells, n = 2,166 H3K9me3 cells). c, Assignment of several H3K36me3 cell types to one H3K9me3 cluster. The H3K36me3 (columns) and H3K9me3 (rows) label for each double-incubated cells (n = 1,197 cells) are plotted onto a matrix to H3K36me3 cell types to H3K9me3 clusters. Cells are colored by their cell-type label as in b. d, Subclustering of nonblood cells for H3K9me3, represented as a UMAP. Arrow denotes a pseudotime axis. Pseudotime defined as the first PC of the 2D UMAP. e, Joint UMAP of deconvolved double-incubated cells (n = 1,197 cells), colored by the log ratio of number of H3K36me3 cuts versus number of H3K9me3 cuts. f, Boxplot of H3K36me3:H3K9me3 ratio across cell types. Number of double-incubated cells for each cell type: n = 163 erythroid, n = 17 white blood cells, n = 24 endothelial, n = 136 neural tube progenitors, n = 197 neurons, n = 46 Schwann cell precursors, n = 73 epithelial, n = 458 mesenchymal progenitors and n = 83 cardiomyocytes. Boxplots show 25th percentile, median and 75th percentile, with the whiskers spanning 97% of the data.

Full size image

To uncover whether distinct H3K36me3 cell types could share common H3K9me3 landscapes, we deconvolved the H3K36me3 + H3K9me3 cells and projected each cell to both landscapes (Fig. 4b). scChIX-seq reveals that erythroid and white blood cells have both distinct active chromatin and heterochromatin, but the other nonblood cell types show similar heterochromatin distribution. Assigning each double-incubated cell to a H3K36me3 and H3K9me3 cluster confirms that cells with distinct H3K36me3 can share the same H3K9me3 cluster (Fig. 4c). Of note, the variable genes that show cell-type-specific differences in both active chromatin and publicly available mRNA abundances42 (Extended Data Fig. 9a,b) have low signal across cell types in H3K9me3 (Extended Data Fig. 9c), suggesting that using conventional marker genes from RNA-seq would not reveal cell-type differences in H3K9me3.

Differential expression across the three H3K9me3 clusters reveals cluster-specific repressed loci (Extended Data Fig. 9d), with the largest effect coming from erythroid-specific regions. These erythroid-repressed regions are associated with decreased mRNA abundances (Extended Data Fig. 9e–g). Subsetting the data and running LDA on only nonblood cells in H3K9me3, we find that H3K9me3 varies over organogenesis stages (Fig. 4d), suggesting that heterochromatin differences are stronger across organogenesis stages than between cell types.

Because the double-incubated cells have cut fragments associated with both histone modifications, we hypothesized that the deconvolved data could precisely quantify the ratio between the two histone modifications, and how this ratio changes across cell types. Counting total reads from single-incubated data would lead to large cell-to-cell technical variability because counts per cell can span several orders of magnitude. However, comparing the counts of the two histone modification in the same cell could overcome this technical variability. We therefore asked whether the global ratio of H3K36me3 and H3K9me3 in individual cells varies. Plotting the ratio of H3K36me3 and H3K9me3 reveals that most cells have comparable ratios, but that erythroid cells have lower ratios than other cell types (Fig. 4e,f). This lower ratio is consistent with mass spectrometry studies showing a global decrease in H3K36me3 but no change in H3K9me3 during erythroid maturation43. Of note, inferring this global change without scChIX-seq, such as by counting total unique fragments from single-incubation data, is challenging due to the large variability in total counts across cells and the inability to distinguish cell types in certain H3K9me3 clusters (Extended Data Fig. 9h,i).

In sum, applying scChIX-seq to H3K36me3 and H3K9me3 during organogenesis reveals unique insights from multimodal analysis. The complex relationships between the two histone modifications as well as their global changes would not have been elucidated by analyzing single-incubated data alone.

Mark-specific pseudotimes and chromatin velocity

Finally, we applied scChIX-seq to study the dynamic relationships between two active histone modifications, H3K4me1 and H3K36me3, over an in vitro differentiation timecourse. We sorted blood progenitors from mouse bone marrow, added macrophage colony-stimulating factor (MCSF) and collected cells over 7 days (Fig. 5a and Extended Data Fig. 10a,b; Methods). We incubated cells with either H3K4me1, H3K36me3 or both H3K4me1 and H3K36me3, then performed scChIX-seq.

Fig. 5: Applying scChIX-seq to two active marks reveals chromatin velocity during in vitro macrophage differentiation.
figure 5

a, Schematic of mouse macrophage in vitro differentiation timecourse experiment to study H3K4me1 and H3K36me3 in single cells. b, Heatmap of histone modification signal on the bodies of dynamic genes over pseudotime. Rows are gene bodies and columns are single-incubated cells ordered along pseudotime. Color labels of columns are days from which the cells were recovered during the timecourse. c, Boxplots of pseudotime estimates of single-incubated cells along the timecourse. Number of cells per day for H3K4me1: n = 58 day 0, n = 148 day 1, n = 249 day 2, n = 350 day 3, n = 369 day 4, n = 383 day 5, n = 488 day 6, n = 519 day 7. For H3K36me3: n = 42 day 0, n = 125 day 1, n = 176 day 2, n = 301 day 3, n = 384 day 4, n = 366 day 5, n = 522 day 6, n = 567 day 7. Boxplots show 25th percentile, median and 75th percentile, with the whiskers spanning 97% of the data. d, Estimate of the average difference of pseudotime from one day to the next. Error bars indicate 95% CI, calculated by a linear model of the pseudotime differences between days. Statistics derived from number of cells indicated in c. e, Estimates of two different pseudotimes from a single cell. Error bars are 95% CI of the estimates. Each point is a double-incubated cell. f, Joint UMAP of H3K4me1 and H3K36me3 from scChIX-seq, lines connect single cells with multimodal information. g, Chromatin velocity estimates of an upregulated gene (above) and a downregulated gene (below). Red curve is the exponential relaxation fit according to the solution of the first-order differentiation equation. h, High-dimensional chromatin velocities of dynamic genes projected onto PCs 1 and 2. Vector field estimated by smoothing across nearest neighbors of cells (Methods).

Full size image

Genome tracks of H3K4me1 and H3K36me3 signal for each day shows upregulation of macrophage-specific genes, such as Mertk44 (Extended Data Fig. 10c). Heatmap of H3K4me1 and H3K36me3 dynamics at gene bodies along pseudotime reveals that the two histone modifications up- and downregulate genes with different dynamics. H3K36me3 shows a gradual up- or downregulation of signal while H3K4me1 reaches a new steady state earlier along pseudotime (Fig. 5b). Summarizing log2 fold change of the two histone modifications genome-wide, we find that dynamics in H3K36me3 are often larger than in H3K4me1 (Extended Data Fig. 10d). Comparing pseudotime progression with day of sample collection shows that changes in H3K4me1 peak at day 2 and then increases progressively over the day while H3K36me3 dynamics peak around day 3 and 4 before relaxing towards steady state (Fig. 5c). The time of the largest change in H3K4me1 dynamics occurs 1 day before H3K36me3 (Fig. 5d), suggesting that global changes in H3K4me1 precede changes in H3K36me3. Summarizing at the genome-wide level, UMAPs of H3K4me1 and H3K36me3 of single-incubated cells show that both active marks move progressively towards a macrophage state during the timecourse (Fig. 5e).

Using continuous pseudotime of H3K4me1 and H3K36me3 as our training data (Methods), for both H3K4me1 and H3K36me3 we infer where along pseudotime each double-incubated cell came from. Plotting the inferred pseudotimes of each mark for each cell uncovers the dynamic relationships between the two marks (Fig. 5e). H3K4me1 pseudotime initially progresses while H3K36me3 remains relatively unchanged. As H3K4me1 pseudotime approaches 0.5, H3K36me3 then progresses rapidly towards 1. This sigmoidal-like relationship between H3K4me1 versus H3K36me3 pseudotime progression is consistent with H3K4me1 dynamics occurring before H3K36me3. Finally, we used this inferred pseudotime information to project the deconvolved cells onto the H3K4me1 and H3K36me3 UMAPs. Both UMAPs showed that the single-incubated and deconvolved cells intermingle with each other, suggesting that deconvolution was successful (Extended Data Fig. 10e,f). Using the deconvolved cells as anchors, we then linked the two histone modification maps together (Fig. 5f).

Since we observed that H3K4me1 dynamics occur before H3K36me3, we asked whether we could model the H3K36me3 dynamics as a first-order differential equation analogous to RNA velocity45 (Fig. 5g, top; Methods). Since our data come from a timecourse, we directly fitted the exponential curves for dynamic genes along pseudotime for H3K36me3 (Extended Data Fig. 10g), which avoids making steady-state assumptions and leverages information from both single-incubated and deconvolved cells across histone modifications. The distribution of inferred rate constants from the exponential fit show a median of approximately 2.3 per pseudotime (Extended Data Fig. 10h). These rate constants were then used to predict the H3K36me3 levels for each cell over small pseudotime steps (Δt = 0.02; Fig. 5g). Finally, summarizing the predictions of dynamic genes, we projected the high-dimensional velocity vectors onto the first two principal components (PCs). From the chromatin velocity summary, we found that differentiation starts with large changes in H3K36me3 dynamics, and then relaxes towards the macrophage state.

In summary, we applied scChIX-seq to two active histone modifications to find dynamic relationships between activation states. We then model these dynamics to infer chromatin velocity during macrophage differentiation.

Discussion

Here, we demonstrate that scChIX-seq can deconvolve multiplexed histone modifications, expanding the number of histone marks that can be profiled in single cells. Using simulations, purified cell types and whole bone marrow, we demonstrate that scChIX-seq can accurately map several histone marks. To show how scChIX-seq can reveal unique biological insights in more challenging systems, we applied scChIX-seq to study H3K36me3 and H3K9me3 dynamics during mouse organogenesis to reveal the joint transcriptional and heterochromatin relationships in single cells. scChIX-seq can identify complex cell-type relationships between histone modifications, such as when several cell types can share a similar heterochromatin landscape. Finally, we applied scChIX-seq to two active marks during macrophage in vitro differentiation to quantify the relationship between two correlating marks. Importantly, scChIX-seq is flexible in which histone modifications can be used. The correlation structure between modifications is inferred from the model and therefore does not require a priori assumptions of specific features of the two modifications. Thus, scChIX-seq complements a recent method that focuses on differences in fragment lengths between Pol2 serine-5 phosphate and H3K27me3 to assign reads to their respective mark46.

Recently, there have been new experimental innovations to CUT&TAG that modify the pA-Tn5 complex to map several histone modifications in single cells24,47,48,49. One drawback of Tn5-based approaches (for example, CUT&TAG) compared with MNase-based (for example, sortChIC and CUT&RUN) used in this study is that Tn5 can have biases to open chromatin50. Current CUT&TAG methods suppress this bias by using more stringent washing conditions51, but exceedingly high salt conditions reduce the sensitivity and could wash away weakly bound factors such as transcription factors50,51. On the flip side, MNase-based approaches involve more experimental effort than Tn5-based approaches, reducing the number of single cells that can be processed per round. Although we demonstrate our scChIX-seq method using an MNase-based approach (sortChIC), our computational and experimental framework can also be applied to Tn5-based strategies. Furthermore, our scChIX-seq method may have synergies with recent nanobody-based methods47,48. For example, using two nanobodies, each specific to a different species of immunoglobulin G, one can profile four histone modifications by generating two sets of scChIX-seq simultaneously: two antibodies raised from one species and the other two antibodies raised from the second species.

A limitation in scChIX-seq is that the maximum number of cuts at a specific base pair location is fundamentally limited by the copy number in that cell. Therefore, a nucleosome that has several modifications in their histone tails would still be cut only once. Currently, our binning strategy (5 kilobase (kb), 50 kb or gene bodies, depending on the biological question) and multinomial model assumes that there is no limit to the number of fragments that can be generated in one bin, which is an approximation that is valid when the bins are large and the number of cuts within the bins are small (for example, due to dropouts).

We demonstrate that scChIX-seq can reveal biological insights by multimodal analysis that would otherwise be obscured by analyzing each modality separately. Overall, scChIX-seq unlocks multimodal analysis in antibody-based chromatin profiling and enables joint analysis of distinct histone modifications in single cells.

Methods

Animal experiments

All mice used in this study were Cast-EiJ/Bl6 mice and were bred and maintained in the Hubrecht Institute Animal Facility. All mouse experimentation was approved by the Animal Experimentation Committee (DEC) from the Koninklijke Nederlandse Akademie van Wetenschappen (KNAW) and complied with existing European Union legislation and local standards.

Mouse bone marrow

Male 13-week-old C57BL/6 mice were used to extract bone marrow cells. Femurs and tibia were extracted, the bone ends were cut away to access the bone marrow, which was flushed out using a 22G syringe with HBSS (– calcium, – magnesium, – phenol red, Gibco, catalog no. 14175053) supplemented with Pen-Strep and 1% fetal calf serum. The bone marrow was dissociated and debris removed by passing through a 70 μm cell strainer (Corning, catalog no. 431751). Cells were washed with 25 ml supplemented HBSS before depleting the sample of unnucleated cells using IOTest 3 Lysing solution (Beckman Coulter) following the providerʼs instructions. Cells were washed an additional two times with PBS before processing them by the sortChIC protocol for histone modifications. For whole bone marrow experiments (that is, not enriched for specific cell types), we processed cells using the sortChIC protocol for unfixed cells (without ethanol fixation). For the ground truth experiment with sorted cell types, we processed cells using the sortChIC protocol for ethanol-fixed cells. For ethanol fixation, cells were resuspended in 70% ethanol and fixed for 1 h at –20 °C. Afterwards cells were resuspended in Storage buffer (42.5 ml H2O RNAse free, 1 ml 1 M HEPES pH 7.5 (Invitrogen), 1.5 ml 5 M NaCl, 3.6 μl spermidine (Sigma Aldrich, catalog no. S2626-5G), protease inhibitor (Sigma Aldrich, catalog no. 5056489001), 200 μl 0.5 M EDTA, 5 μl dimethylsulfoxide) and frozen at –80C, before processing by the sortChIC protocol.

Mouse organogenesis

No randomization or blinding was performed. Sex of embryos was not known at the time of collection. Four to five embryos were pooled for each reported timepoint (E9.5, E10.5, E11.5) before single-cell isolation. Pooled embryos were dissociated in TrypleE for 10 min at room temperature. Undigested portions were physically removed and the remainder filtered through a 30 μm filter before the single-cell suspension was split into three samples for each timepoint and each scChIX-seq experiment. Per timepoint, two single-cell samples were used each for a single antibody incubation (H3K36me3 or H3K9me3) and one sample for the double antibody incubation (H3K36me3 + H3K9me3). Antibody incubation was performed as described in the scChIX-seq protocol before single-cell capture using flow cytometry. A DNA library was prepared for each sample using the sortChIC protocol for unfixed cells.

In vitro macrophage differentiation

For in vitro differentiation of bone marrow-derived macrophages, bone marrow was collected aseptically by flushing tibia and femurs from euthanized wild-type male C57BL/6 mice with sterile RPMI and 10% FCS through a 70 μm cell strainer (Corning). To enrich for stem and progenitor cells, lineage marker-positive (Lin+) cells were depleted by magnetic-activated cell sorting using a mouse Lineage Cell Depletion kit (Miltenyi Biotec). Lin cells were cultured on nontissue-culture-treated plates (Corning) for 7 days in RPMI medium supplemented with 10% FCS, 100 IU ml–1 penicillin, 100 mg ml–1 streptomycin and 10 ng ml–1 recombinant murine MCSF (Peprotech). Medium was refreshed after 3 days. Every 24 h, suspension cells were collected and adherent cells were harvested by incubating 10 min in 2 mM EDTA/0.5% BSA in PBS. Suspension and adherent cells were combined and stained with CellTrace fluorescent labels (Thermo Fisher), according to manufacturer’s instructions. Briefly, cells were pelleted and resuspended in 37 °C PBS containing fluorescent dyes (working concentrations CellTrace CSFE (CTC): 2.5 μM; CellTrace Yellow (CTY): 2.5 μM; CellTrace Far Red (CTFR): 0.5 μM) at a concentration of 1,000,000 cells ml–1. Cells were incubated at 37 °C protected from light for 20 min. Staining reactions were stopped by adding two volumes of RPMI/10% FCS and incubating for 5 min at room temperature, protected from light, after which cells were washed twice in PBS. The following combinations of labels were used: unstained (day 0), CTC (day 1), CTY (day 2), CTFR (day 3), CTC + CTY (day 4), CTC + CTFR (day 5), CTY + CTFR (day 6) and CTC + CTY + CTFR (day 7). After harvesting and staining, cells were fixed in 70% ethanol for 1 h and stored for later by the sortChIC protocol for fixed cells.

Cell preparation without ethanol fixation for sortChIC experiments

Cells from whole bone marrow (H3K4me1+H3K27me3) and mouse embryos (H3K36me3+H3K9me3) were processed using the sortChIC method without ethanol fixation. Cells were processed in 0.5 ml protein low-binding tubes. Following steps were performed on ice. Cells were resuspended in 500 μl wash buffer (47.5 ml H2O RNAse free, 1 ml 1 M HEPES pH 7.5 (Invitrogen), 1.5 ml 5M NaCl, 3.6 μl pure spermidine solution (Sigma Aldrich)). Cells were pelleted at 600g for 3 min and resuspended in 400 μl wash buffer 1 (wash buffer with 0.05% saponin (Sigma Aldrich), protease inhibitor cocktail (Sigma Aldrich), 4 μl 0.5 M EDTA) containing the primary antibody (1:100 dilution for the antibody, saponin has to be prepared fresh every time as a 10% solution in PBS). Cells were incubated overnight at 4 °C on a roller, before being washed once with 500 μl wash buffer 2 (wash buffer with 0.05% saponin, protease inhibitor). Afterwards cells were resuspended in 500 μl wash buffer 2 containing Protein A-Micrococcal Nuclease (pA-MNase) (3 ng ml–1) and incubated for 1 h at 4 °C on a roller.

Finally, cells were washed an additional two times with 500 μl wash buffer 2 before passing it through a 70 μm cell strainer (Corning, catalog no. 431751) and sorting G1 cells based on Hoechst staining on a BD Influx FACS machine into 384-well plates containing 50 nl wash buffer 3 (wash buffer containing 0.05% saponin) and 5 μl sterile filtered mineral oil (Sigma Aldrich) per well. Small volumes were distributed using a Nanodrop II system (Innovadyme).

Cell preparation with ethanol fixation and surface antibody incubation for sortChIC experiments

Cells from sorted bone marrow (H3K27me3+H3K9me3) and macrophage in vitro differentiation (H3K4me1+H3K36me3) were processed using the ethanol fixation protocol. Sorted bone marrow cells were also incubated with surface antibody to enrich for known cell types. For the ethanol-fixed cells the above described sortChIC protocol was adapted. Wash buffers were used as described above, except that 0.05% saponin was exchanged for 0.05% Tween. Ethanol-fixed cells were thawed on ice. Cells were spun at 400g for 5 min and washed once with 400 μl wash buffer 1. Cells were spun again at 400g and resuspended in 400 μl wash buffer 1. Cell suspension was split into three samples each having a volume of 400 μl and incubated with one or two antibodies (1:100 dilution for the antibody) overnight on a roller at 4 °C. The next day, cells were spun at 400g, washed once with 400 μl wash buffer 2 and resuspended in 500 μl wash buffer 2 containing pA-MNase (3 ng ml–1) and incubated for 1 h on a rotator at 4 °C. Next, cells were spun at 400g and resuspended in 400 μl wash buffer 2 (with addition of 5% blocking rat serum). To sort for defined cell types in the ground truth bone marrow experiment, surface antibodies were added according to these concentrations and were incubated for 30 min on ice:

$$begin{array}{l}begin{array}{ll}{{mbox{antibody}}},&,{{mbox{info}}}\ {{mbox{GR1}}},&,{{mbox{A647, anti-mouse Ly-6G/Ly-6C (Gr-1) Antibody,}}}\ & {mbox{clone: RB6-8C5}}\ {{mbox{NK1}}},&,{{mbox{A488, anti-mouse NK-1.1 Antibody, clone: PK136}}}\ {{mbox{CD19}}},&,{{mbox{BV421, anti-mouse CD19 Antibody, clone: 6D5}}}end{array}\begin{array}{l}{{mbox{working concentration}}}\1:8,000\1:400\1:200end{array}end{array}$$

BD FAC software v.1.2.0.142 was used to collect data from the FACS machine during cell sorting; see Supplemental Fig. 1 for the gating strategy.

Finally, samples were washed once with 500 μl wash buffer 2 before passing them through a 70 μm cell strainer (Corning, catalog no. 431751) and sorting on a BD Influx FACS machine, with surface antibody specific gating, into 384-well plates containing 50 nl wash buffer 3 (wash buffer containing 0.05% Tween) and 5 μl sterile filtered mineral oil (Sigma Aldrich) per well. Small volumes were distributed using a Nanodrop II system (Innovadyme).

MNase activation for sortChIC experiments

Targeted fragmentation was started by the addition of 5 μl wash buffer 2 containing 4 mM CaCl2. For digestion, plates were incubated for 30 min in a PCR machine set at 4 °C. Afterwards the reaction was stopped by adding 100 nl of a stop solution containing 40 mM EGTA, 1.5% NP40, and 10 nl 2 mg ml−1 proteinase K. Plates were incubated in a PCR machine for further 20 min at 4 °C, before chromatin was released and pA-MNase permanently destroyed by proteinase K digestion at 65 °C for 6 h followed by 80 °C for 20 min to heat inactivate proteinase K. Afterwards plates were stored at –80 °C until further processing.

Library preparation for sortChIC experiments

DNA fragments were blunt ended by adding 150 nl end repair mix per well and incubating for 30 min at 37 °C followed by 20 min at 75 °C for enzyme inactivation. End repair mix per well: Klenow large (NEB, catalog no. M0210L) 2.5 nl, T4 PNK (NEB, catalog no. M0201L) 2.5 nl, dNTPs 10 mM 6 nl, ATP 100 mM 3.5 nl, MgCl2 25 mM 10 nl, PEG8000 50% 7.5 nl, PNK buffer 10× (NEB, catalog no. B0201S) 35 nl, BSA 20 ng 1.8 nl, nuclease-free water 81.3 nl.

Blunt fragments were subsequently A-tailed by adding 150 nl per well of A-tailing mix and incubated for 15 min at 72 °C. Through the strong preference of AmpliTaq 360 to incorporate dATP as a single base overhang even in the presence of other nucleotides, a general dNTP removal was not necessary. A-tailing mix per well: AmpliTaq 360 (Thermo Fisher Scientific, catalog no. 4398828) 1 nl, dATPs 100 mM 1 nl, KCl 1 M 25 nl, PEG8000 50% 7.5 nl, BSA 20 ng 0.8 nl, nuclease-free water 114.8 nl.

Fragments were ligated to T-tail containing forked adapters containing a T7 polymerase binding site for in vitro transcription (IVT)-based amplification.

Top strand: 5′-GGTGATGCCGGTAATACGACTCACTATAGGGAGTTCTACAGTCCGACGATCNNNACACACTAT-3′

Bottom strand: 5′-TAGTGTGTNNNGATCGTCGGACTGTAGAACTCCCTATAGTGAGTCGTATTACCGGCGAGCTT-3′

The three random nucleotides (NNN) were the unique molecular identifier used for read deduplication and the eight bases afterwards represent the cell barcodes, which were different for each of the 384 wells. For a full list of adapters and the cell barcodes for each well, see the excel sheet in Supplemental Table 1. Cell barcodes for each 384-well plates are also found as a text file in the scChIX-seq Github repository: (https://github.com/jakeyeung/scChIX/blob/main/inst/extdata/cellbarcodes_384_NLA_annotated.bc).

For ligation, 50 nl of 5 μM adapter in 50 mM Tris pH 7 was added to each well with a Mosquito HTS (ttp labtech). After centrifugation, 150 nl of ligation mix was added before incubating plates for 20 min at 4 °C, followed by 16 h at 16 °C for ligation and 10 min at 65 °C to inactivate ligase. Adapter ligation mix per well: T4 ligase (400,000 U ml–1, NEB, catalog no. M0202L) 25 nl, MgCl2 1 M 3.5 nl, Tris 1 M pH 7.5 10.5 nl, DTT 0.1 M 52.5 nl, ATP 100 mM 3.5 nl, PEG8000 50% 10 nl, BSA 20 ng 1 nl, nuclease-free water 44 nl.

Before pooling, 1 μl nuclease-free water was added to each well to minimize material loss. Ligation products were pooled by centrifugation into oil coated VBLOK200 Reservoir (ClickBio) at 500g for 2 min and the liquid face was transferred into 1.5 ml Eppendorf tubes and then purified by centrifugation at 13,000g for 1 min and transferred into a fresh tube twice. DNA fragments were purified using Ampure XP beads (Beckman Coulter, prediluted one in eight in bead binding buffer: 1 M NaCl, 20% PEG8000, 20 mM Tris pH 8, 1 mM EDTA) at a bead to sample ratio of 0.8. After 15 min incubation at room temperature, beads were washed twice with 1 ml 80% ethanol resuspending the beads during the first wash and resuspended in 20 μl nuclease-free water. After 2 min elution, the supernatant was transferred into a fresh 0.5 ml tube. A second cleanup was performed adding 26 μl undiluted Ampure XP beads and the beads were resuspended in 8 μl nuclease-free water. The cleaned DNA was then linear amplified by IVT by adding 12 μl of MEGAscript T7 Transcription Kit (Fisher Scientific, catalog no. AMB13345) for 12 h at 37 °C. Template DNA was removed by addition of 2 μl–1 TurboDNAse (IVT kit) and incubation for 15 min at 37 °C. The RNA produced was further purified using RNA Clean XP beads (Beckman Coulter) at a beads to sample ratio of 0.8 and samples were resuspended in 22 μl of nuclease-free water. RNA was fragmented by mixing in 4.4 μl fragmentation buffer (200 mM Tris-acetate pH 8.1, 500 mM KOAc, 150 mM MgOAc) and incubation for 2 min at 94 °C. Fragmentation was stopped by transferring samples to ice, adding 2.64 μl 0.5 M EDTA and another bead cleanup; samples were resuspended in 12 μl nuclease-free water.

RNA (5 μl) was primed for reverse transcription by adding 0.5 μl 10 mM dNTPs and 1 μl 20 mM randomhexamerRT primer (5′-GCCTTGGCACCCGAGAATTCCANNNNNN-3′) and hybridizing it by incubation at 65 °C for 5 min followed by direct cool down on ice. Reverse transcription was performed by further addition of 2 μl first strand buffer (part of Invitrogen kit, catalog no. 18064014), 1 μl 0.1 M DTT, 0.5 μl RNAseOUT (Invitrogen, catalog no. LS10777019) and 0.5 μl SuperscriptII (Invitrogen, catalog no. 18064014) and incubating the mixture at 25 °C for 10 min followed by 1 h at 42 °C. Single-stranded DNA was purified through incubation with 0.5 μl RNAseA (Thermo Fisher, catalog no. EN0531) and incubation for 30 min at 37 °C.

A final PCR amplification to add the Illumina small RNA barcodes and handles was performed by adding 25 μl of NEBNext Ultra II Q5 Master Mix (NEB, catalog no. M0492L), 11 μl nuclease-free water and 2 μl of 10 μM RP1 and RPIx primers.

PCR protocol for sortChIC experiments

Activation for 30 s at 98 °C, 8–12 cycles (depending on starting material), 10 s at 98 °C, 30 s at 60 °C, 30 s at 72 °C, final amplification 10 min at 72 °C.

PCR products were cleaned by two consecutive DNA bead clean-ups with a bead to sample ratio of 0.8. Final product was eluted in 7 μl nuclease-free water. The abundance and quality of the final library were assessed by QUBIT and bioanalyzer.

Data processing

All DNA libraries were sequenced on a Illumina NextSeq500 with 2 × 75 bp. We ran the raw fastq files through the Single-Cell MultiOmics (SCMO) workflow (github.com/BuysDB/SingleCellMultiOmics52). The workflow comprises of six steps.

(1) Demultiplex raw fastq files using demux.py (SCMO). (2) Trim fastq files by removing adapters using cutadapt (v.3.5). (3) Map trimmed fastq files using bwa (v.0.7.17-r1188). (4) Tag bam files with cell barcode information, using bamtagmultiome.py (SCMO). (5) Generate count tables using bamToCountTable.py (SCMO). (6) Run dimensionality reduction of count matrices using run_LDA_model.R. See an example of the pipeline in the scChIX-seq Github repository53.

Unmixing scChIX-seq signal

Single-cell epigenomics techniques (for example, sortChIC, CUT&RUN and CUT&TAG) generate a vector of counts indicating the number of cut fragments that map in each genomic region for each cell. We model the vector of counts from a double-incubated cell (overrightarrow{y}) as a linear combination of two multinomial distributions: one coming from a cluster c of histone modification 1, parameterized by ({overrightarrow{p}}_{c}), the other from another cluster d of histone modification 2 ({overrightarrow{q}}_{d}). The log-likelihood for a linear combination of two multinomials is:

$${{{{rm{L}}}}}_{(c,d)}=log (Pleft(overrightarrow{y}| {overrightarrow{p}}_{c},{overrightarrow{q}}_{d},wright))propto mathop{sum }limits_{g=1}^{G}{y}_{g}log left(w{p}_{c,g}+left(1-wright){q}_{d,g}right).$$

(1)

(overrightarrow{y}) is the number of cuts across the genome for a double-incubated cell. pc,g and qd,g are cluster-specific probabilities indicating the likelihood that a cut fragment maps to region g in histone modifications 1 and 2, respectively. w is the mixing fraction of histone modification 1 in the double-incubated cell, which we estimate by maximizing the log-likelihood given (overrightarrow{y}), ({overrightarrow{p}}_{c}) and ({overrightarrow{q}}_{d}).

Applying single-cell techniques to complex tissues generates data with many clusters. Therefore, given a double-incubated cell, we do not know which pair of clusters (c,d) were combined to generate the observed counts. We therefore calculate the log-likelihood for all possible pairs of clusters learned from the training data and then select the cluster pair with the highest probability for each cell.

Cluster-specific probabilities ({overrightarrow{p}}_{c}) and ({overrightarrow{q}}_{d}) are learned by applying LDA (with k = 30 topics) using the topicmodels R package54 to the training data (that is, single-incubated cells), which are count matrices.

After assigning each cell to the most probable cluster pair ((hat{c},hat{d})), we assign yi,j, the jth read mapped to region g in cell i, to histone mark 1 with probability Pi,j:

$${P}_{mathrm{i,j}}=frac{w{p}_{hat{c},g}}{w{p}_{hat{c},g}+left(1-wright){q}_{hat{d},g}}.$$

(2)

This assignment generates a pair of vectors ({overrightarrow{y}}_{1,i}) and ({overrightarrow{y}}_{2,i}) that are linked because they both come from cell i. Unmixed counts ({overrightarrow{y}}_{1,i}) and ({overrightarrow{y}}_{2,i}) are then projected back onto the space inferred from training data of histone modification 1 and 2, respectively. The links between histone modification 1 and 2 are used to transfer labels and create linked UMAPs between the two histone modifications.

Latent Dirichlet allocation

LDA is a probabilistic matrix decomposition model that is useful when the input data is a matrix of counts. LDA uses hierarchical multinomial models to estimate the relative frequencies of cuts in each genomic region in single cells.

To generate the genomic location of the jth read for cell i:

Choose a topic zi,j by sampling from the cell-specific distribution of topics:

$$begin{array}{r}{overrightarrow{U}}_{mathrm{i}} sim ,{{{rm{Dirichlet}}}},(alpha )\ {z}_{mathrm{i,j}} sim ,{{{rm{Multinomial}}}},({overrightarrow{U}}_{i},1)end{array}$$

Choose genomic region wi,j by sampling from the topic-specific distribution of genomic regions:

$$begin{array}{r}{overrightarrow{V}}_{mathrm{k}} sim ,{{{rm{Dirichlet}}}},(delta )\ {w}_{mathrm{i,j}} sim ,{{{rm{Multinomial}}}},({overrightarrow{V}}_{{z}_{mathrm{i,j}}},1)end{array}$$

The Dirichlet distributions are priors to prevent overfitting when there are few cuts in the region. We used the LDA model implemented by the topicmodels R package, using the Gibbs sampling implementation with hyperparameters α = 1.67, δ = 0.1, where K is the number of topics23.

We estimate ({overrightarrow{p}}_{c}) and ({overrightarrow{q}}_{d}) for each cluster in histone modification 1 ({{overrightarrow{p}}_{1},{overrightarrow{p}}_{2},…,{overrightarrow{p}}_{C}}) and modification 2 ({{overrightarrow{q}}_{1},{overrightarrow{q}}_{2},…,{overrightarrow{q}}_{D}}) by averaging the estimated probabilities across cells assigned to each cluster for each gene g:

$$p_{g,c}=frac{1}{vert C vert}mathop{sum }limits_{mathrm{i}in C}mathop{sum }limits_{mathrm{k}=1}^{K}{V}_{mathrm{g,k}}{U}_{mathrm{k,i}}$$

where C is the set of cells that belong to cluster c.

Simulation of single- and double-incubated histone modification data

To simulate multimodal single-cell histone modification data with varying degrees of overlap, we extended simATAC55 to allow generating cell-type profiles from histone modifications of varying mutually exclusive relationships.

For each cell type, we first run simATAC to generate sparse count data of 10,000 loci across 750 cells partitioned into three technical replicates of 250 cells each. The high-dimensional count data are sparse. Counts from each locus are generated according to a Poisson likelihood with locus-specific means (λ) matching real single-cell ATAC-seq from K562 cells (GSE99172).

In our 750 cells, cells 1–250 represent single-incubated cells from mark 1; cells 251–500 from mark 2; cells 501–750 from double-incubated cells. Cells from mark 1 have counts generated from locus-specific means λ. Cells from mark 2 also have counts generated from λ, but we swap the top x% of bins with highest λ with bins with lowest λ, allowing precisely defined sets of mutually exclusive and overlapping bins. We use x = 1%, 50% and 99% to benchmark our method from mostly overlapping (that is x = 1%) to mostly mutually exclusive (that is x = 99%) Cells from mark 3 are generated by adding counts generated from mark 1 and mark 2 to simulate double-incubated cells.

To generate cell-type-specific profiles, we repeat the above with a cell-type-specific random seed and shuffle the order of the bins. This generates count data where λ is cell-type specific, but the distribution of λ are preserved genome-wide.

Estimating the top cluster-specific bins

We use the LDA matrix factorization to identify the top cluster-specific bins in the data. We rank the bin loadings for each cell type and take the top 150 (whole bone marrow) or 250 (mouse organogenesis) bins with the largest loadings.

Inferring pseudotime in differentiation data

To analyze the macrophage differentiation data, we first removed erythroblasts, plasmacytoid dendritic cells, and innate lymphocyte cells from the data, which were concentrated at day 0 and not considered to be part of the macrophage differentiation trajectory. We then ran LDA (k = 30 topics) and performed principal component analysis (PCA) on the LDA outputs, which retrieves the principal components that explain the largest amount of variance after denoising the data. We used the first principal component for H3K4me1 and H3K36me3 to define pseudotime, which we found correlates with the day along the timecourse.

Unmixing scChIX-seq signal from continuous pseudotime

To apply scChIX-seq on continuous pseudotime, we modify the log-likelihood (equation (1)) to account for a continuous variable:

$${{{rm{L}}}}left({t}_{1},{t}_{2}right)=log left(Pleft(overrightarrow{y}| overrightarrow{p}left({t}_{1}right),overrightarrow{q}left({t}_{2}right),wright)right)propto mathop{sum }limits_{g=1}^{G}{y}_{g}log left(w{p}_{g}left({t}_{1}right)+left(1-wright){q}_{g}left({t}_{2}right)right)$$

(3)

where t1 [0, 1] is pseudotime from histone modification 1 and t2 [0, 1] is pseudotime from modification 2.

To estimate pseudotime, we ran LDA to denoise the count matrix, and then ran PCA to estimate largest principal components explaining the variance in the data. We took the first principal component as our pseudotime estimate for both marks, which captured the epigenomic changes over the 7-day timecourse.

({p}_{g}left(tright)) is estimated by fitting the signal from histone modification 1 at genomic region g with a lowess curve along pseudotime. We estimate qg analogously but using signal from histone modification 2.

To infer the pseudotime of histone modifications 1 and 2 simultaneously given a vector of counts from a double-incubated cell, we estimate t1 and t2 that minimizes the log-likelihood L from equation (3). We estimate the variance-covariance matrix of t1 and t2 by the square root of the inverse of the Hessian matrix, which we use to calculate the standard errors.

Since the t1 and t2 are constrained between 0 and 1, we use the L-BFGS-B optimization algorithm implemented in R. Since estimates from a single cell can sometimes be noisy due to low counts, we sum the counts across the 25-nearest neighbors (estimated from the latent space inferred by LDA) for each double-incubated cell.

Chromatin velocity during macrophage differentiation

We assume that dynamic genomic regions in H3K36me3 can be modeled using a first-order differential equation

$$frac{d{K}_{36}left(tright)}{dt}={K}_{4}left(tright)-gamma {K}_{36}left(tright).$$

(4)

We estimate the time constant γ for each genomic region by fitting an exponential relaxation function across pseudotime

$${K}_{36}left(tright)={y}_{0}+Aleft(1-{e}^{-gamma t}right),$$

(5)

where y0 is the signal at t = 0 and A is the predicted H3K36me3 levels at steady state. Fitting the γ directly from the pseudotime allows us to leverage signal from both single- and deconvolved cells.

To predict future values of H3K36me3 levels for each cell at each genomic region, we use the Euler method and plug in the estimated γ, H3K4me1 levels at time t and time step h of 0.02 pseudotime units:

$${K}_{36}left(t+1right)={K}_{36}left(tright)+hleft({K}_{4}left(tright)-gamma {K}_{36}left(tright)right).$$

(6)

Finally, we project the single- and double-incubated H3K36me3 signal onto the first two principal components and project the predicted future values onto the PCA. We use the velocity grid flow visualization as implemented in velocyto56 to visualize the velocity vectors on the PCA space.

Comparison with multi-CUT&TAG

Raw fastq files (R1, R2 and R3) from the single-cell experiments were downloaded from Gene Expression Omnibus accession number GSE171554. The first 42 bases of the reads in R1 and R2 were trimmed to remove the barcodes and the bases common to all Tn5 adapter sequences. The 16-base cell barcodes in R3 were added to the fastq headers of R1 and R2. The trimmed and cell-barcoded R1 and R2 reads were then aligned to the mm10 mouse genome using Burrows-Wheeler aligner (bwa v.0.7.17-r1188). Fragments that start at same location and have the same cell barcode were considered duplicates and discarded. Cells with more than 100 fragments with MAPQ scores in R1 greater than or equal to 40 were kept for comparison with scChIX-seq.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The data discussed in this publication have been deposited in NCBI’s Gene Expression Omnibus and are accessible through Gene Expression Omnibus Series accession number GSE155280 (ref. 57).

Code availability

We developed the SingleCellMultiOmics package, in which there are modules used for processing sortChIC data (https://github.com/BuysDB/SingleCellMultiOmics)52, and an R package that implements scChIX-seq and contains snakemake workflows for processing data and example notebooks for downstream analyses (https://github.com/jakeyeung/scChIX)53.

References

  1. Rothbart, S. B. & Strahl, B. D. Interpreting the language of histone and DNA modifications. Biochim. Biophys. Acta 1839, 627–643 (2014).

    Article 
    CAS 

    Google Scholar
     

  2. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article 
    CAS 

    Google Scholar
     

  3. Landt, S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).

    Article 
    CAS 

    Google Scholar
     

  4. Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).

    Article 
    CAS 

    Google Scholar
     

  5. Lara-Astiaso, D. et al. Chromatin state dynamics during blood formation. Science 345, 943–949 (2014).

    Article 
    CAS 

    Google Scholar
     

  6. Rotem, A. et al. Single-cell ChIP–seq reveals cell subpopulations defined by chromatin state. Nat. Biotechnol. 33, 1165–1172 (2015).

    Article 
    CAS 

    Google Scholar
     

  7. Grosselin, K. et al. High-throughput single-cell ChIP–seq identifies heterogeneity of chromatin states in breast cancer. Nat. Genet. 51, 1060–1066 (2019).

    Article 
    CAS 

    Google Scholar
     

  8. Ai, S. et al. Profiling chromatin states using single-cell itChIP–seq. Nat. Cell Biol. 21, 1164–1172 (2019).

    Article 
    CAS 

    Google Scholar
     

  9. Schmid, M., Durussel, T. & Laemmli, U. K. ChIC and ChEC: genomic mapping of chromatin proteins. Mol. Cell 16, 147–157 (2004).

    CAS 

    Google Scholar
     

  10. Skene, P. J., Henikoff, J. G. & Henikoff, S. Targeted in situ genome-wide profiling with high efficiency for low cell numbers. Nat. Protoc. 13, 1006–1019 (2018).

    Article 
    CAS 

    Google Scholar
     

  11. Kaya-Okur, H. S. et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat. Commun. 10, 1–10 (2019).

    Article 
    CAS 

    Google Scholar
     

  12. Ku, W. L. et al. Single-cell chromatin immunocleavage sequencing (scChIC-seq) to profile histone modification. Nat. Methods 16, 323–325 (2019).

    Article 
    CAS 

    Google Scholar
     

  13. Wang, Q. et al. CoBATCH for high-throughput single-cell epigenomic profiling. Mol. Cell 76, 206–216.e7 (2019).

    Article 
    CAS 

    Google Scholar
     

  14. Harada, A. et al. A chromatin integration labelling method enables epigenomic profiling with lower input. Nat. Cell Biol. 21, 287–296 (2019).

    Article 
    CAS 

    Google Scholar
     

  15. Wu, S. J. et al. Single-cell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression. Nat. Biotechnol. 39, 819–824 (2021).

    Article 
    CAS 

    Google Scholar
     

  16. Bartosovic, M., Kabbe, M. & Castelo-Branco, G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nat. Biotechnol. 39, 825–835 (2021).

    Article 
    CAS 

    Google Scholar
     

  17. Janssens, D. H. et al. Automated CUT&Tag profiling of chromatin heterogeneity in mixed-lineage leukemia. Nat. Genet. 53, 1586–1596 (2021).

    Article 
    CAS 

    Google Scholar
     

  18. Zeller, P. et al. Hierarchical chromatin regulation during blood formation uncovered by single-cell sortChIC. Preprint at bioRxiv https://doi.org/10.1101/2021.04.26.440606 (2021).

  19. Ku, W. L., Pan, L., Cao, Y., Gao, W. & Zhao, K. Profiling single-cell histone modifications using indexing chromatin immunocleavage sequencing. Genome Res. 31, 1831–1842 (2021).

    Article 

    Google Scholar
     

  20. Navidi, Z., Zhang, L. & Wang, B. simATAC: a single-cell ATAC-seq simulation framework. Genome Biol. 22, 1–16 (2021).

    Article 

    Google Scholar
     

  21. Pauler, F. M. et al. H3K27me3 forms BLOCs over silent genes and intergenic regions and specifies a histone banding pattern on a mouse autosomal chromosome. Genome Res. 19, 221–233 (2009).

    Article 
    CAS 

    Google Scholar
     

  22. Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).


    Google Scholar
     

  23. Grün, B. & Hornik, K. topicmodels: an R package for fitting topic models. J. Stat. Softw. 40, 1–30 (2011).

    Article 

    Google Scholar
     

  24. Gopalan, S., Wang, Y., Harper, N. W., Garber, M. & Fazzio, T. G. Simultaneous profiling of multiple chromatin proteins in the same cells. Mol. Cell 81, 4736–4746.e5 (2021).

    Article 

    Google Scholar
     

  25. Giladi, A. et al. Single-cell characterization of haematopoietic progenitors and their trajectories in homeostasis and perturbed haematopoiesis. Nat. Cell Biol. 20, 836–846 (2018).

    Article 
    CAS 

    Google Scholar
     

  26. Mendenhall, E. M. et al. GC-rich sequence elements recruit PRC2 in mammalian ES cells. PLoS Genet. 6, e1001244 (2010).

    Article 

    Google Scholar
     

  27. Zou, F. et al. Expression and function of tetraspanins and their interacting partners in B cells. Front. Immunol. 9, 1606 (2018).

    Article 

    Google Scholar
     

  28. Benhamou, D. et al. The c-Myc/miR17-92/PTEN axis tunes PI3K activity to control expression of recombination activating genes in early B cell development. Front. Immunol. 9, 2715 (2018).

    Article 

    Google Scholar
     

  29. Goldmit, M. et al. Epigenetic ontogeny of the Igk locus during B cell development. Nat. Immunol. 6, 198–203 (2005).

    Article 
    CAS 

    Google Scholar
     

  30. Pan, C., Baumgarth, N. & Parnes, J. R. CD72-deficient mice reveal nonredundant roles of CD72 in B cell development and activation. Immunity 11, 495–506 (1999).

    Article 
    CAS 

    Google Scholar
     

  31. Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article 

    Google Scholar
     

  32. Pishesha, N. et al. Transcriptional divergence and conservation of human and mouse erythropoiesis. Proc. Natl Acad. Sci. USA 111, 4103–4108 (2014).

    Article 
    CAS 

    Google Scholar
     

  33. Koretzky, G. A., Abtahian, F. & Silverman, M. A. SLP76 and SLP65: complex regulation of signalling in lymphocytes and beyond. Nat. Rev. Immunol. 6, 67–78 (2006).

    Article 
    CAS 

    Google Scholar
     

  34. Brachtendorf, G. et al. Early expression of endomucin on endothelium of the mouse embryo and on putative hematopoietic clusters in the dorsal aorta. Dev. Dyn. 222, 410–419 (2001).

    Article 
    CAS 

    Google Scholar
     

  35. Sedykh, I. et al. Zebrafish Rfx4 controls dorsal and ventral midline formation in the neural tube. Dev. Dyn. 247, 650–659 (2018).

    Article 
    CAS 

    Google Scholar
     

  36. DeBoer, E. M. et al. Prenatal deletion of the RNA-binding protein HuD disrupts postnatal cortical circuit maturation and behavior. J. Neurosci. 34, 3674–3686 (2014).

    Article 
    CAS 

    Google Scholar
     

  37. Inoue, T. et al. Analysis of mouse Cdh6 gene regulation by transgenesis of modified bacterial artificial chromosomes. Dev. Biol. 315, 506–520 (2008).

    Article 
    CAS 

    Google Scholar
     

  38. Chen, A. F. et al. GRHL2-dependent enhancer switching maintains a pluripotent stem cell transcriptional subnetwork after exit from naive pluripotency. Cell Stem Cell 23, 226–238.e4 (2018).

    Article 

    Google Scholar
     

  39. Logan, M. et al. Expression of Cre recombinase in the developing mouse limb bud driven by aPrxl enhancer. Genesis 33, 77–80 (2002).

    Article 
    CAS 

    Google Scholar
     

  40. Takeuchi, J. K. & Bruneau, B. G. Directed transdifferentiation of mouse mesoderm to heart tissue by defined factors. Nature 459, 708–711 (2009).

    Article 
    CAS 

    Google Scholar
     

  41. Zhao, R. et al. Loss of both GATA4 and GATA6 blocks cardiac myocyte differentiation and results in acardia in mice. Dev. Biol. 317, 614–619 (2008).

    Article 
    CAS 

    Google Scholar
     

  42. Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).

    Article 
    CAS 

    Google Scholar
     

  43. Murphy, Z. C. et al. Regulation of RNA polymerase II activity is essential for terminal erythroid maturation. Blood 138, 1740–1756 (2021).

    Article 
    CAS 

    Google Scholar
     

  44. Gautier, E. L. et al. Gene-expression profiles and transcriptional regulatory pathways that underlie the identity and diversity of mouse tissue macrophages. Nat. Immunol. 13, 1118–1128 (2012).

    Article 
    CAS 

    Google Scholar
     

  45. La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).

    Article 

    Google Scholar
     

  46. Janssens, D. H. et al. CUT&Tag2for1: a modified method for simultaneous profiling of the accessible and silenced regulome in single cells. Genome Biol. 23, 81 (2022).

    Article 
    CAS 

    Google Scholar
     

  47. Stuart, T. et al. Nanobody-tethered transposition allows for multifactorial chromatin profiling at single-cell resolution. Preprint at bioRxiv https://doi.org/10.1101/2022.03.08.483436 (2022).

  48. Bartosovic, M. & Castelo-Branco, G. Multimodal chromatin profiling using nanobody-based single-cell CUT&Tag. Preprint at bioRxiv https://doi.org/10.1101/2022.03.08.483459 (2022).

  49. Meers, M. P., Llagas, G., Janssens, D. H., Codomo, C. A. & Henikoff, S. Multifactorial profiling of epigenetic landscapes at single-cell resolution using MulTI-Tag. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01522-9 (2022).

  50. Wang, M. & Zhang, Y. Tn5 transposase-based epigenomic profiling methods are prone to open chromatin bias. Preprint at bioRxiv https://doi.org/10.1101/2021.07.09.451758 (2021).

  51. Kaya-Okur, H. S., Janssens, D. H., Henikoff, J. G., Ahmad, K. & Henikoff, S. Efficient low-cost chromatin profiling with CUT&Tag. Nat. Protoc. 15, 3264–3283 (2020).

    Article 
    CAS 

    Google Scholar
     

  52. de Barbanson, B. A. et al. BuysDB/SingleCellMultiOmics: 0.1.30 (v.0.1.30). Zenodo. https://doi.org/10.5281/zenodo.7074511 (2022).

  53. Yeung, J. jakeyeung/scChIX: v.1.0.1 (v.1.0.1). Zenodo. https://doi.org/10.5281/zenodo.7152037 (2022).

  54. Grün, B. & Hornik, K. topicmodels: an R package for fitting topic models. J. Stat. Softw. 40, 1–30 (2011).

    Article 

    Google Scholar
     

  55. Navidi, Z., Zhang, L. & Wang, B. simATAC: a single-cell ATAC-seq simulation framework. Genome Biol. 22, 74 (2021).

    Article 
    CAS 

    Google Scholar
     

  56. La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).

    Article 

    Google Scholar
     

  57. Yeung, J., Florescu, M., Zeller, P, de Barbanson, B. A., Wellenstein, M. D. & van Oudenaarden, A. scChIX-seq infers relationships between histone modifications in single cells. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE155280 (2022).

Download references

Acknowledgements

We thank M. van Loenhout for experimental advice on purifying cell types from the bone marrow, R. van der Linden for expertise with FACS and M. Blotenburg for help with cell typing the mouse organogenesis dataset. We thank M. Saraswat and O. Stegle for discussions on multinomial distributions. This work was supported by a European Research Council Advanced grant (ERC-AdG 742225-IntScOmics); Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO) TOP grant (NWO CW 714.016.001) and NWO grant (OCENW.GROOT.2019.017); the Swiss National Science Foundation Early Postdoc Mobility (P2ELP3-184488 to P.Z. and P2BSP3-174991 to J.Y.); Marie Sklodowska-Curie Actions Postdoc (798573 to P.Z.) and the Human Frontier for Science Program Long-Term Fellowships (LT000209-2018-L to P.Z. and LT000097-2019-L to J.Y.). This work is part of the Oncode Institute which is financed partly by the Dutch Cancer Society.

Author information

Author notes

  1. These authors contributed equally: Jake Yeung, Maria Florescu, Peter Zeller.

Authors and Affiliations

  1. Oncode Institute, Hubrecht Institute-KNAW (Royal Netherlands Academy of Arts and Sciences) and University Medical Center Utrecht, Utrecht, the Netherlands

    Jake Yeung, Maria Florescu, Peter Zeller, Buys Anton de Barbanson, Max D. Wellenstein & Alexander van Oudenaarden

  2. Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria

    Jake Yeung

Contributions

J.Y., M.F., B.A.d.B. and A.v.O. conceived the project. M.F. developed double-incubation techniques and performed mouse bone marrow and organogenesis experiments with help from P.Z. P.Z. developed single-incubation techniques. P.Z. and M.D.W. designed and performed macrophage in vitro differentiation experiments. J.Y., M.F. and A.v.O. analyzed the data. J.Y. developed and applied statistical methods with help from M.F. and B.A.d.B. B.A.d.B. wrote the sortChIC preprocessing pipeline, with help from M.F. and J.Y. J.Y., M.F. and A.v.O. wrote the manuscript, with input from P.Z., M.D.W. and B.A.d.B.

Corresponding authors

Correspondence to
Jake Yeung or Alexander van Oudenaarden.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Benchmarking scChIX-seq across a range of overlapping patterns.

Left column: simulation results in a mutually exclusive scenario (that is 1% of loci are overlapping). Middle column: results for an intermediate amount of overlap (that is 50% of loci are overlapping). Right column: results for highly correlated scenario (that is 99% of loci are overlapping). (a) Distribution of unique fragment cuts per cell in simulation. (b) Sparsity of the input matrix. Note that in the mutually exclusive scenario, the double-incubated marks is less sparse than single-incubated marks because loci with zero reads in one mark often have non-zero reads in another mark. (c) Distribution of the degree of overlap (defined as fraction of double-incubated signal belonging to mark1: (p=frac{{S}_{1}}{{S}_{1}+{S}_{2}})) for each locus genome-wide. (d) Estimated degree of overlap from scChIX-seq. (e) UMAP representation of the three cell types underlying simulation. UMAPs from the two marks are linked by double-incubated cells that are deconvolved by scChIX-seq. (f) Empirical 95% confidence interval across the range of (hat{p}=frac{{hat{S}}_{1}}{{hat{S}}_{1}+{hat{S}}_{2}}) (from 0 to 1). Range obtained by aggregating results from the three overlapping patterns. n=101 simulation datapoints spread evenly between 0 and 1 inclusive. Error bars are empirial 95% confidence intervals, centers are the mean.

Extended Data Fig. 2 scChIX-seq accurately deconvolves double-incubated signal into their respective histone modifications.

(a) Histogram of unique fragment cuts per cell. (b) Histogram of fraction of unique fragments starting with a “TA” motif. (c, d) UMAP of latent Dirichlet allocation (LDA) embedding using k=30 topics for H3K27me3 (c) and H3K9me3 (d). (e, f) UMAP representation of H3K27me3 (left) and H3K9me3 (right) data colored by unmixed or single-incubated cells (e) or ground truth cell type labels defined by FACS (f). (g, h) Genome-wide Pearson correlation between deconvolved H3K27me3 (g) and H3K9me3 (h) signal versus ground truth sortChIC purified by FACS. Shared genomic regions were calculated by using 1 kb bins across the genome. (i) Comparison of fragments per cell obtained from Multi-CUT&TAG versus scChIX-seq. Multi-CUT&TAG data came from a mixture of embryonic and trophoblast stem cells in vitro, while scChIX-seq came from sorted bone marrow cells in vivo. n=1806 cells for Multi-CUT&TAG, n=290 for scChIX-seq.

Extended Data Fig. 3 Coverage tracks of deconvolved cells and genome statistics.

(a) Coverage tracks for B cells visualizing the H3K27me3+H3K9me3, deconvolved H3K27me3 or H3K9me3, and ground truth H3K27me3 or H3K9me3 histone modification levels for three different genomic regions. Double-incubated signal in grey, H3K27me3 single, and unmixed signal in orange, and H3K9me3 single and unmixed signal in blue. Under each coverage track are cut fragments of single cells. Each row of the single cells track are cuts from an individual cell. Shown are a subset of cells, which were chosen for their high number of cuts in the region. Rows are in decreasing order of total number of cuts. (b) H3K27me3 coverage tracks showing the region around Pax5 for the ground truth H3K27me3 pseudobulk signal from single-incubated cells and for the deconvolved H3K27me3 pseudobulk signal from double-incubated cells for three cell types: B cells (grey), granulocytes (green), and NK cells (blue). (c) H3K9me3 (top) and H3K27me3 (bottom) coverage tracks showing the region around Auts2 for ground truth (single-incubated) and for the unmixed (unmixed) for B cells (grey), granulocytes (green) and NK cells (blue), respectively. d Distribution of assignment probability estimates in the genome for the three cell types. Vertical dotted lines represent cutoffs to define H3K9me3-specific (that is p < 0.5) or H3K27me3-specific regions (that is p≥0.5). e Boxplot distributions of GC content in H3K27me3-marked and H3K9me3-marked regions. f Boxplot distributions of distance to TSS in the two classes of regions. Distances are measured from the center of the 50 kb locus to the nearest TSS. Number of bins in each boxplot: n=9962 for B cells p < 0.5, n=15877 for B cells p≥0.5, n=12483 for granulocytes p < 0.5, n=13345 for granulocytes p≥0.5, n=7337 for NK cells p < 0.5, n=18491 for NK cells p≥0.5. Boxplots show 25th percentile, median and 75th percentile, with the whiskers spanning 97% of the data.

Extended Data Fig. 4 Inferring cluster pairs from H3K4me1+H3K27me3 transfers cell type labels.

(a) Histogram of unique fragment cuts per cell. (b) Histogram of fraction of unique fragments starting with a “TA” motif. (c) UMAP of H3K4me1 sortChIC data, cells colored by cell type. (d) Assignment plot showing individual H3K4me1+H3K27me3 cells (represented as dots) assigned to a pair of topics (x-axis labels are H3K4me1 clusters, named by their associated cell type, while y-axis are H3K27me3 clusters). Cells along the diagonal are high-confidence predictions that match a H3K4me1 cluster with a H3K27me3 topics, and are colored by the H3K4me1-derived cell type labels. (e) UMAP of H3K4me1+H3K27me3 sortChIC. Cells are colored by their cell type inferred from cluster pairs. Low-confidence predictions are colored in grey. (f, g) UMAP representation of H3K4me1 (f) and H3K27me3 (g). Cells are colored by whether the epigenome was generated by single-incubation or by unmixing by scChIX-seq.

Extended Data Fig. 5 Histone modification signal of deconvolved cell types correlates with public H3K4me1 ChIP-seq and H3K27me3 sortChIC ground truth data.

(a-d) Pearson correlation between publicly available H3K4me1 ChIP-seq5 data of purified B cells (a), erythroid (b), granulocytes (c), or NK cells (d) versus H3K4me1 profiles of different cell types derived from scChIX-seq. Single: pseudobulk profiles generated by single incubation, unmixed: pseudobulk profiles deconvolved by scChIX-seq. (e-g) Pearson correlation between H3K27me3 sortChIC from FACS-purified B cells (e), granuloytes (f), NK cells (g) versus H3K27me3 sortChIC derived from pseudobulks of whole bone marrow without FACS purification. Single: pseudobulk profiles generated by single incubation, unmixed: pseudobulk profiles deconvolved by scChIX-seq. (h) Distribution of assignment probability estimates p in the genome for the three cell types. Vertical dotted lines represent cutoffs for p to define H3K27me3-specific and H3K4me1-specific regions. p is the expected fraction of reads that belong to H3K4me1 in a specific genomic locus. (i) Boxplot distributions of GC content for the two classes of regions. (j) Boxplot distributions of distance to TSS in the two classes of regions. Distances are measured from the center of the 5 kb locus to the nearest TSS. Boxplots show 25th percentile, median and 75th percentile, with the whiskers spanning 97% of the data.

Extended Data Fig. 6 Re-clustering on B cells reveals heterogeneity within B cells.

(a) UMAP visualization of H3K4me1 and H3K27me3 (single signal and unmixed signal), colored by cell types derived from H3K4me1 and transferred to H3K27me3. Black rectangle indicates the B cell population used to re-cluster in (b,c,d). (b) UMAP of pro-B and B cells only. (c,d) Projection of H3K4me1 signal of marker genes for pro-B (c) or for differentiated B cells (d). H3K4me1 signal is measured in all cells of the H3K4me1 UMAP (that is both single- and double-incubated have H3K4me1 signal in the H3K4me1 UMAP). Double- (colored) but not single-incubated (grey) cells have H3K4me1 signal in the H3K27me3 UMAP.

Extended Data Fig. 7 H3K4me1 and H3K27me3 signal during neutrophil maturation.

(a) UMAP visualization of H3K4me1 and H3K27me3, lines join H3K4me1 and H3K27me3 UMAPs of double-incubated neutrophils. Heterogeneity within neutrophils are colored as neutrophil pseudotime. (b) H3K4me1 and H3K27me3 modification levels at the Retnlg (a mature neutrophil marker gene) locus along neutrophil pseudotime. (c) H3K4me1 and H3K27me3 modification levels at the Hoxa along neutrophil pseudotime. (d) UMAP of H3K27me3 signal across single cells colored by weights of a topic containing high H3K27me3 levels at many Hox and developmental gene loci (Hox topic). (e) Topic weights of the top 150 genes associated with loci in the Hox topic for H3K27me3. (f) Neutrophil mRNA abundance of genes in the Hox topic compared to other genes derived from publicly available scRNA-seq data25. Number of genes per boxplot: n=17986 for All Genes, n=127 for genes in the Hox topic. Boxplots show 25th percentile, median and 75th percentile, with the whiskers spanning 97% of the data.

Extended Data Fig. 8 Cell typing mouse organogenesis dataset using H3K36me3 using marker genes.

(a) Histogram of unique fragment cuts per cell. (b) Histogram of fraction of unique fragments starting with a “TA” motif. (c-l) Genome browser plots of cell type-specific H3K36me3 loci showing pseudobulk CPM signals (colored lines, top) and cut locations of individual cells (bottom, black marks). Cells are ordered by cell type (color-coded on the left).

Extended Data Fig. 9 H3K9me3-specific regions across cell types.

(a) Heatmap of H3K36me3 signal for the top 250 H3K36me3-specific loci (rows) across cell types (columns). (b) Heatmap of mRNA abundances for the genes associated with the H3K36me3-specific loci in (a) across pseudobulks. Data processed from publicly available scRNA-seq data from Cao et al.42. (c) Heatmap of H3K9me3 signal for the same top 250 H3K36me3-specific loci as in (a). The H3K36me3 and H3K9me3 heatmaps are mean-centered and scaled using a common mean and standard deviation calculated across both marks. (d) Heatmap of H3K9me3 signal across pseudobulks at H3K9me3-variable loci. (e) Relative mRNA abundances42 at n=364 genes associated with erythroblast-repressed loci across nine cell types. (f) mRNA abundance of an erythroblast-repressed gene, Nell2, across pseudobulks. (g) Genome browser plot of around the Nell2 locus, an erythroblast-specific region for H3K9me3. Top of plot is pseudobulk H3K9me3 CPM signals, below are cut locations of individual cells (black marks). Cells are ordered by cell type (color-coded as in heatmaps). (h, i) Total unique fragments across cell types for single-incubated cells for H3K36me3 (h) and H3K9me3 (i), showing that the variability of the number of cuts across cells can span orders of magnitude. Number of single-incubated H3K36me3 cells for each boxplot: n=154 erythroid, n=36 white blood cells, n=60 endothelial, n=250 neural tube progenitors, n=272 neurons, n=58 Schwann cell precursors, n=154 epithelial, n=570 mesenchymal progenitors, n=160 cardiomyocytes. For H3K9me3: n=207 erythroid, n=26 white blood cells, n=736 non-blood cell types. Boxplots in (e), (h), (i) show 25th percentile, median and 75th percentile, with the whiskers spanning 97% of the data.

Extended Data Fig. 10 Distinct dynamics of H3K4me1 and H3K36me3 during macrophage in vitro differentiation.

(a) Density plots of total number of cuts across cells for H3K4me1, H3K36me3, and H3K4me1+H3K36me3 labeled cells. (b) Density plots of fraction of cuts starting with a TA motif across cells for H3K4me1, H3K36me3, and H3K4me1+H3K36me3 labeled cells. (c) Genome-browser plot around gene body of Mertk, a macrophage-specific gene. Tracks are bigwigs from pseudobulks averaged across the time course. (d) Log2 fold change estimates along pseudotime on gene bodies in the genome. Colored dots are considered significant (log2 fold change in H3K36me3 > 3.5, zscore in both H3K36me3 and H3K4me1 > 2) and used for chromatin velocity estimates. (e, f) UMAP of H3K4me1 (e) and H3K36me3 (f) of single-incubated and deconvolved cells showing intermingling of the two types of cells. (g) Examples of H3K4me1 and H3K36me3 for an upregulated (above) and downregulated (below) gene along pseudotime. (h) Histogram of estimates of the rate constant γ for the 209 dynamic genes highlighted in (d).

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yeung, J., Florescu, M., Zeller, P. et al. scChIX-seq infers dynamic relationships between histone modifications in single cells.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01560-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-022-01560-3

Read More
Jake Yeung

10 essential Mac trackpad gestures you need to know

MacBook Pro Trackpad

Apple

Tamara Palmer is a freelance writer, recipe developer, zine publisher and professional DJ based in San Francisco.

    Read More
    Tyisha Roberie

    File Explorer tabs are finally in Windows! Here’s how to use them to simplify your life

    Windows 11 Tabbed File Explorer

    Mark Hachman / IDG

    , Senior Editor

    As PCWorld’s senior editor, Mark focuses on Microsoft news and chip technology, among other beats. He has formerly written for PCMag, BYTE, Slashdot, eWEEK, and ReadWrite.

    Read More
    Leigha Buresh

    Keychron K3 Pro review: The thin keyboard to beat

    Skip to content

    Keychron K3 PRo keyboard

    Michael Crider/IDG

    , Staff Writer

    Michael is a former graphic designer who’s been building and tweaking desktop computers for longer than he cares to admit. His interests include folk music, football, science fiction, and salsa verde, in no particular order.

    Read More
    Georgianna Fleishman

    This year, learn Python while this bundle is just $30

    premium python

    StackCommerce

    Read More
    Maribel Pekar