Blog | News - Part 7000

What Is Mezcal?

Financial
Mezcal

January 3, 2023

Like wine or beer, mezcal is a blanket term. In its broadest definition, it’s any distilled spirit made from the agave plant, which is known in Mexico as a maguey. Mezcal is meant to be sipped neat in small cups, but in the hands of a trustworthy bartender, a mezcal cocktail is a winning situation. Today, we’re digging into how it gets made.

The Agave Plant

Agave is a succulent native to Mexico. There are roughly 30 different varieties of agave that can be used to make mezcal. They grow in the wild and on farms, and take seven to 20 years to mature. The most well-known maguey is the blue weber agave, because it’s used to make tequila. You read that right—tequila’s a type of mezcal. Excluding tequila, the majority of mezcal is made from the espadin agave plant because it’s high in sugar and matures quickly.

Mezcal Production

Mezcal’s distilled from the heart of the agave plant. The piña, as it’s known in Spanish, looks like an oversized pineapple. It can weigh up to 300 pounds and takes very difficult labor to harvest and transport. Once it’s pulled from the earth, the leaves are removed and the piña gets roasted to release its natural sugars.

Fermentation occurs when cooked agaves are mashed to a pulp and combined with water and yeast. After sitting for days, those sugars turn into alcohol. The liquid is then run through a still at least twice to refine it into a drinkable spirit.

There are industrial mezcal producers (often companies owned by Americans), but most mezcals are made in rural areas by Mexican families who don’t have access to expensive machinery. Farming, harvesting, and distillation processes have been passed down for generations and distillers, known as mezcaleros, have found creative ways to fabricate the equipment they need for mezcal production. A hole in the ground can be an oven, if you put your back into it. Need portable fermentation tanks? Rawhide will do. Want a mezcal with a fuller mouthfeel? Make your still from clay pots. Where there’s a will, there’s a way.

Types of Mezcal

Not all mezcals have a smoky flavor. The smokiness comes from the traditional method of roasting magueys in a pit covered with firewood and rocks. However, some mezcals taste bright and delicate, while others are herbaceous and viscous. Flavor varies based on the varietal, age, and terroir of the agave plant, as well as the water and production process.

The most produced type of mezcal is the ensamble. This is when a mezcalero takes different species of agave and roasts, ferments, and distills them together to balance flavors. There are also mezcals made from a single type of agave, and blends that mix mezcals after distillation.

Then there’s pechuga. Loosely translated, this means “breast.” The term refers to the animal breast, usually chicken, that’s hung over the still to impart savory flavors. Pechuga is made in small batches for special occasions like festivals and weddings. If you can find one, it’ll cost you a pretty peso.

Once distilled, mezcal’s stored in food-grade plastic or glass, and ultimately bottled joven (unaged). When it comes to alcohol content, the range will be 40–55% ABV, a little stronger than your average vodka or scotch.

Reposado and anejo mezcals aren’t as common because barrel-aging imparts flavors that detract from the natural taste of the mezcal. Besides, outside of the state of Jalisco where tequila’s made, oak barrels aren’t readily available in Mexico. Only tequila gets that treatment.

Regulations and the Law

This wouldn’t be a proper article about booze if we didn’t address liquor laws and the quasi-illegal shenanigans surrounding their enforcement. The Mexican government relies on four “independent” companies to regulate mezcal. The Consejo Regulador del Mezcal (CRM) is the biggest and most influential.

Many traditional mezcaleros ignore the CRM’s regulations because they view it as a pay-to-play system, which they either can’t afford or don’t want to endorse. Additionally, the CRM limits mezcal’s Denominacion de Origen (DO) to 10 of the 32 Mexican states. Oaxaca is probably the one you’ve heard of, but there’s also Durango, Guanajuato, Guerrero, Michoacan, Puebla, Tamaulipas, San Luis Potosi, Sinaloa, and Zacatecas. The result is that some of the most authentic, family-owned distilleries make “agave spirits” because their mezcal can’t legally be called mezcal.

Shopping Tips

My three rules when shopping for mezcal: (1) Celebrity brands overcharge and underdeliver; (2) glass is hard to come by in rural Mexico, so elaborate bottles aren’t good mezcal, they’re good marketing; (3) the label should be transparent about who, what, where, when, and how it was made.

Identification of patient-specific CD4⁺ and CD8⁺ T cell neoantigens through HLA-unbiased genetic screens

Identification
patient-specific
Science & Nature

newsycanuse

January 3, 2023

Identification of patient-specific CD4<sup>+</sup> and CD8<sup>+</sup> T cell neoantigens through HLA-unbiased genetic screens

Science & Nature

Main

Cancer immunotherapies that aim to harness the antitumor activity of T cells have shown impressive clinical results in a subset of patients with cancer, and accumulating evidence suggests that the efficacy of these therapies is driven largely by T cells that recognize cancer neoantigens that result from patient-specific nonsynonymous tumor mutations¹. Consequently, there is a strong interest in developing approaches to specifically boost the number or activity of neoantigen-reactive T cells in individual patients. However, identification of T cell-recognized neoantigens is challenging due to their patient-specific nature². Previous antigen discovery methods have been limited by relying on the use of single or selected HLA alleles^3,4,5,6,7 and are therefore not straightforwardly compatible with identifying T cell (neo)antigens across the complete HLA haplotypes of individual patients with cancer. Moreover, while CD4⁺ T cells have important roles in tumor control and response to immunotherapy^8,9,10,11, previous methods have focused primarily on the identification of CD8⁺ T cell-recognized neoantigens. Thus, experimental tools are required to enable the routine and HLA-unbiased identification of CD4⁺ and CD8⁺ T cell-recognized neoantigens in individual patients.

Here, we present a high-throughput genetic system for the personalized identification of CD4⁺ and CD8⁺ T cell-recognized (neo)antigens (Fig. 1a). In this method, termed HANSolo (HLA-Agnostic Neoantigen Screening), patient-matched, Bcl-6/xL-immortalized B cell lines are engineered to express large libraries of minigenes that encode candidate T cell antigens. As the resulting B cells are fully MHC class I and class II proficient, this enables the unbiased screening of T cell specificities across the complete MHC class I and class II genotypes of individual patients using T cell pools as selective pressure. To this purpose, antigen library-expressing B cells are coincubated with patient T cell populations of interest (for example, tumor-infiltrating lymphocytes (TIL) or T cells engineered to express patient-derived T cell receptors (TCRs)¹²), and antigen hits are identified by next-generation sequencing to measure the depletion of those B cells that express T cell-recognized epitopes.

Science & Nature figure 1 — **Fig. 1: Overview and validation of neoantigen discovery technology.**

To first evaluate the feasibility and sensitivity of our method, we took advantage of well-described HLA-A*02:01-restricted TCRs specific for either the CDK4_R24L neoantigen (TCR #53)¹³ or for the melanocyte differentiation antigen-derived MART1_26-35 epitope (TCRs DMF4 and DMF5)¹⁴ (Supplementary Fig. 1). Activity of the CDK4_R24L neoantigen-specific TCR should result in strong depletion of B cells expressing the mutant, but not the wild-type (WT), CDK4 sequence. TCR DMF4 has an affinity towards the MART1 self antigen that is around fivefold lower as compared with the DMF5 TCR^14,15, providing a means to assess the sensitivity of the method in the context of weak T cell–target cell interactions. Furthermore, the use of the parental MART1 epitope as well as a previously identified variant with increased affinity for MHC-I¹⁶ (here referred to as MART1-ELA) should allow one to determine whether the level of epitope presentation can be gauged from screening data. To provide first proof-of-concept, we designed a model antigen library with a complexity (4,764 minigenes) that would be sufficient to enable the screening of the entire mutational repertoire of human tumors with the highest mutational burden, such as melanomas, lung tumors and microsatellite-instable tumors¹⁷. Individual MHC class I-restricted antigens, including the CDK4_R24L and MART1 antigens and immunodominant epitopes of EBV, CMV and influenza, as well as MHC class II-restricted neoantigens (Supplementary Table 1) were expressed as minigenes, each coupled to two unique barcode identifiers to provide internal replicate measurements. Subsequently, HLA-A*02:01-positive immortalized B cells were created and modified to express the epitope library.

Following optimization of conditions to ensure maximal sensitivity of antigen screens (Supplementary Fig. 2), screening of this proof-of-concept library with T cells expressing the CDK4_R24L-specific TCR resulted in clear depletion of CDK4_R24L-expressing B cells, but crucially not B cells expressing the WT CDK4 minigene (Fig. 1b). Furthermore, B cells expressing the MART1-ELA epitope showed substantial depletion after exposure to T cells transduced with the MART1-specific DMF4 or DMF5 TCRs (Fig. 1c). Notably, the level of depletion mediated by the low affinity DMF4 TCR was comparable with that of the DMF5 TCR. Moreover, when using the high affinity 1D3 TCR¹⁸, depletion was observed for both MART1 epitopes but was substantially stronger for the MART1-ELA epitope (Supplementary Figs. 1 and 3). Next, to test whether this system allows the profiling of the antigen-specificities of T cell populations in which T cells specific for a given antigen make up only a minority of the total T cell pool (such as patient TIL cultures, or donor T cells expressing libraries of patient-derived TCRs), we mixed T cells expressing either the DMF4, DMF5 or 1D3 TCR with mock-transduced T cells, such that MART1-specific T cells represented 10%, 1%, 0.3% or 0.1% of total T cells. Analysis of epitope abundance after exposure to these different T cell populations demonstrated that the MART1-ELA epitope was robustly identified when cognate TCR-expressing T cells comprised as little as 0.1–0.3% of all T cells (Supplementary Fig. 3). Depletion of the native MART1 epitope was detected only when using the high affinity 1D3 TCR. Together, these data demonstrate that our genetic screening methodology allows the efficient discovery of MHC class I-restricted T cell (neo)antigens from large antigen libraries. Furthermore, the technology allows one to distinguish high and low avidity TCR-pMHC interactions and genetic screens may be performed with clonally diverse T cell populations.

A substantial fraction of T cell-recognized cancer neoantigens is restricted by MHC class II molecules, and CD4⁺ T cells recognizing such MHC class II-restricted neoantigens contribute to tumor control^8,9,10,11. To test the suitability of HANSolo for the discovery of MHC class II-restricted neoantigens, we explored a previously established engineering method that routes individual minigene products through both the MHC class I and class II presentation pathways. In line with expectations, fusion of neoantigen-encoding minigenes to the sorting signal of the invariant chain (CD74) resulted in robust activation of both CD4⁺ and CD8⁺ neoantigen-specific T cells (Supplementary Fig. 4), and this universal antigen expression system was therefore selected for further use. We next took advantage of two MHC class II-restricted neoantigen-specific TCRs that were isolated from tumor-infiltrating T cells of a melanoma patient (Supplementary Fig. 4), transduced both TCRs into donor CD4⁺ T cells and expressed the model antigen library in patient-matched immortalized B cells. Screening of library-expressing B cells with T cells expressing either MHC class II-restricted TCR resulted in the notable depletion of B cells that expressed the cognate neoantigen, but not its WT counterpart (Fig. 1d). Furthermore, the use of CD4⁺ T cell populations in which T cells expressing either of the MHC-II-restricted neoantigen-specific TCRs were present at low frequency demonstrated clear depletion of the relevant neoantigens at antigen-specific CD4⁺ T cell frequencies as low as 0.3–1% (Supplementary Fig. 5).

As compared with previously developed genetic screening technologies, HANSolo has the advantage of allowing the identification of T cell epitopes restricted by any of the class I or II alleles of an individual patient. To demonstrate the utility of such unbiased screening, we first focused on analysis of neoantigen reactivity among intratumoral T cells in a patient with metastatic melanoma (patient NKIRTIL063). CD4⁺ and CD8⁺ T cell cultures were generated by in vitro expansion of TIL, and both resulting T cell populations possessed cytotoxic potential, as measured by degranulation potential upon polyclonal stimulation (Supplementary Fig. 6). In parallel, nonsynonymous mutations in protein-coding genes were identified by exome and RNA sequencing, yielding 685 nonsynonymous expressed tumor variants, and a library of 2,762 minigenes that encoded all identified tumor mutations, as well as their corresponding WT sequences, was generated and expressed in autologous immortalized B cells. Screening of this patient mutanome library with in vitro-expanded CD8⁺ TIL revealed TIL reactivity towards four neoantigens (Fig. 2a). Importantly, no reactivity against the corresponding WT minigenes in the library was detected. Furthermore, screening the same neoantigen library with CD4⁺ TIL yielded reactivity against six neoantigens (Fig. 2b). Both minigenes encoding the tumor variant MYLK_D>N showed reproducible low-level depletion after coculture with CD4⁺ TIL, and this variant was therefore considered a putative screen hit. Recognition of screen-identified neoantigens, but not WT counterpart sequences, was subsequently validated upon expression of the individual sequences in patient B cells, resulting in confirmed CD4⁺ and CD8⁺ TIL reactivity towards 10 out of 11 identified screen hits (Fig. 2c,d and Supplementary Fig. 6). Notably, three neoantigens—GFPT2_A>V, TNFAIP2_P>A and CCSER2_P>L—were recognized by both CD8⁺ and CD4⁺ TIL of this patient.

Science & Nature figure 2 — **Fig. 2: Personalized and HLA-agnostic neoantigen screening of patient-derived CD4⁺ and CD8⁺ T cells.**

To assess the sensitivity of our method in comparison with other available neoantigen discovery methods, we next analyzed neoantigen reactivity among CD4⁺ and CD8⁺ TIL of patient NKIRTIL063 using the previously established tandem minigene (TMG) approach^4,19, in which generally ten minigenes are concatenated and expressed as a single transgene in separate pools of antigen-presenting cells. To screen neoantigen specificities of patient NKIRTIL063 CD4⁺ and CD8⁺ TIL using TMGs within a reasonable timeframe, 200 of the 685 mutations were selected on the basis of expression level and mutation clonality and used to generate 20 pools of patient B cells. Incubation of these cell pools with either CD4⁺ or CD8⁺ TIL of patient NKIRTIL063 revealed notable reactivity of CD4⁺ TIL to three TMGs (#6, #9 and #13) and reactivity of CD8⁺ TIL to four TMGs (#8, #11, #15 and #16) (Fig. 2e). Recognition of six out of seven TMGs was mediated by neoantigens identified before using the genetic library screens, as demonstrated by a subsequent deconvolution step (Supplementary Fig. 6). TMG#9 did not encode a neoantigen hit from our screens but did elicit low-level reactivity of CD4⁺ TIL. Conversely, the TMG screen failed to identify four CD4⁺ TIL-recognized neoantigens that were detected using the HANSolo screens (Fig. 2f), demonstrating the potential of the method to mine patient neoantigens with increased depth compared with existing methodologies.

Next, to assess the value of the developed system for the routine discovery of neoantigens across patients with cancer, we mapped neoantigen specificities in three additional patient samples. Tumor mutations were identified in an additional melanoma tumor (NKIRTIL027; 660 nonsynonymous expressed mutations) and used to construct a patient mutanome library of 2,562 minigenes. Screening the neoantigen specificities of CD4⁺ and CD8⁺ TIL resulted in six putative CD8⁺ TIL-recognized neoantigens (Fig. 2g) and one neoantigen recognized by CD4⁺ TIL (Fig. 2h), and recognition of these epitopes was confirmed for five out of seven neoantigens (Fig. 2i–k and Supplementary Fig. 7). In addition, as observed in genetic screens using model antigens and TCRs, the level of epitope depletion in this patient screen correlated with the capacity of patient T cells to produce interferon gamma (IFNγ) in response to minigene-expressing B cells and kill such cells (Fig. 2i,j). We next analyzed neoantigen specificities of intratumoral CD4⁺ and CD8⁺ T cells in a nonsmall cell lung tumor (patient ITO34; 231 mutations), resulting in the detection of CD8⁺ TIL reactivity against one neoantigen (Fig. 2l,m and Supplementary Fig. 8). Recently, strategies that enrich T cell populations for tumor-specific T cells by culture with patient tumor organoids^20,21 or antigen-expressing APCs²² have been reported. To assess whether such strategies may complement our methodology, for instance, in settings where fresh tumor material for the generation of TIL cultures is unavailable, we applied our screening method to a microsatellite-instable colorectal tumor (ITO66; 1,834 mutations). For this purpose, the patient mutanome was screened using a CD8⁺ T cell product that was generated by ex vivo culture of patient peripheral blood mononuclear cells (PBMCs) with matched tumor organoids, resulting in the identification of two CD8⁺ T cell-recognized neoantigens (Supplementary Fig. 9). Thus, the use of our screening methodology enabled the successful identification of patient neoantigens in all four tested patients.

Collectively, these data demonstrate the feasibility of personalized and HLA-agnostic discovery of CD4⁺ and CD8⁺ T cell neoantigens from large genetic libraries. Benchmarking against the existing TMG method demonstrated enhanced sensitivity of our approach, in particular for the discovery of CD4⁺ T cell-recognized neoantigens, while enabling substantially improved throughput. From a translational perspective, identified neoantigens may be used to select TCRs for use in next-generation TCR gene therapies or may be utilized in patient-specific cancer vaccines^{22,23,24,25,26}. Of note, state-of-the-art algorithms that predict the immunogenicity of tumor mutations for use in personalized neoantigen vaccines ranked only 3 out of all 14 identified patient neoantigens as actionable vaccination targets (Supplementary Table 2), underlining the value of approaches that allow the unbiased and functional identification of patient neoantigens. With the current next-generation sequencing and DNA synthesis technologies and dedicated screening workflows, our system enables patient neoantigen discovery within 10 weeks (Supplementary Fig. 10), a timespan that is compatible with the production of personalized immunotherapies²⁴.

Methods

Antibodies

The following antibodies were used for flow cytometry: CD3-PerCP-Cy5.5 (clone SK7; eBioscience; used 1:20); CD4-FITC (clone RPA-T4; BD Biosciences; used 1:20), CD4-APC (clone RPA-T4; BD Biosciences; used 1:30), CD4-BV421 (clone SK3, Biolegend; used 1:100), CD8-BV421 (clone RPA-T8; BD Biosciences; used 1:50), CD14-APC-H7 (clone MoP9, BD Biosciences; used 1:100), CD16-APC-H7 (clone 3G8, BD Biosciences; used 1:100), CD19-FITC (clone 4G7, BD Biosciences; used 1:30), CD137-BV421 (clone 4B4-1; Biolegend; used 1:200), CD137-APC (clone 4B4-1; BD Biosciences; used 1:30), OX40-PE-Cy7 (clone Ber-ACT35, Biolegend), CD107-PE (clone H4A3, BD Biosciences; used 1:150) and PE-conjugated anti-mouse TCRβ constant domain (clone H57-597; BD Biosciences; used 1:150). The viability stain IR-Dye (Thermo Fisher, used 1:2,000) was used to identify live cells.

Generation of patient T cell products, Bcl-6/Bcl-xL-immortalized B cells and tumor organoids

Tumor tissue and PBMCs were collected from patients treated at the Netherlands Cancer Institute—Antoni van Leeuwenhoek Hospital (NKI-AVL) with written informed consent and in accordance with guidelines of the Medical Ethical Committee. The study protocol was approved by the Medical Ethical Committee of the NKI-AVL. Fresh tumor tissue obtained by surgical resection was mechanically disrupted and digested overnight in RPMI 1640 medium (Life Technologies) supplemented with 1 mg ml⁻¹ collagenase type IV (BD Biosciences), penicillin-streptomycin (Roche) and 0.01 mg ml⁻¹ pulmozyme (Roche).

For patients NKIRTIL027, NKIRTIL063 and ITO34, TIL cultures were generated by culturing tumor digest suspensions in T cell medium (RPMI 1640 medium supplemented with 10% human AB serum (Life Technologies), penicillin-streptomycin, l-glutamine (Life Technologies)), supplemented with 6,000 U ml⁻¹ IL-2 (Proleukin, Novartis) for 2–4 weeks. Obtained TIL cultures were subsequently stained with IR-Dye and antibodies against CD3, CD4 and CD8, and single CD3⁺CD4⁺ and CD3⁺CD8⁺ T cells were sorted using a FACSAria Fusion cell sorter (BD Biosciences). Isolated CD4⁺ and CD8⁺ T cells were expanded using the rapid expansion protocol (REP), using 30 ng ml⁻¹ anti-CD3 antibody (clone OKT-3; eBioscience) and 3,000 U ml⁻¹ IL-2 in a 1:1 mixture of RPMI 1640 and AIM-V medium (Gibco) supplemented with 5% human AB serum, in the presence of irradiated (40 Gy) allogeneic PBMCs (200:1 feeder/T cell ratio). After 7 days of REP culture, medium was refreshed with medium and IL-2 every 2 days. Purity of the resultant CD4⁺ and CD8⁺ T cell populations was confirmed by flow cytometry at day 14 after start of REP (routinely >99%), and cells were subsequently either used directly in antigen discovery screens or cryopreserved in liquid nitrogen. Data from flow cytometry experiments was acquired using FACSDiva software and analyzed using Flowjo (BD Biosciences).

Immortalized patient B cell lines were generated by retroviral transduction with Bcl-6/Bcl-xL²⁷. Patient PBMCs were isolated from peripheral blood by Ficoll-Paque density gradient separation and stained with IR-Dye and antibodies against CD3, CD14, CD16 and CD19. Single IR-Dye⁻CD3⁻CD14⁻CD16⁻CD19⁺ cells were sorted using a FACSAria Fusion cell sorter and stimulated for 36 h with irradiated (55 Gy) CD40L⁺ mouse L cells in B cell medium (IMDM medium (Gibco) supplemented with penicillin-streptomycin, 10% heat-inactivated fetal bovine serum (Sigma-Aldrich) and 50 ng ml⁻¹ IL-21 (Peprotech)), followed by retroviral transduction of Bcl-6 and Bcl-xL. The Bcl-6/Bcl-xL-encoding vector also encodes GFP to allow evaluation of transduction efficiency. Bcl-6/Bcl-xL-immortalized (GFP⁺) B cells were cultured in B cell medium and were stimulated every week by addition of irradiated CD40L⁺ L cells. Medium and IL-21 were refreshed every 3–4 days.

For patient ITO66, tumor organoids were established^20,21. Tumor-reactive patient T cells were generated by coculturing PBMCs and tumor organoids as follows. Following incubation with 200 ng ml⁻¹ IFNγ (Peprotech) for 24 h, tumor organoids were dissociated into single-cell suspensions using TripLE Express (Gibco). Tumor organoid cells were mixed with patient PBMCs (20:1 PBMC/tumor cell ratio) and 1 × 10⁵ PBMC were seeded in each well of a U-bottom 96-well plate precoated with 5 μg ml⁻¹ anti-CD28 antibody (clone CD28.2; eBioscience). Coculture medium consisted of T cell medium supplemented with 150 U ml⁻¹ IL-2 and 20 µg ml⁻¹ anti-PD1 blocking antibody (clone 5C4; kindly provided by Merus). Coculture medium was refreshed every 2–3 days. PBMCs were harvested and restimulated every 7 days by replating with fresh tumor organoid cells.

Retroviral transduction of TCRs

Codon-optimized TCR α and β variable sequences (encompassing V-CDR3-J domains) of selected TCRs were gene-synthesized (Twist Biosciences) and subcloned into a modified pMP71 retroviral vector¹². This vector contains mouse TCR constant regions to reduce mispairing of introduced and endogenous TCR chains, as well as the puromycin N-acetyltransferase resistance gene. Retrovirus was produced by transfecting FLY-RD18 packaging cells with pMP71-TCR plasmid DNA using Xtremegene 9 transfection reagent (Roche). In parallel, healthy donor PBMCs (Sanquin Blood Bank) were separated into CD8⁺ and CD8⁻ (for transduction with MHC class I- and MHC class II-restricted TCRs, respectively) cells using the CD8⁺ T Cell Isolation Kit (Miltenyi Biotec). Isolated cell fractions were stimulated with CD3/CD28 Dynabeads (Life Technologies) in T cell medium with 150 U ml⁻¹ IL-2. After 48 h, retroviral supernatants were collected and used to infect prestimulated CD8⁻/CD8⁺ PBMCs by spinoculation (2,000 g for 90 min) in Retronectin (Takara)-coated plates. Transduction efficiency was measured 72 h later by staining with an anti-mouse TCRβ constant domain antibody and analysis by flow cytometry. TCR-transduced T cells were then selected with 2.5 µg ml⁻¹ puromycin (Gibco) for 48 h and received fresh medium and IL-2 every 3–4 days. After 12–14 days of culture, transduced T cells were expanded using the REP as described above.

T cell activation assays

Reactivity of TCR-transduced donor T cells was determined by coincubating T cells and target cells for 18–24 h in U-bottom 96-well plates (1:1 T cell/target cell ratio) in T cell medium. Incubation of T cells without target cells, and in the presence of 50 ng ml⁻¹ phorbol 12-myristate 13-acetate (Sigma-Aldrich) and 1 µg ml⁻¹ ionomycin (Sigma-Aldrich) served as negative and positive controls, respectively. Following incubation, cells were stained with IR-Dye and antibodies against CD3, CD4, CD8 and the activation markers CD137 or OX40 and analyzed by flow cytometry. When T cell reactivity towards tumor organoids was tested, IFNγ-pretreated organoids were incubated with T cells in the presence of 20 µg ml⁻¹ anti-PD1 blocking antibody (Merus) in anti-CD28 antibody precoated plates.

The cytotoxic capacity of T cells was assessed by coincubating T cells and target cells for 72 h in 96-well plates at a T cell/target cell ratio of 5:1, unless indicated otherwise. Target cells cultured in the absence of T cells served as negative control. Following incubation, 7.46 µm AccuCount blank counting beads (Spherotech) were added to individual cultures to enable quantification of remaining live target cells. Cells were subsequently harvested, stained with 4,6-diamidino-2-phenylindole and anti-CD3 antibody, and measured by flow cytometry. When cytotoxicity against tumor organoids was assessed, IFNγ-pretreated organoids were incubated with T cells in the presence of 20 µg ml⁻¹ anti-PD1 blocking antibody (Merus) and 10 µM Y-27632 in anti-CD28 antibody precoated 96-well plates. Where indicated, target cells were incubated with 50 µg ml⁻¹ MHC class I blocking antibody (clone W6/32) for 30 min at 37 °C before incubation with T cells. Data from functional T cell assay was analyzed using Graphpad Prism v.9.

Exome and RNA sequencing

Tumor genomic DNA and RNA was extracted from formalin-fixed paraffin embedded tumor material using the AllPrep DNA/RNA kit (Qiagen). For patient ITO66, genomic DNA and RNA were isolated from tumor organoids. Genomic DNA of patient PBMCs was extracted using the DNeasy Blood & Tissue kit (Qiagen). Exome enrichment was performed using the SureSelect XT2 Human All Exon V6 kit (Agilent) and strand-specific libraries were generated using the TruSeq Stranded mRNA sample preparation kit (Illumina) according to the manufacturer’s instructions. Resulting libraries were sequenced on HiSeq 2500 or NovaSeq 6000 DNA analyzers (Illumina). Whole-exome and RNA sequencing was processed using bcbio-nextgen. Briefly, DNA reads were mapped against GRCh38 using Burrows–Wheeler aligner (BWA), duplicates were marked with Picard MarkDuplicates and low complexity regions were excluded. Somatic and germline mutations were identified using Mutect2 and HaplotypeCaller, respectively, followed by annotation by SnpSift. RNA reads were quality filtered and mapped with STAR or TopHat2, transcript-level expression was quantified by Salmon and gene fusions were determined by Arriba^12,21.

Antigen library design

To design the model antigen library used to validate the screening system, protein sequences of genes encoding known human nonmutated cancer regression antigens, as well as selected immunodominant epitope-encoding genes of Epstein-Barr virus, cytomegalovirus and influenza, were collected from the Uniprot database (https://www.uniprot.org/) (Supplementary Table 1). Protein sequences were reverse-translated and codon-optimized, and resulting nucleotide sequences were segmented into 93 nucleotide (nt) minigenes with 45 nt overlap between neighboring minigenes. In addition, a set of previously characterized neoantigens was included, all encoded by 93 nt minigenes in which the mutant codon was flanked on either side by 45 nt of the relevant nonmutant gene sequence. Minigene sequences encoding the corresponding nonmutated peptides were included for each model neoantigen. A stop codon was added directly following each minigene sequence, and internal BbsI recognition sites were removed without altering the encoded peptide sequences. Each 93 nt sequence was duplicated for a total of 4,764 sequences, and a unique 12 nt barcode sequence was incorporated into each minigene sequence following the stop codon. The resulting sequences were flanked by sequences to enable PCR amplification and subcloning using BbsI (New England Biolabs) into a pMSCV retroviral vector that also encodes the puromycin N-acetyltransferase resistance gene and mCherry (pMSCV-puroR-mCherry).

To design NKIRTIL027 and NKIRTIL063 patient mutanome libraries, all single nucleotide variants (SNVs) and frameshifting indels with confirmed RNA expression within tumor cells were encoded as 93 nt minigenes. RNA sequencing data of tumor ITO34 was unavailable, and the library was designed without taking RNA expression of tumor variants into account. For SNVs, minigenes were designed that encoded peptides in which the mutant codon was flanked on either side by 45 nt of the relevant nonmutant gene sequence. In the case of frameshifting indels, or when SNVs resulted in loss of a stop codon, the newly formed open reading frame was segmented in 93 nt minigenes with 45 nt overlap between adjacent minigenes. Minigenes encoding corresponding WT sequences were included for all tumor variant minigenes. Minigenes encoding the MART1_26–35 and CDK4_R24L epitopes were included in all libraries as internal controls. Internal BbsI recognition sites were removed without altering encoded peptide sequences, and minigenes were flanked by sequences for PCR amplification and subcloning as described above. For patient ITO66, the mutanome library was designed to encode tumor variants as 63 nt minigenes, and no corresponding WT minigenes were included. All minigene libraries were synthesized by Twist Biosciences.

Generation of a universal antigen expression vector

To establish a library expression system that enables the concurrent processing and presentation of minigene products through both the MHC class I and class II pathways, constructs were designed in which a TMG encoding two previously identified neoantigens recognized by either CD4⁺ or CD8⁺ TIL of patient NKIRTIL027 (LEMD2_P>L (ref. ²⁸) and TTC37_A>V (unpublished data), respectively) was either fused or not fused to the signal sequence of CD74 (Supplementary Fig. 4)²⁹. Codon-optimized constructs were synthesized (Twist Biosciences) and subcloned into the retroviral pMSCV-puroR-mCherry vector. NKIRTIL027 immortalized B cells were transduced with TMG constructs, selected to over 90% purity (by measuring mCherry expression) with 5 μg ml⁻¹ puromycin and incubated with NKIRTIL027 CD4⁺ or CD8⁺ TIL at a ratio of 1:1 for 48 h in T cell medium with 30 U ml⁻¹ IL-2. T cell activation was subsequently assessed by measuring IFNγ levels in the culture supernatant using the Cytometric Bead Array kit (BD Biosciences), following the manufacturer’s instructions.

Library cloning and transduction

Oligonucleotide libraries were amplified by 12 cycles of PCR using Phusion High-Fidelity DNA Polymerase (New England Biolabs) and primers Preamp Forward (5′-ACTGTCAGAAGACTGCAAGC-3′) and Preamp Reverse (5′-TGACAGCGAAGACCATAGTG-3′). For first proof-of-concept screening experiments using MHC class I-restricted TCRs, the amplified model antigen library was cloned by Golden Gate assembly using BbsI into the pMSCV-puroR-mCherry retroviral vector. For all other screens, amplified libraries were cloned into the pMSCV-puroR-mCherry vector modified to include the sorting sequence of CD74. Subcloned libraries were amplified using Endura electrocompetent cells (Lucigen) and library DNA was extracted using the PureLink HiPure Maxiprep kit (Invitrogen). During all cloning steps, a library representation of at least 100× was maintained.

Libraries were retrovirally transduced in duplicate into immortalized B cell lines, as described above. To ensure single retroviral integrations, B cells were transduced at an infection rate of less than 10%. One day after transduction, B cells were transferred to B cell medium in the presence of irradiated CD40L⁺ L cells. Transduction efficiency was assessed 3 days post-transduction by measuring mCherry expression by flow cytometry, followed by selection with 5 µg ml⁻¹ puromycin for 2 days and expansion of the B cell cultures until used in screens.

Antigen discovery screens

For proof-of-concept screens using MHC class I-restricted TCRs, the antigen library encoding known cancer regression antigens was transduced into a previously immortalized HLA-A*02:01⁺ patient B cell line (OVC21)¹². Library-expressing B cells were coincubated in duplicate with donor CD8⁺ T cells transduced with the CDK4_R24L-specific TCR #53 or MART_26–35-specific TCRs DMF4, DMF5 or 1D3 (all HLA-A*02:01-restricted) in T cell medium with 25 U ml⁻¹ IL-2 at a T cell:B cell ratio of 5:1 and at a density of 2 × 10⁶ total cells cm⁻². Cultures were resuspended on day 1 and 2 of the experiment. For screens using patient-derived MHC class II-restricted neoantigen-specific TCRs, the model library was transduced into patient-matched immortalized B cells (patient NKIRTIL017), and library-expressing B cells were cocultured with donor CD4⁺ T cells transduced with either the MANSC1_D>H– or SNORD73A_R>W-specific TCR as described above. To simulate screening conditions using clonally diverse T cell populations, TCR-expressing T cells were mixed with donor-matched mock-transduced T cells at indicated ratios. Library coverage of at least 300× was maintained in all experiments. After 72 h of coincubation, cells were washed in PBS, and cell debris was removed by either Ficoll-Paque density gradient separation or using the Dead Cell Removal kit (Miltenyi Biotec). Isolated cells were subsequently resuspended in DirectPCR Lysis Reagent (Viagen) containing 500 µg ml⁻¹ proteinase K and lysed by incubation at 55 °C for 60 min, 85 °C for 30 min and 94 °C for 5 min. Minigene sequences were then amplified by PCR using NEBNext Ultra II Q5 Master Mix (New England Biolabs), using the following primers:

Prep-I Forward (for screens with MHC class I-restricted TCRs):

5′-CAAGCAGAAGACGGCATACGATGGAGGAGAACCCTGGACCTACAAGC-3′

Prep-II Forward (for all other screens):

5′-CAAGCAGAAGACGGCATACGACCTGCGGATGAAGCTGCCCG-3′

Prep Reverse:

5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNG ATCCGACTCGGTGCCACTTTTTCAAC-3′

The 7-nt stretch of N nucleotides indicates a unique barcode sequence used to enable the multiplexed preparation of sequencing libraries. Following PCR, samples were pooled equimolarly and run on a 1% agarose gel to separate minigene amplicons from potential primer dimers. Minigene amplicons were extracted from gel using the Monarch DNA Gel Extraction Kit (New England Biolabs) and deep sequenced on an Illumina HiSeq 2500 Sequencing system (single read 65 bp). Sequencing data were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive under accession code PRJNA884260 (ref. ³⁰).

For patient neoantigen screens, mutanome libraries were transduced into autologous immortalized B cells, followed by selection with puromycin. The cytotoxic potential of expanded patient TIL was confirmed before neoantigen screens by measuring their capacity to degranulate. To this end, CD4⁺ and CD8⁺ TIL were polyclonally stimulated using CD3/CD28 Dynabeads in T cell medium in the presence of Golgistop (BD Biosciences) and an antibody against CD107 for 12 h. Following incubation, cells were stained with IR-Dye and an anti-CD4 antibody and analyzed by flow cytometry. Neoantigen screens were subsequently performed by incubating library-expressing B cells in duplicate with patient T cells at a T cell:B cell ratio of 5:1. Library-transduced B cells cultured in the absence of patient T cells served as a negative control. After 72 h of coincubation, cells were processed as described above.

Sequence analysis

Initial sequence quality profiles were quantified by FastQC and demultiplexed using fastq-multx (ea-utils) with one mismatch allowed. Vector sequences were trimmed from sequence reads using fastq-mcf (ea-utils) against the UniVec database, and samples were subsequently quality filtered using cutadapt. The unique 12 nt barcodes that were added to individual minigene sequences were extracted using seqkit and mapped using Bowtie2 with no multimatched hits allowed. For the ITO66 neoantigen screen, high-quality reads were mapped against the full minigene sequences of the patient library using BBMap with ambiguously mapped reads removed and only perfect mappings allowed. Per sample count tables were differentially compared and normalized using DESeq2. Minigenes with an average abundance below the fourth percentile and a coefficient of variation greater than one across the two internal replicates were removed from analyses. Statistical testing was performed using the DESeq2 Wald test and log-fold change cut-off of 0.25. Tumor variants were defined as screen hits when at least one of the duplicate mutant sequences, but neither of the corresponding WT-encoding minigenes, had an false discovery rate-corrected P value less than 0.2 and a log₂ fold change of less than −0.5. All data analysis was performed using R and visualized using the ggplot2 package.

To validate patient neoantigens identified in screens, minigenes encoding the screen hits, as well as their WT counterparts, were synthesized as individual gBlocks (IDT), cloned into the CD74 signal sequence-modified pMSCV-puroR-mCherry vector and transduced into immortalized patient B cells. Following selection with puromycin, minigene-transduced B cells were cocultured with expanded patient CD4⁺ or CD8⁺ TIL for 48 h in T cell medium with 30 U ml⁻¹ IL-2 and T cell activation was assessed by measuring IFNγ levels in the culture supernatant using the Cytometric Bead Array kit (BD Biosciences). Reactivity towards neoantigens was considered confirmed when T cells secreted at least twofold more IFNγ in response to the mutant sequence compared with the WT control sequence.

NKIRTIL063 TMG screen

The number of tumor variants of patient NKIRTIL063 selected for TMG screening was reasonably limited to 200 (of a total of 685 nonsynonymous expressed mutations). Mutations were selected by first including the 25 most clonal mutations (based on variant allele frequency), followed by including mutations with highest gene expression up to a total of 200 tumor variants. TMG constructs were designed to encode ten variant-encoding minigenes (93 nt each) in which the mutant codon was flanked by 45 nt of nonmutant gene sequence on either side. Codon-optimized sequences were synthesized (Twist Biosciences) and subcloned into the CD74-modified pMSCV-puroR-mCherry retroviral vector. NKIRTIL063 immortalized B cells were transduced with TMG constructs and selected to more than 80% purity with 5 μg ml⁻¹ puromycin. Next, TMG-expressing B cells were cocultured with NKIRTIL063 CD4⁺ or CD8⁺ TIL (from the same expansion cultures as used for the antigen discovery screen) at a ratio of 1:1 for 48 h in T cell medium with 30 U m⁻¹ lL-2, and activation of T cells was determined by measuring IFNγ levels in the culture supernatant using the Cytometric Bead Array kit (BD Biosciences). To validate that the observed reactivity to selected TMG constructs was mediated by neoantigens identified using our neoantigen discovery screen, modified versions of these TMGs were designed such that exclusively the minigene that encoded the identified neoantigen was reverted to its WT sequence. Reactivity of NKIRTIL063 CD4⁺ or CD8⁺ TIL to B cells transduced with these modified TMGs was subsequently assessed as above.

In silico selection of neoantigen vaccine targets

The computational tool Vaxrank³¹ was used to rank tumor mutations of patients NKIRTIL063, NKIRTIL027 and ITO66 for use in a putative personalized cancer vaccine. Patient ITO34 was omitted from this analysis because RNA expression data were unavailable. HLA typing of patients was performed using OptiType for HLA-A, -B and -C alleles. The set of somatic variant calls and aligned RNA reads were used as input, with parameters set to a peptide length of 25, an epitope length of 8–11 and utilization of the MHCFlurry prediction algorithm. In line with ongoing clinical trials of personalized neoantigen-based vaccines³², the 20 top ranking predicted neoantigens were considered for putative neoantigen vaccines (Supplementary Table 2).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

DNA sequencing data of antigen discovery screens have been deposited in the NCBI Sequence Read Archive under accession code PRJNA884260 (ref. ³⁰). Protein sequences of genes encoding known human nonmutated cancer regression antigens, as well as selected viral genes were collected from the Uniprot database (https://www.uniprot.org/).

Code availability

Scripts used for analyzing sequencing data from antigen discovery screens are available at https://github.com/twbattaglia/amplicon-nf (ref. ³³). Script output for the presented analyses is available at https://github.com/twbattaglia/HANSolo-manuscript (ref. ³⁴).

References

Schumacher, T. N., Scheper, W. & Kvistborg, P. Cancer neoantigens. Annu. Rev. Immunol. 37, 173–200 (2018).
Article

Google Scholar
Schumacher, T. N. & Schreiber, R. D. Neoantigens in cancer immunotherapy. Science 348, 69–74 (2015).
Article
CAS

Google Scholar
Bentzen, A. K. et al. Large-scale detection of antigen-specific T cells using peptide-MHC-I multimers labeled with DNA barcodes. Nat. Biotechnol. 34, 1037–1045 (2016).
Article
CAS

Google Scholar
Lu, Y. C. et al. Efficient identification of mutated cancer antigens recognized by T cells associated with durable tumor regressions. Clin. Cancer Res. 20, 3401–3410 (2014).
Article
CAS

Google Scholar
Kula, T. et al. T-Scan: a genome-wide method for the systematic discovery of T cell epitopes. Cell 178, 1016–1028.e13 (2019).
Article
CAS

Google Scholar
Joglekar, A. V. et al. T cell antigen discovery via signaling and antigen-presenting bifunctional receptors. Nat. Methods 16, 191–198 (2019).
Article
CAS

Google Scholar
Li, G. et al. T cell antigen discovery via trogocytosis. Nat. Methods 16, 183–190 (2019).
Article
CAS

Google Scholar
Alspach, E. et al. MHC-II neoantigens shape tumour immunity and response to immunotherapy. Nature 574, 696–701 (2019).
Article
CAS

Google Scholar
Borst, J., Ahrends, T., Babala, N., Melief, C. J. M. & Kastenmuller, W. CD4+ T cell help in cancer immunology and immunotherapy. Nat. Rev. Immunol. 18, 635–647 (2018).
Article
CAS

Google Scholar
Oh, D. Y. et al. Intratumoral CD4+ T cells mediate anti-tumor cytotoxicity in human bladder cancer. Cell 181, 1612–1625.e13 (2020).
Article
CAS

Google Scholar
Tran, E. et al. Cancer immunotherapy based on mutation-specific CD4+ T cells in a patient with epithelial cancer. Science 344, 641–645 (2014).
Article
CAS

Google Scholar
Scheper, W. et al. Low and variable tumor reactivity of the intratumoral TCR repertoire in human cancers. Nat. Med. 25, 89–94 (2019).
Article
CAS

Google Scholar
Stronen, E. et al. Targeting of cancer neoantigens with donor-derived T cell receptor repertoires. Science 352, 1337–1341 (2016).
Article
CAS

Google Scholar
Johnson, L. A. et al. Gene transfer of tumor-reactive TCR confers both high avidity and tumor reactivity to nonreactive peripheral blood mononuclear cells and tumor-infiltrating lymphocytes. J. Immunol. 177, 6548–6559 (2006).
Article
CAS

Google Scholar
Borbulevych, O. Y., Santhanagopolan, S. M., Hossain, M. & Baker, B. M. TCRs used in cancer gene therapy cross-react with MART-1/Melan-A tumor antigens via distinct mechanisms. J. Immunol. 187, 2453–2463 (2011).
Article
CAS

Google Scholar
Valmori, D. et al. Vaccination with a Melan-A peptide selects an oligoclonal T cell population with increased functional avidity and tumor reactivity. J. Immunol. 168, 4231–4240 (2002).
Article
CAS

Google Scholar
Chalmers, Z. R. et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med. 9, 34 (2017).
Article

Google Scholar
Jorritsma, A. et al. Selecting highly affine and well-expressed TCRs for gene therapy of melanoma. Blood 110, 3564–3572 (2007).
Article
CAS

Google Scholar
Tran, E. et al. Immunogenicity of somatic mutations in human gastrointestinal cancers. Science 350, 1387–1390 (2015).
Article
CAS

Google Scholar
Cattaneo, C. M. et al. Tumor organoid–T-cell coculture systems. Nat. Protoc. 15, 15–39 (2020).
Article
CAS

Google Scholar
Dijkstra, K. K. et al. Generation of tumor-reactive T cells by co-culture of peripheral blood lymphocytes and tumor organoids. Cell 174, 1586–1598.e12 (2018).
Arnaud, M. et al. Sensitive identification of neoantigens and cognate TCRs in human solid tumors. Nat. Biotechnol. 40, 656–660 (2022).
Article
CAS

Google Scholar
Sahin, U. et al. Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer. Nature 547, 222–226 (2017).
Article
CAS

Google Scholar
Ott, P. A. et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature 547, 217–221 (2017).
Article
CAS

Google Scholar
Hilf, N. et al. Publisher correction: Actively personalized vaccination trial for newly diagnosed glioblastoma. Nature 566, E13–E13 (2019).
Article
CAS

Google Scholar
Keskin, D. B. et al. Neoantigen vaccine generates intratumoral T cell responses in phase Ib glioblastoma trial. Nature 565, 234–239 (2019).
Article
CAS

Google Scholar
Kwakkenbos, M. J. et al. Generation of stable monoclonal antibody–producing B cell receptor–positive human memory B cells by genetic programming. Nat. Med. 16, 123–128 (2010).
Article
CAS

Google Scholar
Linnemann, C. et al. High-throughput epitope discovery reveals frequent recognition of neo-antigens by CD4+ T cells in human melanoma. Nat. Med. 21, 81–85 (2015).
Article
CAS

Google Scholar
Bonehill, A. et al. Messenger RNA-electroporated dendritic cells presenting MAGE-A3 simultaneously in HLA class I and class II molecules. J. Immunol. 172, 6649–6657 (2004).
Article
CAS

Google Scholar
Cattaneo, C.M. et al. HLA-agnostic Neoantigen Screening (HANSolo) – raw sequencing data. NCBI Sequence Read Archive (SRA) https://www.ncbi.nlm.nih.gov/bioproject/PRJNA884260 (2022).
Rubinsteyn, A., Hodes, I., Kodysh, J. & Hammerbacher, J. Vaxrank: A computational tool for designing personalized cancer vaccines. Preprint at bioRxiv https://doi.org/10.1101/142919 (2017).
Blass, E. & Ott, P. A. Advances in the development of personalized neoantigen-based therapeutic cancer vaccines. Nat. Rev. Clin. Oncol. 18, 215–229 (2021).
Article

Google Scholar
Battaglia, T. HANSolo amplicon-nf pipeline. GitHub https://github.com/twbattaglia/amplicon-nf (2022).
Battaglia, T. HANSolo analysis code. GitHub https://github.com/twbattaglia/HANSolo-manuscript (2022).

Download references

Acknowledgements

We would like to thank M. Slagter and L. Wessels for bioinformatic and statistical support, K. Dijkstra for support with single-cell TCR sequencing, A. van de Leun for support with isolation of neoantigen-specific TCRs, M. Wolkers for kindly sharing patient material, K. Bresser and D. Vredevoogd for helpful discussions on library design, the NKI-AVL Flow Cytometry Facility for flow cytometric support, the NKI-AVL Core Facility Molecular Pathology and Biobanking for supplying NKI-AVL Biobank material and laboratory support and the NKI-AVL Genomics Core Facility for support with next-generation sequencing. This work was supported by the Dutch Cancer Society Young Investigator Grant (grant No. 2020-1/12977) (to W.S.), ZonMw Translational Research Program 2 (grant No. 446002001) (to W.S. and J.B.A.G.H.), the Queen Wilhelmina Cancer Research Award and ERC AdG SENSIT (grant agreement No. 742259) (to T.N.S.), the NWO Gravitation program (NWO 2012-2022) (to E.E.V.) and Oncode Institute (to T.N.S. and E.E.V.). Figure 1a was created with BioRender.com.

Author information

Author notes

These authors contributed equally: Thomas Battaglia, Jos Urbanus.
These authors jointly supervised this work: Emile E. Voest, Ton N. Schumacher, Wouter Scheper.

Authors and Affiliations

Department of Molecular Oncology and Immunology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
Chiara M. Cattaneo, Thomas Battaglia, Jos Urbanus, Ziva Moravec, Rhianne Voogd, John B. A. G. Haanen, Emile E. Voest, Ton N. Schumacher & Wouter Scheper
Oncode Institute, Utrecht, The Netherlands
Chiara M. Cattaneo, Thomas Battaglia, Jos Urbanus, Emile E. Voest & Ton N. Schumacher
Department of Genomics of Cancer and Targeted Therapies, IFOM, FIRC Institute of Molecular Oncology, Milan, Italy
Chiara M. Cattaneo
Department of Hematopoiesis, Sanquin Research, Amsterdam, The Netherlands
Rosa de Groot
Department of Hematology, Leiden University Medical Centre, Leiden, The Netherlands
Rosa de Groot & Ton N. Schumacher
Department of Surgery, The Netherlands Cancer Institute, Amsterdam, The Netherlands
Koen J. Hartemink
Department of Medical Oncology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
John B. A. G. Haanen & Emile E. Voest

Contributions

C.M.C., J.U., Z.M., R.V. and W.S. designed, performed, analyzed and interpreted experiments. T.B. analyzed sequencing data of screens. K.J.H. and R.d.G. supplied patient tumor material. C.M.C., J.B.A.G.H., E.E.V., T.N.S. and W.S. wrote the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to
Emile E. Voest, Ton N. Schumacher or Wouter Scheper.

Ethics declarations

Competing interests

T.N.S. is advisor for Allogene Therapeutics, Celsius, Merus, Neogene Therapeutics and Scenic Biotech; is a recipient of research support from Merck KgaA; is a stockholder in Allogene Therapeutics, Cell Control, Celsius, Merus, Neogene Therapeutics and Scenic Biotech and is venture partner at Third Rock Ventures, all outside of the current work. J.B.A.G.H. is advisor for BioNTech, Neogene Therapeutics, Scenic Biotech and T-Knife; is a recipient of research grant support from BioNTech; is a stock option holder in Neogene Therapeutics, all outside of the current work. All other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Paul Robbins and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Cite this article

Cattaneo, C.M., Battaglia, T., Urbanus, J. et al. Identification of patient-specific CD4⁺ and CD8⁺ T cell neoantigens through HLA-unbiased genetic screens.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-022-01547-0

Download citation

Received: 24 February 2021
Accepted: 06 October 2022
Published: 02 January 2023
DOI: https://doi.org/10.1038/s41587-022-01547-0

Accurate isoform discovery with IsoQuant using long reads

accurate
isoform
Science & Nature

newsycanuse

January 3, 2023

Accurate isoform discovery with IsoQuant using long reads

Main

Long-read RNA sequencing is now widely used in bulk, sorted cells, single cells and spatial approaches. This wide field of applications has led to the development of multiple spliced alignment programs^1,2,3,4, transcript discovery methods^{5,6,7,8,9,10,11}, tools for transcript classification¹², annotation¹³ and visualization^14,15. Additionally, several reference-free tools for RNA long-read correction and assembly have been developed^16,17. Current community efforts address the problem of understanding performance, weaknesses and advantages of each approach for various applications¹⁸.

Here we present IsoQuant—a tool for transcript discovery and quantification with long RNA reads. IsoQuant takes as input a reference genome and a dataset containing PacBio or ONT (Oxford Nanopore Technologies) RNA reads. By default, IsoQuant maps input reads to the genome via minimap2 in splice mode². Alternatively, a user may provide BAM files generated with a spliced aligner of their choice, for example STARlong¹ for PacBio and uLTRA⁴ or deSALT³ for ONT reads. In two distinct modes, IsoQuant can be used for de novo annotation-free transcript discovery as well as with the reference gene annotation.

IsoQuant uses long-read spliced alignments to construct an intron graph, in which vertices are splice junctions, that is, pairs of splice sites (donor and acceptor), and two vertices are connected with a directed edge if the corresponding splice junctions are consecutive in at least one read (Methods). This graph is exploited for constructing paths that correspond to full-length transcripts (Fig. 1a). If the reference annotation is provided, IsoQuant first assigns reads to known isoforms via an inexact intron-chain matching algorithm that accounts for splice site shifts, which are typical for alignment of error-prone reads¹⁹. These assignments are further used for reference transcript quantification and correction of inaccurately detected splice junctions and misalignments, such as skipped microexons.

**Fig. 1: IsoQuant pipeline outline and characteristics of novel transcripts generated from mouse simulated data.**

To compare IsoQuant performance against existing transcript discovery tools, we first simulated mouse PacBio and ONT data using realistic gene expression profiles with IsoSeqSim (https://github.com/yunhaowang/IsoSeqSim) and Trans-NanoSim²⁰ respectively. For more informative benchmarking, we simulated an ONT R9.4 dataset representing R9.4 chemistry and an ONT R10.4 dataset corresponding to a more accurate R10.4 chemistry (Methods).

To mimic real-life datasets containing unannotated transcripts, we arbitrarily removed 5,311 (15%) of 35,684 expressed isoforms (the ones contributing to at least one read during the simulation) from the GENCODE²¹ gene annotation. These 5,311 hidden transcripts were further used as a ground truth for novel transcript discovery. The reduced GENCODE annotation was used as an input for all tools. Each output annotation was then separated into a set of known and a set of novel transcripts, which were compared against the respective baselines using gffcompare²² (Methods).

For known transcripts, IsoQuant has the highest F1-score (the harmonic mean of precision and recall) compared to TALON⁷, FLAIR⁸, Bambu¹¹ and StringTie⁵, but these advances are not dramatic (Supplementary Tables 1–3). However, IsoQuant produces novel transcripts with a 1.9-fold higher F1-score on ONT R10.4 data compared to the second-best tool, StringTie. In comparison to TALON, FLAIR and Bambu, the improvement in F1-score is even more noticeable (Fig. 1b, left). On PacBio data, IsoQuant again shows the best F1-score, but the difference from other tools is smaller than for ONT R10.4 data (Fig. 1b, right).

Compared to most tools, IsoQuant’s improvements in F1-score is primarily caused by its very high precision of novel transcripts. As compared to TALON, FLAIR and StringTie, IsoQuant shows a minimum of fivefold drop in false-positive rate on ONT R10.4 data, while still maintaining slight gains in recall (Fig. 1d). The situation is of a different nature for Bambu. IsoQuant has higher precision (86.3 versus 69.9%), but substantially higher recall: while Bambu only reconstructs 73 out of 5,311 novel isoforms (1% recall), IsoQuant reconstructs 3,848 (62.6%). On ONT R9.4 simulated data IsoQuant similarly shows a notably lower false-positive rate compared to other tools (Supplementary Table 2).

On PacBio simulated data, similar trends can be observed for novel transcripts, although with a less drastic difference in specificity. Bambu shows slightly higher precision (95.8%) compared to IsoQuant (94.4%), but again has the lowest recall (18.7% for Bambu versus 76.8% for IsoQuant). StringTie, TALON and FLAIR again predict transcripts with comparable recall, but have at least fivefold higher false-positive rate compared to IsoQuant (Fig. 1e, detailed analysis of the false-positive transcript is provided in Supplementary Note 8).

Further, we measured precision and recall for novel transcripts with different expression levels (Fig. 1c and Supplementary Fig. 1). While all tools tend to show lower recall and precision for lowly expressed transcripts, IsoQuant yields highly specific transcript models (≥80% precision) and maintains advances for novel transcript discovery regardless of the expression levels. Thus, IsoQuant is likely to be highly useful across many genes, including but not limited to low-expressed long-noncoding RNAs and marker genes of cell types.

Among the five listed methods, only StringTie and IsoQuant support annotation-free transcript discovery. Thus, we compared these two tools on the same simulated datasets used above without providing any annotation (Supplementary Table 4). On PacBio data both tools yield highly accurate transcript models. On ONT data StringTie shows higher recall, while IsoQuant generates transcripts with substantially lower false-positive rates (2.5-fold decrease for ONT R10.4 dataset and 3.7-fold for ONT R9.4). While overall quality of transcripts discovered in reference-based mode is, indeed, higher compared to annotation-free runs, the precision and recall of novel transcripts appears to be rather similar in both modes.

To complement our benchmarks on simulated data, we also sequenced Lexogen spike-in RNA variant (SIRV) synthetic molecules on the Oxford Nanopore MinION using ONT R10.4 flowcells (Methods). Along with the complete SIRV annotation, Lexogen provides an incomplete annotation, missing 26 out of the total 69 SIRV isoforms, which allows the evaluation of novel transcript discovery, similar to the one we performed for simulated data with the reduced GENCODE annotation.

Results on SIRV sequencing data resemble the ones obtained on simulated reads. When predicting novel isoforms, IsoQuant shows at least four times higher F1-score and eightfold lower false-positive rate than any other tool. In comparison to most tools, with the exception of TALON, IsoQuant shows high gains in both precision and recall. TALON has a better recall (42.3 versus 38.5%), but IsoQuant has tenfold higher precision (Fig. 2a). Similar to simulated data, all tools are able to accurately predict SIRV transcripts kept in the annotation, with Bambu, StringTie and IsoQuant having perfect precision for known isoforms alone (Supplementary Table 5).

To support our observations, we also applied all tools to the real human ONT complementary DNA, ONT direct RNA (dRNA)²³ and PacBio public datasets, for which the ground truth is indeed unknown. We used gffcompare to estimate the consistency of predictions by computing the number of identical transcript models reported by the different tools. On the human ONT dRNA dataset, IsoQuant shows the highest percentage of transcripts confirmed by at least three other methods (70.1%), while no other tool surpasses the 40% threshold. This suggests that IsoQuant transcript models are notably more consistent with other methods (Fig. 2b, middle). In comparison to the other approaches, IsoQuant also reports the lowest number of transcripts that are not predicted by any other method. If one interprets such transcript models as potential false positives, IsoQuant again stands out in the lowest false-discovery rate (3.5%, 1,162 transcripts). In contrast, other tools output annotations containing more than 33% of unconfirmed transcript models (varying from 18,000 to 48,000). Additionally, for each tool we computed the number of potentially missed transcripts that were reported by all other methods. While TALON has the lowest number of such transcripts (75), Bambu shows the second-best results of 1,089 possible false negatives and IsoQuant shows the third-best results of 1,521 such transcripts (Supplementary Table 6).

**Fig. 2: Characteristics of transcripts obtained from real sequencing data.**

Similar trends can be observed in ONT cDNA and PacBio datasets, although the overall percentage of common transcripts appears to be lower compared to ONT dRNA data (Fig. 2b, left and right). IsoQuant again shows the highest fraction of transcripts predicted by at least three other tools (35.6% for ONT cDNA, 55.6% for PacBio), while other programs have correspondingly 25 and 40% at best. All four other tools produce annotations containing a high number of transcripts that are not confirmed by any other method (> 50% of all transcripts for ONT cDNA, > 30% for PacBio), while IsoQuant’s potential false predictions are below 25% on ONT cDNA dataset and below 10% on the PacBio dataset.

Although these values cannot be explicitly treated as false positives and false negatives, they advocate that, unlike other tools, IsoQuant produces highly specific annotations that are strongly consistent with transcripts reported by several alternative approaches. Moreover, because IsoQuant typically misses very few isoforms predicted by all other tools simultaneously, it is likely to also be highly sensitive (Supplementary Table 6, the number of potentially missed transcripts).

Additionally, we used long-read RNA sequencing data from a mouse brain sample, in which a previous study reported 76 novel isoforms of high biological importance²⁴, which were confirmed by manual annotation by the GENCODE team. Here, we compared IsoQuant only with StringTie, which has the second-best F1-score across all simulated datasets. On PacBio data, IsoQuant correctly reconstructs 71% of the confirmed novel isoforms, while StringTie restores approximately half as many novel transcripts—37% (Supplementary Table 7). Similarly, on the single-cell ONT dataset from the same brain sample IsoQuant restores almost 50% of these 76 novel isoforms, whereas StringTie reports 30%. Although it is not possible to evaluate specificity in this kind of experiment, it confirms that IsoQuant can maintain high recall values on real sequencing data.

Beside transcript discovery, IsoQuant implements additional functionality, such as read-to-isoform assignment and transcript quantification. Benchmarks of these supplementary features, information on computational performance, as well as IsoQuant results obtained with different spliced aligners can be found in the Supplementary Notes 2–7.

In summary, IsoQuant accurately predicts transcript models from PacBio or ONT RNA sequencing data. For known isoforms, IsoQuant has higher F1-score compared to other tested tools, but these differences are not dramatic. For unannotated isoforms, however, IsoQuant provides very strong increases in F1-score over other existing approaches. In comparison to most tools, it achieves this F1-score increase by maintaining higher recall, while substantially increasing precision. Thus, IsoQuant is a valuable tool for predicting novel alternatively spliced isoforms in the age of long-read sequencing.

Methods

Sequencing Lexogen SIRV transcripts

First, total RNA from HeLa cells was extracted using the miRNeasy Tissue/Cells Advanced Mini Kit (Qiagen, 217604), and polyA transcripts were pulled-down using the NEBNext Poly(A) messenger RNA Magnetic Isolation Module (NEB, E7490S). Next, the SIRV-Set 4 (Iso Mix E0/ERCC/Long SIRVs) (Lexogen, 141.01) was spiked-in to the RNA and reverse transcribed using the Maxima H Minus Reverse Transcriptase (Thermo Scientific, EP0752). The reverse transcriptase reaction final concentrations are as follows: 1.25 ng μl⁻¹ polyA HeLa RNA, 0.33 ng μl⁻¹ SIRV-Set 4, 0.5 mM dNTP, 5 μM dT-VN oligo, 5 μM TSO, 1× reverse transcriptase buffer, 2 U μl⁻¹ RiboLock RNase Inhibitor (Thermo Scientific, EO0382) and 20 U μl⁻¹ Maxima H Minus Reverse Transcriptase. The reaction was incubated for 30 min at 50 °C and 5 min at 85 °C. Then, 5 μl of reverse transcriptase reaction were amplified using the Platinum Superfi II Mastermix (ThermoFisher, 12368010) for 12 cycles, according to the manufacturer’s instructions and using Forward- and Reverse-Amplification primers. Finally, the cDNA was cleaned up using SPRIselect beads at a 0.8× ratio (Beckman Coulter, B23318) and used as input for Oxford Nanopore Technology sequencing with both the Kit 12 (SQK-LSK110 kit and FLO-MIN106D flowcells) and Q20+(SQK-LSK112 kit and FLO-MIN112 flowcells) chemistries. Both were run for 72 h and basecalled using the Super Accuracy model.

Data simulation

To simulate PacBio circular consensus sequencing (CCS) reads we used IsoSeqSim (https://github.com/yunhaowang/IsoSeqSim), which generates a read by truncating a transcript sequence according to given probabilities and randomly inserts sequencing errors at a specified rate with uniform distribution. As reported in previous studies²⁵, a uniform error distribution is a realistic model for PacBio CCS reads. Here we used 5′ and 3′ truncation probabilities typical for PacBio Sequel II (provided within the package) and an overall error rate of 1.6%: 0.6% deletions, 0.6% insertions and 0.4% substitutions. While these discrepancies do not necessarily represent sequencing errors, they must nevertheless be modeled, as they can confuse transcript reconstruction. The above values were obtained by mapping real PacBio CCS reads to the reference genome¹⁸.

ONT reads were simulated with the NanoSim software in the transcriptome mode²⁰. NanoSim is designed specifically for simulating ONT-specific sequencing errors and biases. It first constructs error-profile and length-distribution models, which are further used to mutate reference transcript sequences. We trained the model using the ONT R10.4 sequencing data (average error rate of 2.8%: 0.7% deletions, 1.1% insertions, 1% substitutions.). To simulate ONT R9.4 chemistry, we used a pretrained model provided within the NanoSim package, which was obtained using publicly available ONT cDNA data²³ from the NA12878 human cell line and has an average error rate of 15.9%: 6% deletions, 5.1% insertions and 4.8% substitutions. In addition, we turned off the simulation of intron retention events and random unaligned reads representing the background noise.

However, additional analysis of the simulated ONT data and NanoSim code revealed that NanoSim randomly selects a start position of a read in a transcript sequence with a uniform distribution, thus introducing no 5′ or 3′ bias. To simulate more realistic ONT reads, we aligned real ONT cDNA data obtained from the mouse brain sample to the reference transcriptome using minimap2 and derived empirical truncation probability distributions on both 5′ and 3′ ends. Further, we changed the NanoSim source code to enable sequence truncation with respect to obtained probabilities (Supplementary Fig. 2). The modified version is available at https://github.com/andrewprzh/lrgasp-simulation.

For both ONT and PacBio simulation we used Mouse GENCODE v.26 and Human GENCODE v.36 basic annotations²¹. Before simulation, we also attached a 30 basepair (bp) polyA tail to every transcript sequence. To simulate realistic mouse data, a transcript expression profile was obtained using PacBio data from a mouse brain sample²⁴. For human data, a gene expression profile was computed with PacBio GM12878 data. A complete description of every dataset used in this study is provided in the Supplementary Table 8.

Quality evaluation of predicted novel transcripts

To mimic real-life situations and assess the ability of an algorithm to predict novel transcripts, we created reduced gene annotations by removing a fraction of expressed isoforms. First, we define a subset of true expressed transcripts that contributed to at least one read during the simulation. Among this set, we select a fraction of transcripts to be excluded from the annotation. These transcripts are denoted as the true novel isoforms. The remaining transcripts (among the expressed) are defined as true known isoforms. To create a reduced gene annotation, we remove all true novel isoforms from the comprehensive GENCODE annotation. Here we created a reduced mouse annotation with 15% of expressed transcripts removed, and four human reduced annotations with 10, 15, 20 and 25% of excluded expressed isoforms (Supplementary Note 2).

To evaluate a transcript prediction tool, we provided the entire set of simulated reads and the reduced annotation as an input. Thus, true novel isoforms are hidden from the annotation, but present in the reads. We then compute precision and recall by running gffcompare²² for (1) the entire output annotation versus the complete set of expressed transcripts, (2) reported known isoforms versus the set of true known isoforms and (3) predicted novel transcript models versus the true novel set. The information on whether a transcript is known or novel is obtained from the output GTF file. The script for computing these metrics can be found in the IsoQuant repository in misc/reduced_db_gffcompare.py.

For the annotation-free benchmarks we simply compared the entire output annotation with the true set of expressed isoforms using gffcompare.

To estimate how recall and precision of novel transcripts depend on the expression levels, predicted transcripts are grouped into bins by their transcripts per million (TPM) values. For computing recall the number of false negative calls (undetected transcripts) in each TPM bin is required. We thus group transcripts by their TPM values used during the simulation. However, computing precision requires the number of false-positive predictions within each bin and thus only reported TPM values can be used (the true TPM for a false prediction is 0). Thus, it may happen that the same transcript may fall into different bins when benchmarking different tools. Although it is not possible to compute precision and recall exactly for an arbitrary TPM range, the bias has a minor effect as only a small number of bins was used in this experiment (five). Therefore, despite being imperfect, these estimations can provide additional insights on whether a transcript discovery method has any bias toward high- or low-expressed isoforms.

To evaluate SIRV transcripts we used an incomplete SIRV annotation containing only 43 out of 69 SIRV transcripts. The output annotations were again split into known and novel transcripts, and compared against the respective reference set using gffcompare. The SIRV-Set 4 annotations are available at https://www.lexogen.com/sirvs/download/.

Estimating consistency between annotations

Consistency between transcripts generated on real data was estimated using gffcompare (without providing a reference annotation). Based on gffcompare output, for each tool we computed how many of its transcripts are supported by (1) all four other tools, (2) exactly three other tools, (3) one or two other tools and (4) no other tool (possible false predictions). We also counted the number of potentially missed transcripts that were reported by all methods except the one being evaluated (possible false negative). This approach is implemented in misc/denovo_model_stats.py.

Command line options

For PacBio data minimap2 was launched with ‘splice:hq’ preset; for ONT data we used k-mer size 14 with the usual ‘splice’ preset. We also provided annotated splice junctions in BED format as an input. In each experiment, all tools were provided with the same BAM file and the same reference annotation. IsoQuant was launched with the default parameters setting the appropriate data type via ‘–data_type’ option. StringTie2 was launched with the ‘-L’ option. All other tools were run with the default parameters in 20 threads. In contrast to all other tools, Bambu outputs all reference transcripts, including unexpressed ones. Thus, we filtered out all transcripts with read count values <1 from the Bambu output. As recommended in the user manual, we also ran TALON using preliminary alignment correction with TranscriptClean²⁶ (https://github.com/mortazavilab/TALON). However, as the results with and without correction were almost identical, we decided to use the annotations obtained from raw data for a fair comparison. Complete information on all options and software versions are provided in the Supplementary Table 9.

IsoQuant algorithm

To process long RNA reads, IsoQuant requires a reference genome and optionally—a corresponding gene annotation. If the reads are provided in the FASTQ format, IsoQuant maps them to the reference with minimap2 in splice mode². Alternatively, a user may provide a sorted and indexed BAM file generated with a spliced aligner of their choice. If the reference annotation is provided, the IsoQuant algorithm includes four main steps: (1) assigning mapped reads to known isoforms, (2) transcript quantification, (3) alignment correction and (4) transcript model construction. In the annotation-free mode, the pipeline simply proceeds to the transcript discovery step. Below, we describe the key aspects of all four procedures.

Assigning long reads to known isoforms

The algorithm for assigning long reads to annotated isoforms is based on intron-chain matching and detecting exonic overlaps. To assign reads, IsoQuant processes each gene individually by extracting reads that map to the respective region from the sorted BAM file.

IsoQuant first processes the annotation to construct splice junction and exon profiles of all known isoforms. A set of annotated splice junctions in the gene is sorted according to their coordinates in the genome and enumerated from 1 to N. Thus, an annotated isoform can be represented as a vector of length N, in which the element at position i is set to 1 if this isoform includes the ith splice junction and −1 otherwise (Supplementary Fig. 3a). This vector is henceforth referred to as an isoform splice junction profile. The exon profile is constructed in a similar manner: all annotated exons are first split into a minimal set of M nonoverlapping fragments, such that every exon can be represented as their combination, and these exonic fragments are sorted and enumerated. The exon profile for an annotated isoform is similarly denoted as a vector of length M, where the ith element is set to 1 if this isoform contains the ith exon fragment and −1 otherwise (Supplementary Fig. 3b).

To assign a read to an annotated isoform, each splice junction from the alignment is matched against annotated splice junctions from the current gene and a read splice junction profile is constructed (also a vector of length N). In this vector the ith element is set to 1 if the annotated splice junction with index i matches to a splice junction from the read, −1 if it is overlapped or spanned by the read, but no match is detected, and 0 otherwise. A zero value indicates that the splice junction is located outside the alignment region and therefore no information can be derived, for example due to read truncation. Similarly, the exon profile of the read is constructed based on M exonic fragments described above: 1 indicates that the respective exonic fragment is overlapped, −1 means it is spanned and 0 is set for exonic fragments outside the alignment region (Supplementary Fig. 4).

Due to sequencing errors, an aligner may detect splice site positions inaccurately¹⁹. To avoid considering them as alternative or novel, the algorithm allows a small difference Δ between annotated and alignment splice site coordinates when matching splice junctions. Formally speaking, an annotated splice junction (x₁, x₂) matches a read splice junction (y₁, y₂) if |x₁ − y₁| ≤ Δ and |x₂ − y₂| ≤ Δ. The default Δ value varies for different types of input data: 4 bp used for PacBio CCS reads and 6 bp for ONT reads (can be set manually). Although an aligned read can be assigned to an isoform by simply comparing its intron chain and exonic coordinates to the annotation, vectorizing the alignment as described above allows one to easily implement inexact splice site comparison with a delta, and quickly detect candidate isoforms for read assignment.

Further, to assign a read to an isoform, its exon and splice junction profiles are matched against the respective profiles of the annotated isoforms. The distance between two profiles is computed simply as the number of distinct elements in which the read profile has nonzero values. A read is said to be consistent with an isoform if the distances between their exon and splice junction profiles are 0, and the read has no unannotated splice junctions/exons (Supplementary Fig. 4). When a read is consistent with a single isoform, it is reported as a unique match. When a read is consistent with multiple isoforms simultaneously, it is classified as ambiguous, which may happen, for example, due to read truncation. If a read contains unannotated splice junctions/exons, or its profiles are not consistent with any isoform, it is marked as inconsistent. For such alignments IsoQuant reports the most similar reference transcript and detected alternative splicing events.

Some inconsistencies can be, however, caused by misalignments, rather than by real alternative splicing events¹⁹: (1) skipped short exons, (2) intron shifts exceeding Δ bp and (3) short unannotated exons at transcript ends (Supplementary Fig. 5). If an inconsistent alignment contains only these types of discrepancy, the read is reclassified as conditionally consistent.

Transcript quantification

Once long reads are assigned to annotated isoforms, quantification becomes rather trivial. Uniquely assigned reads are counted as a single detected transcript, while ambiguous reads are treated as multi-mappers and contribute to multiple assigned isoforms with lower weight. A transcript is reported as expressed only if it has at least one uniquely assigned read. Inconsistent reads are considered as potential novel isoforms and ignored during the quantification step. Beside genes and transcripts, IsoQuant can also count inclusion and exclusion abundances for separate exons and introns, which can be useful for computing percentage spliced-in values.

IsoQuant implements additional functionality for barcoded long RNA reads, for example barcoded by single-cell or spatial location^24,27. A user can provide information on how the reads are grouped, for example, as a TSV file that indicates a barcode or a cell type of origin for every read. Isoform and gene abundances are then calculated for every read group separately, which can facilitate an expression comparison between different groups or cell types.

Spliced alignment correction

IsoQuant corrects each uniquely assigned read individually. If a read contains misalignments described above (Supplementary Fig. 5) or its intron chain is not identical to the intron chain of the assigned isoform, the alignment is corrected as follows. Short skipped exons are restored according to the annotation and minor splice junction shifts are replaced with the respective splice junctions from the assigned transcript. Unannotated terminal microexons are simply removed from the alignment. Finally, any unannotated splice site is substituted with the nearest site from the assigned transcript if (1) these splice sites are located within Δ bp and (2) read alignment contains sequencing errors near this splice site. Coordinates of corrected alignments are then saved in BED12 format.

Transcript model construction

The transcript reconstruction procedure implemented in IsoQuant includes four steps: (1) intron graph construction from read alignments, (2) intron graph simplification, (3) attaching terminal vertices and (4) construction of paths representing full-length transcripts. This stage does not require any information on reference transcripts and thus can be used for both de novo and annotation-based transcript discovery. Below we provide a detailed description of all algorithms and intuition behind them.

Intron graph construction

To construct transcript models, IsoQuant implements a concept of an intron graph, which was influenced by the previously designed splice graph approach²⁸, used, for example, in StringTie⁵. For a given set of transcripts, an intron graph is constructed as follows. First, we define internal vertices as a set of all splice junctions from all transcripts. Thus, each vertex represents a pair of splice sites (donor and acceptor) or, more formally, an ordered pair of coordinates in the genome. Two vertices are connected with a directed edge if the respective splice junctions are consecutive in any transcript. Finally, for every first or last splice junction in a transcript, the corresponding vertex is connected with a terminating vertex that represents the transcript start and end positions (formally, a single integer). The intron graph is a directed acyclic graph since every edge connects only consecutive elements. Each transcript can now be represented as a path in the graph that traverses from the initial to terminal vertex, where internal vertices denote its intron chain (Supplementary Fig. 6a).

The described approach can be used to construct an intron graph from read alignments. Similarly, to the read-to-isoform assignment procedure, the genes are processed by IsoQuant individually. First, the algorithm constructs a set of internal vertices corresponding to splice junctions from the selected alignments. Two vertices are likewise connected when the respective splice junctions are consecutive in any read alignment. Due to the presence of inexactly detected splice sites, which may remain even after the alignment correction, such a graph may contain false vertices and connections. These false nodes typically form topological patterns, such as tips and bulges. A tip is defined as a dead end (dead start) edge that has a starting (ending) vertex with outdegree (indegree) at least 2. A bulge consists of two alternative paths having the same start and end vertices (Supplementary Fig. 6b). Similar patterns are also typical for de Bruijn graphs, which are used for short read assembly, where bulges and tips are caused by sequencing errors. To remove tips and bulges assemblers exploit various techniques broadly called graph simplification^29,30.

Intron graph simplification

Here we implement a graph simplification procedure based on the following observations: (1) a false splice junction is typically unannotated, (2) splice site shifts that cause a false intron are short and (3) the number of reads supporting the correct splice junction often exceeds read support of a false one. Formally, a bulge/tip is removed from the graph if it represents an unannotated splice junction that has at least twice lower read support compared to the alternative vertex and the alternative vertex has splice sites within 20 bp (10 bp for PacBio). In other cases, when an unannotated splice junction has a high read support or no similar splice junction exists, a bulge or a tip is likely to represent a part of a novel isoform and thus should be preserved (Supplementary Fig. 6b). Although intron graph simplification strongly resembles naive splice junctions clustering, it has an important difference: a splice junction is removed not only based on its properties, such as splice site positions and read support, but based on the graph topology as well, thus considering adjacent splice junctions. Such a method allows one to, for example, preserve similar splice junctions from distinct isoforms. It is worth noting that the simplification procedure keeps track of all collapsed tips and bulges, thus preserving the possibility to later traverse alignment containing removed splice junctions through the graph.

Collecting terminal positions

After the graph is simplified, the algorithm proceeds to attach starting and terminal vertices. In contrast to annotated transcripts, read alignments do not provide the exact terminal positions, as their sequences can be truncated. Thus, to avoid having an extreme number of terminal vertices, terminal positions are detected using the heuristics presented below. Without loss of generality here we assume that the gene of interest is on the forward strand and polyA tails are on the right.

For every splice junction V in the graph, the algorithm selects only read alignments that contain V as a terminal splice junction and processes them as follows. First, the polyA sites are collected and clustered. Clustered polyA positions {p₁, …, p_k} are added to the graph as terminal vertices and connected to vertex V (Supplementary Fig. 7a). Further, the algorithm adds the rightmost non-polyA terminal position P as a terminal vertex if one of the conditions is satisfied: (1) V has no outgoing edges, (2) V has an outgoing edge to a splice junction (u₁, u₂) and P > u₁ + Δ or (3) V has adjacent polyA vertices {p₁, …, p_k} and P > max(p₁, …, p_k) + Δ (where Δ is the parameter defined above). Thus, a non-polyA terminal position can only be attached if it is located to the right of adjacent exons or polyA vertices. Starting positions are collected in a similar manner, but without looking for polyA sites (Supplementary Fig. 7b). The described approach, however, may lose information when several isoforms share the same starting splice junction but have distinct transcription start and end sites. Thus, we also apply an additional transcripts correction, which is described below.

Transcript discovery via path construction

Once the intron graph is constructed and simplified, IsoQuant detects full-length paths that connect starting and terminal vertices. Paths entirely supported by at least a single read alignment (that is, full-splice match) are marked as transcript prediction candidates (Supplementary Fig. 7c). To filter out unreliable novel transcripts IsoQuant applies read support cutoffs: at least five full-splice match reads (three for PacBio) and at least 2% from the maximum graph coverage. Since some isoforms may not have a full-splice matching alignment, IsoQuant also reports known transcripts that (1) have at least one uniquely assigned read and (2) can be traversed through the intron graph. It also reports known mono-exonic transcripts that have (1) a uniquely assigned read and (2) a confirmed polyA site.

To correct terminal positions of a novel transcript, the algorithm selects all alignments consistent with this transcript and uses them to extract terminal positions using the approach described above (Supplementary Fig. 7d). In contrast to detecting terminal vertices for the entire graph, where all alignments are used, the subset of consistent reads likely belongs specifically to this isoform and thus provides correct start and end positions. The resulting transcripts are saved in GTF format, providing additional information about transcript types and their reference genes.

While the previously designed splice graph structure and the intron graph implemented in this work are designed to represent alternatively spliced transcripts and, in general, are highly similar, there are a few differences that can be highlighted. First of all, the splice graph natively supports transcription start and polyA sites as well as mono-exonic transcripts. The intron graph, however, requires the introduction of additional types of ‘terminal vertex’ that denote transcript start and end positions. At the same time, any exonic overlap between alternative transcripts will lead to a merged node in the splice graph, while the intron graph requires an exact match of both splice sites between two transcripts to form a single connected component. Thus, the intron graph can potentially be less tangled for the genes containing multiple alternatively spliced isoforms and, therefore, less complex to traverse through. Moreover, the intron graph natively provides information on neighboring splice junctions, which allows to easily detect incorrectly detected splice sites caused by misalignments and perform graph simplification. While this procedure can definitely be implemented within the splice graph concept, it seems to be more straightforward and native for the intron graph.

To evaluate how different steps of the transcript model construction algorithm affect recall and precision of IsoQuant, we performed a separate experiment described in Supplementary Note 1.