Telomere-to-telomere assembly of diploid chromosomes with Verkko

Science & Nature

Data availability

No new data were generated for this study. All assemblies generated in this paper are archived at Zenodo⁷⁸ and we have provided convenient links to download both data and assemblies⁷⁹. The data are also hosted in public databases: A. thaliana PRJCA005809, H. axyridis PRJEB45202, CHM13 PRJNA559484, HG002 SAMN03283347 and the HPRC AWS bucket⁸⁰.

Code availability

Verkko code is available from GitHub⁸¹ and all code used for the paper is archived at Zenodo⁷⁸.

References

Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Article
CAS
PubMed
PubMed Central

Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article
CAS
PubMed
PubMed Central

Google Scholar
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article
CAS
PubMed
PubMed Central

Google Scholar
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Article
CAS
PubMed
PubMed Central

Google Scholar
Nagarajan, N. & Pop, M. Sequencing and genome assembly using next-generation technologies. Methods Mol. Biol. 673, 1–17 (2010).
Article
CAS
PubMed

Google Scholar
Koren, S. & Phillippy, A. M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 23C, 110–120 (2014).

Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article
CAS
PubMed

Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. https://doi.org/10.1101/gr.263566.120 (2020).
Article
PubMed
PubMed Central

Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article
CAS
PubMed
PubMed Central

Google Scholar
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
Article
CAS
PubMed
PubMed Central

Google Scholar
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
Article
CAS
PubMed
PubMed Central

Google Scholar
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article
CAS
PubMed
PubMed Central

Google Scholar
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
Article
CAS
PubMed
PubMed Central

Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article
CAS
PubMed
PubMed Central

Google Scholar
Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D. & Pevzner, P. A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat. Biotechnol. 40, 1075–1081 (2022).
Article
CAS
PubMed

Google Scholar
Schwartz, D. C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).
Article
CAS
PubMed

Google Scholar
Ghareghani, M. et al. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics 34, i115–i123 (2018).
Article
CAS
PubMed
PubMed Central

Google Scholar
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).
Article
CAS
PubMed

Google Scholar
O’Neill, K. et al. Assembling draft genomes using contiBAIT. Bioinformatics 33, 2737–2739 (2017).
Article
PubMed
PubMed Central

Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Article
CAS
PubMed
PubMed Central

Google Scholar
Dudchenko, Olga et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article
CAS
PubMed
PubMed Central

Google Scholar
Ghurye, J. et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput. Biol. 15, e1007273 (2019).
Article
CAS
PubMed
PubMed Central

Google Scholar
Howe, K. et al. Significantly improving the quality of genome assemblies through curation. GigaScience 10, giaa153 (2021).
Article
PubMed
PubMed Central

Google Scholar
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Article
CAS
PubMed
PubMed Central

Google Scholar
Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016).
Article
CAS
PubMed

Google Scholar
Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, e1005595 (2017).
Article
PubMed
PubMed Central

Google Scholar
Di Genova, A., Buena-Atienza, E., Ossowski, S. & Sagot, M.-F. Efficient hybrid de novo assembly of human genomes with WENGAN. Nat. Biotechnol. 39, 422–430 (2021).
Article
PubMed

Google Scholar
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).
Article
CAS

Google Scholar
Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).
Article
CAS
PubMed
PubMed Central

Google Scholar
Sanders, A. D., Falconer, E., Hills, M., Spierings, D. C. J. & Lansdorp, P. M. Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat. Protoc. 12, 1151–1176 (2017).
Article
CAS
PubMed

Google Scholar
Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).
Article
CAS
PubMed
PubMed Central

Google Scholar
Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995).
Article
CAS
PubMed

Google Scholar
Wang, B. et al. High-quality Arabidopsis thaliana genome assembly with Nanopore and HiFi long reads. Genomics Proteomics Bioinformatics 20, 4–13 (2021).
Article
PubMed
PubMed Central

Google Scholar
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).
Article
CAS
PubMed

Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Article
CAS
PubMed
PubMed Central

Google Scholar
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
Article
CAS
PubMed
PubMed Central

Google Scholar
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).
Article

Google Scholar
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Article
CAS
PubMed
PubMed Central

Google Scholar
Boyes, D. et al. The genome sequence of the harlequin ladybird, Harmonia axyridis (Pallas, 1773). Wellcome Open Res. 7, 177 (2022).
Article
PubMed
PubMed Central

Google Scholar
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
Article
CAS
PubMed

Google Scholar
Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 1–26 (2016).
Article

Google Scholar
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
Article
CAS
PubMed
PubMed Central

Google Scholar
Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01435-7 (2022).
Rhie, A. et al. The complete sequence of a human Y chromosome. Preprint at bioRxiv https://doi.org/10.1101/2022.12.01.518724 (2022).
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).
Article
CAS
PubMed

Google Scholar
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Article
CAS
PubMed
PubMed Central

Google Scholar
Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005.e26 (2022).
Article
CAS
PubMed

Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article
CAS
PubMed
PubMed Central

Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article
PubMed
PubMed Central

Google Scholar
Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data. Bioinformatics 36, 1260–1261 (2020).
Article
CAS
PubMed

Google Scholar
Mohajeri, K. et al. Interchromosomal core duplicons drive both evolutionary instability and disease susceptibility of the Chromosome 8p23.1 region. Genome Res. 26, 1453–1467 (2016).
Article
CAS
PubMed
PubMed Central

Google Scholar
McNulty, S. M. & Sullivan, B. A. Alpha satellite DNA biology: finding function in the recesses of the genome. Chromosome Res. 26, 115–138 (2018).
Article
CAS
PubMed
PubMed Central

Google Scholar
Mahtani, M. M. & Willard, H. F. Pulsed-field gel analysis of α-satellite DNA at the human X chromosome centromere: high-frequency polymorphisms and array size estimate. Genomics 7, 607–613 (1990).
Article
CAS
PubMed

Google Scholar
Wevrick, R. & Willard, H. F. Physical map of the centromeric region of human chromosome 7: relationship between two distinct alpha satellite arrays. Nucleic Acids Res. 19, 2295–2301 (1991).
Article
CAS
PubMed
PubMed Central

Google Scholar
Waye, J. S. & Willard, H. F. Chromosome specificity of satellite DNAs: short- and long-range organization of a diverged dimeric subset of human alpha satellite from chromosome 3. Chromosoma 97, 475–480 (1989).
Article
CAS
PubMed

Google Scholar
Waye, J. S. et al. Chromosome-specific alpha satellite DNA from human chromosome 1: hierarchical structure and genomic organization of a polymorphic domain spanning several hundred kilobase pairs of centromeric DNA. Genomics 1, 43–51 (1987).
Article
CAS
PubMed

Google Scholar
Willard, H. F. et al. Detection of restriction fragment length polymorphisms at the centromeres of human chromosomes by using chromosome-specific alpha satellite DNA probes: implications for development of centromere-based genetic linkage maps. Proc. Natl Acad. Sci. USA 83, 5611–5615 (1986).
Article
CAS
PubMed
PubMed Central

Google Scholar
Wevrick, R. & Willard, H. F. Long-range organization of tandem arrays of alpha satellite DNA at the centromeres of human chromosomes: high-frequency array-length polymorphism and meiotic stability. Proc. Natl Acad. Sci. USA 86, 9394–9398 (1989).
Article
CAS
PubMed
PubMed Central

Google Scholar
de Lima, L. G. et al. PCR amplicons identify widespread copy number variation in human centromeric arrays and instability in cancer. Cell Genomics 1, 100064 (2021).
Article
PubMed
PubMed Central

Google Scholar
KeyGene. Maize B73 Oxford Nanopore duplex sequence data release. https://www.keygene.com/news-events/maize-b73-oxford-nanopore-duplex-sequence-data-release/ (2022).
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science https://doi.org/10.1126/science.abl4178 (2022).
Article
PubMed
PubMed Central

Google Scholar
Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 8, e42989 (2019).
Article
PubMed
PubMed Central

Google Scholar
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
Article
CAS
PubMed
PubMed Central

Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article
CAS
PubMed
PubMed Central

Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article
CAS
PubMed
PubMed Central

Google Scholar
Rautiainen, M. & Marschall, T. MBG: minimizer-based sparse de Bruijn graph construction. Bioinformatics 37, 2476–2478 (2021).
Article
CAS
PubMed
PubMed Central

Google Scholar
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).
Article
PubMed
PubMed Central

Google Scholar
Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).
Article
CAS
PubMed
PubMed Central

Google Scholar
Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics 38, 2049–2051 (2022).
Article
CAS
PubMed
PubMed Central

Google Scholar
Onodera, T., Sadakane, K. & Shibuya, T. in Algorithms in Bioinformatics (eds Darling, A. & Stoye, J.) 338–348 (Springer Berlin Heidelberg, 2013).
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
Article
CAS
PubMed

Google Scholar
Edgar, R. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ 9, e10805 (2021).
Article
PubMed
PubMed Central

Google Scholar
Ferragina, P. & Manzini, G. Indexing compressed text. J. ACM 52, 552–581 (2005).
Article

Google Scholar
Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 395–415 (1999).
Article

Google Scholar
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Article
CAS
PubMed
PubMed Central

Google Scholar
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Article
CAS
PubMed
PubMed Central

Google Scholar
Koren, S. Verkko beta2 source and assemblies evaluated in manuscript. Zenodo https://doi.org/10.5281/zenodo.6618379 (2022).
Koren, S. verkko publication readme. GitHub https://github.com/marbl/verkko/blob/master/paper/README.md (2022).
HPRC HG002 public data. Amazon S3 https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix= (2022).
Koren, S. verkko repository. GitHub https://github.com/marbl/verkko/ (2022).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article
CAS
PubMed
PubMed Central

Google Scholar
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
Article
CAS
PubMed

Google Scholar
Smith George, P. Evolution of repeated DNA sequences by unequal crossover. Science 191, 528–535 (1976).
Article

Google Scholar
Alkan, C., Eichler, E. E., Bailey, J. A., Sahinalp, S. C. & Tüzün, E. The role of unequal crossover in alpha-satellite DNA evolution: a computational analysis. J. Comput. Biol. 11, 933–944 (2004).
Article
CAS
PubMed

Google Scholar
Alkan, C., Bailey, J. A., Eichler, E. E., Sahinalp, S. C. & Tuzun, E. An algorithmic analysis of the role of unequal crossover in alpha-satellite DNA evolution. Genome Inform. 13, 93–102 (2002).
CAS
PubMed

Google Scholar
Schindelhauer, D. & Schwarz, T. Evidence for a fast, intrachromosomal conversion mechanism from mapping of nucleotide variants within a homogeneous α-satellite DNA array. Genome Res. 12, 1815–1826 (2002).
Article
CAS
PubMed
PubMed Central

Google Scholar

Download references

Acknowledgements

This work was supported, in part, by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (M.R., S.N., A.R., B.P.W., A.M.P. and S.K.) as well as by grants from the US National Institutes of Health (NIH grant nos. HG010169 and HG002385 to E.E.E.) and the National Institute of General Medical Sciences (NIGMS grant no. 1F32GM134558 to G.A.L.). E.E.E. is an investigator of the Howard Hughes Medical Institute. This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Author information

Author notes

Sergey Nurk
Present address: Oxford Nanopore Technologies, Oxford, UK

Authors and Affiliations

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Mikko Rautiainen, Sergey Nurk, Brian P. Walenz, Arang Rhie, Adam M. Phillippy & Sergey Koren
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Glennis A. Logsdon, David Porubsky & Evan E. Eichler
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
Evan E. Eichler

Contributions

M.R., S.N., B.P.W. and S.K. were responsible for the methods and software development. G.A.L., D.P., A.R. and S.K. were responsible for data analysis and validation. E.E.E. and A.M.P. provided resources. M.R., S.N., A.M.P. and S.K. wrote the first draft of the manuscript. M.R., S.N., G.A.L., D.P., A.M.P. and S.K. prepared the figures. M.R., S.N., B.P.W., A.M.P. and S.K. edited the manuscript with the assistance of all authors. E.E.E., A.M.P. and S.K. supervised the study. M.R., S.N., A.M.P. and S.K. conceptualized the study.

Corresponding authors

Correspondence to
Adam M. Phillippy or Sergey Koren.

Ethics declarations

Competing interests

E.E.E. is on the scientific advisory board of DNAnexus, Inc. S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies. S.N. is an employee of Oxford Nanopore Technologies. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Rayan Chikhi, Anton Korobeynikov and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A. thaliana chromosome unitigs in Verkko (left) vs published assembly chromosomes evaluated by VerityMap (right).

From top to bottom, Chr1, Chr2, Chr3, Chr4, and Chr5. VerityMap compares the spacing of unique k-mers within the HiFi reads to the spacing observed in the assembly. Whenever there is a disagreement, the plot shows a spike at the discrepant location. The x-axis indicates the coordinates along the assembly contig or scaffold while the y-axis shows the fraction of disagreeing reads (0–100%). A disagreement greater than 50% is likely not a heterozygous variant but a true error in the assembly. The BED file produced by VerityMap also indicates the size of the discrepancy, estimated from the difference in k-mer spacing between the reads and the assembly.

Extended Data Fig. 2 Verkko CHM13 assembly sub-graphs.

A. The remaining unresolved regions in CHM13 chromosomes 5, 9 and 16, visualized using Bandage⁶⁹, with the correct resolution marked in red paths. Left: Chr5 has a spurious edge causing a cycle, and three spurious low-coverage nodes which were not removed by bubble popping since they are a part of the cycle. Middle: Chr9 has a spurious edge. Right: Chr16 has two spurious edges, and one missing edge (dashed red curve). The spurious non-genomic edges are caused by noisy ONT alignments switching between highly similar repeats in the LA graph, while the missing edge is caused by low HiFi coverage. B. rDNA cluster mixing in CHM13 chromosomes 13, 14, and 21, visualized using Bandage⁶⁹. Each chromosome has a separate rDNA tangle. There are two cross-chromosomal connections by erroneous low coverage (<4x) nodes circled in red. For all three chromosomes, the remainder of the p and q arms are contained in the long unitigs shown.

Extended Data Fig. 3 VerityMap discrepant reads plot for CHM13 HiFi and ONT unitigs assembled by Verkko (left) and CHM13 v1.1 14 (right).

A. The assemblies for Chromosome 4. The Verkko assembly has no regions where a large fraction of reads are deviated even though QUAST marks an error at approximately 52 Mb. This corresponds to a position in the reference with a large fraction of deviated reads and an estimated 19 kb discrepancy. B. same for Chromosome 17. There are no regions with a large fraction (>50%) of discrepant reads in the Verkko assembly despite QUAST reporting an error at approximately 25 Mb on the reference. This corresponds to an approximately 3 kb discrepancy identified by VerityMap in CHM13 v1.1.

Extended Data Fig. 4 Merqury 66 haplotype blob plots.

A. HG002 downsampled Verkko B. HG002 downsampled DeepConsensus HiFi Verkko and C. HG002 full-coverage Verkko assemblies. The Hi-C phased assembly is on the left and the trio-phased assembly is on the right. Each contig/scaffold is a circle on the plot, with the size scaled based on contig/scaffold length. The x-axis shows the number of maternal markers while the y-axis shows the number of paternal markers. Contigs which lie along either the x-axis or y-axis show no haplotype errors and are consistently maternal or paternal. Contigs which mixed haplotypes would appear along the diagonal but are not observed in these plots, indicating an accurately phased assembly.

Extended Data Fig. 5 IGV 82 views of a recently published HG002 diploid assembly of paternal Chromosome 10 ¹¹ (top) and the Verkko full-coverage trio assembly of the same chromosome (bottom).

The tracks show the maternal (red) and paternal (blue) markers. The centromere location is shown in gray. The published assembly has extensive switching within the centromere array, indicated by the presence of maternal markers and the absence of paternal markers. In contrast, the Verkko assembly centromere shows only paternal markers. The Verkko paternal centromere array is shorter but shows no signs of mis-assembly (Extended Data Fig. 8) indicating the larger array in the published assembly is likely due to the incorrect insertion of maternal sequence. Overall, the Verkko assembly is more continuous, with 0 gaps vs 4, and a lower hamming error rate, 0.03%, versus 1.98% compared to the published assembly.

Extended Data Fig. 6 Strand-seq validation of the full-coverage Verkko trio assembly and HPRC manually curated assembly 11.

The maternal haplotype is shown along the top row and the paternal along the bottom row. Leftmost: alignment-based scaffold assignment to the maternal haplotype (top) and paternal haplotype (bottom) for the full-coverage Verkko assembly. Almost all chromosomes are a single color, indicating that Verkko scaffolds resolved most chromosomes end-to-end. The only exceptions are in the acrocentrics, where some of the scaffolds could not be assigned due to low mappability and maternal Chromosome 6 and paternal Chromosomes 5 which are each composed of two large scaffolds. Over 99.7% of the scaffold bases could be assigned to chromosomes. Middle: the cluster assignment for the maternal haplotype (top) and paternal haplotype (bottom) based on Strand-seq data for the full-coverage Verkko assembly. Here, cluster ID is assigned to each 200 kb window in a scaffold. In case of large scale chromosomal mis-joins, we expect to see multiple colors in a chromosome. The Verkko assembly is consistent with scaffolds all representing a single chromosome bin. Once again, >99.7% of the scaffold bases can be assigned using Strand-seq. Only 2 and 4 Mb of sequence not scaffolded by Verkko could be assigned to the maternal and paternal haplotypes, respectively. Right: The cluster assignment for the maternal haplotype (top) and the paternal haplotype (bottom) based on Strand-seq data for the HPRC manually curated assembly. Here, cluster ID is assigned to each 200 kb window in a scaffold. In case of large scale chromosomal mis-joins, we expect to see multiple colors in a chromosome. A smaller fraction of contigs (and a slightly lower fraction of bases) was assigned than for the Verkko assembly, despite the combination of technologies and manual curation. This may be due to shorter contigs from unresolved repeats which are resolved through Verkko’s ONT integration. There is also visible chromosome mixing within the acrocentric chromosomes unlike in the Verkko result.

Extended Data Fig. 7 Strand-seq structural variant analysis for Verkko full-coverage assembly.

The states assigned to each scaffold in the paternal (A) and maternal (B) for the full-coverage Verkko trio assembly. Strand-seq reads aligned to each assembly are genotype based on their directionality into three possible strand states. Crick-Crick (‘cc’) state in which both homologs in Strand-seq data map in direct orientation and thus such regions are consistent with Strand-seq directional information. Watson-Watson (‘ww’) state in which both homologs in Strand-seq data map in inverted orientation and are indicative of assembly misorientation or unresolved homozygous inversion. Lastly, there are a few (<1% of bases) Watson-Crick (‘wc’) where there is a mixture of Watson and Crick reads and such regions are indicative of heterozygous inversions between haplotypes or low-mappability regions for short Strand-seq reads. C. The size of the heterozygous inversion versus the count of inversions of that size in the maternal and paternal haplotypes of the full-coverage Verkko trio assembly. These regions have confident Strand-seq alignments and normal copy number so these regions indicate potential true heterozygous variation between the haplotypes. D. Strand-seq alignments to the reference Chromosome Y before it was corrected (top) and full-coverage Verkko trio Chromosome Y assembly (bottom). Each plot shows Strand-seq directional read coverage reported as binned (bin size: 10,000, step size: 1,000) read counts represented as vertical bars above (teal; Crick read counts) and below (orange; Watson read counts) the midline. The top plot shows an inversion (dashed line) where directly oriented reads (Crick; teal) switch to inversely oriented reads (Watson, orange) and then back to directly oriented reads. The Verkko assembly in contrast is consistent with only Crick reads present in the same location (dashed line).

Extended Data Fig. 8 Full-coverage Verkko trio assemblies of chromosome 1 (a), 3 (b), 4 (c), 11 (d), 9 (e), 10 (f), 16 (g), and 18 (h) centromeric regions in the HG002 genome.

Both maternal and paternal haplotypes are shown, with repeat element annotation generated by RepeatMasker (cite:1. Smit, A., Hubley, R. & Green, P. Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015. RepeatMasker Open-4.0. http://www.repeatmasker.org (2013)) shown on top, followed by PacBio HiFi coverage, ONT coverage, and StainedGlass⁷⁰ plots. As with the Chromosome 19 centromeres (Fig. 4), the maternal and paternal haplotypes show large-scale structural variation, with alpha-satellite HOR arrays sizes varying by tens to hundreds of kb. Sites with discrepant HiFi mappings (low coverage or high coverage) are marked with an asterisk. There are few sites in the centromeres, and the artifacts are localized and often inconsistent between ONT and HiFi alignments, indicating the assembly is overall of high quality. To further validate assembly accuracy, we intersected centromere array locations with VerityMap errors and found that in all but four cases (two on the Chr1 paternal centromere, Chr9 paternal centromere, and Chr10 maternal centromere), the errors were short (≤1 kb) or lower frequency (≤50% of the reads). VerityMap also identified one issue, with ≥50% of reads deviating in the Chr4 maternal centromere. However, this was not visible in the NucFreq ^37,83 plots above, and the region only had a total of three mapped reads.

Extended Data Fig. 9 Comparison of the HG002 maternal and paternal full-coverage Verkko trio assemblies for the centromeric regions of chromosomes 1 (a), 3 (b), 4 (c), 9 (d), 10 (e), 11 (f), 16 (g), 18 (h), and 19 (i) in the HG002 genome.

The plots show the similarity between the two haplotypes, with the maternal haplotype on the y-axis and the paternal on the x-axis. The centromeric regions show varying ɑ-satellite HOR array sizes and sequence identity between the two haplotypes, consistent with earlier reports that indicate that centromeric HOR arrays often expand and contract due to their repetitive nature and their propensity for unequal crossing over^84,85,86 and gene conversion⁸⁷ events. For Chromosome 19, as in Fig. 4, the tracks show the repeat annotations and read coverages. The triangles show the self-similarity within each haplotype for comparison.

Extended Data Fig. 10 Examples of haplotype scaffolding by Rukki in the HG002 genome.

The nodes are colored according to their haplotype assignments. Nodes with at least 100 total markers where 90% of the markers agree are colored: red for maternal, blue for paternal. Nodes with less than 100 markers are colored gray for unassigned. The haplotype paths are marked with solid curves with dotted curves for gaps. (A) A well behaved genomic region consisting of phased heterozygous bubbles, homozygous nodes, and spurious nodes caused by sequencing errors. Where possible, Rukki connects the nodes attributed to the same haplotype across the homozygous regions, producing two phased unitigs without gaps. (B) A tangle within one haplotype. Rukki scaffolds across the tangle (dotted line), reporting an estimated size of the tangled region. (C) A gap in the paternal haplotype. Rukki uses haplotype assignments and the topology of the graph to scaffold across the gap (dotted line), and estimates the size of the gap based on the size of the paired haplotype.

Supplementary information

Rights and permissions

About this article

Cite this article

Rautiainen, M., Nurk, S., Walenz, B.P. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01662-6

Download citation

Received: 24 June 2022
Accepted: 03 January 2023
Published: 16 February 2023
DOI: https://doi.org/10.1038/s41587-023-01662-6

News You Can USe!