Telomere-to-telomere assembly of diploid chromosomes with Verkko

Science & Nature

Data availability

No new data were generated for this study. All assemblies generated in this paper are archived at Zenodo78 and we have provided convenient links to download both data and assemblies79. The data are also hosted in public databases: A. thaliana PRJCA005809, H. axyridis PRJEB45202, CHM13 PRJNA559484, HG002 SAMN03283347 and the HPRC AWS bucket80.

Code availability

Verkko code is available from GitHub81 and all code used for the paper is archived at Zenodo78.

References

  1. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  2. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  3. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  4. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  5. Nagarajan, N. & Pop, M. Sequencing and genome assembly using next-generation technologies. Methods Mol. Biol. 673, 1–17 (2010).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  6. Koren, S. & Phillippy, A. M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 23C, 110–120 (2014).


    Google Scholar
     

  7. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  8. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. https://doi.org/10.1101/gr.263566.120 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  9. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  10. Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  11. Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  12. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  13. Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  14. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  15. Bankevich, A., Bzikadze, A. V., Kolmogorov, M., Antipov, D. & Pevzner, P. A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat. Biotechnol. 40, 1075–1081 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  16. Schwartz, D. C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  17. Ghareghani, M. et al. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics 34, i115–i123 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  18. Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  19. O’Neill, K. et al. Assembling draft genomes using contiBAIT. Bioinformatics 33, 2737–2739 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  20. Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  21. Dudchenko, Olga et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  22. Ghurye, J. et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput. Biol. 15, e1007273 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  23. Howe, K. et al. Significantly improving the quality of genome assemblies through curation. GigaScience 10, giaa153 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  24. Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  25. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  26. Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, e1005595 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  27. Di Genova, A., Buena-Atienza, E., Ossowski, S. & Sagot, M.-F. Efficient hybrid de novo assembly of human genomes with WENGAN. Nat. Biotechnol. 39, 422–430 (2021).

    Article 
    PubMed 

    Google Scholar
     

  28. Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).

    Article 
    CAS 

    Google Scholar
     

  29. Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  30. Sanders, A. D., Falconer, E., Hills, M., Spierings, D. C. J. & Lansdorp, P. M. Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat. Protoc. 12, 1151–1176 (2017).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  31. Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  32. Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  33. Wang, B. et al. High-quality Arabidopsis thaliana genome assembly with Nanopore and HiFi long reads. Genomics Proteomics Bioinformatics 20, 4–13 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  34. Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  35. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  36. Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  37. Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).

    Article 

    Google Scholar
     

  38. Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  39. Boyes, D. et al. The genome sequence of the harlequin ladybird, Harmonia axyridis (Pallas, 1773). Wellcome Open Res. 7, 177 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  40. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  41. Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).

  42. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 1–26 (2016).

    Article 

    Google Scholar
     

  43. Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  44. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01435-7 (2022).

  45. Rhie, A. et al. The complete sequence of a human Y chromosome. Preprint at bioRxiv https://doi.org/10.1101/2022.12.01.518724 (2022).

  46. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  47. Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  48. Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005.e26 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  49. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  50. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  51. Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data. Bioinformatics 36, 1260–1261 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  52. Mohajeri, K. et al. Interchromosomal core duplicons drive both evolutionary instability and disease susceptibility of the Chromosome 8p23.1 region. Genome Res. 26, 1453–1467 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  53. McNulty, S. M. & Sullivan, B. A. Alpha satellite DNA biology: finding function in the recesses of the genome. Chromosome Res. 26, 115–138 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  54. Mahtani, M. M. & Willard, H. F. Pulsed-field gel analysis of α-satellite DNA at the human X chromosome centromere: high-frequency polymorphisms and array size estimate. Genomics 7, 607–613 (1990).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  55. Wevrick, R. & Willard, H. F. Physical map of the centromeric region of human chromosome 7: relationship between two distinct alpha satellite arrays. Nucleic Acids Res. 19, 2295–2301 (1991).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  56. Waye, J. S. & Willard, H. F. Chromosome specificity of satellite DNAs: short- and long-range organization of a diverged dimeric subset of human alpha satellite from chromosome 3. Chromosoma 97, 475–480 (1989).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  57. Waye, J. S. et al. Chromosome-specific alpha satellite DNA from human chromosome 1: hierarchical structure and genomic organization of a polymorphic domain spanning several hundred kilobase pairs of centromeric DNA. Genomics 1, 43–51 (1987).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  58. Willard, H. F. et al. Detection of restriction fragment length polymorphisms at the centromeres of human chromosomes by using chromosome-specific alpha satellite DNA probes: implications for development of centromere-based genetic linkage maps. Proc. Natl Acad. Sci. USA 83, 5611–5615 (1986).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  59. Wevrick, R. & Willard, H. F. Long-range organization of tandem arrays of alpha satellite DNA at the centromeres of human chromosomes: high-frequency array-length polymorphism and meiotic stability. Proc. Natl Acad. Sci. USA 86, 9394–9398 (1989).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  60. de Lima, L. G. et al. PCR amplicons identify widespread copy number variation in human centromeric arrays and instability in cancer. Cell Genomics 1, 100064 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  61. KeyGene. Maize B73 Oxford Nanopore duplex sequence data release. https://www.keygene.com/news-events/maize-b73-oxford-nanopore-duplex-sequence-data-release/ (2022).

  62. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science https://doi.org/10.1126/science.abl4178 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  63. Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 8, e42989 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  64. Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  65. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  66. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  67. Rautiainen, M. & Marschall, T. MBG: minimizer-based sparse de Bruijn graph construction. Bioinformatics 37, 2476–2478 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  68. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  69. Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  70. Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics 38, 2049–2051 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  71. Onodera, T., Sadakane, K. & Shibuya, T. in Algorithms in Bioinformatics (eds Darling, A. & Stoye, J.) 338–348 (Springer Berlin Heidelberg, 2013).

  72. Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  73. Edgar, R. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ 9, e10805 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  74. Ferragina, P. & Manzini, G. Indexing compressed text. J. ACM 52, 552–581 (2005).

    Article 

    Google Scholar
     

  75. Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 395–415 (1999).

    Article 

    Google Scholar
     

  76. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  77. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  78. Koren, S. Verkko beta2 source and assemblies evaluated in manuscript. Zenodo https://doi.org/10.5281/zenodo.6618379 (2022).

  79. Koren, S. verkko publication readme. GitHub https://github.com/marbl/verkko/blob/master/paper/README.md (2022).

  80. HPRC HG002 public data. Amazon S3 https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix= (2022).

  81. Koren, S. verkko repository. GitHub https://github.com/marbl/verkko/ (2022).

  82. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  83. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  84. Smith George, P. Evolution of repeated DNA sequences by unequal crossover. Science 191, 528–535 (1976).

    Article 

    Google Scholar
     

  85. Alkan, C., Eichler, E. E., Bailey, J. A., Sahinalp, S. C. & Tüzün, E. The role of unequal crossover in alpha-satellite DNA evolution: a computational analysis. J. Comput. Biol. 11, 933–944 (2004).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  86. Alkan, C., Bailey, J. A., Eichler, E. E., Sahinalp, S. C. & Tuzun, E. An algorithmic analysis of the role of unequal crossover in alpha-satellite DNA evolution. Genome Inform. 13, 93–102 (2002).

    CAS 
    PubMed 

    Google Scholar
     

  87. Schindelhauer, D. & Schwarz, T. Evidence for a fast, intrachromosomal conversion mechanism from mapping of nucleotide variants within a homogeneous α-satellite DNA array. Genome Res. 12, 1815–1826 (2002).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

Download references

Acknowledgements

This work was supported, in part, by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (M.R., S.N., A.R., B.P.W., A.M.P. and S.K.) as well as by grants from the US National Institutes of Health (NIH grant nos. HG010169 and HG002385 to E.E.E.) and the National Institute of General Medical Sciences (NIGMS grant no. 1F32GM134558 to G.A.L.). E.E.E. is an investigator of the Howard Hughes Medical Institute. This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Author information

Author notes

  1. Sergey Nurk

    Present address: Oxford Nanopore Technologies, Oxford, UK

Authors and Affiliations

  1. Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA

    Mikko Rautiainen, Sergey Nurk, Brian P. Walenz, Arang Rhie, Adam M. Phillippy & Sergey Koren

  2. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA

    Glennis A. Logsdon, David Porubsky & Evan E. Eichler

  3. Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA

    Evan E. Eichler

Contributions

M.R., S.N., B.P.W. and S.K. were responsible for the methods and software development. G.A.L., D.P., A.R. and S.K. were responsible for data analysis and validation. E.E.E. and A.M.P. provided resources. M.R., S.N., A.M.P. and S.K. wrote the first draft of the manuscript. M.R., S.N., G.A.L., D.P., A.M.P. and S.K. prepared the figures. M.R., S.N., B.P.W., A.M.P. and S.K. edited the manuscript with the assistance of all authors. E.E.E., A.M.P. and S.K. supervised the study. M.R., S.N., A.M.P. and S.K. conceptualized the study.

Corresponding authors

Correspondence to
Adam M. Phillippy or Sergey Koren.

Ethics declarations

Competing interests

E.E.E. is on the scientific advisory board of DNAnexus, Inc. S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies. S.N. is an employee of Oxford Nanopore Technologies. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Rayan Chikhi, Anton Korobeynikov and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A. thaliana chromosome unitigs in Verkko (left) vs published assembly chromosomes evaluated by VerityMap (right).

From top to bottom, Chr1, Chr2, Chr3, Chr4, and Chr5. VerityMap compares the spacing of unique k-mers within the HiFi reads to the spacing observed in the assembly. Whenever there is a disagreement, the plot shows a spike at the discrepant location. The x-axis indicates the coordinates along the assembly contig or scaffold while the y-axis shows the fraction of disagreeing reads (0–100%). A disagreement greater than 50% is likely not a heterozygous variant but a true error in the assembly. The BED file produced by VerityMap also indicates the size of the discrepancy, estimated from the difference in k-mer spacing between the reads and the assembly.

Extended Data Fig. 2 Verkko CHM13 assembly sub-graphs.

A. The remaining unresolved regions in CHM13 chromosomes 5, 9 and 16, visualized using Bandage69, with the correct resolution marked in red paths. Left: Chr5 has a spurious edge causing a cycle, and three spurious low-coverage nodes which were not removed by bubble popping since they are a part of the cycle. Middle: Chr9 has a spurious edge. Right: Chr16 has two spurious edges, and one missing edge (dashed red curve). The spurious non-genomic edges are caused by noisy ONT alignments switching between highly similar repeats in the LA graph, while the missing edge is caused by low HiFi coverage. B. rDNA cluster mixing in CHM13 chromosomes 13, 14, and 21, visualized using Bandage69. Each chromosome has a separate rDNA tangle. There are two cross-chromosomal connections by erroneous low coverage (<4x) nodes circled in red. For all three chromosomes, the remainder of the p and q arms are contained in the long unitigs shown.

Extended Data Fig. 3 VerityMap discrepant reads plot for CHM13 HiFi and ONT unitigs assembled by Verkko (left) and CHM13 v1.114 (right).

A. The assemblies for Chromosome 4. The Verkko assembly has no regions where a large fraction of reads are deviated even though QUAST marks an error at approximately 52 Mb. This corresponds to a position in the reference with a large fraction of deviated reads and an estimated 19 kb discrepancy. B. same for Chromosome 17. There are no regions with a large fraction (>50%) of discrepant reads in the Verkko assembly despite QUAST reporting an error at approximately 25 Mb on the reference. This corresponds to an approximately 3 kb discrepancy identified by VerityMap in CHM13 v1.1.

Extended Data Fig. 4 Merqury66 haplotype blob plots.

A. HG002 downsampled Verkko B. HG002 downsampled DeepConsensus HiFi Verkko and C. HG002 full-coverage Verkko assemblies. The Hi-C phased assembly is on the left and the trio-phased assembly is on the right. Each contig/scaffold is a circle on the plot, with the size scaled based on contig/scaffold length. The x-axis shows the number of maternal markers while the y-axis shows the number of paternal markers. Contigs which lie along either the x-axis or y-axis show no haplotype errors and are consistently maternal or paternal. Contigs which mixed haplotypes would appear along the diagonal but are not observed in these plots, indicating an accurately phased assembly.

Extended Data Fig. 5 IGV82 views of a recently published HG002 diploid assembly of paternal Chromosome 10 11 (top) and the Verkko full-coverage trio assembly of the same chromosome (bottom).

The tracks show the maternal (red) and paternal (blue) markers. The centromere location is shown in gray. The published assembly has extensive switching within the centromere array, indicated by the presence of maternal markers and the absence of paternal markers. In contrast, the Verkko assembly centromere shows only paternal markers. The Verkko paternal centromere array is shorter but shows no signs of mis-assembly (Extended Data Fig. 8) indicating the larger array in the published assembly is likely due to the incorrect insertion of maternal sequence. Overall, the Verkko assembly is more continuous, with 0 gaps vs 4, and a lower hamming error rate, 0.03%, versus 1.98% compared to the published assembly.

Extended Data Fig. 6 Strand-seq validation of the full-coverage Verkko trio assembly and HPRC manually curated assembly11.

The maternal haplotype is shown along the top row and the paternal along the bottom row. Leftmost: alignment-based scaffold assignment to the maternal haplotype (top) and paternal haplotype (bottom) for the full-coverage Verkko assembly. Almost all chromosomes are a single color, indicating that Verkko scaffolds resolved most chromosomes end-to-end. The only exceptions are in the acrocentrics, where some of the scaffolds could not be assigned due to low mappability and maternal Chromosome 6 and paternal Chromosomes 5 which are each composed of two large scaffolds. Over 99.7% of the scaffold bases could be assigned to chromosomes. Middle: the cluster assignment for the maternal haplotype (top) and paternal haplotype (bottom) based on Strand-seq data for the full-coverage Verkko assembly. Here, cluster ID is assigned to each 200 kb window in a scaffold. In case of large scale chromosomal mis-joins, we expect to see multiple colors in a chromosome. The Verkko assembly is consistent with scaffolds all representing a single chromosome bin. Once again, >99.7% of the scaffold bases can be assigned using Strand-seq. Only 2 and 4 Mb of sequence not scaffolded by Verkko could be assigned to the maternal and paternal haplotypes, respectively. Right: The cluster assignment for the maternal haplotype (top) and the paternal haplotype (bottom) based on Strand-seq data for the HPRC manually curated assembly. Here, cluster ID is assigned to each 200 kb window in a scaffold. In case of large scale chromosomal mis-joins, we expect to see multiple colors in a chromosome. A smaller fraction of contigs (and a slightly lower fraction of bases) was assigned than for the Verkko assembly, despite the combination of technologies and manual curation. This may be due to shorter contigs from unresolved repeats which are resolved through Verkko’s ONT integration. There is also visible chromosome mixing within the acrocentric chromosomes unlike in the Verkko result.

Extended Data Fig. 7 Strand-seq structural variant analysis for Verkko full-coverage assembly.

The states assigned to each scaffold in the paternal (A) and maternal (B) for the full-coverage Verkko trio assembly. Strand-seq reads aligned to each assembly are genotype based on their directionality into three possible strand states. Crick-Crick (‘cc’) state in which both homologs in Strand-seq data map in direct orientation and thus such regions are consistent with Strand-seq directional information. Watson-Watson (‘ww’) state in which both homologs in Strand-seq data map in inverted orientation and are indicative of assembly misorientation or unresolved homozygous inversion. Lastly, there are a few (<1% of bases) Watson-Crick (‘wc’) where there is a mixture of Watson and Crick reads and such regions are indicative of heterozygous inversions between haplotypes or low-mappability regions for short Strand-seq reads. C. The size of the heterozygous inversion versus the count of inversions of that size in the maternal and paternal haplotypes of the full-coverage Verkko trio assembly. These regions have confident Strand-seq alignments and normal copy number so these regions indicate potential true heterozygous variation between the haplotypes. D. Strand-seq alignments to the reference Chromosome Y before it was corrected (top) and full-coverage Verkko trio Chromosome Y assembly (bottom). Each plot shows Strand-seq directional read coverage reported as binned (bin size: 10,000, step size: 1,000) read counts represented as vertical bars above (teal; Crick read counts) and below (orange; Watson read counts) the midline. The top plot shows an inversion (dashed line) where directly oriented reads (Crick; teal) switch to inversely oriented reads (Watson, orange) and then back to directly oriented reads. The Verkko assembly in contrast is consistent with only Crick reads present in the same location (dashed line).

Extended Data Fig. 8 Full-coverage Verkko trio assemblies of chromosome 1 (a), 3 (b), 4 (c), 11 (d), 9 (e), 10 (f), 16 (g), and 18 (h) centromeric regions in the HG002 genome.

Both maternal and paternal haplotypes are shown, with repeat element annotation generated by RepeatMasker (cite:1. Smit, A., Hubley, R. & Green, P. Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015. RepeatMasker Open-4.0. http://www.repeatmasker.org (2013)) shown on top, followed by PacBio HiFi coverage, ONT coverage, and StainedGlass70 plots. As with the Chromosome 19 centromeres (Fig. 4), the maternal and paternal haplotypes show large-scale structural variation, with alpha-satellite HOR arrays sizes varying by tens to hundreds of kb. Sites with discrepant HiFi mappings (low coverage or high coverage) are marked with an asterisk. There are few sites in the centromeres, and the artifacts are localized and often inconsistent between ONT and HiFi alignments, indicating the assembly is overall of high quality. To further validate assembly accuracy, we intersected centromere array locations with VerityMap errors and found that in all but four cases (two on the Chr1 paternal centromere, Chr9 paternal centromere, and Chr10 maternal centromere), the errors were short (≤1 kb) or lower frequency (≤50% of the reads). VerityMap also identified one issue, with ≥50% of reads deviating in the Chr4 maternal centromere. However, this was not visible in the NucFreq 37,83 plots above, and the region only had a total of three mapped reads.

Extended Data Fig. 9 Comparison of the HG002 maternal and paternal full-coverage Verkko trio assemblies for the centromeric regions of chromosomes 1 (a), 3 (b), 4 (c), 9 (d), 10 (e), 11 (f), 16 (g), 18 (h), and 19 (i) in the HG002 genome.

The plots show the similarity between the two haplotypes, with the maternal haplotype on the y-axis and the paternal on the x-axis. The centromeric regions show varying ɑ-satellite HOR array sizes and sequence identity between the two haplotypes, consistent with earlier reports that indicate that centromeric HOR arrays often expand and contract due to their repetitive nature and their propensity for unequal crossing over84,85,86 and gene conversion87 events. For Chromosome 19, as in Fig. 4, the tracks show the repeat annotations and read coverages. The triangles show the self-similarity within each haplotype for comparison.

Extended Data Fig. 10 Examples of haplotype scaffolding by Rukki in the HG002 genome.

The nodes are colored according to their haplotype assignments. Nodes with at least 100 total markers where 90% of the markers agree are colored: red for maternal, blue for paternal. Nodes with less than 100 markers are colored gray for unassigned. The haplotype paths are marked with solid curves with dotted curves for gaps. (A) A well behaved genomic region consisting of phased heterozygous bubbles, homozygous nodes, and spurious nodes caused by sequencing errors. Where possible, Rukki connects the nodes attributed to the same haplotype across the homozygous regions, producing two phased unitigs without gaps. (B) A tangle within one haplotype. Rukki scaffolds across the tangle (dotted line), reporting an estimated size of the tangled region. (C) A gap in the paternal haplotype. Rukki uses haplotype assignments and the topology of the graph to scaffold across the gap (dotted line), and estimates the size of the gap based on the size of the paired haplotype.

Supplementary information

About this article

Science & Nature Verify currency and authenticity via CrossMark

Cite this article

Rautiainen, M., Nurk, S., Walenz, B.P. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01662-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-023-01662-6

Read More
Mikko Rautiainen

Latest

Prediction: Scotland vs Curacao

Soccer Prediction: Lawrence Shankland to score hat-trick for Scotland Best...

“A long time coming” – Ex-Nigeria international tips Arsenal to overcome PSG in UCL final 

Soccer Arsenal manager Mikel Arteta applauds the fans. Copyright:...

PSG vs Arsenal: When and how to watch 2026 Champions League final in Nigeria?

Soccer French Ligue 1 Champions Paris Saint-Germain will play...

Newsletter

Don't miss

Prediction: Scotland vs Curacao

Soccer Prediction: Lawrence Shankland to score hat-trick for Scotland Best...

“A long time coming” – Ex-Nigeria international tips Arsenal to overcome PSG in UCL final 

Soccer Arsenal manager Mikel Arteta applauds the fans. Copyright:...

PSG vs Arsenal: When and how to watch 2026 Champions League final in Nigeria?

Soccer French Ligue 1 Champions Paris Saint-Germain will play...

Paris Saint-Germain vs Arsenal: Jay-Jay Okocha’s former side backed to trump Gunners in UCL final

Soccer Augustine Okocha (PSG) on the ball, Paris Saint...

US Business Leaders Optimistic About China Cooperation, Emphasize Importance of Chinese Market

© 2026 China Money Network. All Rights Reserved. Disclaimer: The views, opinions, forecasts, and statements made by our hosts and guests are the personal views of those respective individuals and may or may not be either endorsed or accepted by China Money Network Limited or the companies with which these individuals are employed.

Tesla’s Business Has Become Much More Diversified in Just the Past Five Years. Does That Make Its Stock a Better Buy Today?

Key Points Tesla's energy generation and storage segment generated 27% revenue growth last year. The company's non-automotive segments were able to help offset a double-digit decline in auto revenue in 2025. These 10 stocks could mint the next wave of millionaires › Tesla (NASDAQ: TSLA) is known for its electric vehicles (EVs), and while they

WD sees sustainability as key business driver in an ‘AI economy’

Hard drive company WD promoted long-term operations and sustainability executive Jackie Jung to become its first chief sustainability officer in February, as it steps up sales to companies building AI data centers. Her vision: Turn sustainability into a “brand” for WD, a strategy that reduces risk for the $6 billion company (formerly known as Western