Large language models generate functional protein sequences across diverse families

  • Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227 (2012).

    Article 
    CAS 

    Google Scholar
     

  • Lin, Y.-R. et al. Control over overall shape and size in de novo designed proteins. Proc. Natl Acad. Sci. USA 112, E5478–E5485 (2015).

    Article 
    CAS 

    Google Scholar
     

  • Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).

    Article 
    CAS 

    Google Scholar
     

  • Huang, P.-S. et al. De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat. Chem. Biol. 12, 29–34 (2016).

    Article 
    CAS 

    Google Scholar
     

  • Boyken, S. E. et al. De novo design of protein homo-oligomers with modular hydrogen-bond network–mediated specificity. Science 352, 680–687 (2016).

    Article 
    CAS 

    Google Scholar
     

  • Lapedes, A. S., Bertrand, G. G., LonChang, L. & Stormo, G. D. Correlated mutations in models of protein sequences: Phylogenetic and structural effects. Lect. Notes Monogr. Ser. 33, 236–256 (1999).

    Article 

    Google Scholar
     

  • Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).

    Article 
    CAS 

    Google Scholar
     

  • Hopf, T. A. et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019).

    Article 
    CAS 

    Google Scholar
     

  • Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).

    Article 
    CAS 

    Google Scholar
     

  • Ovchinnikov, S., Kamisetty, H. & Baker, D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).

    Article 

    Google Scholar
     

  • Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article 
    CAS 

    Google Scholar
     

  • Wu, Z. et al. Signal peptides generated by attention-based neural networks. ACS Synth. Biol. 9, 2154–2161 (2020).

    Article 
    CAS 

    Google Scholar
     

  • Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol. 39, 691–696 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613–623 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Moffat, L., Kandathil, S. M. & Jones, D. T. Design in the DARK: Learning deep generative models for De Novo Protein Design. Preprint at bioRxiv https://doi.org/10.1101/2022.01.27.478087 (2022).

  • Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

  • Huang, B. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022).

    Article 
    CAS 

    Google Scholar
     

  • Leinonen, R. et al. UniProt archive. Bioinformatics 20, 3236–3237 (2004).

    Article 
    CAS 

    Google Scholar
     

  • Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005).

    Article 
    CAS 

    Google Scholar
     

  • Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).

    Article 
    CAS 

    Google Scholar
     

  • Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS, 2017).

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2019).

  • Brown, T. B. et al. Language models are few-shot learners. In 34th Conference on Neural Information Processing Systems (NeurIPS, 2020).

  • Zellers, R. et al. Defending against neural fake news. In 33rd Conference on Neural Information Processing Systems (NeurIPS, 2019).

  • Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at arXiv https://doi.org/10.48550/arXiv.1909.05858 (2019).

  • AlQuraishi, M. The future of protein science will not be supervised. Some Thoughts on a Mysterious Universe https://moalquraishi.wordpress.com/2019/04/01/the-future-of-protein-science-will-not-be-supervised/ (2019).

  • Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).


    Google Scholar
     

  • Peters, M. E. et al. Deep contextualized word representations. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2018).

  • Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL, 2018).

  • Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. Preprint at https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).

  • Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Pfaff, C. W. Constraints on language mixing: Intrasentential code-switching and borrowing in Spanish/English. Language 55, 291–318 (1979).

    Article 

    Google Scholar
     

  • Poplack, S. Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching. Linguistics 18, 581–618 (1980).

    Article 

    Google Scholar
     

  • Dathathri, S. et al. Plug and play language models: a simple approach to controlled text generation. In 8th International Conference on Learning Representations (ICLR, 2020).

  • Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).

    Article 
    CAS 

    Google Scholar
     

  • Broendum, S. S., Buckle, A. M. & McGowan, S. Catalytic diversity and cell wall binding repeats in the phage-encoded endolysins. Mol. Microbiol. 110, 879–896 (2018).

    Article 
    CAS 

    Google Scholar
     

  • Love, M. J., Abeysekera, G. S., Muscroft-Taylor, A. C., Billington, C. & Dobson, R. C. J. On the catalytic mechanism of bacteriophage endolysins: opportunities for engineering. Biochim. Biophys. Acta. Proteins Proteom. 1868, 140302 (2020).

    Article 
    CAS 

    Google Scholar
     

  • Martin, P. P. Potts Models And Related Problems In Statistical Mechanics (World Scientific, 1991).

  • Thomas, J., Ramakrishnan, N. & Bailey-Kellogg, C. Graphical models of residue coupling in protein families. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 183–197 (2008).

    Article 
    CAS 

    Google Scholar
     

  • Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67–72 (2009).

    Article 
    CAS 

    Google Scholar
     

  • Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).

    Article 
    CAS 

    Google Scholar
     

  • Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).

    Article 

    Google Scholar
     

  • Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Binformatics 37, 3029–3031 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Mooers, B. H. M., Tronrud, D. E. & Matthews, B. W. Evaluation at atomic resolution of the role of strain in destabilizing the temperature-sensitive T4 lysozyme mutant Arg 96 → His. Protein Sci. 18, 863–870 (2009).

    Article 
    CAS 

    Google Scholar
     

  • Baase, W. A., Liu, L., Tronrud, D. E. & Matthews, B. W. Lessons from the lysozyme of phage T4. Protein Sci. 19, 631–641 (2010).

    Article 
    CAS 

    Google Scholar
     

  • Kuroki, R., Weaver, L. H. & Matthews, B. W. A covalent enzyme–substrate intermediate with saccharide distortion in a mutant T4 lysozyme. Science 262, 2030–2033 (1993).

    Article 
    CAS 

    Google Scholar
     

  • Mchaourab, H. S., Oh, K. J., Fang, C. J. & Hubbell, W. L. Conformation of T4 lysozyme in solution. Hinge-bending motion and the substrate-induced conformational transition studied by site-directed spin labeling. Biochemistry 36, 307–316 (1997).

    Article 
    CAS 

    Google Scholar
     

  • Kim, J.-K. et al. BetaCavityWeb: a webserver for molecular voids and channels. Nucleic Acids Res. 43, W413–W418 (2015).

    Article 
    CAS 

    Google Scholar
     

  • Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).

    Article 
    CAS 

    Google Scholar
     

  • Pearson, W. R. An introduction to sequence similarity (‘homology’) searching. Curr. Protoc. Bioinforma. 3, 3.1 (2013). ChapterUnit.


    Google Scholar
     

  • Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).

    Article 

    Google Scholar
     

  • Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. Transfer learning in natural language processing. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics (eds Jill Burstein, J., Doran, C. & Solorio T.) (Association for Computational Linguistics, 2019).

  • Huh, M., Agrawal, P. & Efros, A. A. What makes ImageNet good for transfer learning? Preprint at arXiv https://doi.org/10.48550/arXiv.1608.08614 (2016).

  • LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article 
    CAS 

    Google Scholar
     

  • Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 746 (2022).

    Article 
    CAS 

    Google Scholar
     

  • Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136–D143 (2012).

    Article 
    CAS 

    Google Scholar
     

  • Pettit, L. D. The IUPAC stability constants database. Chem. Int. 28, 14–15 (2006).

    CAS 

    Google Scholar
     

  • Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

    Article 
    CAS 

    Google Scholar
     

  • Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003).


    Google Scholar
     

  • Madani, A. et al. ProGen: language modeling for protein generation. Preprint at arXiv https://doi.org/10.1101/2020.03.07.982272 (2020).

  • Vig, J. et al. BERTology meets biology: Interpreting attention in protein language models. In International Conference on Learning Representations (ICLR, 2020).

  • Goyal, K., Dyer, C. & Berg-Kirkpatrick, T. Exposing the implicit energy networks behind masked language models via metropolis–hastings. In 10th International Conference on Learning Representations (ICLR, 2022).

  • Bhattacharya, N. et al. Single layers of attention suffice to predict protein contacts. Preprint at bioRxiv https://doi.org/10.1101/2020.12.21.423882 (2020).

  • Ramsauer, H. et al. Hopfield Networks is All You Need. Preprint at arXiv https://doi.org/10.48550/arXiv.2008.02217 (2020).

  • Alley, E., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article 
    CAS 

    Google Scholar
     

  • Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).


    Google Scholar
     

  • Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint arXiv https://doi.org/10.48550/arXiv.1412.6980 (2014).

  • Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proc. 30th International Conference on Machine Learning (eds. Dasgupta, S. & McAllester, D.) 1310–1318 (PMLR, 2013).

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).


    Google Scholar
     

  • Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In 8th International Conference on Learning Representations (ICLR, 2020).

  • Goodfellow, I. J. et al. Generative adversarial networks. In 28th Conference on Neural Information Processing Systems (NIPS, 2014).

  • Koehn, P. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. in Machine Translation: From Real Users to Research 115–124 (Springer, 2004).

  • Sun, Z. Z. et al. Protocols for implementing an Escherichia coli based TX-TL cell-free expression system for synthetic biology. J. Vis. Exp. 16, e50762 (2013).


    Google Scholar
     

  • Kabsch, W. XDS. Acta Crystallogr. D Biol. Crystallogr. 66, 125–132 (2010).

    Article 
    CAS 

    Google Scholar
     

  • McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).

    Article 
    CAS 

    Google Scholar
     

  • Kovalevskiy, O., Nicholls, R. A., Long, F., Carlon, A. & Murshudov, G. N. Overview of refinement procedures within REFMAC5: utilizing data from different sources. Acta Crystallogr D Struct. Biol. 74, 215–227 (2018).

    Article 
    CAS 

    Google Scholar
     

  • Terwilliger, T. C. et al. Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr. D Biol. Crystallogr. 64, 61–69 (2008).

    Article 
    CAS 

    Google Scholar
     

  • Hoh, S. W., Burnley, T. & Cowtan, K. Current approaches for automated model building into cryo-EM maps using Buccaneer with CCP-EM. Acta Crystallogr D Struct. Biol. 76, 531–541 (2020).

    Article 
    CAS 

    Google Scholar
     

  • Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D Biol. Crystallogr. 66, 486–501 (2010).

    Article 
    CAS 

    Google Scholar
     

  • Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. D Biol. Crystallogr. 68, 352–367 (2012).

    Article 
    CAS 

    Google Scholar
     

  • Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint at arXiv https://doi.org/10.48550/arXiv.1910.10683 (2019).

  • Studier, F. W. Protein production by auto-induction in high density shaking cultures. Protein Expr. Purif. 41, 207–234 (2005).

    Article 
    CAS 

    Google Scholar
     

  • Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

    Article 
    CAS 

    Google Scholar
     

  • Read More
    Ali Madani

    Latest

    Franklin Templeton says Wall Street fears blockchain because it threatens its profits

    Jenny Johnson, Franklin Templeton's CEO, said blockchain and crypto threaten a huge number of business models that exist today in traditional finance. Jun 3, 2026, 7:04 a.m. 2 min read Make preferred on The future of asset management is shifting on-chain, but the transition is exposing a major structural conflict over traditional corporate revenue. Speaking

    Big tech is ‘terrified’ of AI agents wiping out ad revenue, says Billions Network CEO

    Evin McMullen’s view on AI agents disrupting Google’s and Facebook’s business model was previously shared by Cardano Founder Charles Hoskinson and Cloudflare CSO Stephanie Cohen. Jun 3, 2026, 6:51 a.m. 2 min read Make preferred on The legacy financial and digital frameworks propping up the current internet architecture face an imminent, existential crisis. Evin McMullen

    What Responsibilities Come With Sole Proprietorship for Self-Employed Individuals?

    As a sole proprietor, you take on significant responsibilities that impact your business and personal finances. You’ll need to maintain precise financial records, file taxes using Schedule C, and guarantee compliance with local regulations. Moreover, you’re personally liable for any business debts, which underscores the importance of liability insurance. Securing the right licenses and permits

    Philippine Blockchain Week 2026 marks shift from Web3 potential to real-world deployment

    Homepage > News > Business > Philippine Blockchain Week 2026 marks shift from Web3 potential to real-world deployment MANILA, Philippines — The next phase of the digital economy will not be announced after the fact—it will take shape in real time at Philippine Blockchain Week (PBW) 2026. From June 19 to 21 at the SMX

    Newsletter

    Don't miss

    Franklin Templeton says Wall Street fears blockchain because it threatens its profits

    Jenny Johnson, Franklin Templeton's CEO, said blockchain and crypto threaten a huge number of business models that exist today in traditional finance. Jun 3, 2026, 7:04 a.m. 2 min read Make preferred on The future of asset management is shifting on-chain, but the transition is exposing a major structural conflict over traditional corporate revenue. Speaking

    Big tech is ‘terrified’ of AI agents wiping out ad revenue, says Billions Network CEO

    Evin McMullen’s view on AI agents disrupting Google’s and Facebook’s business model was previously shared by Cardano Founder Charles Hoskinson and Cloudflare CSO Stephanie Cohen. Jun 3, 2026, 6:51 a.m. 2 min read Make preferred on The legacy financial and digital frameworks propping up the current internet architecture face an imminent, existential crisis. Evin McMullen

    What Responsibilities Come With Sole Proprietorship for Self-Employed Individuals?

    As a sole proprietor, you take on significant responsibilities that impact your business and personal finances. You’ll need to maintain precise financial records, file taxes using Schedule C, and guarantee compliance with local regulations. Moreover, you’re personally liable for any business debts, which underscores the importance of liability insurance. Securing the right licenses and permits

    Philippine Blockchain Week 2026 marks shift from Web3 potential to real-world deployment

    Homepage > News > Business > Philippine Blockchain Week 2026 marks shift from Web3 potential to real-world deployment MANILA, Philippines — The next phase of the digital economy will not be announced after the fact—it will take shape in real time at Philippine Blockchain Week (PBW) 2026. From June 19 to 21 at the SMX

    Top 7 Cloud Accounting Software Options for Small Businesses

    If you’re a small business owner, choosing the right cloud accounting software can greatly impact your financial management. There are several top contenders available, each with distinct features that cater to various needs and budgets. QuickBooks Online stands out for its user-friendly interface, whereas Wave offers a free option for solo entrepreneurs. As you evaluate

    Jury acquits 2 business executives of bribing Navy admiral for government contract

    A federal jury has acquitted two business executives of charges that they conspired to bribe a retired four-star U.S. Navy admiral, who is now serving a six-year prison sentence for his conviction on corruption charges By MICHAEL KUNZELMAN Associated Press WASHINGTON -- A federal jury has acquitted two business executives of charges that they conspired

    US Business Leaders Optimistic About China Cooperation, Emphasize Importance of Chinese Market

    © 2026 China Money Network. All Rights Reserved. Disclaimer: The views, opinions, forecasts, and statements made by our hosts and guests are the personal views of those respective individuals and may or may not be either endorsed or accepted by China Money Network Limited or the companies with which these individuals are employed.

    Tesla’s Business Has Become Much More Diversified in Just the Past Five Years. Does That Make Its Stock a Better Buy Today?

    Key Points Tesla's energy generation and storage segment generated 27% revenue growth last year. The company's non-automotive segments were able to help offset a double-digit decline in auto revenue in 2025. These 10 stocks could mint the next wave of millionaires › Tesla (NASDAQ: TSLA) is known for its electric vehicles (EVs), and while they