NVIDIA Extends Big Language Models to Biology

As scientists explore new insights into DNA, proteins, and other building blocks of life, the NVIDIA BioNeMo framework, announced today at the NVIDIA GTC, will accelerate their research.

NVIDIA BioNeMo is a framework for training and deploying large biomolecular language models at supercomputing scale, helping scientists better understand disease and find therapies for patients. The Large Language Model (LLM) framework will support chemistry, protein, DNA, and RNA data formats.

It is part of the NVIDIA Clara Discovery collection of AI frameworks, applications, and models for drug discovery.

Just as AI learns to understand human languages ​​with LLMs, it also learns the languages ​​of biology and chemistry. By facilitating the training of massive neural networks on biomolecular data, NVIDIA BioNeMo helps researchers discover new patterns and information in biological sequences – information that researchers can associate with biological properties or functions, or even conditions of human health.

NVIDIA BioNeMo provides a framework for scientists to train large-scale language models using larger datasets, resulting in better performing neural networks. The framework will be available in early access on NVIDIA NGC, a hub for GPU-optimized software.

In addition to the language model framework, NVIDIA BioNeMo has a cloud API service that will support a growing list of pre-trained AI models.

BioNeMo framework supports larger models, better predictions

Scientists today using natural language processing models for biological data often train relatively small neural networks that require custom pre-processing. By adopting BioNeMo, they can evolve into LLMs with billions of parameters that capture information about molecular structure, protein solubility and more.

BioNeMo is an extension of the NVIDIA NeMo Megatron framework for GPU-accelerated training of large-scale self-supervising language models. It is domain-specific, designed to support molecular data represented in SMILES notation for chemical structures and in FASTA sequence strings for amino acids and nucleic acids.

“The framework enables researchers in the healthcare and life sciences industry to take advantage of their rapidly growing biological and chemical datasets,” said Mohammed AlQuraishi, founding member of the OpenFold Consortium and professor. assistant in the Department of Systems Biology at Columbia University. “This facilitates the discovery and design of therapies that precisely target a disease’s molecular signature.”

The BioNeMo service offers LLMs for Chemistry and Biology

For developers who want to get started quickly with LLMs for computational biology and chemistry applications, the NVIDIA BioNeMo LLM service will include four pre-trained language models. These are optimized for inference and will be available in early access through a cloud API running on NVIDIA DGX Foundry.

  • ESM-1: This LLM protein, originally published by Meta AI Labs, processes amino acid sequences to generate representations that can be used to predict a wide variety of protein properties and functions. It also improves scientists’ ability to understand protein structure.
  • OpenFold: The public-private consortium creating state-of-the-art protein modeling tools will make its open-source AI pipeline accessible through the BioNeMo service.
  • MegaMolBART: Trained on 1.4 billion molecules, this generative chemistry model can be used for reaction prediction, molecular optimization and de novo molecular generation.
  • ProtT5: The model, developed in a collaboration led by the RostLab at the Technical University of Munich and including NVIDIA, extends the capabilities of LLM proteins like ESM-1b to sequence generation.

In the future, researchers using the BioNeMo LLM service will be able to customize LLM models for greater accuracy on their applications within hours – with fine tuning and new techniques such as p-tuning, a training method that requires a dataset with a few hundred examples instead of millions.

Startups, researchers and the pharmaceutical industry adopt NVIDIA BioNeMo

A wave of biotechnology and pharmaceutical industry experts are adopting NVIDIA BioNeMo to support drug discovery research.

  • Astra Zeneca and NVIDIA used the Cambridge-1 supercomputer to develop the MegaMolBART model included in the BioNeMo LLM service. The global biopharmaceutical company will use the BioNeMo framework to help train some of the world’s largest language models on datasets of small molecules, proteins and, soon, DNA.
  • Researchers at Broad Institute of MIT and Harvard are working with NVIDIA to develop next-generation DNA language models using the BioNeMo framework. These models will be integrated with Terra, a cloud platform co-developed by the Broad Institute, Microsoft and Verily that allows biomedical researchers to securely share, access and analyze data at scale. AI models will also be added to the collection of the BioNeMo service.
  • The OpenFold The consortium plans to use the BioNeMo framework to advance its work by developing AI models that can predict molecular structures from amino acid sequences with near-experimental accuracy.
  • Peptone focuses on modeling intrinsically disordered proteins – proteins that do not have a stable 3D structure. The company is working with NVIDIA to develop versions of the ESM model using the NeMo framework, on which BioNeMo is also based. The project, which is expected to run on NVIDIA’s Cambridge-1 supercomputer, will advance Peptone’s drug discovery work.
  • Evozynea Chicago-based biotech company, combines engineering and deep learning technology to design novel proteins to solve long-standing challenges in therapeutics and sustainability.

“The BioNeMo framework is a technology to efficiently harness the power of LLMs for data-driven protein design in our design-build-test cycle,” said Andrew Ferguson, co-founder and chief computation officer at Evozyne. . “This will have an immediate impact on our design of novel functional proteins, with applications in human health and sustainability.”

“As we see the increasing adoption of large language models in the protein space, being able to efficiently train LLMs and rapidly modulate model architectures becomes extremely important,” said Istvan Redl, lead of the machine learning at Peptone, a biotech startup in the NVIDIA Inception Program. “We believe that these two engineering aspects – scalability and rapid experimentation – are exactly what the BioNeMo framework could provide.”

Register for early access to the NVIDIA BioNeMo LLM service or BioNeMo framework. For hands-on experience with the MegaMolBART chemistry model in BioNeMo, request a free lab from NVIDIA LaunchPad on LLM Training and Deployment.

Discover the latest in AI and healthcare at T&Csonline until Thursday, September 22. Registration is free.

Watch the GTC keynote from NVIDIA Founder and CEO Jensen Huang below:

Main image by Mahendra Awale, under license CC BY-SA 3.0 via Wikimedia Commons

Leave a Reply

%d bloggers like this: