A collaboration between IMPACT and IBM has produced INDUS, a comprehensive suite of large language models (LLMs) tailored for the domains of Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics and trained using curated scientific corpora drawn from diverse data sources. Kaylin Bugbee (ST11), team lead of NASA's Science Discovery Engine (SDE), spoke to the benefit INDUS offers to existing applications: "Large language models are rapidly changing the search experience. The Science Discovery Engine, a unified, insightful search interface for all of NASA's open science data and information, has prototyped integrating INDUS into its search engine. Initial results have shown that INDUS improved the accuracy and relevancy of the returned results."
The INDUS models are openly available on Hugging Face. For the benefit of the scientific community, the team has released the developed models and will release the benchmark datasets that span named entity recognition for climate change, extractive QA for Earth science, and information retrieval for multiple domains. A paper on INDUS, "INDUS: Effective and Efficient Language Models for Scientific Applications," is available at https://arxiv.org/pdf/2405.10725.