#Machine Learning #Materials Science

Scientific Language Modeling and Information Extraction

Scientific articles have long been the primary means of disseminating scientific discoveries. Valuable data and potentially ground-breaking insights have been collected and buried deep in the mountain of scientific publications over the centuries. We strive to answer interesting and important questions in science by extracting facts from publications and, in the process, have built a foundational large language model for science.

Our previous works have focused on designing application-specific models and pipelines, which has produced a polymer extraction model that outperformed a leading chemical extraction toolkit by up to 50%, as measured by F1 score, as well as a druglike molecule extraction model that found 3,591 molecules from COVID-19-related medical research that had not been previously considered by Argonne’s computational screening research team.

Large Language Models (LLMs) has become of core of many NLP solutions in recent years due to their flexibility and performance advantages over traditional machine learning methods. Most publicly available pretrained LLMs are pretrained on general English corpora such as news reports, wiki pages, and blogs, while scientific texts are usually neglected. We have pretrained on the ScholarBERT and ScholarBERT-XL models on a corpus of 75 million of scientific publications. Preliminary experimental results showed their strengths in identifying disciplines corresponding to scientific named entities compared to general LLMs or domain-specific LLMs.


Funding and Acknowledgements

This work was performed under award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD); by the U.S. Department of Energy under contract DE-AC02-06CH11357; and by U.S. National Science Foundation awards DGE-2022023 and OAC-2106661. This research used resources of the University of Chicago Research Computing Center and the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.


Aswathy Ajith
Eamon Duede
Greg Pauloski
Ian Foster
Kyle Chard
Zhi Hong