Information Extraction

A wealth of valuable data is locked within the millions of research articles published each year. We are researching methods to liberate this data via hybrid human-machine models.

The amount of scientific literature published every year is growing at an overwhelming rate. Some studies place the number of scientific journals at more than 28,000 and the number of articles published each year at 1.8 million. Therefore, the amount of important findings (e.g. experiment results) locked in tables, figures and text of various format is staggering. Reading and extracting pertinent information has become an unmanageable task for scientists and is now hindering their research. For example, in materials science and chemistry, difficulties discovering published materials properties directly affect the design of new materials. Indeed, despite the large number of publications in this domain, the process of designing new materials is still one of trial and error. Access to a structured, queryable database of all materials properties would facilitate the design and model validation of new substances, improving efficiency by enabling scientists and engineers to more quickly discover, query and compare properties of existing compounds.

This avalanche of publications and scientific facts however is not yet machine accessible and cannot easily be transformed into human-consumable knowledge. The most common current practice is to recruit experts to manually extract and curate published properties (e.g., the Physical Properties of Polymers Handbook). This approach is laborious, expensive, and quickly outdated as new works are continuously being published. While the explosive growth of the World Wide Web has fostered huge efforts in information extraction systems, machine “reading” is still a challenging task requiring human intervention due to the heterogeneity and the lack of structure in much web-accessible data.

The goal of this work is to demonstrate that moderate supervision from humans, using appropriate tools, can generate high quality data for the scientific community and produce an improved system that learns from user input. We have developed workflows that identify scientific named entities and extracting their property data from publications.

The named entity tagger is the first step in the scientific information extraction workflow. It is built on bidirectional LSTM networks and conditional random fields and leverages knowledge from external sources such as Wikipedia to boost its learning performance. Experiments show that it outperforms a leading domain-specific extraction toolkit by up to 50%, as measured by F1 score. while also being easily adapted to new domains. Our model has been applied to extract drug-like molecules from publications on COVID-19 and found 18 candidate molecules that was previously not considered by the experimentalists.

Language models such as BERT are often used in state-of-the-art Information extraction systems. The original BERT model is trained on a non-scientific corpus, namely the Wikipedia and BookCorpus, which limits their performance on scientific tasks. Domain-specific BERT models such as PubMedBERT and BioBERT exists, but their training corpus are limited to certain disciplines like chemistry and biology. We trained a more general science-focused BERT model and evaluated it on downstream tasks such as sentence classification, relation extraction and named entity recognition in a variety of domains including biochemistry, computer science, and sociology.

We fine-tune our models on various scientific datasets to evaluate their performance on downstream tasks such as Named Entity Recognition (NER) and Relation Extraction (REL). For the NER task, we finetune on BC5CDR, JNLPBA, SciERC, ChemDNER, and NCBI-Disease and the performance on REL tasks is evaluated using the SciERC, ChemProt datasets, and Paperfield datasets. We explore the impact of varying pre-training iterations, size of data used in pre-training, as well as model size on downstream task performance. The performance on these tasks is also compared against state-of-the-art models trained on scientific corpa.</p>

For extracting properties for scientific entities, the initial target of extraction is a fundamental thermodynamic, and particularly challenging property, called the Flory-Huggins (or χ) parameter. The χ parameter describes the miscibility of polymer blends. This property is particularly challenging as it is published in heterogeneous data formats (e.g., text, figures, tables) and is represented in several different temperature-dependent expressions.

We have developed a χDB workflow consisting of a Web information extraction phase followed by a crowdsourced curation phase that results in a digital handbook of χ values.

In this first stage, χDB discovers and downloads relevant publications from suitable journals. After downloading the publications it extracts each publication’s metadata, including Digital Object Identifier (DOI), title, authors, and date of publication. This information is used to index the publication such that it can be linked to other stored information (e.g., referenced values in other papers), used as part of a provenance trail, and discovered by users. Finally, the publication is parsed into a number of items (e.g., abstract, figures, tables, equations, text) that can be reviewed individually. Each publication and publication item is then registered in the database. Link between publication items are maintained so that they can displayed to reviewers in a coherent manner. The full text and the original URL are also stored such that reviewers and users can retrieve the original publication.

In the crowdsourced curation phase, the χDB interface leads “reviewers” through analyzing publications and extracting the χ parameter. In our first implementation of this phase, we designed an interface to be used in a class co-organized with colleagues at the Institute of Molecular Engineering. We engaged students via a course that combines teaching of fundamentals of polymer chemistry and physics with the analysis of literature containing χ parameters. The reviewing of publication was organized in series, each publication being analyzed by two reviewers with an option to flag it for expert review when the two reviewers have conflicting input. In addition to extracting χ, users are able to mark publications items as relevant (e.g. contain a χ value or is an image of the material).

Once an extracted values has passed through the review cycle, it is stored in a database (χDB digital handbook) with associated provenance information that links the value back to the original publication and describes the review process employed. The database contains all publication items that reviewers have deemed relevant and all extracted χDB values. All information in this database has been reviewed multiple times and is ready to be presented to external users, i.e., to material scientists and engineers. To support user access to the database, χDB includes a web service and interface that allows users to browse the database for χ values. Users can also use an API to ingest χDB values directly from custom applications, for example to retrieve χ values for a set of specific polymers they may then perform calculations or visualizations of those values. Visit the χDB digital handbook at:

The content of the database was analyzed to learn new information about how the χ parameter is currently found in the literature. Based on a 5-year search period of publications in Macromolecules we gained some insight about the most studied polymers, polymer pairs in this area. We also identified the most common form under which χ (a number at a specific temperature) is published as well as the most common source of χ in publications (plain text). We reported more interesting findings from a polymer science perspective. For example, in the case of polymer-polymer χ values - as opposed to polymer solvent χ values - the most commonly reported methods were in agreements with the ones originally proposed by our materials experts. Find more details in our publications1,2.

The first target for improvement identified in the χDB system is the publication selection process. Only 38.5% of our selected publications actually contained χ values; thus, about 60% of the papers curated by student reviewers did not in fact contribute to the digital handbook. As a first step, machine learning techniques can help optimize the use of reviewer time by prioritizing and better classifying relevant publications. The question this component χDB aims to answer is whether we can automatically classify and predict papers that contain χ values, based solely on abstracts and captions. Using a support vector machine classifier, we report a precision of 86.94% and a recall of 90.87% on classifying papers containing χ values for polymer-polymer pairs based on abstracts. Find more details in our publications1,2.

Further reading