SAMURAI - NIMS Researchers Database

HOME > 会議録 > 詳細

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

著者Luca Foppiano, Laurent Romary, Masashi Ishii, Mikiko Tanifuji.
発表誌名DocEng '19 Proceedings of the ACM Symposium on Document Engineering 2019
概要We present Grobid-quantities, an open source application for ex- tracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make un- structured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid, a machine learning frame- work for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implemen- tation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (nu- meric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in vari- ous TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI).
公開範囲 インターネット公開
作成時刻 / 更新時刻2020-01-31 03:00:19 +0900 / 2020-02-17 09:18:52 +0900