Nov 28, 2022
Optical Chemical Structure Recognition (OCSR) deals with the translation from chemical images to molecular structures, which is the main way chemical compounds are depicted in scientific documents. Traditional rule-based methods follow a framework based on the detection of atoms and bonds, followed by the reconstruction of the compound structure. Recently, neural architectures analog to image captioning have been explored to solve this task, yet they still show to be data inefficient, using millions of examples just to show performance comparable with traditional methods. Looking to motivate and benchmark new approaches based on atomic-level entities detection and graph reconstruction, we present CEDe, a unique collection of chemical entity bounding boxes manually curated by experts for scientific literature datasets. These annotations combine to more than 700,000 chemical entity bounding boxes with the necessary information for structure reconstruction. Also, a large synthetic dataset containing 1 million molecular images and annotations is released in order to explore transfer-learning techniques that could help these architectures perform better under low-data regimes. Benchmarks show that detection-reconstruction based models can achieve performances on par with or better than image captioning-like models, even with 100x fewer training examples.Optical Chemical Structure Recognition (OCSR) deals with the translation from chemical images to molecular structures, which is the main way chemical compounds are depicted in scientific documents. Traditional rule-based methods follow a framework based on the detection of atoms and bonds, followed by the reconstruction of the compound structure. Recently, neural architectures analog to image captioning have been explored to solve this task, yet they still show to be data inefficient, using mill…
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker