Automatic semantic knowledge extraction from electronic forms

Research output: Contribution to journalArticlepeer-review

2 Citations (Web of Science)

Abstract

Electronic tabular forms are an intuitive way for organisations to collect, present and store structured information for human readers. Forms use features such as fonts, colours and cell positioning to help readers navigate and find information. Millions of forms, typically in Portable Document Format (PDF), are generated by businesses as part of routine operations. Unlike human readers, machines are not able to directly ‘understand’ the implicit cues contained in the fonts, colours and use of boxes without explicit processing. In this paper, a supervised computer vision model is proposed to decompose the PDF form document into nested microtables. The cells within these microtables are then processed using a customisable rule bank for meaningful table content and semantic relationship extraction. The process is demonstrated on an industry dataset of 37 maintenance procedure documents containing 373 pages and 1016 unique microtables. A web application EMU (Extracting Machine Understandable Semantics from Forms) demonstrates how data captured in tables with different dimensions in procedural forms can be automatically extracted and stored in JavaScript Object Notation (JSON). Identifying and extracting nested tables is a critical fundamental step for future applications to support machine-automated search and extraction of data at scale for both maintenance and other procedural documentation.

Original languageEnglish
Pages (from-to)933-944
Number of pages12
JournalProceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
Volume238
Issue number5
Early online dateMay 2023
DOIs
Publication statusPublished - Oct 2024

Fingerprint

Dive into the research topics of 'Automatic semantic knowledge extraction from electronic forms'. Together they form a unique fingerprint.

Cite this