TY - JOUR
T1 - Automatic semantic knowledge extraction from electronic forms
AU - Wu, Haolin
AU - French, Tim
AU - Liu, Wei
AU - Hodkiewicz, Melinda
PY - 2024/10
Y1 - 2024/10
N2 - Electronic tabular forms are an intuitive way for organisations to collect, present and store structured information for human readers. Forms use features such as fonts, colours and cell positioning to help readers navigate and find information. Millions of forms, typically in Portable Document Format (PDF), are generated by businesses as part of routine operations. Unlike human readers, machines are not able to directly ‘understand’ the implicit cues contained in the fonts, colours and use of boxes without explicit processing. In this paper, a supervised computer vision model is proposed to decompose the PDF form document into nested microtables. The cells within these microtables are then processed using a customisable rule bank for meaningful table content and semantic relationship extraction. The process is demonstrated on an industry dataset of 37 maintenance procedure documents containing 373 pages and 1016 unique microtables. A web application EMU (Extracting Machine Understandable Semantics from Forms) demonstrates how data captured in tables with different dimensions in procedural forms can be automatically extracted and stored in JavaScript Object Notation (JSON). Identifying and extracting nested tables is a critical fundamental step for future applications to support machine-automated search and extraction of data at scale for both maintenance and other procedural documentation.
AB - Electronic tabular forms are an intuitive way for organisations to collect, present and store structured information for human readers. Forms use features such as fonts, colours and cell positioning to help readers navigate and find information. Millions of forms, typically in Portable Document Format (PDF), are generated by businesses as part of routine operations. Unlike human readers, machines are not able to directly ‘understand’ the implicit cues contained in the fonts, colours and use of boxes without explicit processing. In this paper, a supervised computer vision model is proposed to decompose the PDF form document into nested microtables. The cells within these microtables are then processed using a customisable rule bank for meaningful table content and semantic relationship extraction. The process is demonstrated on an industry dataset of 37 maintenance procedure documents containing 373 pages and 1016 unique microtables. A web application EMU (Extracting Machine Understandable Semantics from Forms) demonstrates how data captured in tables with different dimensions in procedural forms can be automatically extracted and stored in JavaScript Object Notation (JSON). Identifying and extracting nested tables is a critical fundamental step for future applications to support machine-automated search and extraction of data at scale for both maintenance and other procedural documentation.
KW - Knowledge graph
KW - maintenance
KW - procedures
KW - semantic relationship
UR - http://www.scopus.com/inward/record.url?scp=85131000034&partnerID=8YFLogxK
U2 - 10.1177/1748006X221098272
DO - 10.1177/1748006X221098272
M3 - Article
AN - SCOPUS:85131000034
SN - 1748-006X
VL - 238
SP - 933
EP - 944
JO - Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
JF - Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
IS - 5
ER -