TY - GEN
T1 - MaintNorm
T2 - 9th Workshop on Noisy and User-Generated Text
AU - Bikaun, Tyler
AU - Hodkiewicz, Melinda
AU - Liu, Wei
PY - 2024
Y1 - 2024
N2 - Maintenance short texts are invaluable unstructured data sources, serving as a diagnostic and prognostic window into the operational health and status of physical assets. These user-generated texts, created during routine or ad-hoc maintenance activities, offer insights into equipment performance, potential failure points, and maintenance needs. However, the use of information captured in these texts is hindered by inherent challenges: the prevalence of engineering jargon, domain-specific vernacular, random spelling errors without identifiable patterns, and the absence of standard grammatical structures. To transform these texts into accessible and analysable data, we introduce the MaintNorm dataset, the first resource specifically tailored for the lexical normalisation task of maintenance short texts. Comprising 12,000 examples, this dataset enables the efficient processing and interpretation of these texts. We demonstrate the utility of MaintNorm by training a lexical normalisation model as a sequence-to-sequence learning task with two learning objectives, namely, enhancing the quality of the texts and masking segments to obscure sensitive information to anonymise data. Our benchmark model demonstrates a universal error reduction rate of 95.8%. The dataset and benchmark outcomes are made available to the public under the MIT license.
AB - Maintenance short texts are invaluable unstructured data sources, serving as a diagnostic and prognostic window into the operational health and status of physical assets. These user-generated texts, created during routine or ad-hoc maintenance activities, offer insights into equipment performance, potential failure points, and maintenance needs. However, the use of information captured in these texts is hindered by inherent challenges: the prevalence of engineering jargon, domain-specific vernacular, random spelling errors without identifiable patterns, and the absence of standard grammatical structures. To transform these texts into accessible and analysable data, we introduce the MaintNorm dataset, the first resource specifically tailored for the lexical normalisation task of maintenance short texts. Comprising 12,000 examples, this dataset enables the efficient processing and interpretation of these texts. We demonstrate the utility of MaintNorm by training a lexical normalisation model as a sequence-to-sequence learning task with two learning objectives, namely, enhancing the quality of the texts and masking segments to obscure sensitive information to anonymise data. Our benchmark model demonstrates a universal error reduction rate of 95.8%. The dataset and benchmark outcomes are made available to the public under the MIT license.
UR - http://www.scopus.com/inward/record.url?scp=85190302041&partnerID=8YFLogxK
M3 - Conference paper
AN - SCOPUS:85190302041
T3 - W-NUT 2024 - 9th Workshop on Noisy and User-Generated Text, Proceedings of the Workshop
SP - 58
EP - 67
BT - W-NUT 2024 - 9th Workshop on Noisy and User-Generated Text, Proceedings of the Workshop
A2 - van der Goot, Rob
A2 - Bak, JinYeong
A2 - Muller-Eberstein, Max
A2 - Xu, Wei
A2 - Ritter, Alan
A2 - Baldwin, Tim
A2 - Baldwin, Tim
PB - Association for Computational Linguistics (ACL)
Y2 - 22 March 2024
ER -