Lexical normalisation aims to computationally correct errors in text so that the data may be more successfully analysed. Noisy, unstructured short-text data presents unique challenges as it contains multiple types of Out Of Vocabulary (OOV) words. Some are spelling mistakes, which should be normalised to in-dictionary words; some are acronyms or abbreviations, which should be expanded to the corresponding phrases; and some are domain specific terms which should remain in their original form not to be mis-corrected to conform with the dictionary used. Despite its critical significance in assuring data quality, text normalisation is an area with a less cohesive and focused research effort, evidenced by the diverse set of keywords used and scattered publication venues. Integrated approaches that address all three types of OOV terms are scarce. Here we introduce a two-stage, modular classification-based framework that specifically targets the various types of Out Of Vocabulary terms prevalent in short-text data. To avoid laborious feature engineering, our system utilises a Bi-Directional Long Short-Term Memory + CRF model to classify each erroneous token into a particular class. The system then selects an appropriate normalisation technique based on the predicted class of each token. For spell-checking, we introduce two learning models that predict the correct spelling of a word given its context: one that utilises word embeddings, and another that uses a quasi-recurrent neural network. We compare our system to two existing state of the art lexical normalisation systems and find that our system achieves greater performance on the log data domain.
|Title of host publication||2018 IEEE International Conference on Big Knowledge (ICBK)|
|Place of Publication||Singapore|
|Publisher||IEEE, Institute of Electrical and Electronics Engineers|
|Number of pages||10|
|Publication status||Published - 1 Nov 2018|