2ED: An efficient entity extraction algorithm using two-level edit-distance

Zeyi Wen, Dong Deng, Rui Zhang, Ramamohanarao Kotagiri

Research output: Chapter in Book/Conference paperConference paper

Abstract

Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on string matching against a dictionary of known entities. For approximate entity extraction from free text, considering solely character-based or solely token-based similarity cannot simultaneously deal with minor name variations at token-level and typos at character-level. Moreover, the tolerance of mismatch in character-level may be different from that in token-level, and the tolerance thresholds of the two levels should be able to be customised individually. In this paper, we propose an efficient character-level and token-level edit-distance based algorithm called FuzzyED. To improve the efficiency of FuzzyED, we develop various novel techniques including (i) a spanning-based candidate sub-string producing technique, (ii) a lower bound dissimilarity to determine the boundaries of candidate sub-strings, (iii) a core token based technique that makes use of the importance of tokens to reduce the number of unpromising candidate sub-strings, and (iv) a shrinking technique to reuse computation. Empirical results on real world datasets show that FuzzyED can efficiently extract entities and produce a high F1 score in the range of [0.91, 0.97].

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019
Place of PublicationMacao
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages998-1009
Number of pages12
ISBN (Electronic)9781538674741
DOIs
Publication statusPublished - 1 Apr 2019
Externally publishedYes
Event35th IEEE International Conference on Data Engineering, ICDE 2019 - Macau, China
Duration: 8 Apr 201911 Apr 2019

Publication series

NameProceedings - International Conference on Data Engineering
Volume2019-April
ISSN (Print)1084-4627

Conference

Conference35th IEEE International Conference on Data Engineering, ICDE 2019
CountryChina
CityMacau
Period8/04/1911/04/19

    Fingerprint

Cite this

Wen, Z., Deng, D., Zhang, R., & Kotagiri, R. (2019). 2ED: An efficient entity extraction algorithm using two-level edit-distance. In Proceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019 (pp. 998-1009). [8731344] (Proceedings - International Conference on Data Engineering; Vol. 2019-April). Macao: IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICDE.2019.00093