2ED: An efficient entity extraction algorithm using two-level edit-distance

Zeyi Wen, Dong Deng, Rui Zhang, Ramamohanarao Kotagiri

Research output: Chapter in Book/Conference paperConference paperpeer-review

5 Citations (Scopus)


Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on string matching against a dictionary of known entities. For approximate entity extraction from free text, considering solely character-based or solely token-based similarity cannot simultaneously deal with minor name variations at token-level and typos at character-level. Moreover, the tolerance of mismatch in character-level may be different from that in token-level, and the tolerance thresholds of the two levels should be able to be customised individually. In this paper, we propose an efficient character-level and token-level edit-distance based algorithm called FuzzyED. To improve the efficiency of FuzzyED, we develop various novel techniques including (i) a spanning-based candidate sub-string producing technique, (ii) a lower bound dissimilarity to determine the boundaries of candidate sub-strings, (iii) a core token based technique that makes use of the importance of tokens to reduce the number of unpromising candidate sub-strings, and (iv) a shrinking technique to reuse computation. Empirical results on real world datasets show that FuzzyED can efficiently extract entities and produce a high F1 score in the range of [0.91, 0.97].

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019
Place of PublicationMacao
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages12
ISBN (Electronic)9781538674741
Publication statusPublished - 1 Apr 2019
Externally publishedYes
Event35th IEEE International Conference on Data Engineering, ICDE 2019 - Macau, China
Duration: 8 Apr 201911 Apr 2019

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627


Conference35th IEEE International Conference on Data Engineering, ICDE 2019


Dive into the research topics of '2ED: An efficient entity extraction algorithm using two-level edit-distance'. Together they form a unique fingerprint.

Cite this