Query-driven frequent co-occurring term computation over relational data using MapReduce

Jianxin Li, C. Liu, R. Zhou, J.X. X. Yu

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

© 2014 The British Computer Society. All rights reserved.Given a keyword query q and a large structured , traditional keyword search may return a large number of relevant results to users, which leads to a frustrating procedure for the users to select their interesting results. To help users understand the data to be searched, in this work we investigate the problem of frequent co-occurring terms (FCTs) in large relational data. By returning a set of most FCTs with the given keywords, we can provide a chance for users to see a big picture of relevant data information. The investigation of FCT problem is also one of the fundamental building blocks of data mining because the discovered FCTs can be employed to analyze the topics or contexts of user interest. Although the problem of FCTs computation was proposed and investigated in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg Russia March 23-26, pp. 839-850. ACM New York [USA], further investigation is needed to improve the performance because FCT computation is very expensive. Especially for the increasing volume of data, the centralized approach in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg Russia, March 23-26, pp. 839-850. ACM, New York, USA] may incur a big challenge on the efficiency of performing an FCT computation. To do this, we investigate how to perform parallel FCT computation using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. We design an effective mapping mechanism that exploits the approximately maximal workload of FCT computation for balancing the computational cost of each processor, while reducing the shuffling cost and avoiding the data-skewness. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark
Original languageEnglish
Pages (from-to)497-513
Number of pages17
JournalComputer Journal
Volume58
Issue number3
DOIs
Publication statusPublished - 2015
Externally publishedYes

Fingerprint Dive into the research topics of 'Query-driven frequent co-occurring term computation over relational data using MapReduce'. Together they form a unique fingerprint.

  • Cite this