Efficient finer-grained incremental processing with MapReduce for big data

Liang Zhang, Yuanyuan Feng, Peiyi Shen, Guangming Zhu, Wei Wei, Juan Song, Syed Afaq Ali Shah, Mohammed Bennamoun

    Research output: Contribution to journalArticle

    Abstract

    With the continuous development of the Internet and information technology, more and more mobile terminals, wear equipment etc. contribute to the tremendous data. Thanks to the distributed computing, we can analyze the big data with quite high speed. However, many kinds of big data have an obvious common character that the datasets grow incrementally overtime, which means the distributed computing should focus on incremental processing. A number of systems for incremental data processing are available, such as Google's Percolator and Yahoo's CBP. However, in order to utilize these mature framework, one needs to make a troublesome change for their program to adapt to the environment requirement. In this paper, we introduce a MapReduce framework, named HadInc, for efficient incremental computations. HadInc is designed for offline scenes, in which real-time is needless and in-memory cluster computing is invalid. HadInc takes the advantages of finer-grained computing and Content-defined Chunking(CDC) to make sure that the system can still reuse the results which we have computed before, even if the split data has been changed seriously. Instead of re-computing the changed data entirely, HadInc can quickly find out the difference between the new split and the old one, and then merge the delta and old results into the latest result of the new datasets. Meanwhile, the dividing stability of the datasets is a key factor for reusing the results. In order to guarantee the stability of the dataset's division, we propose a series of novel algorithms based on CDC. We implemented HadInc by extending the Hadoop framework, and evaluated it with many experiments including three specific cases and a practical case. From the comparing results it can be seen that the proposed HadInc is very efficient.

    Original languageEnglish
    Pages (from-to)102-111
    Number of pages10
    JournalFuture Generation Computer Systems
    Volume80
    DOIs
    Publication statusPublished - 1 Mar 2018

    Fingerprint

    Distributed computer systems
    Cluster computing
    Processing
    Information technology
    Wear of materials
    Internet
    Data storage equipment
    Experiments
    Big data

    Cite this

    Zhang, Liang ; Feng, Yuanyuan ; Shen, Peiyi ; Zhu, Guangming ; Wei, Wei ; Song, Juan ; Ali Shah, Syed Afaq ; Bennamoun, Mohammed. / Efficient finer-grained incremental processing with MapReduce for big data. In: Future Generation Computer Systems. 2018 ; Vol. 80. pp. 102-111.
    @article{8550c1f5b5ab40868ade9b5379f1a7fe,
    title = "Efficient finer-grained incremental processing with MapReduce for big data",
    abstract = "With the continuous development of the Internet and information technology, more and more mobile terminals, wear equipment etc. contribute to the tremendous data. Thanks to the distributed computing, we can analyze the big data with quite high speed. However, many kinds of big data have an obvious common character that the datasets grow incrementally overtime, which means the distributed computing should focus on incremental processing. A number of systems for incremental data processing are available, such as Google's Percolator and Yahoo's CBP. However, in order to utilize these mature framework, one needs to make a troublesome change for their program to adapt to the environment requirement. In this paper, we introduce a MapReduce framework, named HadInc, for efficient incremental computations. HadInc is designed for offline scenes, in which real-time is needless and in-memory cluster computing is invalid. HadInc takes the advantages of finer-grained computing and Content-defined Chunking(CDC) to make sure that the system can still reuse the results which we have computed before, even if the split data has been changed seriously. Instead of re-computing the changed data entirely, HadInc can quickly find out the difference between the new split and the old one, and then merge the delta and old results into the latest result of the new datasets. Meanwhile, the dividing stability of the datasets is a key factor for reusing the results. In order to guarantee the stability of the dataset's division, we propose a series of novel algorithms based on CDC. We implemented HadInc by extending the Hadoop framework, and evaluated it with many experiments including three specific cases and a practical case. From the comparing results it can be seen that the proposed HadInc is very efficient.",
    keywords = "Big data, Finer grained reusing, Incremental processing, Yarn",
    author = "Liang Zhang and Yuanyuan Feng and Peiyi Shen and Guangming Zhu and Wei Wei and Juan Song and {Ali Shah}, {Syed Afaq} and Mohammed Bennamoun",
    year = "2018",
    month = "3",
    day = "1",
    doi = "10.1016/j.future.2017.09.079",
    language = "English",
    volume = "80",
    pages = "102--111",
    journal = "Future Generation Computer Systems: the international journal of grid computing: theory, methods and applications",
    issn = "0167-739X",
    publisher = "Elsevier",

    }

    Efficient finer-grained incremental processing with MapReduce for big data. / Zhang, Liang; Feng, Yuanyuan; Shen, Peiyi; Zhu, Guangming; Wei, Wei; Song, Juan; Ali Shah, Syed Afaq; Bennamoun, Mohammed.

    In: Future Generation Computer Systems, Vol. 80, 01.03.2018, p. 102-111.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - Efficient finer-grained incremental processing with MapReduce for big data

    AU - Zhang, Liang

    AU - Feng, Yuanyuan

    AU - Shen, Peiyi

    AU - Zhu, Guangming

    AU - Wei, Wei

    AU - Song, Juan

    AU - Ali Shah, Syed Afaq

    AU - Bennamoun, Mohammed

    PY - 2018/3/1

    Y1 - 2018/3/1

    N2 - With the continuous development of the Internet and information technology, more and more mobile terminals, wear equipment etc. contribute to the tremendous data. Thanks to the distributed computing, we can analyze the big data with quite high speed. However, many kinds of big data have an obvious common character that the datasets grow incrementally overtime, which means the distributed computing should focus on incremental processing. A number of systems for incremental data processing are available, such as Google's Percolator and Yahoo's CBP. However, in order to utilize these mature framework, one needs to make a troublesome change for their program to adapt to the environment requirement. In this paper, we introduce a MapReduce framework, named HadInc, for efficient incremental computations. HadInc is designed for offline scenes, in which real-time is needless and in-memory cluster computing is invalid. HadInc takes the advantages of finer-grained computing and Content-defined Chunking(CDC) to make sure that the system can still reuse the results which we have computed before, even if the split data has been changed seriously. Instead of re-computing the changed data entirely, HadInc can quickly find out the difference between the new split and the old one, and then merge the delta and old results into the latest result of the new datasets. Meanwhile, the dividing stability of the datasets is a key factor for reusing the results. In order to guarantee the stability of the dataset's division, we propose a series of novel algorithms based on CDC. We implemented HadInc by extending the Hadoop framework, and evaluated it with many experiments including three specific cases and a practical case. From the comparing results it can be seen that the proposed HadInc is very efficient.

    AB - With the continuous development of the Internet and information technology, more and more mobile terminals, wear equipment etc. contribute to the tremendous data. Thanks to the distributed computing, we can analyze the big data with quite high speed. However, many kinds of big data have an obvious common character that the datasets grow incrementally overtime, which means the distributed computing should focus on incremental processing. A number of systems for incremental data processing are available, such as Google's Percolator and Yahoo's CBP. However, in order to utilize these mature framework, one needs to make a troublesome change for their program to adapt to the environment requirement. In this paper, we introduce a MapReduce framework, named HadInc, for efficient incremental computations. HadInc is designed for offline scenes, in which real-time is needless and in-memory cluster computing is invalid. HadInc takes the advantages of finer-grained computing and Content-defined Chunking(CDC) to make sure that the system can still reuse the results which we have computed before, even if the split data has been changed seriously. Instead of re-computing the changed data entirely, HadInc can quickly find out the difference between the new split and the old one, and then merge the delta and old results into the latest result of the new datasets. Meanwhile, the dividing stability of the datasets is a key factor for reusing the results. In order to guarantee the stability of the dataset's division, we propose a series of novel algorithms based on CDC. We implemented HadInc by extending the Hadoop framework, and evaluated it with many experiments including three specific cases and a practical case. From the comparing results it can be seen that the proposed HadInc is very efficient.

    KW - Big data

    KW - Finer grained reusing

    KW - Incremental processing

    KW - Yarn

    UR - http://www.scopus.com/inward/record.url?scp=85032788792&partnerID=8YFLogxK

    U2 - 10.1016/j.future.2017.09.079

    DO - 10.1016/j.future.2017.09.079

    M3 - Article

    VL - 80

    SP - 102

    EP - 111

    JO - Future Generation Computer Systems: the international journal of grid computing: theory, methods and applications

    JF - Future Generation Computer Systems: the international journal of grid computing: theory, methods and applications

    SN - 0167-739X

    ER -