On Fault Tolerance for Distributed Iterative Dataflow Processing

Xu, Chen and Holzemer, Markus and Kaul, Manohar and Soto, Juan et. al. (2017) On Fault Tolerance for Distributed Iterative Dataflow Processing. IEEE Transactions on Knowledge and Data Engineering, 29 (8). pp. 1709-1722. ISSN 1041-4347

Full text not available from this repository. (Request a copy)


Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and experimental analyses us...

[error in script]
IITH Creators:
IITH CreatorsORCiD
Item Type: Article
Uncontrolled Keywords: Fault tolerance, distributed data processing, iterative computation, graph processing, machine learning analytics
Subjects: Computer science
Divisions: Department of Computer Science & Engineering
Depositing User: Team Library
Date Deposited: 24 May 2019 08:40
Last Modified: 24 May 2019 08:40
URI: http://raiith.iith.ac.in/id/eprint/5317
Publisher URL: http://doi.org/10.1109/TKDE.2017.2690431
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 5317 Statistics for this ePrint Item