Efficient Fault-tolerance for Iterative Graph Processing on Distributed Dataflow Systems

Xu, C and Holzemer, M and Kaul, Manohar and Markl, V (2016) Efficient Fault-tolerance for Iterative Graph Processing on Distributed Dataflow Systems. In: 32nd IEEE International Conference on Data Engineering (ICDE), MAY 16-20, 2016, Helsinki, FINLAND.

Full text not available from this repository. (Request a copy)


Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.

[error in script]
IITH Creators:
IITH CreatorsORCiD
Item Type: Conference or Workshop Item (Paper)
Subjects: Computer science > Big Data Analytics
Others > Engineering technology
Divisions: Department of Computer Science & Engineering
Depositing User: Team Library
Date Deposited: 03 Oct 2016 08:50
Last Modified: 20 Sep 2017 11:15
URI: http://raiith.iith.ac.in/id/eprint/2798
Publisher URL:
Related URLs:

Actions (login required)

View Item View Item
Statistics for RAIITH ePrint 2798 Statistics for this ePrint Item