The transitive closure of a directed graph answers the reachability question for every pair of nodes in the graph. For instance, if the nodes of a graph represent US cities, and the edges represent railroad lines, then the transitive closure might contain the pair (Sacramento, Boston) if there is a path of (any number of) rail lines which leads from Sacramento to Boston.
Another way to think about the transitive closure is in the case of a social network; the base graph tells you which individuals are friends, but the transitive closure gives you all individuals who are connected by any number of degrees of separation.
Calculating the transitive closure was a pretty well studied problem even back in the 1990s and before, but a recent line of papers from Afrati, Ullman, et al have investigated the properties of several older algorithms when translated to the MapReduce/Hadoop distributed computing environment. I did my thesis at UC Davis on distributed algorithms for the transitive closure; specifically, two algorithms discussed in these papers, called Seminaive and Smart, which I implemented using Hadoop.
The next post will cover the basics of the two algorithms, but first, here’s a brief overview of the MapReduce computing paradigm.
MapReduce divides problems into two phases: the Map phase and the Reduce phase. There will typically be many mappers and many reducers. The input data is split up and distributed among all of the mappers. Each mapper receives a list of (key, value) pairs. For each pair, the mapper emits zero or more (key’, value’ ) pairs (that’s key-prime, value-prime, meaning they can and usually will be different from the input pair).
The output of the Map phase is sent to the Reducers, organized by the MapReduce framework so that all pairs with the same key from the Map phase will be sent to a single Reducer. That way, however the input was originally divided up, each reducer knows that it is seeing all of the data with a given key produced from the map phase. The Reducer then performs its processing, such as an aggregation, and outputs the final result(s).
It definitely helps get the hang of the idea to go through an example MapReduce algorithm. The canonical introductory algorithm is Word Count. Rather than post my own version of how that works in Map Reduce, I’ll just point to the Yahoo! tutorials, which are an excellent starting point for learning more about MapReduce: Word Count Example
For further details about MapReduce’s use in distributed computation, the paper from the Googlers who created it, MapReduce: Simplied Data Processing on Large Clusters, is very good. Hadoop, which I used for my project, is an open source Apache project that implements the MapReduce model, and makes it very easy to get programs up and running on dozens of computers simultaneously.
From my experience, the MapReduce model was not the best choice of environment for finding the transitive closure, or really for any iterative graph computation. Seminaive and Smart both proceed in rounds, where the state of the computation at the previous round is modified to get the state at the end of the current round. With Hadoop, this requires reloading all of the data at each round throughout the cluster, which incurs a pretty significant overhead. Something like Pregel (and its open source implementation, Giraph) or HaLaoop probably would have been far more efficient, but I ended up running out of time to explore those systems in my thesis. Nevertheless, MapReduce and Hadoop are very popular and powerful tools, and implementing the algorithms to run in Hadoop on Amazon’s EC2 cloud computing service was an interesting experience.
The next post will cover the basic workings of the Seminaive and Smart transitive closure algorithm. After that, I will post the source code for their implementations in Pig (a programming language built on top of Hadoop) and then the source code for the algorithms using “real” Hadoop (the Java API), along with instructions to get the programs up and running on a cluster of Amazon EC2 computers.