The Theory behind Map-Reduce
Since the introduction of map-reduce in 2004, this programming model has been adopted by many leading technology companies. The easy-to-use implementations of map-reduce enable extensive data processing and generation "without" a deep understanding of parallel and distributed programming. Nevertheless, reducing the runtime on map-reduce computations can be crucial when processing terabytes of data on clusters with over a thousand machines. In this work, we will identify cost drivers in map-reduce and discuss an approach to designing efficient map-reduce algorithms. In particular, we will focus on the tradeoff between a high degree of parallelism and low communication cost. Furthermore, we will examine different algorithms that solve the Hamming distance problem using map-reduce and analyze them theoretically and experimentally.