Paper Review Series : Mapreduce
MapReduce is a programming paradigm where Map and Reduce are two functions in which many real word problems can be expressed or framed. The advantage is problems framed into this paradigm is automatically paralleled. The implementation will deal with distributing computations onto commodity servers so that programmer can code the logic without knowing about distributed system. This paper discusses the problems the implementation of MapReduce and how it can be used.
MapReduce interface is very simple to use. Users basically only have to write their business logic into the Map and Reduce functions which are fairly straightforward as well. On the other hand, that means a relatively involved job might require several map and reduce functions, which is only cumbersome to write, but also implies performance issues, as one map function that is slow may thwart execution of subsequent and dependent functions.
The system runs on commodity hardwares and fault tolerance and failure recovery are simple. Basically, in case of worker failure, jobs are re-executed excepted for completed reduce job. If a master fails, clients check and decide whether to retry. These combined makes the system very scalable.
One thing that is insufficiently discussed is job scheduling and resource allocation policy, both of which are very important in distributed system.
