Map Reduce
— Processing data in distributed manner.
— It has 2 phases
— Map
— Reduce
— Both Map and Reduce will work on key value pairs
— Input to the map && Reduce is key value pair and output to the reduce is key value pair.
— Map reduce is a programming paradigm by solving the problem in particular way using map and reduce phase.
— Traditional programming works when data is kept on single machine.
— When we have data kept on multiple machines in a distributed manner then we required a new programming model required
to solve the problem, None of our traditional model works well.
— Map Phase
— A piece of code which is Running on each block of data node parallely is called map phase
— Hadoop works on the principal of data locality i.e., Code is going to the data.
— Reduce Phase
— The output of these mapper is sent to one another machine is called Reduce phase.
— The output map phase is Intermediate result.
— The output reduce phase is final result(aggregated result).
— Shuffling
— Moving data from map machine to reducer machine
— Sorting
— Based on the key data will be sorted.
— Sorting will happens on reducer machine.
— Shuffle && Sorting will taken care by the framework.
— Mappers will give more parallelism.
— Reducer Phase — Aggregation
— Suppose if there is no aggreation we can set reducer to 0
— By default its 1 reducer
— Partitoned data from the mapper output will go to that reducer
— Partition comes to picture, where we have more than one reducer.
— This is to tell which key value pair goes to which reducer..
— By default there is system defined Hash function tells which key value pair goes to which partititon.
— Hash function is consistant
(Hello,1) — Reducer 1
(Hi,1) — Reducer 2
(Hello,1) — Reducer 1
— same key will go to the same reducer.
— If you dont want system defined hash function
i.e., all the words whose length is less than 4 chars should go to reducer 1
all the words whose length is greater than or equal to 4 chars should go to reducer 2
i.e., In that case we can use custome partition logic.
— Combiner, Will do local aggregation.
— Most of the work should do on mapper end