Map Reduce

SIRIGIRI HARI KRISHNA
2 min readOct 18, 2022

--

— Processing data in distributed manner.
— It has 2 phases
— Map
— Reduce
— Both Map and Reduce will work on key value pairs
— Input to the map && Reduce is key value pair and output to the reduce is key value pair.
— Map reduce is a programming paradigm by solving the problem in particular way using map and reduce phase.
— Traditional programming works when data is kept on single machine.
— When we have data kept on multiple machines in a distributed manner then we required a new programming model required
to solve the problem, None of our traditional model works well.
— Map Phase
— A piece of code which is Running on each block of data node parallely is called map phase
— Hadoop works on the principal of data locality i.e., Code is going to the data.
— Reduce Phase
— The output of these mapper is sent to one another machine is called Reduce phase.
— The output map phase is Intermediate result.
— The output reduce phase is final result(aggregated result).
— Shuffling
— Moving data from map machine to reducer machine
— Sorting
— Based on the key data will be sorted.
— Sorting will happens on reducer machine.
— Shuffle && Sorting will taken care by the framework.
— Mappers will give more parallelism.
— Reducer Phase — Aggregation
— Suppose if there is no aggreation we can set reducer to 0
— By default its 1 reducer
— Partitoned data from the mapper output will go to that reducer
— Partition comes to picture, where we have more than one reducer.
— This is to tell which key value pair goes to which reducer..
— By default there is system defined Hash function tells which key value pair goes to which partititon.
— Hash function is consistant
(Hello,1) — Reducer 1
(Hi,1) — Reducer 2
(Hello,1) — Reducer 1
— same key will go to the same reducer.
— If you dont want system defined hash function
i.e., all the words whose length is less than 4 chars should go to reducer 1
all the words whose length is greater than or equal to 4 chars should go to reducer 2
i.e., In that case we can use custome partition logic.
— Combiner, Will do local aggregation.
— Most of the work should do on mapper end

--

--

SIRIGIRI HARI KRISHNA
SIRIGIRI HARI KRISHNA

Written by SIRIGIRI HARI KRISHNA

Data Engineer passionate about Spark, Azure, and the Cloud. Simplifying data complexities on my Medium blog. Let's dive into the world of data together!

No responses yet