BIGDATA

SIRIGIRI HARI KRISHNA
4 min readOct 18, 2022

--

BIGDATA BASICS #1
As Per IBM Bigdata is classified into 3V’s, there was lot many v’s.
1)Volume
Scale of data
2.5 quintillion (Transitional system which can’t handle this data)
2)variety
Different Forms of data
Structured
RDBMS Database (Oracle & MySQL)
Semistructred
CSV, XML, JSON
Unstructured Data
Audio,Video,Image,LogFiles.
3)Velocity
Speed of data
900 Million photos on facebook
600 Million tweets on Twitter…etc
4)Veracity
Uncertainity of data
poor Quality of data Unclean data
Why Bigdata?
To process huge amount of data which traditional systems are not capable of processing.
Big Data is problem Statement to Solve the above the problems which tradition DB is not able to process.
Hadoop is a framework to solve Bigdata Problems.
Big Data Requirements?
Process
To Process Huge amount of data which traditional systems are not capable of processing.
Store
To Process huge amount of data first we need to storage, Traditional DB is not capable to store huge amount of data.
Scale
Scale easily as data grows.
Two ways to builds a system
Monolithic System
Single Powerful system with lot of resources
** Threshold limit at certain time
Distributed System
Many Smaller systems come together
Each system is called node, together all nodes is called cluster.
** No Threshold limit, Keep on increasing resources.
That is why all good bigdata systems are based on distributed architecture.
Resources
Ram-Memory
CPU-Compute
Hard Disk-Storage
History
2003(Google Released a paper to describe how to store large datasets)
This Paper was called Google File System(GFS)
2004(Google Released another paper to describe how to process large datasets)
This Paper is called MapReduce.
2006 Yahoo took these papers and implemented it.
The implementation of GFS was named as HDFS(Hadoop distributed File system)
The implementation of MapReduce was named as MapReduce
2009 Hadoop came under Apache Software foundation and became open source.
2013 Apache released Hadoop 2.0 to provide major performance enhancements.
Hadoop 1
HADFS, MapReduce
Hadoop 2
HADFS, MapReduce, YARN
YARN-Yet Another Resource Negotiator,
It is like an OS, Mainly Responsible for Resource Management.
Hadoop Core
HDFS,MR,YARN
HDFS ARCHITECTURE
— Data nodes are made of commodity(cheap) Hardware
— Name node are made of high quality Hardware.
— Name node will have block mapping information
— Data nodes will store actually data in the form of blocks
— Replication Factor, By defaults its 3
— If Data nodes fails, the data wont be lost.
— The replica of each DN will placed in other data nodes
— Heart Beat, By defaults 3s
— Every DN sends heart beat to NN for every 3 seconds
— IF NN doesn’t receive 10 consecutive heart beats, it assume that data node is dead or running very slow.
— IN HADOOP Version 1 Block size is 64MB
— IN HADOOP Version 2 Block size 128MB
Name Node Failure(Active)
— IN HADOOP V1 NN was single point of failure
— IN Hadoop V2 NN was no longer single point of failure.
— EX: Index page
— NN Fails means, no access to metadata.
— No metadata means, No access to the cluster.
— we will be loosing the block mapping information
— NN metadata will be stored in memory.
— There are 2 things that help us to make sure there is no downtime involved.
fsimage
— snapshot of inmemory filesystem at a given moment
i.e. Block mapping information(which block is stored on which machine, which block is free,… etc. )
edit logs(edits)
— All the new changes or transactions that happens after the snapshot is taken will come to the edit logs file.
— Merging of fsimage + edit logs will gives you to the latest fsimage.
— Merging of fsimage + edit logs is a compute heavy process.
— NN should not take the activity of merging these two files. NN is busy doing lot of other things.
Secondary NN(Passive)
— Merging of the 2 files(fsimage+editlogs) is taken care and get new updated Fsimage.
— Merging of fsimage and edit logs is called checkpointing
— fsimage and edit logs will be stored in shared folder.
— NN and Secondary NN will have access to the shared folder.
— This process repeats after every 30 seconds.
— Once updated fs image will generated, fs image will be updated to latest fsimage and edit logs will reset to empty.
Rack Awareness Mechanism
— Rack means group of systems placed in different geographical location.
— Name Node stores data block in a data node of a rack.
— Replicas of data block are forwarded to new locations
— Forwarding data blocks within the same rack requires small amount of network bandwidth.
— Involved less input-output operation.
— If we place all replicas in a single rack, there are high chances of data loss if the rack goes down.
— Each replica is placed in different rack is also not an ideal solution.
— It will take lot of time to write from one geographical locations to another location.
— The balanced approach is to place replica in two different racks.
— one replica in one rack and other two in a different rack or vice versa.
Block report
— Each data node sends a block report to the name node at a fixed frequency indication if any blocks are corrupted.
Installation modes in Hadoop
Local modes
— Only MR will work
— works for test Mr Jobs
Pseudo distributed modes
— HDFS,MR,YARN are running in a single machine.
— No parallelism
Fully distributed Mode
— Each machine will have HDFS,MR,YARN
The below xml have all the configuration about block size, replication …
hdfs-site.xml
mapred-site.xml
core-site.xml
yarn-site.xml
FSCK
Use fsck command to see metadata in hdfs.
fsck stands for filesystem check
hdfs fsck /user/crazycodingteam6223/loan_stats_3c.csv -files -blocks -locations
The above command will give the block information and replication information
along with the ip address of datanodes where the block are kept.

--

--

SIRIGIRI HARI KRISHNA
SIRIGIRI HARI KRISHNA

Written by SIRIGIRI HARI KRISHNA

Data Engineer passionate about Spark, Azure, and the Cloud. Simplifying data complexities on my Medium blog. Let's dive into the world of data together!

No responses yet