Compression Techniques in Hive
Compression Techniques:
++++++++++++++++++++++
Will help you to save storage.
It will help us to process data faster.
It will reduce I/O cost.
Compression and uncompressing comes with some cost in terms of time taken to compress and time taken to uncompressed.
when we compare it with I/O gains we can actually neglect this additional time.
Some of the compression codecs are optimized for storage. However some compression codecs are optimized for speed.
If we want more compression ratio then we have spend more time in compression.
Snappy:
++++++
— Very fast Compression Codec
— Interims of compression it is not very good
— Mostly used
— Optimized for speed rather than storage.
— snappy by default is not splittable
— Avro, orc, parquet-Container based format
— Developed by google
— Processing performance with snappy can be significantly better than other file formats
LZO
+++
— It is optimized for speed just like snappy.
— It is splittable
— less compression ratio
— good for text files
GZip
++++
— optimized more for storage.(2.5 times better compression than snappy)
— Processing speed is slow.(2 times slower than snappy)
— It is not splittable.
— Used with container based file format
— One reason that GZip is sometimes slower than snappy
for processing is that GZip compressed files take up fewer blocks.
so fewer tasks required for processing the same data.
for this reason using smaller blocks with GZip can lead to better performance.
BZip2
+++++
— Optimized for excellent storage but significantly slower than other compression codecs such as Snappy interims of processing performance.
— Splittable
— BZip2 normally compress 9% better than GZip in terms of storage
— BZip2 is 10 times slower than GZip
— For this reason, this is not an ideal codec for Hadoop storage, unless your primary need is reducing the storage footprint.
— BZip2 is provided very good compression ratio (less disk space)archival purpose