Sharing is Caring: Hive - Partitioning vs Bucketing

Hive-Structure level optimization

In Hive Structure level optimization can be done in 2 ways:
1)Partitioning
2) Bucketing
Both partitioning and bucketing are used to divide the data into smaller parts.

𝟏) 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴:

→If we have low cardinality(less number of distinct values) on the desired column then we can perform partitioning on that column.
→Each partition is a directory/folder in HDFS.
The data is by default stored in the ‘/user/hive/warehouse’ directory
If there is a query
select * from customers where country = ‘CA’
→To optimize the above query we can do partition on the country column while table creation so that when data is loaded into the table it will be divided based on a country column.
→As each partition is a folder in HDFS. When partitioning data the directory structure will be like this:
‘/user/hive/warehouse/test.db/partition_table/country=’CA’’
‘/user/hive/warehouse/test.db/partition_table/country=’NY’’
Here the no of partitions is based on the no of unique column values i.e

𝗡𝗼 𝗼𝗳 𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀 = 𝗡𝗼 𝗼𝗳 𝘂𝗻𝗶𝗾𝘂𝗲 𝗰𝗼𝗹𝘂𝗺𝗻 𝘃𝗮𝗹𝘂𝗲𝘀

Here if we have 10 different countries then the no of partitions will be 10.When we run the above query it will not scan the entire data present in HDFS, it will only scan the directory which has the required country value.

→If we try to do partitioning on a column where cardinality is high then a lot of folders will be created.
→In this case if we do partition on the customer_id column then more than 1 million folders will be created.
→This will lead to burdening of the Name node as it needs to handle all 1 million metadata folders. This will lead to a decrease in performance.
→In order to overcome this we go for bucketing.

𝟐)𝐁𝐮𝐜𝐤𝐞𝐭𝐢𝐧𝐠:

→Bucketing is done when the cardinality of the column is high.
→Bucketing divides the data based on the hash function

If we want to run the query:
select * from customers where customer_id=20000;
since the cardinality is high for the above column we should go for Bucketing.
→In the above case we need to tell the no of buckets in advance when creating the table.
→Since we have fixed no of buckets it is easy to find which bucket holds which value as the query will run internally and find the hash value of the bucketed column and will search in the respective bucket.

Each 𝐛𝐮𝐜𝐤𝐞𝐭 𝐢𝐬 𝐚 𝐟𝐢𝐥𝐞 whereas each 𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧 𝐢𝐬 𝐚 𝐟𝐨𝐥𝐝𝐞𝐫.

𝗡𝗼𝘁𝗲 :
We can do partitioning->bucketing (folder can contain file) whereas reverse is not possible i.e
bucketing->partitioning(file cannot contain folder

Sharing is Caring

Tuesday, September 13, 2022

Hive - Partitioning vs Bucketing

No comments:

Post a Comment

Spark- Window Function

Report Abuse