Hive-Structure level optimization
In Hive Structure level optimization can be done in 2 ways:
1)Partitioning
2) Bucketing
Both partitioning and bucketing are used to divide the data into smaller parts.
๐) ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ถ๐ป๐ด:
โIf we have low cardinality(less number of distinct values) on the desired column then we can perform partitioning on that column.
โEach partition is a directory/folder in HDFS.
The data is by default stored in the โ/user/hive/warehouseโ directory
If there is a query
select * from customers where country = โCAโ
โTo optimize the above query we can do partition on the country column while table creation so that when data is loaded into the table it will be divided based on a country column.
โAs each partition is a folder in HDFS. When partitioning data the directory structure will be like this:
โ/user/hive/warehouse/test.db/partition_table/country=โCAโโ
โ/user/hive/warehouse/test.db/partition_table/country=โNYโโ
Here the no of partitions is based on the no of unique column values i.e
๐ก๐ผ ๐ผ๐ณ ๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ = ๐ก๐ผ ๐ผ๐ณ ๐๐ป๐ถ๐พ๐๐ฒ ๐ฐ๐ผ๐น๐๐บ๐ป ๐๐ฎ๐น๐๐ฒ๐
Here if we have 10 different countries then the no of partitions will be 10.When we run the above query it will not scan the entire data present in HDFS, it will only scan the directory which has the required country value.
โIf we try to do partitioning on a column where cardinality is high then a lot of folders will be created.
โIn this case if we do partition on the customer_id column then more than 1 million folders will be created.
โThis will lead to burdening of the Name node as it needs to handle all 1 million metadata folders. This will lead to a decrease in performance.
โIn order to overcome this we go for bucketing.
๐)๐๐ฎ๐๐ค๐๐ญ๐ข๐ง๐ :
โBucketing is done when the cardinality of the column is high.
โBucketing divides the data based on the hash function
If we want to run the query:
select * from customers where customer_id=20000;
since the cardinality is high for the above column we should go for Bucketing.
โIn the above case we need to tell the no of buckets in advance when creating the table.
โSince we have fixed no of buckets it is easy to find which bucket holds which value as the query will run internally and find the hash value of the bucketed column and will search in the respective bucket.
Each ๐๐ฎ๐๐ค๐๐ญ ๐ข๐ฌ ๐ ๐๐ข๐ฅ๐ whereas each ๐ฉ๐๐ซ๐ญ๐ข๐ญ๐ข๐จ๐ง ๐ข๐ฌ ๐ ๐๐จ๐ฅ๐๐๐ซ.
๐ก๐ผ๐๐ฒ :
We can do partitioning->bucketing (folder can contain file) whereas reverse is not possible i.e
bucketing->partitioning(file cannot contain folder
No comments:
Post a Comment