Apache Hive is an open-source data warehouse system used to query and analyze large datasets. Data in Apache Hive can be categorized into the following three parts :
What is Bucketing in Hive?
Bucketing in the hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be used for more efficient queries.
You’ve seen that partitioning gives results by segregating HIVE table data into multiple files only when there is a limited number of partitions, what if partitioning the tables results in a large number of partitions. This is where the concept of bucketing comes in.
When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node
In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.
What Do Buckets Do?
- The concept of bucketing is based on the hashing technique.
- modules of current column value and the number of required buckets is calculated (let say, F(x) % 3).
- Based on the resulted value, the data stored into the corresponding bucket.
- The Records with the same bucketed column stored in the same bucket
- This function requires you to use the Clustered By clause to divide a table into buckets.
- Bucketing Enable more efficient queries by imposing extra structure on the table, below are the difference between file structure of partitioning & bucketing.
Hive Bucketing Example
Apache Hive supports bucketing as documented here. The steps for the creation of bucketed column are as follows:
- Select the database in which we want to create a table.
- Create a dummy table to store the data.
- load the data into the table.
- Enable the bucketing in hive
- Create a bucketing table
- insert the data of dummy table into the bucketed table.
use bucketdData;create table emp_demo (Id int, Name string , Salary float) row format delimited fields terminated by ',' ; load data local inpath '/home/codegyani/hive/emp_details' into table emp_demo; set hive.enforce.bucketing = true; create table emp_bucket(Id int, Name string , Salary float) clustered by (Id) into 3 buckets row format delimited fields terminated by ',' ; insert overwrite table emp_bucket select * from emp_demo;
Here, we can see that the data is divided into three buckets.
Assume we have the below data in our bucket :
According to hash function :
- Bucket 0
So, these columns are stored in bucket 0.
- Bucket 1
So, these columns are stored in bucket 1.
Bucketed vs. Non-Bucketed Query Performance
For this demonstration, we’ll use the standard tpc-ds tables. The big table samples will be catalog_sales, store_sales, and store_returns.
Primarily we have three datasets used for the queries,
- catalog_sales – 164 GB, 143 Million rows (Big)
- store_sales – 220 GB, 2.87 Billion rows (Big)
- store_returns – 25 GB, 287 Million rows (Medium)
The chart below compares the same data accessed as bucketed vs non-bucketed table queries for three different variations.
Query 1: Big-Medium tables join. This query has a join between the catalog_sales and store_returns tables.
Query 2: Join 2 Big-Medium tables. This query has a join of 2 joins. catalog_sales joined with store_returns and store_sales joined with store_returns.
Query 3: Two big tables join. This query has a join between the catalog_sales and store_sales tables. (Yes, the graphic looks unusual. No, this is not a mistake.)
Bucketing in the hive is very useful for the Optimization front and also for certain big dataset joins. In addition, hive bucketing is more efficient for queries with filters on bucketing columns and aggregates.
Hope this article will help you to understand the basics of optimization in the hive. Stay tuned for upcoming blogs.