What is Partitioning?
In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable.
Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally which also helps to increase query performance.
Why is Partitioning Important?
- Pentabytes of data is stored in HDFS, due to which it becomes very difficult for Hadoop users to query this huge amount of data.
- To understand this first lets look at a scenario ,
- There is a multinational bank name ABC_BANK which spans across multiple countries. Now we have a table which contains information of new customers named ‘new_cust‘. Lets say you want to find out count of new customers from ‘USA’ . In this case Hive does this as below
- Search across all the countries and filter records of ‘USA’
- Count the number of new customers from ‘USA’
- This will give the correct output but can we optimize this so that Hive fetches record faster.
- What if Hive already new where records belonging to USA is present so it didn’t have to go through all the countries records. This can be achieved using Hive Partition
How to Create Partitions in Hive?
The following command is used to create data partitioning in Hive:
CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2 data_type,….);
If you create an internal Hive table without partitions, the data files will be stored (by default) in HDFS at /user/hive/warehouse/[table_name]
When we create a partitioned internal table, Hive creates subfolders, for each branch of the partition, and the data files reside in the lowest-level subfolder
Types of Hive Partitioning
There are two main types of Partitioning
- Partitions are added manually so it is also known as manual partition.
- In static partitioning, we partition the table based on some attribute.
- Data Loading in static partitioning is not only faster for the massive files to load but also time saver.
- In static partitioning individual files are loaded as per the partition we want to set.
- where clause is used to use limit in static partition.
- Example –
Let’s create a table ‘student’ with the following attributes and partition it using ‘section’ in our default database
CREATE TABLE student(student_name STRING ,father_name STRING ,percentage FLOAT) partitioned by (section STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Describe the table to see information about attributes as well as partitioned columns
Create 4 different files containing data of students from respective sections
(student-A, student-B, student-C, student-D) additionally make sure that section column which is our partition column is never added to the actual table
LOAD DATA LOCAL INPATH '/home/dikshant/Documents/student_A' INTO TABLE student partition(section = "A"); LOAD DATA LOCAL INPATH '/home/dikshant/Documents/student_B' INTO TABLE student partition(section = 'B'); LOAD DATA LOCAL INPATH '/home/dikshant/Documents/student_C' INTO TABLE student partition(section = 'C'); LOAD DATA LOCAL INPATH '/home/dikshant/Documents/student_D' INTO TABLE student partition(section = 'D');
Now as mentioned above, go to your HDFS(/user/hive/warehouse/) and check the student table to see how the partitions are made, following is for the reference
Here we can easily observe that the student table is partitioned and contain data of student in accordance with their section, accordingly, the hive can query data on section-wise
- In most cases, you will find yourself using Dynamic partitions.
- Dynamic partitions provide us with flexibility and it also create partitions automatically depending on the data that we are inserting into the table.
- By default, Hive does not enable dynamic partition, because this will protect us, from creating from a huge number of partitions accidentally
- Below mentioned commands will make dynamic partition enabled in hive
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
We will learn more about Dynamic partitioning in the upcoming blog
Static vs Dynamic Partitioning
Partitioning vs Bucketing
Partitioning as well as bucketing are kind of similar techniques with the goal of improving query performance. Depending on the use case & the data we have, the optimal technique can be chosen.
to know more about Bucketing in the hive, refer to hive bucketing
In this Hive Partitioning article, you have learned about the basics of hive partitioning, how to improve the performance of the queries by doing Partition and Bucket on Hive tables.
Hope this article will help you to understand the basics of partitioning in the hive. Stay tuned for upcoming blogs.