Dynamic Partitioning in Apache Hive

Reading Time: 3 minutes

Introduction

We are back with another Important concept of big data is Dynamic partitioning in Hive. Before moving to the dynamic one we should know about static partitioning which I explained In the blog Static partitioning

Now it’s time to deep dive into a dynamic one.

How Dynamic Differ from Static Partitioing

  • In this partition, columns values are only known at EXECUTION TIME
  • User is just mentioning the column, on which partition is required. REST is taken care by hive itself

Working of Dynamic Partitioing

Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0.

In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions.

In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduces the Hadoop cluster workload comparing to the multiple insert case.

Example

Step1: 
First, select the database in which we want to create a table.
  use StudentData;

Step2:
Enable the dynamic partition by using the following commands: -
  set hive.exec.dynamic.partition=true;  
  set hive.exec.dynamic.partition.mode=nonstrict; 

Step3:
Create a dummy table to store the data.
 
create table stud_demo(id int, name string, age int, institute string,course string)   
row format delimited  
fields terminated by ',';  

Step4:
Now, load the data into the table.

load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;  

Step5:
Create a partition table by using the following command: -

create table student_part (id int, name string, age int, institute string)   
partitioned by (course string)  
row format delimited  
fields terminated by ','; 

Step6:
Now, insert the data of dummy table into the partition table.

insert into student_part  
partition(course)  
select id, name, age, institute, course  
from stud_demo;  

In the following screenshot, we can see that the table student_part is divided into two categories.

Can we drop/delete the hive partitions?

Yes, we can drop the partitions if required. To drop or delete the partition column using the below command

Let’s see a few variations of drop partition.

Drop a single partition

ALTER TABLE student_part DROP IF EXISTS PARTITION(course = 'java');

Drop a multiple partition

ALTER TABLE student_part drop if exists partition (course = 'java'), partition (course = 'hadoop');

Advantages of Dynamic Partition

  • Good for loading huge files in tables.
  • Data read row wise.
  • Generally, it reduces the query processing time.
  • We can use to load data from the table that is not partitioned.
  • External and managed tables is used for this type of partition.

Disadvantages of Dynamic Partition

  • We cannot perform alter on Dynamic Partition.
  • Sometime needs to be extra carefully because one wrong partition column can lead to creating millions of partition in hdfs directory

Conclusion

From the above article, we saw how to used in the hive and how to create it. We also check the advantage of having a dynamic partition over the hive and how to use it. So from this article, we can have a fair idea of how it works in the hive and its advantage.

Written by 

Chitra Sapkal is a software consultant at Knoldus Inc. having experience of 2 years. Knoldus does Big Data product development on Scala, Spark, and Functional Java. She is a self-motivated, passionate person who is recognized as a good team player, Her hobbies include playing badminton and travelling.