Partitioning in Apache Hive

Table of contents
Reading Time: 2 minutes


Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values.thats where Partitioning comes into play a partition is nothing but a directory which contains the chunk of data when we do partitioning, we create a partition for each unique value of the column

lets run a simple example to see what it is

syntax to create partition table is

create table tablename(colname type) partitioned by(colname type);

if hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic



here we create a table named emp info with two fields name and addresss we partitioned the table by column ID of type int and then we insert the value in this table

it’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.

syntax for inserting values is

insert into partition values();

first we insert record with id=1 now insert another record with id=2


now got to /user/hive

/warehouse/default/empinfo directory in your hdfs


as we can see there are two partitions one with name id=1 and other one as id =2 now when a  select query is fired with where clause it will not scan the full table it will only scan the required partition


if you will tried  it with a non partition table with large dataset it will take more time in comparison because it will have to go through entire table scan

i hope this blog will be helpful happy coding