Table of contents

Reading Time: 3 minutes

To speed up the query processing, we need to efficiently manage data.
And partitioning the database is considered as one of the ways to enhance the processing speed.

Using Partitioning, we can divide the data based on some value/field, so that data belonging to different groups are identified easily(may be stored on different nodes), and therefore processing a query won’t require a Full Table Scan, hence making the processing faster.

Following keys can be used to achieve partitioning :

Partition Key: The Partition Key is responsible for data distribution across your nodes.
Clustering Key: The Clustering Key is responsible for data sorting within the partition.
Row Key: A Row Key is just the identifier of an entity within a partition.

Partition key

Determines the node on the cluster, where the partition will be stored.

Partitioning Strategies

Single-Level Partitioning –

a) Range Partition
Range partitioning maps data to partitions based on ranges of partition key values that you establish for each partition. A range is defined using the operators( ‘>’ or ‘<’ (non-inclusive))

Consider the following table :
CUSTOMER –
– CustomerId
– Name
– Location
– DOB

The Lower Bound and the Upper Bound are required for the partition.
Ex: we can define range partition on DOB like:-
Range Partition on DOB ‘2016-01-01’ < DOB <= ’2017-01-01’

b) List Partition – List partitioning enables you to explicitly control how rows map to partitions. A list of discrete values for the partitioning key in the description for each partition is defined.

A row is mapped to a partition by checking whether the value of the partitioning column for a row falls within the set of values that describes the partition. If a table is partitioned by list, the partitioning key can only consist of a single column of the table.

Ex: we can define list partition on Location like
List Partition on Location( ‘Asia’, ‘Europe’, ‘Africa’ )

c) Hash Partition – A column is defined, where hash values are generated for each value and based on the buckets, the partition is made.

The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size.
This is typically used where ranges aren’t appropriate, i.e. employee number, productID, etc.

It is a better choice than range partitioning when:
You do not know beforehand how much data maps into a given range
The sizes of range partitions would differ quite substantially or would be difficult to balance manually
Range partitioning would cause the data to be undesirably clustered
Performance features such as parallel DML, partition pruning, and partition-wise joins are important

Ex: we can define hash partition on any column like:-
Hash Partition( ‘column-name’ )
Composite Partitioning

A combination of more than one partitioning technique.
These can be any of the following type: Range-Range, Range-List, Range-Hash, List-Range, and so on.