Apache Calcite : A Quick insight on Cost-based Optimization

Reading Time: 3 minutes

Hello folks , In this blog , We will discuss about what is Query Optimization,what is a plan cost is , how we can use to drive optimizer decisions and Issues In Cost-Based Optimization.

What is Query Optimization:

We can execute a single query through different algorithms or re-written in different forms and structures. Hence, the question of query optimization comes into the picture – Which of these forms or pathways is the most optimal? The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

Consider the following query:

SELECT * FROM fact 
WHERE event_date BETWEEN ? AND ?

We may do the full table scan and then apply the filter. Alternatively, we may utilise a secondary index on the attribute event_date, merging the scan and the filter into a single index lookup. This sounds like a good idea because we reduce the amount of data that needs to be execute.

The goal of query optimization is to reduce the system resources required to fulfil a query, and ultimately provide the user with the correct result set faster.

Cost estimation

The cost estimation of a query evaluation plan is calculate in terms of various resources that include:

  • Number of disk accesses
  • Execution time taken by the CPU to execute a query
  • Communication costs in distributed or parallel database systems.

For example, a plan that gives the smallest latency might not be the best if our goal is to minimise the hardware usage costs in the cloud. So how do we decide which plan is better?

First of all, we should define the optimization goal, which could be-

  • minimal latency
  • maximal throughput

 Then we may associate every plan with a value that describes how “far” the plan is from the ideal target.

In practical systems let’s see how cost is usually implemented

  • Apache Calcite, the cost is modelled as a scalar representing the number of rows being processed.
  • Catalyst, the Apache Spark optimizer, the cost is a vector of the number of rows and the number of bytes.
  • Presto, the cost is a vector of estimated CPU, memory, and network usage. The vector is also converted into a scalar value during comparison.
  • CockroachDB, the cost is an abstract 64-bit floating-point scalar value.

The scalar is a common choice for practical systems, but this is not a strict requirement. We can use any representation ,as long as it satisfies the requirements of your system and allows you to decide which plan is better.

The cost of optimization of the query depends upon the following-

  1. Cardinality – Cardinality is the number of rows that are return by performing the operations specified by the query execution plan.
  2. Selectivity-  A “selective” query refers to a query that selects few rows relative to the number of rows in the table. In general, an index is useful on such a query, because it reduces the number of data pages that need to be peruse.
  3. Cost- The Optimiser allocates a cost in numerical form which is relate to each step of a possible plan and then finds these values together to get a cost estimate for the plan or for the possible strategy.

Cost-based Optimization

Once we know how to compare the plans, different strategies can be use to search for the best one. A common approach is to enumerate all possible plans for a query and choose a plan with the lowest cost.

Below are some of the features of the cost-based optimization-

  • The query can use a lot of paths based on the value of indexes, available sorting methods, constraints, etc.
  • The aim of query optimization is to choose the most efficient path of implementing the query at the possible lowest minimum cost in the form of an algorithm. 
  • The cost of an algorithm also depends upon the cardinality of the input.

Issues In Cost-Based Optimization

  • In cost-based optimization execution strategies is not really fixed. The number of execution strategies may vary based on the situation.
  • Sometimes, this process is really very time-consuming to cost because it does not always guarantee finding the best optimal strategy
  • It is an expensive process.

Summary

The cost-based optimization estimates the quality of the plans concerning the optimization target, allowing an optimizer to choose the best execution plan.

The cost model depends heavily on metadata maintained by the database, such as estimated cardinalities and selectivity.

That’s it, folks. I hope you liked the blogs. Thanks!

Reference

https://www.javatpoint.com/estimating-query-cost

To read more tech blogs, visit Knoldus Blogs.

Written by 

Gaurav srivastav is a Software Consultant working in java domain.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading