What is Hive?
Hive is a data warehouse infrastructure tool which process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Why to use Hive?
1) Most of the data warehousing application work with SQL based quering language, Hive supports easy portability of SQL-based application to Hadoop
2) Faster results even for tremendous datasets.
3) As data volume and variety increases, more machines can be added without any corresponding reduction in the performance
Features of Hive
1) It accelerate queries as it provide indexes, including bitmap indexes.
2) It stores metadata which reduce the time to perform semantic checks during query execution.
3) It provide built-in functions to manipulate dates, strings and other data-mining tools.
4) It supports different file formats like Avro Files, ORC Files, Parquet etc
Architecture of Hive
Major components of hive are as follows :
1) Metastore : This component is responsible for storing all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is to be stored.
2) Driver : It acts like a controller which receives the HiveQL statements and starts the execution of statement by creating sessions and monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of an HiveQL statement. The driver also acts as a collection point of data or query result obtained after the Reduce operation.
3) Compiler : The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
In other words, the process can be described by the following flow :
Parser —> Semantic Analyser —> Logical Plan Generator —> Query Plan Generator.
4) Optimizer : Performs various transformations on the execution plan to get an optimized DAG(directed acyclic graph).
5) Executor : After compilation and Optimization, the Executor executes the tasks according to the DAG. It interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency gets executed only if all other prerequisites are run
6) CLI, UI and Thrift Server : Interface that can be used by the users to submit queries and get the result.
How hive interact with hadoop ?
1) Execute Query : From hive interface(UI or Command Line) query is sent to driver
2) Check Syntax and Get Plan : The driver takes the help of query compiler which parses query to check the syntax and query plan or requirement of query.
3) Get Metadata: The compiler sends metadata request to Metastore and in return metastore sends metadata as response to compiler.
4) Execute Plan: The compiler checks the requirement and resends the plan to the driver. The driver sends the execute plan to the execution engine.
5) Execute Job: An execution engine, such as Tez or MapReduce, executes the compiled query. The resource manager, YARN, allocates resources for applications across the cluster. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
6) Fetch Query Result : The execution engine receive the result from data nodes and send it to the driver, which return the result to the hive interface over a JDBC/ODBC connection.
After knowing all these internal stuff about execution of hive queries, i hope you will find it more interesting to work on it.
Stay tuned for further blogs!!