Want to know about Greenplum?

Table of contents

Reading Time: 6 minutes

Hello Developer’s, this blog is about what is Greenplum and the feature of Greenplum. So it is an MPP SQL Database based on PostgreSQL. Greenplum Database scales to multi-petabyte data sizes with ease. It also allows a cluster of powerful servers to work together to provide a single SQL interface to the data. We also discuss how to install Greenplum on the system.

What is Greenplum?

It is massively parallel processing (MPP) database server. So Greenplum has an architecture specially designed to manage large-scale, analytic data warehouses and business intelligence workloads.
Therefore it is a software-only solution; the hardware and database software are not coupled. It runs on a variety of commodity server platforms from Greenplum-certified hardware vendors. Because the database is distributed across multiple machines in a Greenplum system, proper selection and configuration of hardware are vital to achieving the best possible performance.

History of Greenplum

From the starting Pivotel Greenplum was based on PostgreSQL.
It is a popular and widely used open-source database. Greenplum sync continuously with PostgreSQL releases until it forked from the main PostgreSQL line at version 8.2.15.
Greenplum, the company, was founded in September 2003 by Luke Lonergan and Scott Yara. Two smaller companies: Metapa (founded in August 2000 near Los Angeles) and Didera in Fairfax, Virginia was merge and created the Greenplum.
In 2012, EMC purchased Pivotal Labs and in 2015, Pivotal announced that it would adopt an open-source strategy for its product set. Pivotal would donate most of the software to the Apache Foundation and the software then would be freely licensed under the Apache rules.

MPP Greenplum and PostgreSQL

We know MPP is also known as Shared Nothing architecture. And refers to the systems that have more than one processors so that they work with each other and carried out an operation. Each processor has its own memory, operating system, and disk. Likewise, it uses this high-performance system architecture to distribute the load the multi-terabyte data. Moreover, it can use all of the system’s resources in parallel to process a query.

What Greenplum inherited from PostgreSQL?

The Greenplum is based on PostgreSQL open-source technology. So it is essentially several PostgreSQL database instances acting together as one cohesive database management system (DBMS).
In addition, it is based on PostgreSQL 8.2.15, and in most cases is very similar to PostgreSQL with regard to SQL support, features, configuration options, and end-user functionality.
Database users interact with Greenplum Database as they would a regular PostgreSQL DBMS.
Moreover, the internals of PostgreSQL has been modified to support the parallel structure of the Greenplum.
It interconnects (the networking layer) enables communication amongst the PostgreSQL instances and allows the system to behave as one logical database.
For example, the system catalogue, optimizer, query executor, and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across all of the parallel PostgreSQL database instances.

Important Note:

Greenplum Database also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, Greenplum has added parallel data loading (external tables), resource management, query optimizations, and storage enhancements, which are not found in standard PostgreSQL.

System Properties Comparison Greenplum vs. PostgreSQL

Name	Greenplum	PostgreSQL
Description	Analytic Database platform built on PostgreSQL. Full name is Pivotal Greenplum Database	Widely used open source RDBMS
Primary database model	Relational DBMS	Relational DBMS
Server-side scripts	Yes	User defined functions
DB-Engines Ranking	Rank #39 Overall #24 Relational DBMS	Rank #4 Overall #4 Relational DBMS
MapReduce	Yes	No
User concepts	fine grained access rights according to SQL-standard	fine grained access rights according to SQL-standard

All the reasons to choose Greenplum.

MPP Architecture: It is a massively parallel processing architecture. And it provides automatic parallelization of all data and queries in a scale-out, shared shared-nothing architecture.
Petabyte-Scale loading: High-performance loading uses MPP technology. Loading speeds scale with each additional node to greater than 10 terabytes per hour, per rack.
Query Optimization: The query optimizer available in Greenplum Database is the industry’s first cost-based query optimizer for big data workloads. It can scale interactive and batch mode analytics to large data-sets in the petabytes without degrading query performance and throughput.
Polymorphic Data Storage: Fully control the configuration for your table and partition storage, execution, and compression. Design your tables based on the way data is accessed. Users have the choice of row or column-oriented storage and processing for any table or partition.
Integrated In-Database Analytics: Provided by Apache MADlib, a library for scalable in-database analytics extending the SQL capabilities of Greenplum Database through user-defined functions.
Federated Data Access: Query external data sources with the Greenplum optimizer and query processing engine. Including Hadoop, Cloud Storage, ORC, AVRO, Parquet and other Polygot data stores.

Install Greenplum OSS on your Ubuntu machine?

 Step 1: sudo add-apt-repository ppa:greenplum/db
 The output will display as shown in this screenshot:

sudo apt-get update
sudo apt-get install greenplum-db-oss

The above command will install the Greenplum Database software and any required dependencies on the system automatically and put the resulting software in /opt/gpdb as seen below:

Load Greenplum Database software into your environment with the following command:

$ . /opt/gpdb/greenplum_path.sh
$ which gpssh
/opt/gpdb/bin/gpssh

You can see the software is on the path by testing using the which command as above. Now you can copy a Greenplum cluster configuration file template into your local directory for editing like this:

cp $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_singlenode .

Edit gpinitsystem Configuration File
 The following edits can be made for the most simple cluster configuration running locally.
 Create this file and put only your hostname into the file:
 MACHINE_LIST_FILE=./hostlist_singlenode
 Update this line to have a directory you want to use for primaries for example:
 declare -a DATA_DIRECTORY=(/gpdata1 /gpdata2)
 declare -a DATA_DIRECTORY=(/home/inovick/primary /home/inovick/primary)
 And make sure the directory mentioned above exists.
 Update this line to have the hostname of your machine, in my case, the hostname is ‘ubuntu’:
 MASTER_HOSTNAME=hostname_of_machine
 MASTER_HOSTNAME=ubuntu
 Update the master data directory entry in the file and ensure it exists by making the directory:
 MASTER_DIRECTORY=/home/inovick/master

That’s enough to get the database initialized and up running, so close the file and let’s initialize the cluster. We will have a master segment instance and two primary segment instances with this configuration. In more advanced setups you would configure a standby master and segment mirrors on additional hosts, and the data would be automatically both sharded (distributed) between the primary segments and mirrored from primaries to mirrors.

Run gpinitsystem

First, let’s make sure ssh keys are exchanged by running the following command. Screenshot of system is shown below:

gpssh-exkeys -f hostlist_singlenode
After that this Command it will be shows

After that we need to start the cluster, let’s get started. Run the following command:

gpinitsystem -c gpinitsystem_singlenode
The utility will print out what its going to do and then ask you to confirm before proceeding.  Here is an example below:

Once it finishes you are good to go, you can create a database, login and start doing queries and inserting data as shown below:

Summary:

It is an MPP SQL Database based on PostgreSQL. It’s used in production in hundreds of large corporations and government agencies around the world. And it including the open-source has over thousands of deployments globally.
It scales to multi-petabyte data sizes with ease and allows a cluster of powerful servers to work together to provide a single SQL interface to the data.

So in addition to using SQL for analyzing structured data. So it provides modules, extensions on top of the PostgreSQL abstractions for in-database machine learning and AI, Geospatial Analytics, Text Search (with Apache Solr) and Text Analytics with Python and Java, and the ability to create user-defined functions with Python, R, Java, Perl, C or C++.

References:
https://en.wikipedia.org/wiki/Greenplum
https://gpdb.docs.pivotal.io/560/admin_guide/intro/arch_overview.html
https://pivotal.io/pivotal-greenplum