Transform Data on cloud with AWS Glue: Managed ETL Platform

Reading Time: 3 minutes

Glue: Data Integration Service

AWS Glue is a managed cloud ETL platform that can be used for data enriching, cleansing, normalising, organisation, validation, or formatting purpose for storage within a data lake, data warehouse or databases.

Glue Jobs acts an orchestra-tor for ETL workflow. We can create jobs within AWS Glue that automates the scripts for ETL tasks. These jobs can be scheduled and chained, or they can be made events driven as well.

Glue: ETL Workflow

Glue Services

AWS Glue can include various services including: building event-driven ETL pipelines, as a data-catalog source to find data across multiple data stores, for monitoring ETL jobs even without maintaining code, for data exploration, for building views etc.

Components of AWS Glue Job

AWS Glue Components

AWS Glue Data Catalog is a persistent metadata store that facilitates data exploration around different data stores just like Apache Hive.

Glue Crawlers is a scanner that is used to scan different types of data which can be used further for data classification. It can extract schema information from it. It can store metadata in data-catalog to guide ETL operations.

Glue Crawlers
Glue Crawlers

Glue Data Brew is a visual tool for data preparation. It makes data cleansing and normalisation operations easier for analysts.

Glue Data Brew
Glue Data Brew

AWS Glue Studio is a graphical interface for creation, running & monitoring ETL jobs in AWS glue.

Glue Studio
Glue Studio

AWS Glue Elastic View builds materialised views for combining or replicating data across multiple data stores.

                                                 Glue Elastic View
Glue Elastic View

AWS Glue Use-Cases

  Data Lake

Glue is widely used these days to build data lakes at enterprise level. Wherein data wranglers from different domain can store and access data at one place. Thereby it provides leading industry experience to data analytics and other data personas.

Data Catalog

AWS Glue can also be used to build Data-catalog based web application for data search, filtering, navigation and research work . A data-catalog allows data analysts to unlock data with just few clicks and reference them for their future work all at one place. With Glue data wranglers can bring all on-premise data to cloud to utilise benefits of cloud storage. Store structured, unstructured, semi-structured or data in any format at one place with proper governance and access controls through various cloud storage.

AWS Glue: Benefits

For a distributed environment wherein there could be large workloads present. These environments would need parallel processing to run these large workloads. Glue supports processing of such huge workloads. On comparing glue processing with any of the AWS sister ETL platforms like AWS Lambda its faster. As they require more complexity to integrate into data sources. Glue seamlessly facilitates enterprise level data integration providing increased data visibility.

AWS Glue: Limitations

Glue comes with certain limitations like it provides support for only two languages – Scala, Python for customising codes. Since glue is an AWS based managed service so its not compatible for platforms outside Amazon ecosystem.

Written by 

Karuna Puri is Tech. Lead at Knoldus Inc. with 8+ years of experience. She is backend developer with expertise in Functional Programming language - Scala. She is also well versed on cloud front with AWS. She is a tech. enthusiastic with interests in other domains - Data Mining and Machine Learning.