Glue: Data Integration Service
AWS Glue is a managed cloud ETL platform that can be used for data enriching, cleansing, normalising, organisation, validation, or formatting purpose for storage within a data lake, data warehouse or databases.
Glue Jobs acts an orchestra-tor for ETL workflow. We can create jobs within AWS Glue that automates the scripts for ETL tasks. These jobs can be scheduled and chained, or they can be made events driven as well.
AWS Glue can include various services including: building event-driven ETL pipelines, as a data-catalog source to find data across multiple data stores, for monitoring ETL jobs even without maintaining code, for data exploration, for building views etc.
Components of AWS Glue Job
AWS Glue Data Catalog is a persistent metadata store that facilitates data exploration around different data stores just like Apache Hive.
Glue Crawlers is a scanner that is used to scan different types of data which can be used further for data classification. It can extract schema information from it. It can store metadata in data-catalog to guide ETL operations.
Glue Data Brew is a visual tool for data preparation. It makes data cleansing and normalisation operations easier for analysts.
AWS Glue Studio is a graphical interface for creation, running & monitoring ETL jobs in AWS glue.
AWS Glue Elastic View builds materialised views for combining or replicating data across multiple data stores.
AWS Glue Use-Cases
Glue is widely used these days to build data lakes at enterprise level. Wherein data wranglers from different domain can store and access data at one place. Thereby it provides leading industry experience to data analytics and other data personas.
AWS Glue can also be used to build Data-catalog based web application for data search, filtering, navigation and research work . A data-catalog allows data analysts to unlock data with just few clicks and reference them for their future work all at one place. With Glue data wranglers can bring all on-premise data to cloud to utilise benefits of cloud storage. Store structured, unstructured, semi-structured or data in any format at one place with proper governance and access controls through various cloud storage.
AWS Glue: Benefits
For a distributed environment wherein there could be large workloads present. These environments would need parallel processing to run these large workloads. Glue supports processing of such huge workloads. On comparing glue processing with any of the AWS sister ETL platforms like AWS Lambda its faster. As they require more complexity to integrate into data sources. Glue seamlessly facilitates enterprise level data integration providing increased data visibility.
AWS Glue: Limitations
Glue comes with certain limitations like it provides support for only two languages – Scala, Python for customising codes. Since glue is an AWS based managed service so its not compatible for platforms outside Amazon ecosystem.