Data integration is a necessary process that combines data for machine learning, analytics, and cloud application development. AWS Glue is a way to make every step of the data integration process serverless and in the AWS Cloud. It possesses all the relevant capabilities so that analysis can be put to use immediately as opposed to later. These capabilities include data extraction from multiple sources, cleaning, normalization, and overall organization in databases and data lakes. Which method you use depends on the type of user you are and what products you are using.
AWS Glue is at its core an ETL (extract, transform, load) service with workflows that can be easily initiated. The monitoring and management of said workflows are done simply by navigating the AWS Glue Studio interface. Like all AWS services, you only pay for what you use, which means resources running at any given moment.
AWS Glue comes with a robust featureset that, if utilized to the fullest, will go through all the integration heavy lifting. You can focus on the data analysis that comes right after.
AWS Glue Features
AWS Glue Data Catalog
This is the prime destination of all your data assets, no matter where they may actually be located. To help control the AWS Glue environment it contains job and table definitions, schemas, and other control information. It uses AWS machine learning and artificial intelligence to recognize patterns to make data queries cost-effective and efficient.
Alongside this, a schema version history is saved so data changes can be seen over the course of time.
AWS Glue Schema Registry
The registry enables the use of Apache Avro schemas to validate and control streaming data without charging extra for it. When data streaming applications are integrated with the registry, you can improve data quality and safeguard against unexpected changes. You can also update or create new AWS Glue tables and partitions using schemas stored in the registry.
Drag and Drop Visual Data
You don’t have to be an Apache Spark expert to create scalable ETL jobs for further distributed processing. AWS Glue lets you use a drag and drop interface to define ETL processes. The code is automatically generated to extract, transform and load your data, in either Python, Scala, or Apache Spark.
Data stored in multiple storage types can have views created on top of them with AWS Glue Elastic Views. You can create the views using an open-source SQL compatible language (PartiQL) to manipulate and query the data. This does not depend on the structure of the data, it can be tabular or like a document.
AWS Glue has a built-in feature called FindMatches that links data together that are imperfect matches of each other. It uses machine learning to fix duplicates in your records, you are asked beforehand whether it is indeed a duplicate. The system will start learning from your criteria of match or no match and can build you an ETL job that can help you find duplicates between records.
You are given development endpoints to test out the code that Glue generated for you. This is useful when choosing to interactively develop your ETL code. Custom readers, writers, or transformations can be written by you and then be imported into your Glue ETL jobs.
AWS Glue DataBrew
DataBrew lets you normalize data without code using a visual interface. The UI is point to click and simple for users like data scientists to normalize it without sifting through code. You can clean data from various other locations like data lakes and databases like Amazon S3 and Redshift.