AWS Glue Studio is an easy-to-use graphical interface that speeds up the process of authoring, running, and monitoring extract, transform, and load (ETL) jobs in AWS Glue. The visual interface allows those who don’t know Apache Spark to design jobs without coding experience and accelerates the process for those who do.
AWS Glue Studio was designed to help you create ETL jobs easily. After you design a job in the graphical interface, it generates Apache Spark code for you, abstracting users from the challenges of coding. When the job is ready, you can run it and monitor the job status using the integrated UI.
AWS Glue Studio supports different types of data sources, both structured and semi-structured, and offers data processing in real time and batch. You can extract data from sources like Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon Kinesis, and Apache Kafka. It also offers Amazon S3 and tables defined in the AWS Glue Data Catalog as destinations.
This post shows you how to create an ETL job to extract, filter, join, and aggregate data easily using AWS Glue Studio.
About this blog post | |
Time to read | 15 minutes |
Time to complete | 45 minutes |
Cost to complete (estimated) | Amazon S3: $0.023 AWS Glue: 0.036 AWS Identity & Access Management: $0 Total Cost: $0.059 |
Learning level | Intermediate (200) |
Services used | AWS Glue, Amazon S3, AWS Identity and Access Management |
Overview of solution
To demonstrate how to create an ETL job using AWS Glue Studio, we use the Toronto parking tickets dataset, specifically the data about parking tickets issued in the city of Toronto in 2018, and the trials dataset, which contains all the information about the trials related to those parking tickets. The goal is to filter, join, and aggregate the two datasets to get the number of parking tickets handled per court in the city of Toronto during that year.
Prerequisites
For this walkthrough, you should have an AWS account. For this post, you launch the required AWS resources using AWS CloudFormation in the us-east-1
Region. If you haven’t signed up for AWS, complete the following tasks:
- Create an account.
- Create an AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.
Important: If the AWS account you use to follow this guide uses AWS Lake Formation to manage permissions on the Glue data catalog, make sure that you log in as a user that is both a Data lake administrator and a Database creator, as described in the documentation.
Launching your CloudFormation stack
To create your resources for this use case, complete the following steps:
- Launch your stack in
us-east-1
:
- Select the I acknowledge that AWS CloudFormation might create IAM resources with custom names option.
- Choose Create stack.
Launching this stack creates AWS resources. The following resources shown in the AWS CloudFormation output are the ones you need in the next steps:
- Key – Description
- AWSGlueStudioRole – IAM role to run AWS Glue jobs
- AWSGlueStudioS3Bucket – Name of the S3 bucket to store blog-related files
- AWSGlueStudioTicketsYYZDB – AWS Glue Data Catalog database
- AWSGlueStudioTableTickets – Data Catalog table to use as a source
- AWSGlueStudioTableTrials – Data Catalog table to use as a source
- AWSGlueStudioParkingTicketCount –Data Catalog table to use as the destination
Creating a job
A job is the AWS Glue component that allows the implementation of business logic to transform data as part of the ETL process. For more information, see Adding Jobs in AWS Glue.
To create an AWS Glue job using AWS Glue Studio, complete the following steps:
- On the AWS Management Console, choose Services.
- Under Analytics, choose AWS Glue.
- In the navigation pane, choose AWS Glue Studio.
- On the AWS Glue Studio home page, choose Create and manage jobs.
AWS Glue Studio supports different sources, including Amazon S3, Amazon RDS, Amazon Kinesis, and Apache Kafka. For this post, you use two AWS Glue tables as data sources and one S3 bucket as the destination.
- In the Create Job section, select Blank graph.
- Choose Create.
This takes you to the Visual Canvas to create an AWS Glue job.
- Change the Job name from
Untitled Job
toYYZ-Tickets-Job
.
You now have an AWS Glue job ready to filter, join, and aggregate data from two different sources.
Adding sources
For this post, you use two AWS Glue tables as data sources: Tickets
and Trials
, which the CloudFormation template created. The data is located in an external S3 bucket in Parquet format. To add these tables as sources, complete the following steps:
- Choose the (+) icon.
- On the Node properties tab, for Name, enter
Tickets
. - For Node type, choose S3.
- On the Data Source properties -S3 tab, for Database, choose yyz-tickets.
- For Table, choose tickets.
- For Partition predicate (optional), leave blank.
Before adding the second data source to the ETL job, be sure that the node you just created isn’t selected.
- Choose the (+) icon.
- On the Node properties tab, for Name, enter
Trials
. - For Node type, choose S3.
- On the Data Source properties -S3 tab, for Database, choose yyz-tickets.
- For Table, choose trials.
- For Partition predicate (optional), leave blank.
You now have two AWS Glue tables as the data sources for the AWS Glue job.
Adding transforms
A transform is the AWS Glue Studio component were the data is modified. You have the option of using different transforms that are part of this service or custom code. To add transforms, complete the following steps:
- Choose the Tickets node.
- Choose the (+) icon.
- On the Node properties tab, for Name, enter
Ticket_Mapping
. - For Node type, choose ApplyMapping.
- For Node parents, choose Tickets.
- On the Transform tab, change the
ticket_number
data type from decimal to int. - Drop the following columns:
Location1
Location2
Location3
Location4
Province
Now you add a second ApplyMapping
transform to modify the Trials
data source.
- Choose the Trials data source node.
- Choose the (+) icon.
- On the Node properties tab, for Name, enter
Trial_Mapping
. - For Node type, choose ApplyMapping.
- For Node parents, leave at default value (Trials).
- On the Transform tab, change the
parking_ticket_number
data type from long to int.
Now that you have set the right data types and removed some of the columns, it’s time to join the data sources using the Join
transform.
- Choose the
Ticket_Mapping
transform. - Choose the (+) icon.
- On the Node properties tab, for Name, enter
Join_Ticket_Trial
. - For Node type, choose Join.
- For Node parents, choose
Ticket_Mapping
andTrial_Mapping
. - On the Transform tab, for Join type, choose Inner join.
- For Join conditions, choose Add condition.
- For
Ticket_Mapping
, chooseticket_number
. - For
Trial_Mapping
, chooseparking_ticket_number
.
Now the two data sources are joined by the ticket_number
and parking_ticket_number
columns.
Performing data aggregation
In this step, you do some data aggregation to see the number of tickets handled per court in Toronto.
AWS Glue Studio offers the option of adding custom code for those use cases that need a more complex transformation. For this post, we use PySpark code to do the data transformation. It contains Sparksql code and a combination of dynamic frames and data frames.
- Choose the
Join_Tickets_Trial
transform. - Choose the (+) icon.
- On the Node properties tab, for Name, enter
Aggregate_Tickets
. - For Node type, choose Custom transform.
- For Node parents, leave
Join_Ticket_Trial
selected.
- On the Transform tab, for Code block, change the function name from
MyTransform
toAggregate_Tickets
. - Enter the following code:
After adding the custom transformation to the AWS Glue job, you want to store the result of the aggregation in the S3 bucket. To do this, you need a Select from collection transform to read the output from the Aggregate_Tickets
node and send it to the destination.
- Choose the New node node.
- Leave the Transform tab with the default values.
- On the Node Properties tab, change the name of the transform to
Select_Aggregated_Data
. - Leave everything else with the default values.
- Choose the Select_Aggregated_Data node.
- Choose the (+) icon.
- On the Node properties tab, for Name, enter
Ticket_Count_Dest
. - For Node type, choose S3 in the Data target section.
- For Node parents, choose Select_Aggregated_Data.
- On the Data Target Properties-S3 tab, for Format, choose Parquet.
- For Compression Type, choose GZIP.
- For S3 Target Location, enter
s3://glue-studio-blog-{Your Account ID as a 12-digit number}/parking_tickets_count/
.
The job should look like the following screenshot.
You now have three transforms to do data mapping, filtering, and aggregation.
Configuring the job
When the logic behind the job is complete, you must set the parameters for the job run. In this section, you configure the job by selecting components such as the IAM role and the AWS Glue version you use to run the job.
- On the Job details tab, for Description, enter
Glue Studio blog post job
. - For IAM Role, choose AWSGlueStudioRole (which the CloudFormation template created).
- For Job Bookmark, choose Disable.
- For Number of retries, optionally enter
1
.
- Choose Save.
- When the job is saved, choose Run.
Monitoring the job
AWS Glue Studio offers a job monitoring dashboard that provides comprehensive information about your jobs. You can get job statistics and see detailed info about the job and the job status when running.
- In the AWS Glue Studio navigation panel, choose Monitoring.
- Choose the entry with the job name YYZ-Tickets_Job.
- For get more details about the job run, choose View run details.
- Wait until Run Status changes to
Succeeded
.
You can verify that the job ran successfully on the Amazon Athena console.
- On the Athena console, choose the
yyz-tickets
database. - Choose the … icon next to the
parking_tickets_count
table (which the CloudFormation template created).
For more information about creating AWS Glue tables, see Defining Tables in the AWS Glue Data Catalog.
- Choose Preview table.
As you can see in the following screenshot, the information that the job generated is available and you can query the number of tickets types per court issued in the city of Toronto in 2018.
Cleaning up
To avoid incurring future charges and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, and AWS Glue job.
Conclusion
In this post, you learned how to use AWS Glue Studio to create an ETL job. You can use AWS Glue Studio to speed up the ETL job creation process and allow different personas to transform data without any previous coding experience. For more information about AWS Glue Studio, see the AWS Glue Studio documentation and What’s New with AWS.
This article has been published from the source link without modifications to the text. Only the headline has been changed.