I was working with a customer who was just getting started using AWS, and they wanted to understand how to query their AWS service logs that were being delivered to Amazon Simple Storage Service (Amazon S3). I introduced them to Amazon Athena, a serverless, interactive query service that allows you to easily analyze data in Amazon S3 and other sources. Together, we used Athena to query service logs, and were able to create tables for AWS CloudTrail logs, Amazon S3 access logs, and VPC flow logs. As I was walking the customer through the documentation and creating tables and partitions for each service log in Athena, I thought there had to be an easier and faster way to allow customers to query their logs in Amazon S3, which is the focus of this post.
This post demonstrates how to use AWS CloudFormation to automatically create AWS service log tables, partitions, and example queries in Athena. We also use the SQL query editor in Athena to query the AWS service log tables that AWS CloudFormation created.
Athena best practices
This solution is appropriate for ad hoc use and queries the raw log files. These raw files can range from compressed JSON to uncompressed text formats, depending on how they were configured to be sent to Amazon S3. If you need to query over hundreds of GBs or TBs of data per day in Amazon S3, performing ETL on your raw files and transforming them to a columnar file format like Apache Parquet can lead to increased performance and cost savings. You can save on your Amazon S3 storage costs by using snappy compression for Parquet files stored in Amazon S3. To learn more about Athena best practices, see Top 10 Performance Tuning Tips for Amazon Athena.
Table partition strategies
There are a few important considerations when deciding how to define your table partitions. Mainly you should ask: what types of queries will I be writing against my data in Amazon S3? Do I only need to query data for that day and for a single account, or do I need to query across months of data and multiple accounts? In this post, we talk about how to query across a single, partitioned account.
By partitioning data, you can restrict the amount of data scanned per query, thereby improving performance and reducing cost. When creating a table schema in Athena, you set the location of where the files reside in Amazon S3, and you can also define how the table is partitioned. The location is a bucket path that leads to the desired files. If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only for that partition. For more information, see Table Location in Amazon S3 and Partitioning Data. You can then define partitions in Athena that map to the data residing in Amazon S3.
Let’s look at an example to see how defining a location and partitioning our table can improve performance and reduce costs. In the following tree diagram, we’ve outlined what the bucket path may look like as logs are delivered to your S3 bucket, starting from the bucket name and going all the way down to the day.
Outlined in red is where we set the location for our table schema, and Athena then scans everything after the CloudTrail
folder. We then outlined our partitions in blue. This is where we can specify the granularity of our queries. In this case, we partition our table down to the day, which is very granular because we can tell Athena exactly where to look for our data. This is also the most performant and cost-effective option because it results in scanning only the required data and nothing else.
If you have to query multiple accounts and Regions, you should back off the location to AWSLogs
and then create a non-partitioned CloudTrail table. This allows you to write queries across all your accounts and Regions, but the trade-off is that your queries take much longer and are more expensive due to Athena having to scan all the data that comes after AWSLogs
every query. However, querying multiple accounts is beyond the scope of this post.
Prerequisites
Before you get started, you should have the following prerequisites:
- Service logs already being delivered to Amazon S3
- An AWS account with access to your service logs
Deploying the automated solution in your AWS account
The following steps walk you through deploying a CloudFormation template that creates saved queries for you to run (Create Table, Create Partition, and example queries for each service log).
- Choose Launch Stack:
- Choose Next.
- For Stack name, enter a name for your stack.
You don’t need to have every AWS service log that the template asks for. If you don’t have CloudFront logs for example, you can leave the PathParameter
as is. If you need CloudFront logs in the future, you can simply update the Create Table statement with the correct Amazon S3 location in Athena.
- For each service log table you want to create, follow the steps below:
- Replace
<_BUCKET_NAME>
with the name of your S3 bucket that holds each AWS service log. You can use the same bucket name if it’s used to hold more than one type of service log. - Replace
<Prefix>
with your own folder prefix in Amazon S3. If you don’t have a prefix, make sure to remove it from the path parameters. - Replace
<ACCOUNT-ID>
and<REGION>
with desired account and region.
- Choose Next.
- Enter any tags you wish to assign to the stack.
- Choose Next.
- Verify parameters are correct and choose Create stack at the bottom.
Verify the stack has been created successfully. The stack takes about 1 minute to create the resources.
Querying your tables
You’re now ready to start querying your service logs.
- On the Athena console, on the Saved queries tab, search for the service log you want to interact with.
- Choose Create Table – CloudTrail Logs to run the SQL statement in the Athena query editor.
Make sure the location for Amazon S3 is correct in your SQL statement and verify you have the correct database selected.
- Choose Run query or press Tab+Enter to run the query.
The table cloudtrail_logs
is created in the selected database. You can repeat this process to create other service log tables.
For partitioned tables like cloudtrail_logs
, you must add partitions to your table before querying.
- On the Saved queries tab, choose Create Partition – CloudTrail.
- Update the Region, year, month, and day you want to partition. Choose Run query or press Tab+Enter to run the query.
After you run the query, you have successfully added a partition to your cloudtrail_logs
table. Let’s look at some of the example queries we can run now.
- On the Saved queries tab, choose Query – CloudTrail Logs.
This is a base template included to begin querying your CloudTrail logs.
- Highlight the query and choose Run query.
You can see the base query template uses the WHERE clause to leverage partitions that have been loaded.
Let’s say we have a spike in API calls from AWS Lambda and we want to see the users that the calls were coming from in a specific time range as well as the count for each user. Our query looks like the following code:
Or if we wanted to check our S3 Access Logs to make sure only authorized users are accessing certain prefixes:
Cost of solution and cleaning up
Deploying the CloudFormation template doesn’t cost anything. You’re only charged for the amount of data scanned by Athena. Remember to use the best practices we discussed earlier when querying your data in Amazon S3. For more pricing information, see Amazon Athena pricing and Amazon S3 pricing.
To clean up the resources that were created, delete the CloudFormation stack you created earlier. This also deletes the saved queries in Athena.
Summary
In this post, we discussed how we can use AWS CloudFormation to easily create AWS service log tables, partitions, and starter queries in Athena by entering bucket paths as parameters. We used CloudTrail and Amazon S3 access logs as examples, but you can replicate these steps for other service logs that you may need to query by visiting the Saved queries tab in Athena. Feel free to check out the video as well, where I go over how we store logs in Amazon S3 and then give a quick demo on how to deploy the solution.
This article has been published from the source link without modifications to the text. Only the headline has been changed.