Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL, and your existing ETL, Business Intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics. As a data engineer or application developer, for some use cases, you want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. Now, with the general availability of the Amazon Redshift Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC. This makes it easier and more secure to work with Amazon Redshift and opens up new use cases.
This introduction explains how to use the Amazon Redshift Data API from the AWS Command Line Interface (CLI) and Python. We also explain how to use AWS Secrets Manager to store and retrieve credentials for the Data API.
Introducing the Data API
The Amazon Redshift Data API enables you to painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web services-based applications and event-driven applications. The following diagram illustrates this architecture.
The Amazon Redshift Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.
The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.
For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon VPC. Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.
Relevant use cases
The Amazon Redshift Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a cluster. It’s applicable in the following use cases:
- Integrating web services-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
- Running a long-running query without having to wait for it to complete.
- Running your query one time and retrieving the results multiple times without having to run the query again.
- Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
- Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
- Building event-driven applications with Amazon EventBridge and Lambda.
- Scheduling SQL scripts to simplify data load, unload, and refreshing of materialized views.
Creating an Amazon Redshift cluster
If you haven’t already created a Redshift cluster, or want to create a new one and aren’t sure how, follow these steps. In this post, we create a table and load data using the COPY command. Make sure that the IAM role you attach to your cluster has AmazonS3ReadOnlyAccess
permission.
Prerequisites for using the Data API
You must be authorized to access the Amazon Redshift Data API. Amazon Redshift provides the RedshiftDataFullAccess
managed policy, which offers full access to Amazon Redshift Data APIs. This policy also allows access to Amazon Redshift clusters, Secrets Manager, and IAM API operations needed to authenticate and access an Amazon Redshift cluster by using temporary credentials. If you want to use temporary credentials with the managed policy RedshiftDataFullAccess
, you have to create one with the user name in the database as redshift_data_api_user
.
You can also create your own IAM policy that allows access to specific resources by starting with RedshiftDataFullAccess
as a template. For details, refer to the Amazon Redshift Cluster management guide.
The Data API allows you to access your database either using your IAM credentials or to use secrets stored in Secrets Manager. In this post, we use Secrets Manager.
For instructions on using database credentials for the Data API, see How to rotate Amazon Redshift credentials in AWS Secrets Manager.
Using the Data API from the AWS CLI
You can use Data API from the AWS CLI to interact with the Amazon Redshift cluster. For instructions on configuring AWS CLI, see Setting up the Amazon Redshift CLI. The Amazon Redshift command line interface (aws redshift
) is a part of AWS CLI that lets you manage Amazon Redshift clusters, such as creating, deleting, and resizing. The Data API now provides a command line interface to the AWS CLI (redshift-data
) that allows you to interact with the databases in an Amazon Redshift cluster.
Before we get started, ensure that you have the updated AWS SDK configured.
You can invoke help using the following command:
The following table shows you different commands available with the Amazon Redshift Data API CLI.
Command | Description |
list-databases |
Lists the databases in a cluster. |
list-schemas |
Lists the schemas in a database. You can filter this by a matching schema pattern. |
list-tables |
Lists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both. |
describe-table |
Describes the detailed information about a table including column metadata. |
execute-statement |
Runs a SQL statement, which can be DML, DDL, COPY, or UNLOAD. |
cancel-statement
|
Cancels a running query. To be canceled, a query must be in running state. |
describe-statement |
Describes the details about a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement. |
list-statements |
Lists the SQL statements. By default, only finished statements are shown. |
get-statement-result |
Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.
|
If you want to get help on a specific command, you have to run the following command:
Now we look at how you can use these commands. First, get the secret key ARN by navigating to your key in Secrets Manager.
Listing databases
Most organizations use a single database in their Amazon Redshift cluster. You can use the following command to list the databases you have in your cluster. This operation requires you to connect to a database and therefore requires database credentials:
Listing schema
Similar to listing databases, you can list your schemas by using the list-schemas command:
You have several schemas that match demo
(demo
, demo2
, demo3
, and so on). You can optionally provide a pattern to filter your results matching to that pattern:
Listing tables
The Data API provides a simple command, list-tables
, to list tables in your database. You might have thousands of tables in a schema; the Data API lets you paginate your result set or filter the table list by providing filter conditions.
You can search across your schema with table-pattern
; for example, you can filter the table list by all tables across all your schemas in the database. See the following code:
You can filter your tables list in a specific schema pattern:
Running SQL commands
You can run your SELECT, DML, DDL, COPY, or UNLOAD command for Amazon Redshift with the Data API. You can optionally specify a name for your statement. You can optionally specify if you want to send an event to EventBridge after the query runs. The query is asynchronous, and you get a query ID after running a query.
Creating a schema
Let’s now use the Data API to see how you can create a schema. The following command will let you create a schema in your database. You do not have to run this SQL if you have pre-created the schema.
The following shows an example output. We will discuss later how you can check the status of a SQL that you executed with execute-statement
Creating a table
You can use the following command to create a table with the CLI.
Loading sample data
The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier.
Retrieving Data
The following query uses the table we created earlier:
If you’re fetching a large amount of data, using UNLOAD is recommended. You can unload data into Amazon Simple Storage Service (Amazon S3) either using CSV or Parquet format. UNLOAD uses the MPP capabilities of your Amazon Redshift cluster and is faster than retrieving a large amount of data to the client side.
The following shows an example output:
You can fetch results using the query ID that you receive as an output of execute-statement.
Checking the status of a statement
You can check the status of your statement by using describe-statement
. The output for describe-statement
provides additional details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. See the following command:
The following is an example output:
Cancelling a running statement
If your query is still running, you can use cancel-statement
to cancel a SQL query. See the following command:
Fetching results from your query
You can fetch the query results by using get-statement-result
. The query result is stored for 24 hours. See the following command:
The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.
Exporting Data
Amazon Redshift allows you to export from database tables to a set of files in an Amazon S3 bucket using the UNLOAD command with a SELECT statement. You can unload data in either text or Apache Parquet format. The following command shows you an example of how you can use the data lake export with the Data API:
Using the Data API from the AWS SDK
You can use the Amazon Redshift Data API in any of the programming languages supported by AWS SDK. For this post, we use the AWS SDK for Python (boto3) as an example to illustrate the capabilities of the Data API.
We first import the boto3 package and establish a session:
Getting a client object
You can create a client object from the boto3.Session
object and using RedshiftData
:
If you don’t want to create a session, your client is as simple as the following code:
import boto3
Running a statement
The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD as a parameter:
As we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput
, which includes the statement ID.
If you want to publish an event to EventBridge
when the statement is complete, you can use the additional parameter WithEvent
set to true
:
Using IAM credentials
Amazon Redshift provides users to get temporary database credentials using GetClusterCredentials
. We recommend you scope the access to a specific cluster and DB user if you are granting your users to use temporary credentials. Here is an example code that uses getting temporary IAM credentials. As you can see in the code, we are using the redshift_data_api_user
. The managed policy RedshiftDataFullAccess
scopes to use temporary credentials only to redshift_data_api_user
.
Describing a statement
You can use describe_statement
to find the status of the query and number of records retrieved. See the following code:
Fetching results from your query
You can use get_statement_result
to retrieve results for your query if your query is complete. See the following code:
The get_statement_result
command returns a JSON object that includes metadata for the result, the actual result set. You might need to process to format the result if you want to display in a user-friendly format.
Fetching and formatting results
For this post, we demonstrate how to format the results with the Pandas framework. The post_process
function processes the metadata and results to populate a data frame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:
import pandas as pd
Best practices
We recommend the following best practices when using the Data API:
- Federate your IAM credentials to the database to connect with Amazon Redshift. Amazon Redshift allows users to get temporary database credentials with
GetClusterCredentials
. We recommend scoping the access to a specific cluster and DB user if you’re granting your users temporary credentials. For more information, see Example policy for using GetClusterCredentials. - Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
- Ensure that the record size that you retrieve is smaller than 64 KB.
- Don’t retrieve a large amount of data to your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
- Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.
Customer Feedback
Datacoral is a fast growing startup that offers an AWS-native data integration solution for analytics. Datacoral integrates data from databases, APIs, events and files into Amazon Redshift while providing guarantees on data freshness and data accuracy to ensure meaningful analytics. Using the Redshift API, they are able to create a completely event-driven and serverless platform that makes data integration and loading easier for our mutual customers. Founder and CEO Raghu Murthy said, “As an Amazon Redshift Ready Advanced Technology Partner, we have worked with the Redshift team to integrate their Redshift API into our product. The Redshift API provides the asynchronous component needed in our platform to submit and respond to data pipeline queries running on Amazon Redshift. It is the last piece of the puzzle for us to offer our customers a fully event-driven and serverless platform that is robust, cost-effective, and scales automatically. We are thrilled to be part of the launch”
Zynga Inc. is an American game developer running social video game services founded in April 2007. Zynga uses Amazon Redshift as its central data warehouse for game events, user, and revenue data. The data in the Amazon Redshift data warehouse is used for analytics, BI reporting, and AI/ML across all games and departments. Zynga wants to replace any programmatic access clients connected to Amazon Redshift with the new Amazon Redshift Data API. Currently, Zynga’s services connect using a wide variety of clients and drivers, and they plan to consolidate all of them. This will remove the need for Amazon Redshift credentials and regular password rotations. Johan Eklund, Senior Software Engineer, Analytics Engineering team in Zynga, who participated in the beta testing said, “Data API would be an excellent option for our services that will use Amazon Redshift programmatically. The main improvement would be authentication with IAM roles without having to involve the JDBC/ODBC drivers since they are all AWS hosted. Our most common service client environments are PHP, Python, Go plus a few more”.
This article has been published from the source link without modifications to the text. Only the headline has been changed.