We live in a world of big data, with multi-terabyte databases and data warehouses with billions of lines of records. It’s a world with lots of analytical opportunities and, at the same time, a whole new raft of problems. Scale has its definite benefits, but it makes it hard to move data around our data centers and clouds, especially when we want to share it with other teams in the business.
Traditionally we’d have just copied the data, passing it on to developers and business analysts as needed. Instead, what’s needed is a way to share data from the source quickly and securely, while still allowing users to make changes and have full access to the data.
Why use Azure Data Share?
Azure Data Share is Microsoft’s managed data sharing platform, working with Azure storage to deliver either snapshots of data or use in-place sharing to give you the best of both worlds. Along with data management tooling, there’s a governance layer so you can see who has access and control how and when they get updates.
Setting up a data sharing environment is hard; you need to find effective ways of partitioning data and providing download capabilities. That means having dedicated infrastructure and bandwidth, especially if you have a lot of partners or if you’re commercializing the data you have and selling it to customers.
Those requirements are a significant blocker to building an effective data economy, requiring significant investment on both sides of a partnership to work with shared data. Working inside Azure with Azure Data Share means that you have a scalable data environment that expands on-demand, while cloud-hosted, serverless systems can handle the data extraction, compression and delivery process for you. There’s no need to build or manage software or infrastructure, it’s all automatically managed for you.
Azure Data Share offers different sharing models for different types of data storage in Azure. Most require sharing snapshots of your data, updating it as new snapshots are released. This does mean that anyone consuming your data will need connectivity and storage, though things are considerably simpler if you’re both in the same Azure region. Some options, like Azure Data Lake, offer incremental snapshot support, sending changes rather than entire tables or databases.
How to get started with Azure Data Share
Working with Azure Data Share is simple enough; all you need is storage in Azure and an Azure account with appropriate permissions for your storage account. There are different ways of working with different sources, so be sure you’re familiar with the necessary techniques for your share. You’ll need to start by giving Azure access to your data source, using the Azure firewall tools.
With the appropriate prerequisites in place, you’re ready to start sharing data. Select the data you want to share and set up a publication schedule. Users get an invitation by email and once accepted receive their first data snapshot into their Azure storage account. There’s no need to share all your data, you can select a set of records to share, giving access to a slice of storage.
Where data is updated regularly, you can set a snapshot schedule for new releases or for incremental updates. This can be hourly or daily, and users can subscribe to releases as and when they need them. One important aspect of the sharing process is that users can choose where the data is delivered, so if you’re sharing, say key values from an Azure Blob, the user can choose to have that delivered directly into an Azure Data Lake ready for analysis.
If you’re using Azure Data Explorer, you can set up an in-place share as an alternative to snapshots. This provides a direct link to your store, so users can read and query data directly while treating it as if it was in their own subscription. Any changes you make will be available instantly. Not everyone will need this level of access, though it will be extremely useful for internal development teams who need access to live data for application testing.
While much of the Azure Data Share tooling is available through the Azure portal there are also REST APIs, so you can build software around your data shares. The APIs let you add a data sharing portal to a site or help you construct and manage a consortium where data is provided by different organisations and the resulting aggregate shared to everyone in the consortium.
How secure is Azure Data Share, and how much does it cost?
At the heart of Azure Data Share is Azure’s security tooling, particularly Azure Active Directory’s support for managed identities. This allows controlled access to stores, without either party in the connection getting access to the other’s credentials. There are three types of users, Owners, Contributors and Readers. Owners and Contributors can manage their share directly, while Readers can only view shared data. You always control the data you share with tooling to manage and monitor Readers. It’s important to note that data is never held in the Azure Data Share service, it’s purely a way of connecting two Azure storage accounts. Some metadata about the data being offered is held, but that’s all.
That level of control is perhaps the most important aspect of the Azure Data Share platform. It means as a provider you can control who has access and how often they can get updates to shared data. Users get some control, managing invitations to shared data and choosing how they use that data.
Pricing is reasonable, 5 cents to move a snapshot from source to destination, and 50 cents per vCore-hour to create the snapshots (charged per minute and rounded up). That compares well with the costs associated with building and running your own infrastructure, and it could make hybrid-data sharing an option if you have a direct connection or a high-speed VPN connection between your data center and Azure. Data can be transferred between Azure regions: a source in the Western United States can be used in East Asia, with all transfers happening inside Azure’s own network.
If you’re a data consumer, using Azure Data Share gives you more data to use in your applications. Datasets can be combined with your own data, or used with your own analytics algorithms, or as part of your own machine learning training data. There’s really no limit to what you can do with it, whether it’s a snapshot or in-place sharing, it’s data.