As Natalie Gagliordi covered in her report this morning, Google is announcing previews of a trio of new data and analytics services that are filling some key gaps in its portfolio, addressing real-time integration, data sharing, and governance. The common thread is about connectivity to existing databases, both on-premises and in the cloud, and establishing a new fabric for Google’s analytic services that will provide a common backplane for data discovery, security, and governance.
Google’s announcements fit a pattern that we are currently seeing with the cloud providers: With their database management and analytics portfolios filling out, they are now adding connective tissue. We’re taking a closer look at Google’s newest services, including Datastream, Analytics Hub, and Dataplex.
Datastream and DMS: Automating real-time replication and database migration
This week, Google is announcing a preview of Datastream, a serverless change data capture (CDC) and replication service that will take change streams from Oracle and MySQL databases to a choice of several Google Cloud targets including BigQuery, Cloud SQL, Cloud Spanner, and Google Cloud Storage. More to the point, Google Datastream replication is real-time, although network latencies will cause a lag with sources outside Google Cloud.
Datastream complements, and shares some of the same underlying technology, with Google Cloud Database Migration Service (DMS) that was announced last fall and made generally available in late March. We’re going to analyze both in this section.
At first blush, it would be easy to confuse the new Datastream service with DMS because both connect to databases outside Google Cloud; both are serverless; both use similar change-data-capture technology; and both automate the configurations such as connectivity setup that traditionally required significant manual effort for provisioning.
But the use cases are different: Datastream is for ongoing, low-latency replications such as for real-time analytics or applications requiring event-driven architectures, while DMS, for now, is for one-shot lift and shift database migrations to Google Cloud’s managed databases (as we note below, that’s going to change). Also, while Datastream provides heterogeneous support, DMS for now is limited to like-to-like migrations: it supports going from any instance of MySQL or PostgreSQL, from sources like on-premises or Amazon RDS, to Cloud SQL running MySQL or PostgreSQL targets, respectively. DMS works by using the native data replication engines of MySQL and PostgreSQL.
Technically, each of the DMS services from each of the three clouds can provide offline (one-time) or online (continuous replication of updates) migrations, but they’re not set up as real-time data replication services like Datastream. There’s another differentiator with both Google Cloud services (Datastream and DMS): the serverless design, which includes support for autoscaling, that in turn eliminates the need for customers to configure and provision cloud infrastructure.
There are some other subtle differences relating to coverage. AWS and Azure DMS services support a wide variety of database sources and targets and include schema conversion tools so, for instance, an Oracle database could be mapped to PostgreSQL, or a SQL Server database mapped to MySQL. Admittedly, when migrating between different databases, there will be more complexity, especially when translating code and supported data types, not all of which can be automated. For now, Google’s DMS lacks conversion capability as it only targets like-to-like migrations, but heterogeneous migrations and schema conversion capabilities are in the works, based in part on the same CDC engine that powers Datastream.
This is Google Cloud’s first shot with automated real-time replication and database migration. In the run-up to the late March GA release of DMS, thousands of customers used it. With one exception (DMS will soon add support for SQL Server like-for-like migrations), no future plans have been announced. But we expect that in an upcoming refresh, that Google will add the usual suspects to Datastream: PostgreSQL and SQL Server. For DMS, Google will soon support heterogenous database conversions (e.g., Oracle to PostgreSQL) that will add capabilities for code conversion, migration planning, and data type conversions.
Analytics Hub: Share and share alike
Google Cloud is taking the first steps toward opening a marketplace for data and analytic models with its new Analytics Hub. It provides a managed one-stop shopping point for data sets and models that is an alternative to the informal sharing that currently goes on inside organizations. Google’s new offering conjures up thoughts of Snowflake’s data marketplace; both share a common thread that they are not commercialized market places where providers can charge for data sets, and both operate under the same user authentication and access control that goes beyond data sets to also include analytic models. Beyond the fact that, at the starting gate, Google’s new Analytics Hub will include models, it will also provide visualization capabilities (so you don’t have to rely on third parties). And, this being Google, they are offering access to some of the family jewel data sets, such as Google Search Trends. We wouldn’t be surprised if Google eventually adds data services or applications.
A question in our mind is how the analytics hub’s model marketplace will interoperate with Google’s AI portfolio, and especially how it would interoperate with model lifecycle management services, that at this point would be offered by Google Cloud partners. The challenge is ensuring that models offered on the marketplace are vetted and current. To a lesser extent, there may be a similar need to vet the quality and currency of data sets that are offered in the hub.
Dataplex: taking first steps to centrally manage, monitor, and access distributed data
Google is following in the footsteps of Cloudera and Microsoft in tackling the complex problem of providing a single pane of glass to manage, secure, govern, and analyze distributed data in the cloud. Google characterizes this as an “intelligent data fabric” (a term that IBM has also started using) that delivers an “integrated analytics experience.”
Dataplex is, in essence, a common backplane for discovering and governing the data that populates analytics services – initially from Google’s portfolio, but it aspires to build a third party ecosystem to operate under the same umbrella. It will start with discovering data stored in Google Cloud Storage and BigQuery, but Google plans to expand that net soon.
Google is not the first provider to take the plunge with consistent self-service access to govern data in governance of the data lake with SDX, and last fall, Microsoft followed up with Azure Purview. Now it’s Google Cloud’s turn, and it’s aiming wider.
Dataplex’s data management, governance, and data access functions extend from understanding metadata properties, determining its lineage to whether data should be retained, who can access it and with which privileges, and enabling data to be discovered and queried via a variety of tools. They have traditionally been silo’ed, with databases carrying some of their own tooling, with third-party tools picking up where the database folks left off. In the cloud, the challenge grows far more complex because of the multiplicity of data sources and data services, all contending for access to the same pool – or pools – of data.
Specifically, Dataplex uses metadata to group distributed data into logical data domains and provides consistent enforcement of data quality, data governance, and access control policies across these data domains groups regardless of where the data is physically stored. It helps customers logically organize the data, regardless of where it physically resides, into data lakes, zones, and raw assets. Dataplex will provide, in effect, a common view of governance, security, and data access for a wide variety of analytic and data integration services.
It automatically discovers metadata by harvesting metadata and syncing it across BigQuery and Dataproc Metastore with built-in data quality checks. That differs from the tag-based approaches used by Cloudera, IBM, and Microsoft in their various data fabrics and cloud data governance services. For Dataplex, metadata becomes the foundation for unifying data discovery and governance; for now, it will publish the metadata to BigQuery, Dataproc Metastore, and Google Cloud Data Catalog.
To ingest data into lakes and zones curated by Dataplex, you can use Google tools such as Cloud Dataflow, Data Fusion, Dataproc, PubSub, or services from third-party partners. On the front end, it works by providing users of BigQuery and Apache Spark one-click access to the logically curated data. Google plans to expand the list of services that consume data from Dataplex in the future. For now, Dataplex will work as the curation and data security backplane to BigQuery and Cloud Storage, but clearly, that’s also just a start. It has announced partnerships with Accenture, Collibra, Confluent, Informatica, HCL, Starburst, NVIDIA, and Trifacta for populating the Dataplex metastore. And, via Anthos or BigQuery Omni, we expect that Google will extend the reach of Dataplex to data sitting in other cloud object stores in the future.
Oh, and yes, more multi-cloud
A major theme in Google Cloud’s positioning is that, while Google would like you to run workloads on their public cloud, they’d be just as happy for you to run their services on any cloud you want. That’s been the storyline of Google Cloud Anthos, which is a general-purpose Kubernetes platform that can run in hostile territory, and on the data and analytics side, BigQuery Omni and Looker. Both have been available for running in AWS, and with this week’s announcements, Azure gets added to the list.
Yep, we’re not fans of running the same logical instance of a database and application to span multiple clouds (too much operational complexity). But let’s face it, most organizations have data and applications scattered across different clouds and they need help to simplify analyzing distributed data. Data gravity is a good reason for having data and analytics services, like BigQuery and Looker, available elsewhere.