SQL Server 2019 big data clusters bring relational and unstructured data together in a world where you don’t have to curate data before using it.
Originally, the SQL database service on Azure and the SQL Server product were based on different code. Some years back, Microsoft unified the code base so that applications would be compatible between them. What the SQL team realised was that rather than just wanting to move their database applications to the cloud, customers wanted it to be easier to manage the complexity of having multiple databases.
“What we saw was that even in the era of the cloud there was so much interest in SQL Server that there was an opportunity to expand the value proposition of what SQL Server is,” general manager for Azure Data and AI John ‘JG’ Chirapurath told TechRepublic. That led to the Polybase technology in SQL Server 2016 (“Polybase allows you to talk in T-SQL, the lingua franca of SQL Server, and query a bunch of different data sources”) and to SQL Server 2017 for Linux, which Chirapurath calls “one of the most seminal releases we’ve ever had of SQL Server and arguably the most successful version we’ve ever had”. In 18 months, it’s been downloaded more than 20 million times — and some of that volume is down to how straightforward the container support makes it to deploy.
Simplifying Kubernetes for big data
Customers wanted SQL Server on Kubernetes, Chirapurath says, but they also wanted that to be easier.
“Most enterprises have adopted containers as a means to basically manage the polyglot apps and systems inside the data centres, and increasingly, they are they connecting it to Kubernetes as an orchestration framework,” says Chirapurath.
“One of the reasons why Kubernetes is becoming complicated is that there’s lots of extensions that people are adding, so it’s gone from what was a general-purpose framework to a ‘one size fits all’ idea. Kubernetes is lots of different things,” Chirapurath adds.
For SQL Server 2019, cluster admins have an admin portal and a command-line interface they can use to do all the management.
What customers then started asking for was the same kind of help with their data lakes of unstructured data that SQL Server was giving them between cloud and on-premises databases. “Help us with our data lake and all the unstructured data we’ve collected. Help us manage all of this complexity with data we see on-premises and in cloud. Give us a very consistent way, where we can use our familiar tools, familiar skills, with things we already bet on. And the bets involved things like Spark, HDFS, perhaps Python, R and similar things,” Chirapurath says.
Once SQL Server 2017 was running on containers, Microsoft could extend Polybase so that not only can you query data sources like MongoDB and Oracle with SQL Server 2019, but you can also put an HDFS storage pool inside a SQL Server cluster (running on a cluster of Linux containers) and run both SQL Server and Spark in that storage pool.
“We’ve essentially taken the open-source Spark engine and integrated it tightly with the SQL query processor, the SQL Server engine,” says Chirapurath. “Now you can essentially embed Spark queries inside SQL Server SQL syntax statements, and you can directly query data lakes. SQL is completely relational, while your data lakes are completely unstructured — they can be any kind of data. Now you’re able to create a data virtualization layer and manage all kinds of data in your enterprise, regardless of whether it can be stored inside SQL Server or not.”
That makes it easier to query HDFS, but it also simplifies setup because instead of creating and managing separate HDFS storage, you can deploy a SQL Server Availability Group into Kubernetes with a single command and get a cluster in about half an hour. It also avoids the usual security issues of integrating relational and big data environments: SQL Server big data clusters have an integrated security tier that covers SQL Server, Spark and HDFS. SQL Server Machine Learning Services run in the master instance and R, Python, Java and machine-learning scoring and predictions run natively in the master cluster. If you want, there are even third-party tools to accelerate Spark inside SQL Server with FPGAs.
Running Spark means you can run Spark ML as well as using the R and Python support added to SQL Server 2017, so you can build machine-learning systems that work against both structured and unstructured data. “You can create a machine-learning pipeline where you can pull in unstructured data through the Spark integration, you can pull in relational data through the SQL query processor and you can use R and Python to create machine-learning models directly inside SQL Server,” says Chirapurath.
“What I see in the market today from an AI perspective is that people are approaching the art and science and craft of machine learning in two ways,” Chirapurath says. “One way is through the big data ecosystem, as they collect the data in the Hadoop world and start to use Spark to basically reason over that data and then start using things like Spark ML. Essentially, they are doing very complicated and complex machine learning on the data that they have access to. On the other side, I see a lot of developers using languages like R and Python, even things like Scala, and that’s very developer-centric. What SQL Server 2019 can do is bridge both those points of view.”
Data engineers, data scientists, DBAs and SQL Server developers can use Azure Data Studio to manage databases and write queries and work with both sets of tools.
Crossing the streams
The advantage of having R and Python inside the SQL Server engine is that you can do machine learning where the data lives, with much lower latency than if the data was stored elsewhere. You get the same advantage with SQL Server 2019, in both directions. The data virtualisation layer avoids the need to move petabytes of data out of a data lake, which isn’t practical. “Likewise, customers are loath to move data out of SQL Server to do a whole data engineering process of cleansing it to achieve an AI task or an ML task. What this allows you to do is just work with the data inside SQL Server, and marry it in the results set for the data you query using Spark from your data lake,” says Chirapurath.
Previously, if you wanted to work with structured and unstructured data together, you were looking at a significant amount of ETL: if you want to combine SQL Server and Spark data in Azure Data Warehouse you have to use Azure Data Factory to ingest and transform data.
Analytics has presented enterprises with what Chirapurath calls a ‘false choice’ between structured/relational and unstructured/non-relational data, by treating them both as siloes for managing and using data for purely historical reasons.
“Relational database systems were born at a time when storage came at a high cost. Because of technology innovations, [the cost of] storage has almost gone to zero,” Chirapurath points out. When the IBM PC first appeared, a 20MB hard drive cost $3-4,000. “Now half the things I get in my inbox are about 20MB,” he says.
The falling cost of storage, and the difficulty of knowing which of your petabytes of data is worth keeping, means that the problem isn’t saving data — it’s making it useful. “What people did with relational databases was they created highly curated means of storing data; they were extremely hardcore about what was saved and what was thrown away. The ‘golden record’ [everything in every system of record in an organization], was sacrosanct, because storage was incredibly expensive. What’s happened over the last decade is there was no cost reason for throwing away anything, so they’ve stored it all. We are in a save-everything world,” Chirapurath says.
SQL Server 2019 will be shipping soon and Chirapurath hinted that there will be more announcements for Microsoft’s database tools that fit in with these market trends at Ignite, including the bet on open source. “What you’re seeing with SQL Server big data clusters, with an open-source analytics engine, which is Spark, coming next to a relational analytics engine, which is SQL Server, with the system being able to query both relational and non-relational data, is an indication of how we believe we should be working with customers to tear down those artificial barriers.”