How to Build a Big Data Solution on Azure

[ad_1]

Azure has made a point of prioritizing AI and analytics services, making it an appealing option for many looking to combine big data analysis and the benefits of cloud computing. With Azure, you can easily process massive amounts of data, both structured and unstructured, with real-time analytics and faster performance than you are likely to get with on-premise resources.

Whether you have a team of data scientists or just want to begin taking advantage of the insights big data can offer, read on to learn about what setting up big data analysis on Azure entails and a bit about some of the services available to you.

Creating a Solution on Azure

To build an effective big data solution, you must address data collection, storage, analytics, and visualization, among other things. These needs can be addressed exclusively with Azure services or through a combination of Azure services and third-party integrations. Through the marketplace, there are even fully-managed, pre-built big data-as-a-service options that you can run, such as Cloudera, Qubole, and Cazena.

Evaluation

Before you can select services, you need to evaluate your big data goals. You need to understand what data types you wish to include and how the data from those types will be formatted. If you are web scraping, for example, you will need to handle your data very differently than if you are collecting from IoT sensors. The type and amount of data you plan to use will inform the ingestion methods you need as well as the storage type.

Once you know what type of data you plan to work with, you must determine how you wish to analyze it. If you don’t have data scientists on your team, you’re likely going to use one of the big data-as-a-service options. If you do, depending on their specific skill sets, you’ll probably want to include Machine Learning (ML) services in your system, in which case you’ll need to choose ones that support your current ML tools and scripting languages.

already using cloud services, you should definitely take a look at what exactly an Azure migration entails before going further. Moving big data to the cloud as your first step is probably not a great idea and you might find it easier to continue if you first focus on moving basic processes and applications to the cloud. If, however, you wish to simply adopt one of the fully-managed analysis-as-a-service options, this is less relevant.

Architecture

Assuming you want to create your own solution, you’ll want to start by determining a rough architecture based on your evaluation. The specific architecture that works best for your needs will depend on your workload, what your legacy system is, and the skillsets of your development and operational teams. A generalized pipeline looks something like the image below, however, and can be used as a template for your configuration.

Generalized big data architecture on Azure

Services

After you’ve evaluated your needs and have a rough idea of the architecture you require, you can begin evaluating your service and integration options.

Data Ingestion

Many of Azure’s managed services can cover ingestion and processing of data in addition to analytics, such as HDInsight, Azure Analysis Services, or Stream Analytics, but there are also stand-alone options available.

One option, Data Factory is a serverless integration tool for siloed data that works in cloud-only and hybrid environments. With it, you can perform Extract, Load, Transform (ELT) processes using over 80 natively built connectors. Data Factory can integrate with Azure Monitor for the monitoring and management of CI/CD pipelines, and you can automate Data Factory processes via schedule policies or event-based triggers.

Although not directly used for data ingestion, Data Catalog is a tool that you’ll likely find helpful for managing and discovering sources. Through a crowdsourcing model, it allows users to contribute metadata and annotations to data which helps provide clear identification and allows for searchable indexing. With this service, you can grant access to a wide variety of users without having to provide access to individual services, keeping your system secure while maximizing community efforts.

Data Storage

Azure supports storage via individual databases or combined databases, as a data lake or data warehouse.

Your most basic option is just to host a SQL server on a VM; this is the cheapest option and works well for hybrid systems. If you’d prefer a managed database, however, Azure options include SQL Database, Database for MySQL, Database for PostgreSQL, and Database for MariaDB.

If you need greater flexibility you might prefer to use Cosmos DB, a fully-managed database service that features global distribution and transparent multi-master replication. With it, you can access a variety of databases through API endpoints, including Cassandra, MongoDB, SQL, Gremlin, Etcd, and Table. Cosmos DB also includes support for Apache Spark and Jupyter Notebooks.

A data warehouse, in which data from multiple sources is stored in a uniformly structured format, is also an option. SQL Data Warehouse is a fully managed warehouse, designed for enterprises, that includes integrated data processing and query functions. It can be used in combination with Azure Active Directory, Azure Data Factory, Azure Data Lake Storage, Azure Databricks and Microsoft Power BI.

Or, if you’d prefer to use a data lake, in which data from a variety of sources is stored in its native format, this is possible through Azure Data Lake Storage. This service is an extension of Azure Blob Storage that is optimized for analytics workloads. It uses a Hadoop compatible file system and supports atomic file and folder operations, to ensure high speed.

Analytics

Azure includes many analytics services, depending on your skill level and data source, including log and IoT stream analytics. This article does not allow for full coverage of these options but two you can start considering are Azure Databricks and HDInsight.

Databricks is a Spark service provided as a collaboration between Spark and Microsoft that integrates with Azure Machine Learning service and SQL Data Warehouse for AI-based analytics. It includes auto-scaling and automatic cluster termination features and simplifies resource management through serverless pools. Databricks includes support for Python, Scala, Java, R, and SQL as well as TensorFlow, PyTorch, and scikit-learn.

HDInsight is an enterprise-grade service focused on open-source analytics. It includes support for the most commonly used frameworks, including Apache Hadoop, Kafka, and Spark and integrates with a variety of Azure services to allow you to easily build analytics pipelines. This service includes support for multiple languages, including Scala, Python, R, JavaScript, and .NET and can be used in combination with a variety of tools, such as Visual Studio, IntelliJ, and Jupyter.

Configuration and Monitoring

After you’ve determined the services you need, you can begin configuration and prepare for a production environment. The exact configuration will depend on a combination of the services you choose, your data sources, and whether you wish to create a hybrid or strictly cloud environment, which is outside the scope of what can be covered here.

Regardless of your specific configuration, however, you should try to embed analytics into as many processes as possible to ensure that you are getting the best performance and the greatest ROI; Azure Monitor and Log Analytics will serve you well here. You will also need to account for data security and privacy policies as well as backup and recovery solutions, to ensure that your data remains secure and available. Microsoft’s big data architecture guide has some best practices that you’re likely to find helpful in making sure that your configuration is as reliable and efficient as possible.

Conclusion

If you want to get the most from your big data analyses, moving your processes to the cloud seems like an obvious choice. With Azure, in particular, you have the flexibility to move many of your existing processes directly over, adopt an all-in-one service, or create some combination of the two.

Hopefully, this article provided some insight on what’s required to set up Azure-based analysis and gave you a better idea of what resources are available, regardless of your in-house expertise.

Image Credit: Tommy Lee Walker/Shutterstock

Source link