The work done to keep data pipelines delivering fresh and high-quality input data to the users and applications that rely on them is known as data reliability engineering (DRE). The goal of DRE is to enable rapid iteration on data infrastructure, the logical data model, and so on, while still ensuring that the data is usable for the applications that rely on it.
End-users—data scientists looking at a/b test results, executives looking at dashboards, customers seeing product recommendations, and so on—don’t care about data quality in the abstract. They are concerned with whether the information they are seeing is relevant to the task at hand. DRE focuses on quantifying and meeting those needs while allowing the organization to grow and evolve its data architecture.
It borrows the core concepts of Site Reliability Engineering, which is used by companies such as Google, Meta, Netflix, and Stripe to iterate quickly while keeping their products reliable 24 hours a day, seven days a week. These ideas bring a methodical and quantifiable approach to defining quality, handling problems gracefully, and aligning teams to balance speed and reliability.
What is the significance of data reliability engineering today?
Nobody needs to be told how important data is to nearly every industry. As more roles—not just data science and engineering professionals—interact with data, whether through self-service analytics or the outputs of machine learning models, there is a greater demand for it to “just work” every hour of every day.
However, in addition to having more users and use cases to serve, data teams are also dealing with larger and more diverse data volumes. Snowflake, Databricks, Airflow, dbt, and other modern data infrastructure tools have made it easier than ever to reach a scale where ad hoc approaches can’t keep up.
While the most visible big-data companies, such as Uber, Airbnb, and Netflix, felt these pains first and led much of the foundational work in this discipline, it’s quickly spreading.
What are the fundamental data reliability engineering principles?
The seven principles outlined in Google’s SRE Handbook are an excellent starting point for DRE, which can adapt them to deal with data warehouses and pipelines rather than software applications.
- Accept risk:
Because something will fail eventually, data teams must devise a strategy for detecting, managing, and mitigating failures as they occur (or before they occur).
- Keep track of everything:
Problems cannot be mitigated if they are not identified. Monitoring and alerting provide teams with the information they need to address data issues.
- Establish data-quality standards:
Acceptable data quality standards must be quantified and agreed upon before teams can take action. Standards-setting tools for DRE include SLIs, SLOs, and SLAs.
- Reduce Toil:
The human-led, operational work required to improve a system is referred to as toil. Teams should ideally reduce toil to reduce overhead for efficient DRE.
- Make use of automation:
Automating manual processes allows data teams to scale reliability efforts and devote more time to higher-order problems.
- Control releases:
Making changes is how things improve and degrade. Pipeline code, faulty or not, is still code that must be tested before deployment.
- Keep things simple:
Minimizing and isolating complexity in any given pipeline job goes a long way toward ensuring its dependability.
Will data reliability engineering become a standard in the industry?
While one could argue that data reliability engineering is still a new concept, modern companies that use data to run and grow their businesses (Uber, DoorDash, Instacart, etc.) are leading the charge to make DRE a standard practice. And job postings for the position are already on the rise. Given the speed of business and the need for data to be trusted, DRE will soon be as common as SRE.