Data pipelines can be complex and come in various archetypes: ETL, migration, streaming, AI/ML, and more. Testing them comes with specific challenges and new ways of thinking.
While many types of testing and quality activities apply to data pipelines (such as unit tests, performance tests, etc.), we'll focus on the core activities every data pipeline testing strategy should consider:
- Functional testing
- Data testing
Functional testing focuses on pipelines and data movement/transformation logic.
Are my pipelines working as we expect? Are they successfully moving data from one location to another? Beyond the pipelines themselves, there could be transformation logic as well. Is this logic working as expected?
How can we ensure any code we write and any system we utilize works properly? And how can we continuously prove this? (hint: automation)
With functional testing, we are in control of the input, and we know exactly what the output should be. We may want to validate an entire data set, specific parts of a data set, and/or an entire row or object.
While we are technically looking at data and testing data when we're testing transformation or data movement logic, there is still an element of data quality missing.
Data comes from multiple sources in multiple forms. We must understand this data as it's coming in and flowing through our system. That helps us set data quality expectations throughout our pipelines. For example, we may not want a column to contain nulls because we know that will break our logic or cause reporting issues downstream.
In this approach, we're testing the data itself. We have somewhat known or uncontrolled inputs, but we have an idea of what it should look like. It's coming from real sources (or test data from these sources in lower environments). For these reasons, this is important for our system in Production.
Profiling the source data to understand the possible values, types, distributions, and relationships is crucial. This helps us answer questions about data expectations and set appropriate data expectations. We could also speak to the data users themselves to understand their expectations of the data. They can provide valuable information on previous (and current) data issues.
Data Testing Methodologies
There are primarily two approaches for testing data:
- Direct-data comparisons (or apples-to-apples comparisons)
- Column-level validations
In direct-data comparison, we validate that one data set exactly equals another. This checks that the schema, counts, and values are equal between expected and actual data sets. It’s where we control the input and know the expected output. Sound familiar? That's because this matches what we described for functional testing.
This is intended mostly for regression testing of functional use cases and logic. Direct-data comparisons can be slow and resource usage intensive. However, it gives us full control over our test data and ensures we can cover all regression use cases.
It can technically be applied to data testing as well, specifically testing data movement across locations or testing transformations of data we don’t control (even if it’s test data from real sources, for example); we just have to be cognizant of the amount of data we’re testing.
In the above picture, the blue table is our expected data, and the green table is our actual result after pushing our test data through a pipeline. The red cell highlights the difference between the two tables.
This method is inappropriate for testing your production data because it can be slow and flaky. We should also ensure functionality is working before going to production 😁
There aren’t many automated testing frameworks or libraries that provide functionality for direct-data comparison, so many teams build their frameworks in-house.
In column-level validations, we check whether a specific column meets specific expectations. But not necessarily against another data set. Rather, it's an inbound test with the expected result defined in the code. For example, are there any rows where Column A is null?
This methodology can be used for both functional and data testing, with more skew for data quality testing since it allows for uncertainty in the data. Also, we are not checking a specific value in a specific column on a specific row.
In this diagram, a sample column-level validation checks the last column and ensures no nulls exist.
We can use this methodology in lower environments and production, but its value is more suited to production.
Many tools and frameworks exist to help facilitate this testing methodology, with Great Expectations and Deequ being the front runners. Some platforms have built this in even, like Databricks with their Delta Live Tables Expectations.
No data testing strategy is complete without observability. Think analytics. How do we know if our pipelines and data in production have an issue?
Some of the dimensions to consider are:
- What metrics do we want to collect?
- What comparisons do we want to make?
- How do we want to showcase the metrics?
- Where do we want to store the metrics?
- When should we alert?
- Who should be alerted?
We collect metrics, build dashboards where needed, monitor the metrics, and alert when a threshold is exceeded. The team then acts and remediates the issue. Perhaps reloading or fixing specific data sets or re-running pipelines.
Building observability usually involves in-house built tools, infrastructure, and out-of-the-box functionality from the platforms we use. For example, if we're using Azure, we can use Azure Monitor to push our metrics into, build dashboards, and alert.
Use direct-data-comparison when:
- Testing functionality/code
- You'd like to validate an entire data set, a specific object/cell, or an entire row
- In lower environments
Use column-level validations when:
- Testing the quality of the data itself
- You'd like to set specific expectations on a specific column
- In production (and in lower environments only if your source data in these environments provide good coverage of data samples)
Use observability when:
- You'd like to know when an issue occurs in production and alert on it
- You'd like to see the health of your data in production