The Curiosity Blog

8 Criteria for a Modern Test Data Solution

Written by Thomas Pryce | 28 June 2022 14:06:36 Z

In 2023, (test) data availability, quality, and compliance risks remain a major headache for software development.

Parallel teams, pipelines, and frameworks remain dependent on outdated Test Data Management (TDM) practices, particularly those focused on copying unwieldy data sets periodically to non-production. These obsolete TDM practices are incapable of supplying compliant data of sufficient variety, at the speeds demanded by “agile”, DevOps and CI/CD.

These perennial TDM challenges are not going to go away in a hurry; in fact, several trends are set to add demand for test data of growing complexity, volume, and variety. This article explores several of these trends, deriving 8 criteria for designing an effective, modern test data solution.

Overall, these criteria point to one broad trend: The need to shift focus from Test Data “Management” and copying data, to Test Data Automation that streams data on-the-fly. In particular, this article calls for a move to Test Data Automation that is real-time and event-based.

Modernising test data in this way shifts the paradigm from slow and manual “provisioning”, to parallel teams and frameworks streaming rich data on-the-fly. To learn more about this bottleneck-free approach to test data, check our Curiosity’s Test Data-as-a-Service Solution Brief.

1.   Enable “Agile”, DevOps and CI/CD

In an era of “agile”, DevOps, and automated pipelines, any manual intervention to make or supply test data is simply too slow. It cannot scale to meet the sheer volume and variety of data needed, as ever-changing tests are executed at speed by automated CI/CD pipelines.

Copying data periodically to non-production presents a particularly clear mismatch with the speed of system evolution promised by greater agility, DevOps and CI/CD:

Even today, it can take organisations weeks or months to refresh data in a non-production environment. This is often performed using a combination of manual processes, scripts and outdated tooling. With such wait times, how can test data provisioning provide up-to-date test data for a system that is updated in days, hours, and minutes?

Once data is available in non-production environments, too much time is then spent finding, making, and hacking data for diverse and fast-evolving tests. Even supposedly “on demand” approaches to test data provisioning tend to be brittle as tests and environments change, as they require re-configuration for different scenarios and environments.

Criteria #1 of a modern test Data solution:

To achieve rigorous testing at speed, a test data solution must make accurately matched data available “just in time” as different tests are executed within CI/CD pipelines.

2.   Support for Complex, Hybrid Architectures

Any modern test data solution must provide integrated data that’s more complex than ever, in parallel and at speed.

Today, test data must link seamlessly across diverse technologies in evolving hybrid architectures. While 70% of organisations have at least 2 database systems [1], test data must further link across messages, APIs, cloud-based components, mainframe systems, applications, and more.

Legacy systems do not go away as new technologies are adopted, and test data must reflect the co-existence of the old and new technologies. This must include the technologies used today in development, including containerisation and microservices architecture:

At its core, end-to-end testing today involves firing data into a system, and measuring its impact as it flows through integrated technologies. Test cases have converged with test data. If test data does not link consistently across integrated technologies, tests will fail even if there is no genuine defect. But then, what’s the value of running the tests?

Criteria #2 of a modern test data solution:

A robust test data solution today must be capable of connecting into numerous source technologies, and pushing data to a range of different technologies. It must also retain referential integrity as it manipulates or creates cross-system data, including during data masking, generation, cloning, and beyond.

3.   Acceleration of High-Investment Transformation Projects

Enterprises today are engaged in a range of high-investment, long-term transformation projects. This often includes (cloud) migrations and DevOps modernisation, along with renewed attempts to boost agility in software development.

These projects are not only high-value and high-cost; they also add to the demand for rich test data.

Any test data solution must support and accelerate these transformation projects, and not drag behind and block it. Let’s consider two examples:

  1. If parallelised development teams have adopted Kubernetes and containers as part of a DevOps modernisation, then any test data solution must push the right data to containerized databases and clusters on demand.
  2. If an organisation is migrating a Mainframe System to cloud-based architecture, then a test data solution must be capable to producing rich and compliant data in parallel for both the legacy and migrated systems.

Test data can introduce negative risk to these costly and lengthy projects, which already have an uncomfortably high failure rate. These risks include gaps in test data coverage and a lack of support for new tools and techniques, as well as compliance risks and test data bottlenecks.

Criteria #3 of a modern test data solution:

A test data solution must provide rich and compliant data at speed for modern transformation projects, even as these projects at to the demand and complexity of test data.

4.   Affordability in the Face of Massive Data Growth

The sheer volumes of data produced by organisations continue to grow at astonishing rates: 1/10 of data science and engineering professionals in North America reported in 2021 that data volumes at their organisation grow by over 100% per month. The average monthly data growth is a whopping 63% [2].

This growth reflects trends like cloud migrations, as well as the adoption of new tools and technologies. Add in the demand for parallelised non-production data, and the scope for runaway test data infrastructure costs becomes clear.

Making large, physical copies of complex production data for parallel teams, frameworks and environments is simply not viable today. It can also present a compliance risk, given rules around “data minimisation” and purpose limitation.

Criteria #4 of a modern test data solution:

A test data solution must continuously manage growing data volumes, providing optimised and affordable data for parallelised testing and development. Techniques like subsetting, cloning, and “covered” test data generation help by only creating as much data as needed in testing. Data Virtualisation furthermore copies data in parallel, at a fraction of the time and cost of making physical copies.

5.   Evolving Privacy Legislation

Data Privacy legislation continues to grow more stringent and complex globally. This can add risk to outdated TDM practices, particularly those that copy (raw) production data to less-secure environments.

A test data solution must be compliant with evolving global privacy legislation. This might include requirements like demonstrating legitimate grounds for data processing in non-production, while showing that only as much data as needed is being used to fulfil those grounds.

Organisations might further need to demonstrate that only as many people as necessary have processed data, and might furthermore need to delete or copy every instance of a person’s data “without delay”.

Criteria #5 of a modern test data solution:

A test data solution must be compliant with evolving global privacy legislation.

Minimising the use of sensitive data should be a priority, given the risk and complexity of complying with rules for using production data in non-production. A combination of masking and synthetic data generation enables a shift away from production data, in time creating simulated versions of production for testing and development.

6.   Centralised-but-Democratised Test Data

Organisations today require greater visibility and control over their data and infrastructure, to maximise efficiency and ensure compliance. However, this centralisation cannot come at the cost of introducing blockers to teams who require continuous data access.

A test data solution should emphasise reusability, centralising and distributing core competencies. Processes set up by a small team of skilled test data engineers should be made reusable by parallel teams and frameworks. This provides sufficient centralisation, while enabling teams and frameworks to parametrise and trigger processes on-the-fly:

Criteria #6 of a modern test data solution:

A test data solution should centralise skills and processes, while making configurable processes reusable on demand by parallel teams and frameworks. Test data jobs should be automated and parameterizable. They should be exposed to human and automated requesters on demand, for instance via self-service forms, API calls, and functions embedded in test automation frameworks:

7.   Parallelisation and Shorter Release Cycles

As organisations seek to shorten release cycles, demand has increased for rapidly changing data. A test data solution must make up-to-date data available to a growing number of parallel teams and frameworks, during sprints that are becoming increasingly short. Test data must furthermore be available for several different system versions at the same time, reflecting the parallelisation of modern development practices:

Criteria #7 of a modern test data solution:

A test data solution must make versioned data available in parallel, at ever-faster speeds, and to a growing number of parallel data requesters.

8.   Automated Data Requesters

Test Data must furthermore be made available to automated data requesters. This includes test automation frameworks and CI/CD pipelines, which are less forgiving than humans.

If data’s incomplete or inaccurate, a human tester might process and adjust the data manually. By contrast, an automated test is likely to fail or deliver a false positive if data is out-of-date, misaligned, or mismatched.

Automated tests furthermore increase demand for parallelisation. Test automation frameworks might run high volumes of parallelised tests. If two tests require the same data, they require this data combination in parallel. One test can furthermore not consume or edit data that will lead other tests to fail.

The pace of testing enabled by automation and CI/CD further increases the demand for data, as different tests are executed at speeds unimaginable with manual testing.

Criteria #8 of a modern test data solution:

A test data solution must make data available on demand to parallelised automated tests, even as evolving scenarios are executed at speed.

Automating Test Data

This article has now identified 8 criteria for a modern test data solution, drawing on observations regarding the nature of software delivery today. To summarise, a modern test data solution must:

  1. Enable “Agile”, DevOps, and CI/CD.
  2. Seamlessly support complex, hybrid architectures.
  3. Accelerate high-investment transformation projects.
  4. Continuously manage exponential data growth.
  5. Comply with evolving privacy legislation.
  6. Centralise, but distribute, test data skills and processes.
  7. Make versioned data in parallel, faster than ever, and to a growing number of requesters.
  8. Make data available on demand to automated requesters like test automation frameworks and CI/CD pipelines.

“Managing” or copying data periodically to non-production environments will not suffice in the face of these criteria. Meeting the demand for data today instead requires Test Data Automation. A test data solution must not only offer all the techniques needed for finding, anonymising, making, and allocating data; these same processes must be automated, reusable, and parameterizable on-the-fly.

This in turn provides Test Data-as-a-Service, allocating rich data “just in time” as testers, automated tests, and developers seamlessly trigger the reusable processes. Instead of a bottleneck, test data then becomes an accelerator of rapid testing and development. To learn more about this bottleneck-free approach to test data, check our Curiosity’s Test Data-as-a-Service Solution Brief.

References:

[1] Redgate (2021), Ten insights from the 2021 State of Database DevOps. Retrieved from https://www.red-gate.com/solutions/database-devops/entrypage/report-2021-infographic on 28/06/2021.

[2] Matillion (2022), Matillion and IDG Survey: Data Growth is Real, and 3 Other Key Findings. Retrieved from https://www.matillion.com/resources/blog/matillion-and-idg-survey-data-growth-is-real-and-3-other-key-findings on 28/06/2022.