Test data security: You're only as strong as your weakest employee

Written by James Walker | 27 June 2023 13:00:00 Z

Last week, I was lucky enough to attend the European InfoSec conference in London. The event hosted a rich mix of start-ups, enterprises and insightful talks on information security. It was a feast of knowledge on the latest security standards.

At the conference, I was particularly struck by two things:

The market is awash with threat detection tools.
There's a common understanding of the risks posed by employees to an organization’s infrastructure.

This appreciation of employee threat applied particularly to test data. It seemed almost every software solution showcased at the conference had capabilities for detecting the misuse of production data in development and test environments.

It was fantastic seeing organisations take the security risks associated with non-production data seriously. But, what are the alternatives to using production data in testing and development?

Production data in test environments: A recipe for disaster

As the adage goes, “a chain is only as strong as its weakest link.”

This applies particularly to your organisation’s data security: it's only as strong as your least informed, least careful employee. In fact, 74% of data breaches involve the “human element”,[1] while the average data breach costs $4.35 million.[2]

A single mishap can accordingly result in grave consequences, including when sensitive production data is spread to less-secure test environments.

Why do organisations still use production data to test?

Yet, organisations still routinely copy sensitive production data to less-secure test environments. This is an avoidable process which extends the attack surface of your data. So, why do organisations still do it?

The misuse of production data in test environments often stems from innocent-enough intentions. Developers frequently resort to using real data to test new features or troubleshoot issues, as it simulates real-world scenarios.

While this may seem beneficial from a testing perspective, it's a precarious practice for security. Companies may invest millions in securing production databases and associated infrastructure, implementing a myriad of guardrails, firewalls, and scanners. However, once that data transitions to a less secure environment such as a test or dev environment, it is now exposed to significantly less security.

Real production data often encompasses sensitive information, including customer names, addresses, and financial details. Mishandling this data can lead to breaches that not only tarnish your company's reputation, but also entail severe legal and financial consequences.

The boardroom might take note of one €1.2 billion fine that’s been levied under the EU General Data Protection Regulation (GDPR), along with the 1,691 other fines that have been imposed since it was introduced in 2018.[3]

Moving from the cause to the symptom…

When I asked the vendors at European InfoSec what they would recommend when production data is found in non-production environments, the common response was that it shouldn't be happening, and that access should be revoked.

This is a valid point, but it does not address the root cause. There are instances where developers or testers need to replicate scenarios to simulate specific scenarios in the application. If there isn’t a solid test data solution in place for provisioning and creating data, then pulling the production data is the only solution. So, the rules get bent, leading to exceptions that create unacceptable risk.

Instead, we should be thinking about a holistic test data strategy that empowers developers, testers and CI/CD tooling to create and provision the data they need securely and conveniently. Providing a faster, more robust alternative to copying production data will motivate testers and developers away from the practice; merely imposing stringent access control to production will lead to workarounds.

Test data masking: A secure solution?

Historically, a common approach to protecting sensitive data in test environments has been to mask, anonymise or obfuscate the data. This technique replaces sensitive data with fictitious yet realistic information, allowing developers to work with data that behaves similarly to the original.

Though popular, this method has its shortcomings. One significant challenge with data masking is that it typically generates high-volume, low-variant data. In other words, it creates large quantities of data that lacks the diversity and unpredictability needed in testing.

Masked data mirrors past production usage, during which most users use the system as expected. It therefore lacks the negative scenarios needed for rigorous testing, along with data for testing new functionality:

Masked data will typically satisfy just a fraction of the scenarios needed in testing and development.

Masking alone limits the scope and effectiveness of testing, potentially allowing costly defects to slip through to production. Moreover, even though masking alters the data, it retains the original data's structure and distribution. For a skilled individual, you can usually then reverse-engineer sensitive data from the masked data set.

Researchers have identified 99.98% of individuals from anonymous data sets, using just 15 demographic attributes. Another study identified 90% of shoppers from credit card metadata, using just 4 random transactions per individual.[4]

The point is: you need to remove a lot of data before a data set is truly “anonymous”, including metadata and time series data. Yet, the data must still resemble the original.

Test data evolution: Masking to synthetic data

To provide both privacy and testing rigour, a modern and fit-for-purpose approach to test data creation has been gaining traction: the use of synthetic data. Synthetic data refers to data that's artificially generated, rather than derived from actual events. Unlike masked data, synthetic data is not derived directly from real data, meaning it carries no risk of exposing sensitive information.

Instead, algorithms are used to create synthetic data based on scenarios and business logic underpinning an application. This means that a rich, covered set of data can be created, giving full data coverage for testing and development.

The use of synthetic data enables comprehensive and realistic testing, mitigating the risk of costly bugs and security risks. It is also capable of generating accurate data on demand, sidestepping the massive amount of development time wasted waiting for, finding, or making data.

With synthetic data, developers and testers can conduct their work efficiently and effectively, without exposing the organization to the risks associated with using real production data.

Remove security risks – accelerate testing and development

The strength of your data security strategy hinges on its weakest link. If you’re using live production data in non-production environments, this will likely represent one of the weakest links in your chain.

By implementing robust test data management practices, you can better fortify your organization's data against breaches. It's imperative to equip your workforce with the tools and knowledge they need to navigate the complex world of data security confidently and effectively. Synthetic test data generation provides a secure solution that accelerates and optimises testing and development.

Footnotes:

[1] Verizon (2023), 2023 Data Breach Investigation Report. Retrieved from https://www.verizon.com/business/en-gb/resources/reports/dbir/ on 22/06/2023.

[2] Ponemon, IBM (2022), Cost of a Data Breach Report 2022. Retrieved from https://www.ibm.com/downloads/cas/3R8N1DZJ on 22/06/2023.

[3] Enforcement Tracker, “Statistics: Fines imposed over time”. Retrieved from https://www.enforcementtracker.com/?insights on 22/06/2023.

[4] Cited in Natasha Lomas (TechCrunch: 2019), “Researchers spotlight the lie of ‘anonymous’ data”. Retrieved from https://techcrunch.com/2019/07/24/researchers-spotlight-the-lie-of-anonymous-data/ on 22/06/2023.

View full post