Why mask data?
Earlier this month, the security firm Imperva announced it had suffered a significant data breach. Imperva had uploaded an unmasked customer database to AWS for “test purposes.”
It was a test environment and not monitored or controlled as rigorously as the production environment. Cybercriminals stole the API key and used it to export the database contents.
Surprisingly, the victim is a security company selling data masking tools, Imperva Data Masking.
Imperva could have avoided this painful episode if it had used its product and established a policy requiring every development and test environment to be limited to masked data.
The lesson for the rest is that if you’re moving workloads to AWS or another public cloud, you need to mask data in test/development environments.
Here is how companies can implement such a policy.
The rationale for Data Masking
Customers concerned about data loss/theft risk seek to limit the attack surface area presented by critical data. A common approach is to restrict sensitive data to “need-to-know” environments.
It generally involves obfuscating data in non-production (development, test) environments.
Data masking is the process of irreversibly, but self-consistently, transforming data such that one cannot recover the original value from the result. In this sense, it is distinct from reversible encryption and has less inherent risk if compromised.
As data-centric enterprises take advantage of the public cloud, a common strategy is to move all non-production environments. The perception is that these environments present less risk.
In addition, the nature of the development/test cycle means that these workstreams can enormously benefit from the flexibility in infrastructure provisioning and configurations offered by the public cloud.
To leverage this flexibility, development and test data sets need to be readily available and close to production as possible to represent the wide range of production use cases.
Yet, some customers are reluctant to place sensitive data in public cloud environments.
The answer to this puzzle is to take production data, mask it, and move it to the public cloud. The perception of physical control over data continues to provide comfort (whether false or not).
Data masking also makes it easier for public cloud advocates to gain traction in risk-averse organizations by addressing concerns about data security in the cloud.
Regulations such as GDPR, GLBA, CAT, and HIPAA impose data protection standards that encourage some form of masking in non-production environments for Personal Data, PII (Personally Identifiable Information), and PHI (Personal Health Information), respectively.
Requirements of Data Obfuscation Tools
Data obfuscation tools must have the following requirements:
• Data Profiling: The ability to identify sensitive data across data sources (e.g., PII or PHI)
• Data Masking: The process of irreversibly transforming sensitive data into non-sensitive data
• Audit/Governance Reporting: A dashboard for Information Security Officers responsible for meeting regulatory requirements and data protection
Building such a feature set from scratch is a big lift for most organizations, and that’s before we begin considering the various masking functions that a diverse ecosystem will need.
Masked data may have to meet referential integrity, human readability, or other unique requirements to support distinct test requirements.
Referential integrity is crucial to clients with several independent data stores performing a business function or transferring data between each other.
Hash functions are deterministic and meet the referential integrity requirement. However, they do not meet the uniqueness or readability requirements.
Several different algorithms to mask data may be required depending on application requirements. These include:
• Hash functions: E.g., use a SHA1 hash
• Redaction: Truncate/substitute data in the field with random/arbitrary characters
• Substitution: With alternate “realistic” values (common implementation samples with real values to populate a hash table)
• Tokenization: Substitution with a token that can be reversed—generally implemented by storing the original value along with the token in a secure location
Data Masking & Public Cloud Providers
AWS has the following whitepapers and reference implementations:
• An AI-powered masking solution for Personal Health Information (PHI) that uses API Gateway and Lambda to retrieve and mask PHI in images on S3 and returns masked text data posted to API gateway
• A design case study with Data guise to identify and mask sensitive data in S3
• A customer success story of a PII masking tool built using EMR and DynamoDB
• An AWS whitepaper that describes using Glue to segregate PHI into a location with tighter security features
However, these solutions must address masking in relational databases and integrate with the AWS relational database migration product, DMS.
Microsoft offers both versions of its SQL Masking product on Azure, which includes:
• Dynamic Masking for SQL Server, which overwrites query results with masked/redacted data
• Static Masking for SQL Server, which modifies data to mask/redact it
For this discussion, we focus on what Microsoft calls “static masking” since “dynamic masking” leaves the unmasked data present on the DB, failing the requirement to shrink the attack surface as much as possible.
We will also cover AWS technologies to explore cloud-native vs. vendor implementations.
Build Your Data Masking Solution with AWS DMS and Glue
AWS Data Migration Service (DMS) currently provides a mechanism to migrate data from one data source to another, either at one time or through continuous replication, as described in the diagram below (from AWS documentation):
(AlT Text: AWS_DMS_Migration)
DMS currently supports user-defined tasks that modify the Data Definition Language (DDL) during migration (e.g., dropping tables or columns). DMS also supports character-level substitutions on columns with string-type data. When used for data masking, you can build AWS ETL solution Glue to fit into this framework, operating on field-level data rather than DDL or individual characters.
The below diagram shows an automated pipeline to provision and mask test datasets and environments using DMS, Glue, CodePipeline, and CloudFormation:
(ALT text: AWS_Data_Masking)
When using DMS and Glue, the replication/masking workload is run on AWS, not in the customer’s on-premises data center. As a result, unmasked or unredacted data briefly existed in AWS before the transformation.
This solution does not address security concerns around placing sensitive data (and accompanying compute workloads) on AWS for clients who still are cautious about their approach toward public clouds.
For firms looking towards a cloud-native solution as the answer, the above can form a kernel of a workable solution, combined with additional work around identifying data needing masking and reporting/dashboarding/auditing.
Data Masking Solution Vendors
If organizations are less concerned about cloud-native services, there are commercial data obfuscation tools offering masking services in various forms. These include IRI Field Shield, Oracle Data Masking, Okera Active Data Access Platform, IBM Infosphere Optim Data Privacy, Protegrity, Informatica, SQL Server Data Masking, CA Test Data Manager, Compuware Test Data Privacy, Imperva Data Masking, Dataguise, and Delphix.
Several of these vendors have partnerships with cloud service providers. The best data masking solution for the use case under consideration is the one offered by Delphix.
Data Masking with Delphix
This option leverages one of the commercial data masking providers to build AWS data masking capability.
Delphix offers a masking solution on the AWS marketplace. One of the benefits of a vendor solution like Delphix is that it is easily deployable on-premise and in a public cloud.
It allows customers to run masking workloads on-premise and ensure no unmasked data is ever present in AWS.
For example, AWS services such as Storage Gateway can run on-premise. However, Glue and Code Commit/CloudFormation cannot.
Delphix is appealing due to its ability to integrate between its masking solution and its “database virtualization” products.
Delphix virtualization lets users provision “virtual databases” by exposing a file system/storage to a database engine (e.g., Oracle), which contains a “virtual” copy of the files/objects that constitute the database.
It tracks changes at a file-system block level, thus offering a way to reduce the duplication of data across multiple virtual databases (by sharing common blocks). Delphix has also built a rich set of APIs to support CI/CD and self-provisioning databases.
Delphix’s virtualized databases offer several functions more commonly associated with modern version control systems, such as Git. It includes versioning, rollback, tagging, low-cost branch creation, and the ability to revert to a point along the version tree.
These functions are unique as they bring source code control concepts to relational databases, vastly improving the capacity of the CI/CD pipeline to work with relational databases.
It allows users to deliver masked data to their extensible public cloud environments.
A reference architecture for a chained Delphix implementation, utilizing both virtualization and masking, would look like this:
(Alt Text: Delphix_Data Masking)
It is imperative to mask data in lower environments (dev, test).
Masking such data also makes migrating dev and test workloads to public clouds far more manageable and less risky.
Therefore, organizations should build an automated data masking pipeline to provide and mask data efficiently.
This pipeline should support data in various forms, including files and relational databases.
If your build/buy decision is tilting towards a purchase, data obfuscation tools can provide core masking and profiling functions.
Our experience has led us to choose Delphix.