Published in the March 2009 Privacy & Data Security Law Journal.
© 2009 Alex eSolutions, Inc.
Achieving Data Privacy Through Data Obfuscation
BENJAMIN GERBER, ADAM C. NELSON, AND STEVEN L. JONES
In this article the authors explain what data obfuscation is and provide some examples of when it should be employed.
Your organization or client handles personally identifiable information (“PII”) and other sensitive data from markets across the globe and your international workforce—application development and testing is performed in India and China production support relies on expertise in Russia, Brazil, Argentina, and California databases reside in Canada and Belgium data analysis functions occur in Thailand, India, and Costa Rica and it is all managed from Boston
Through an understanding of business processes, data flows, and the application of advanced data obfuscation—including data masking, data de-identification, and data anonymization—your organization or client can achieve a great number of business goals and continue to perform current functions with out using the actual, real PII and sensitive data of customers and employees, thus severely reducing risk and liability.
Attorneys have an important role in mitigating their organization’s risks—with this introductory overview we hope to provide the audience with the knowledge necessary to contribute to the reduction of unnecessary risk related to how their organization handles employees’ and customers’ data. Challenges such as customer contract requirements to protect confidential information across both production and non-production environments, financial industry regulatory mandates aimed at preventing fraud, international privacy compliance regulations—including crossborder data transfer considerations—or a corporate directive to proactively reduce the risk of inadvertent data disclosure may be met by effectively employing data obfuscation.
Techniques for data obfuscation include data masking, data de-identification, and data anonymization. In this article, we discuss what data obfuscation is and provide some examples of when it should be employed.
Data Masking
Data masking allows us to generate faux, yet representative, data for use in the full Systems Development Life Cycle (“SDLC”)—which includes application development, unit testing, systems testing, user acceptance testing, and performance testing—or for specific business intelligence purposes (e.g., statistical analyses, profitability analysis).
Consequently, data masking allows for maintaining:
- Representative data in volume; data quantity and size used in testing or data analysis matches what is found in production systems. This is particularly important for performance testing.
- Representative data in value distribution; when data is used for testing purposes, it need only have the same or similar values found in production data, without revealing or corresponding to individuals’ data. When data is used for analysis purposes, often it is not the values belonging to an individual record that are of interest; instead the data at the aggregate level may be analyzed using statistical techniques.
This allows for the maintaining of data utility while protecting against:
- Identity Disclosure, which occurs when an individual record can be tied to a particular entity; the identity of an individual can thus be inferred from the data.
- Value Disclosure, which occurs when the value of a confidential attribute for a particular entity (the value of one or more variables) can be inferred from the data.
There are many approaches to performing masking modifications on confidential data, including use of data perturbation,1 data shuffling, randomization with range constraints, or micro aggregation. The right algorithm must be selected for the type of data being masked and the intended use of the masked data.
Data De-identification
Data de-identification is the removing of all, some, or portions of identifiers (e.g., name, address, Social Security number or Social Insurance number) from the data prior to use in testing or production environments or release to third parties. While this has been the predominant method of data sanitization or obfuscation, it is important to realize that de-identified data may be subject to re-identification by utilizing categorical (i.e., demographic characteristics) or numerical data. Because of the risks associated with data being re-identified and the relatively small additional overhead of applying masking over de-identification alone, depending on the intended use of the data, often data masking is a superior option for non-production use or analysis of data at a non-aggregate level (i.e., analyzing individual records rather than sets of records).2
Note that “data masking” and “de-identification” may mean different things depending on the context. For example, often when we talk about “masking” in a payment card industry (“PCI”) or credit card context, we are simply referring to displaying or printing only the first six and last four digits, or the truncation of all but the last four digits of an account number. “Data masking” as it is discussed here is more advanced than “data de-identification.” Data de-identification alone leaves data subject to re-identification by utilizing categorical (i.e., demographic characteristics) or numerical data.
Data Anonymization
Data anonymization allows for the maintaining of exact values of data or retaining precise data value distribution, and therefore allows the data to precisely represent production data because it is unaltered, yet it is anonymous because it is unreadable. At a high level, this is accomplished by applying one-way cryptographic hashing to data elements.3 Data anonymization is utilized to perform a variety of analytical and business intelligence functions on data, including marketing data analysis, fraud detection, and consolidation of customer data.
Business Scenarios
Data obfuscation is most often utilized for generating data for development and test environments, as well as many data analysis functions. Data obfuscation is usually applied in batch; a masked copy of a database or databases is created for later use.
Data for Development and Test Environments
The risk of using real production data in test or development environments is completely unnecessary. Even if production controls were mirrored or tight security were in place that addresses access to systems and data stores, the nature of test and development allows for far greater access to data in volume than when data is maintained in a production environment. Also, the nature of how the data is regarded and treated is not covered by the same process and attitude adopted for production environments.
De-identifying or applying simple substitution to the data does not create data sufficiently representative of production data, which is required for full SDLC testing. At the same time, data not sufficiently masked builds a false sense of security, as the data may be subject to re-identification and thus inappropriate disclosure.
The purpose of data masking for generation of data used in test and development environments is to allow production quality test data to be used in application testing without compromising the privacy of the individuals whose personal information is contained in the production data records. Application testing should not be affected by data masking; the basic way tests are defined, conducted, and evaluated should not be affected by use of masked data. In other words, if the same tests were conducted on masked data and unmasked data, they should identify the same application or systems issues. Applying data masking techniques prior to the data being migrated to test and development environments mitigates a great deal of unnecessary risk.
Data for Business Intelligence Functions
Often analyses of data for business intelligence, such as purchasing trends, services usage patterns, and customer satisfaction results, can be performed at an aggregate level or without a need to know precisely which individual is attributed to which values. With an understanding of the analyses intended to be performed, appropriate data masking techniques can be selected to produce extremely similar results (within an acceptable tolerance percentage) that the same analysis operations would produce for production data. Therefore, mitigating the risks of allowing analysts, particularly those located overseas, to handle large amounts of production data becomes practical through data masking. For functions requiring a high degree of precision and the comparison of individual data elements, data anonymization may be a viable option.
Data for Sharing with Third Parties
Consider how often more data than is necessary to perform a specific business function is transferred to third parties. No more data than is absolutely necessary to perform the function that a third party is engaged to perform should be provided to a third party. This can be accomplished by de-identifying data records prior to transfer.
For example, for the completion of salary surveys, the only necessary information may be job function, salary, and broad geographic location of employees. Names, serial numbers, and other unique identifiers are unnecessary—and specific demographic information such as zip code may also be beyond what is required for functional results. As long as the risk of re-identification is understood, de-identification is a very valuable tool.
Project Approach
The following high level outline of a data obfuscation project approach aids in gaining an understanding of the various considerations that go into a successful data obfuscation implementation.
Working with legal, the data business owners, database administrators, data architects, application development and testing leads, information security, and the privacy organization:
- Develop a strategy and determine requirements
- Identify scope of applications and/or databases
- Identify intended uses of obfuscated data, e.g.,
- Unit testing
- User acceptance testing
- Statistical analyses for business intelligence
- Identify data sensitivity/classification levels (option: collect/create data definitions and classify data, if this is not already done)
- Identify (high level) relations amongst data sets (detailed relationships amongst data elements can be addressed in data mapping, detailed design, or configuration of the prototype)
- Map data flows and associated business processes
- Select candidate data sets/elements (schemas, fields, tables) for obfuscation
- Select obfuscating techniques, per
- Intended uses of data, e.g.,
- Unit testing
- User acceptance testing
- Statistical analyses for business intelligence
- Data sensitivity/classification level
- Relations amongst data elements (i.e., referential integrity)
- De-identification
- Data masking, e.g.,
- Data shuffling
- Micro aggregation
- Data swapping
- Lookup values (lookup tables)
- Random number generation
- Randomization with range constraints
- Hard-coded literals
- Special registers (e.g., date, time)
- Substring and concatenation of values
- Sequencing numeric fields (or parts of concatenated fields)
- Date manipulations
- Data anonymization
- Intended uses of data, e.g.,
- Technology/product(s)/tool(s) evaluation and selection (see “Product Selection” below)
- Create structured evaluation criteria for technology/product(s) evaluation based on requirements
- Select options that meet requirements
- Compare options for advantages and disadvantages
- Derive high level architecture options
- Design processes and technical architecture (physical and logical)
- Document intended steps for execution of data obfuscation operations
- Staging systems
- Database instances, e.g.,
- Staging
- Test
- Quality assurance
- Business intelligence
- Implement technical architecture (option: prototyping, see “Prototyping” below)
- Install products/tools
- Configure products/tools
- Perform data obfuscation (i.e., de-identification, masking, or anonymization)
- Develop and execute validation procedures
- Verify protection against identity disclosure and value disclosure
- Verify data utility is maintained
- Test and document repeatable processes
Prototyping
Consider taking a prototyping approach with the initial implementation of a data obfuscation technology, product, or tool. A viable prototype is well beyond a proof of concept; the intent is to design and implement a prototype to serve as a robust functional base to build out subsequent data obfuscation implementations.
Product Selection
Different solutions will have specific characteristics that could offer advantages in meeting the organization’s goals; each should be evaluated to determine which solution is the best fit for the organization’s requirements and business objectives both short and long term.
Data Masking Products
While advanced data masking can be implemented using software developed in-house or by custom development, the benefits of the powerful software packages that perform data masking available today, such as IBM Optim (formerly Princeton Softech Optim)4 and Camouflage Software’s Camouflage,5 usually outweigh undergoing custom development efforts to implement advanced data masking.
Important features to look for in a data masking product include:
- Supported compatibility with your organization’s databases (e.g., DB2, Oracle) and platforms (e.g., mainframe, Unix, Windows).
- Intelligent contextual masking and multiattribute contextual masking to ensure valid values are used.
- Key propagation (including propagation of masked primary keys to dependent foreign keys) is a crucial feature; additionally the capability to consistently propagate masked key values across multiple databases within the enterprise may be desired.
- Consistent masking or “replay” features allow for masking a column the same way each time a masking routine is performed from the same database and across multiple databases; this removes uncertainty across multiple test and development databases when consistency is required.
- Built-in algorithms applicable to your organization’s needs will speed implementation; the flexibility to add additional algorithms and execute exit routines to apply complex algorithms may also be desirable.
- Predefined mapping of data tables used by popular applications your organization employs is highly desirable (e.g., ERP, CRM, SCM applications).
Some database management systems (“DBMS”) have built-in capabilities or add-on packages available that generate test data or provide batch de-identification functionality. While some of this functionality is becoming more sophisticated, often these features are used for less complex de-identification rather than advanced masking operations.
Data De-Identification Products
When de-identification is used in combination with or in the same environment as data masking, often the same tools are leveraged for both obfuscation techniques. However, de-identification is a simpler function and has been performed by custom code and/or built-in or add-on DBMS functionality for decades.
De-identification is not limited to a batch operation; it is also performed live on production data. In much existing code this is often done at the “view” or interface level. However, the way data is used today, accessible beyond a single defined application interface, it is often necessary to apply protections as close to the data as possible. This may dictate a need to change how such real-time or live de-identification is performed; moving to leveraging database level features or a centralized interface layer or bus through which all data access and interfaces flow. (See “Real-time Data Obfuscation” herein.)
Data Anonymization Products
While many products claim data anonymization capabilities, they are often using this term to advertise their de-identification solutions. At a high level, data anonymization is accomplished by applying one-way cryptographic hashing to data elements. Custom built applications of this sort have existed to address various information technology needs for years. However, when considering a data anonymization solution for business-focused operations, it is important to evaluate advanced features, such as identity disambiguation, that allow for more accurate results by cleaning up and/or organizing data before it is actually anonymized. Also consider the ease with which multiple parties can compare data sets, as anonymously comparing data sets across groups within and beyond organizations is a common use for this obfuscation technique (such as determining shared customers among organizations considering a merger). IBM Entity Analytic Solutions’ offers IBM Anonymous Resolution and a family of products that provide identity disambiguation features.6
Real-time Data Obfuscation
Your organization may have a need for the obfuscation of data in real-time as well as in batch. Real-time data obfuscation will permit your organization to enforce its policies regarding the transfer and access of sensitive data while the data is utilized for production business functions. Consider the following business scenario.
Real-time Business Scenario
An application for making decisions about granting loans or proposing financial products is utilized by analysts. Specific information for a customer may be queried up by account number—this information is populated on the users’ screen.
If the analyst is located in the Boston office, the screen may contain all the details of a given account—including the account holder’s name and Social Security number. If the analyst is located in Mumbai, due to requirements driven by privacy regulations and associated security risks, the screen may contain only the information required for the particular type of analysis being performed—therefore the account holder’s name and Social Security number will be blanked out or replaced by a different unique identifier. That is, certain data elements may be de-identified or anonymized based on the authorization and/or location of a given user of the application and possibly the type of transaction they are undertaking.
Technical Solution for Real-time Data Masking: Role-Based Access to the Data Sources
A classic method of achieving real-time data masking is to utilize existing functionality in the database management software along with a form of role-based access control. Modern database management software, including recent versions of legacy database management software, provides functionality for restricting the visibility of data elements based on the specific authorizations granted to the user authenticated to the database. The simplest example of this is to provide different database “views” for different levels of users. These “views” can be constructed in such a manner as to obscure the private data. Even database administrators, who traditionally have unlimited access to data, can be restricted by utilizing field level cryptographic functionality.
While some applications that interface with databases allow for individual users to be directly authenticated to the required database, most applications are built such that users authenticate either to a shared infrastructure component or to the individual applications, which then connect and authenticate to the database as a common shared user. By grouping required data access rules into a role definition and assigning this role to individually authenticated users, data privacy functionality can be incorporated into the required applications.
For applications that authenticate to the database as common users, different roles with unique levels of authorization may still be utilized. This is often achieved by establishing separate instances of an application with each instance configured to authenticate to the database with the appropriate common user for the role.
While implementing role-based data access controls can be an excellent solution for security and privacy requirements, the conversion process may require significant modifications to both the database environment and the associated applications. This is in contrast with the appliance-based solution discussed below in which the additional infrastructure is completely external to the existing applications and databases.
Technical Solution for Real-time Data Masking: Service Oriented Architecture Appliance
This solution utilizes external components in order to avoid the modification of existing applications. A service oriented architecture (“SOA”) appliance (or other appliance that operates on transactions at the data level) can be used to de-identify and anonymize data elements based on the context of transactions, such as authorization and/or location of a user while a business application is used, and possibly the type of transaction they are undertaking at a given moment.
The appliance resides between the application servers and the clients, acting as a centralized interface layer or bus through which all data access and interfaces flow. The appliance virtualizes an application or database service—rendering its presence invisible to the client; the appliance performs advanced transformation on data flowing to and from the client at wire-line speeds. These transformation operations encompass data de-identification and when applicable, anonymization.
Appliance-based advanced data manipulation of this sort is still emerging. We began work on such appliance-based data obfuscation scenarios for executing various data obfuscation functions based on context-sensitive access requirements in early 2007, basing our solutions on the capabilities of the IBM WebSphere DataPower SOA Appliance XI50.7 There are several SOA and data-centric appliances now on the market.
Conclusion
Your organization’s risk profile is markedly reduced when not unnecessarily using sensitive data to execute business activities. If obfuscated data is compromised, it may prove valueless to unauthorized parties. Also, when utilizing obfuscated data, the handling, processing, storage, and transfer of this data may no longer be subject to requirements around the transfer of PII across international borders. If you are not using real PII, then these regulations do not apply. Due to its clear benefits, we recommend that data obfuscation become a core business practice for your organization.
At the time of writing, Benjamin Gerber, CISSP, CISA, CPP, CIPP/G, was a Senior Managing Consultant and the Privacy Services Competency Co-Lead with the Security and Privacy Practice at IBM.
He is now a Principal in the Privacy Strategy group at The MITRE Corporation.
He can be reached at privacy.us/contact or .
Adam C. Nelson, Esq., CIPP/IT, a member of the Board of Editors of the Privacy & Data Security Law Journal, is a Senior Managing Consultant in the Security and Privacy Practice at IBM and the Privacy Services Competency Lead.
He can be reached at .
At the time of writing, Steven L. Jones was an Executive Consultant with the Security and Privacy Practice at IBM.
He is now an independent security consultant.
He can be reached at .
Notes
There are various applied data perturbation methods based on the mathematical methods of perturbation theory which allows us to approximate solutions to one problem based on a known solution to a similar problem. In data masking, we are deriving a distinctly different data set from the original data set, based on the (known) original data set. ↩
While search-engine data differs from employee or customer data, the most widely known incident of re-identification of de-identified data is the 2006 public release of AOL user’s search terms. 101 Dumbest Moments in Business #57 http://money.cnn.com/galleries/2007/biz2/0701/gallery.101dumbest_2007/57.html The risk of damage occurring as a result of re-identification is greater when additional sources can be used to infer additional information about individuals. ↩
An example of a one-way cryptographic hash algorithm is SHA (Secure Hash Algorithm), or the older MD5 (Message-Digest algorithm 5). ↩
IBM Optim http://www.optimsolution.com. ↩
Camouflage Software http://www.datamasking.com. ↩
IBM Entity Analytic Solutions; http://www.ibm.com/software/data/ips/products/masterdata/eas/.
Related software that enhances utility:
IBM Anonymous Resolution enables multiple organizations to selectively share data and leverage proprietary data in a matter that never exposes sensitive information, while still identifying relationships and developing leads; http://www.ibm.com/software/data/db2/eas/anonymous/.
IBM Identity Resolution determines when two or more different looking identity packages are describing the same person; http://www.ibm.com/software/data/db2/eas/identity/.
IBM Global Name Recognition products recognizes multiple cultural variations of name data; http://www.ibm.com/software/data/globalname/.
IBM Relationship Resolution identifies relationships between entities; http://www.ibm.com/software/data/db2/eas/relationship/. ↩IBM WebSphere DataPower SOA Appliance; http://www.ibm.com/software/integration/datapower/. ↩
Published in the March 2009 Privacy & Data Security Law Journal.
© 2009 Alex eSolutions, Inc.