Published in the March 2009 Privacy & Data Security Law Journal.
© 2009 Alex eSolutions, Inc.

Achieving Data Privacy Through Data Obfuscation

BENJAMIN GERBER, ADAM C. NELSON, AND STEVEN L. JONES

In this article the authors explain what data obfuscation is and provide some examples of when it should be employed.

Your organization or client handles personally identifiable information (“PII”) and other sensitive data from markets across the globe and your international workforce—application development and testing is performed in India and China… production support relies on expertise in Russia, Brazil, Argentina, and California… databases reside in Canada and Belgium… data analysis functions occur in Thailand, India, and Costa Rica… and it is all managed from Boston…

Through an understanding of business processes, data flows, and the application of advanced data obfuscation—including data masking, data de-identification, and data anonymization—your organization or client can achieve a great number of business goals and continue to perform current functions with out using the actual, real PII and sensitive data of customers and employees, thus severely reducing risk and liability.

Attorneys have an important role in mitigating their organization’s risks—with this introductory overview we hope to provide the audience with the knowledge necessary to contribute to the reduction of unnecessary risk related to how their organization handles employees’ and customers’ data. Challenges such as customer contract requirements to protect confidential information across both production and non-production environments, financial industry regulatory mandates aimed at preventing fraud, international privacy compliance regulations—including crossborder data transfer considerations—or a corporate directive to proactively reduce the risk of inadvertent data disclosure may be met by effectively employing data obfuscation.

Techniques for data obfuscation include data masking, data de-identification, and data anonymization. In this article, we discuss what data obfuscation is and provide some examples of when it should be employed.

Data Masking

Data masking allows us to generate faux, yet representative, data for use in the full Systems Development Life Cycle (“SDLC”)—which includes application development, unit testing, systems testing, user acceptance testing, and performance testing—or for specific business intelligence purposes (e.g., statistical analyses, profitability analysis).

Consequently, data masking allows for maintaining:

This allows for the maintaining of data utility while protecting against:

There are many approaches to performing masking modifications on confidential data, including use of data perturbation,1 data shuffling, randomization with range constraints, or micro aggregation. The right algorithm must be selected for the type of data being masked and the intended use of the masked data.

Data De-identification

Data de-identification is the removing of all, some, or portions of identifiers (e.g., name, address, Social Security number or Social Insurance number) from the data prior to use in testing or production environments or release to third parties. While this has been the predominant method of data sanitization or obfuscation, it is important to realize that de-identified data may be subject to re-identification by utilizing categorical (i.e., demographic characteristics) or numerical data. Because of the risks associated with data being re-identified and the relatively small additional overhead of applying masking over de-identification alone, depending on the intended use of the data, often data masking is a superior option for non-production use or analysis of data at a non-aggregate level (i.e., analyzing individual records rather than sets of records).2

Note that “data masking” and “de-identification” may mean different things depending on the context. For example, often when we talk about “masking” in a payment card industry (“PCI”) or credit card context, we are simply referring to displaying or printing only the first six and last four digits, or the truncation of all but the last four digits of an account number. “Data masking” as it is discussed here is more advanced than “data de-identification.” Data de-identification alone leaves data subject to re-identification by utilizing categorical (i.e., demographic characteristics) or numerical data.

Data Anonymization

Data anonymization allows for the maintaining of exact values of data or retaining precise data value distribution, and therefore allows the data to precisely represent production data because it is unaltered, yet it is anonymous because it is unreadable. At a high level, this is accomplished by applying one-way cryptographic hashing to data elements.3 Data anonymization is utilized to perform a variety of analytical and business intelligence functions on data, including marketing data analysis, fraud detection, and consolidation of customer data.

Business Scenarios

Data obfuscation is most often utilized for generating data for development and test environments, as well as many data analysis functions. Data obfuscation is usually applied in batch; a masked copy of a database or databases is created for later use.

Data for Development and Test Environments

The risk of using real production data in test or development environments is completely unnecessary. Even if production controls were mirrored or tight security were in place that addresses access to systems and data stores, the nature of test and development allows for far greater access to data in volume than when data is maintained in a production environment. Also, the nature of how the data is regarded and treated is not covered by the same process and attitude adopted for production environments.

De-identifying or applying simple substitution to the data does not create data sufficiently representative of production data, which is required for full SDLC testing. At the same time, data not sufficiently masked builds a false sense of security, as the data may be subject to re-identification and thus inappropriate disclosure.

The purpose of data masking for generation of data used in test and development environments is to allow production quality test data to be used in application testing without compromising the privacy of the individuals whose personal information is contained in the production data records. Application testing should not be affected by data masking; the basic way tests are defined, conducted, and evaluated should not be affected by use of masked data. In other words, if the same tests were conducted on masked data and unmasked data, they should identify the same application or systems issues. Applying data masking techniques prior to the data being migrated to test and development environments mitigates a great deal of unnecessary risk.

Data for Business Intelligence Functions

Often analyses of data for business intelligence, such as purchasing trends, services usage patterns, and customer satisfaction results, can be performed at an aggregate level or without a need to know precisely which individual is attributed to which values. With an understanding of the analyses intended to be performed, appropriate data masking techniques can be selected to produce extremely similar results (within an acceptable tolerance percentage) that the same analysis operations would produce for production data. Therefore, mitigating the risks of allowing analysts, particularly those located overseas, to handle large amounts of production data becomes practical through data masking. For functions requiring a high degree of precision and the comparison of individual data elements, data anonymization may be a viable option.

Data for Sharing with Third Parties

Consider how often more data than is necessary to perform a specific business function is transferred to third parties. No more data than is absolutely necessary to perform the function that a third party is engaged to perform should be provided to a third party. This can be accomplished by de-identifying data records prior to transfer.

For example, for the completion of salary surveys, the only necessary information may be job function, salary, and broad geographic location of employees. Names, serial numbers, and other unique identifiers are unnecessary—and specific demographic information such as zip code may also be beyond what is required for functional results. As long as the risk of re-identification is understood, de-identification is a very valuable tool.

Project Approach

The following high level outline of a data obfuscation project approach aids in gaining an understanding of the various considerations that go into a successful data obfuscation implementation.

Working with legal, the data business owners, database administrators, data architects, application development and testing leads, information security, and the privacy organization:

Prototyping

Consider taking a prototyping approach with the initial implementation of a data obfuscation technology, product, or tool. A viable prototype is well beyond a proof of concept; the intent is to design and implement a prototype to serve as a robust functional base to build out subsequent data obfuscation implementations.

Product Selection

Different solutions will have specific characteristics that could offer advantages in meeting the organization’s goals; each should be evaluated to determine which solution is the best fit for the organization’s requirements and business objectives both short and long term.

Data Masking Products

While advanced data masking can be implemented using software developed in-house or by custom development, the benefits of the powerful software packages that perform data masking available today, such as IBM Optim (formerly Princeton Softech Optim)4 and Camouflage Software’s Camouflage,5 usually outweigh undergoing custom development efforts to implement advanced data masking.

Important features to look for in a data masking product include:

Some database management systems (“DBMS”) have built-in capabilities or add-on packages available that generate test data or provide batch de-identification functionality. While some of this functionality is becoming more sophisticated, often these features are used for less complex de-identification rather than advanced masking operations.

Data De-Identification Products

When de-identification is used in combination with or in the same environment as data masking, often the same tools are leveraged for both obfuscation techniques. However, de-identification is a simpler function and has been performed by custom code and/or built-in or add-on DBMS functionality for decades.

De-identification is not limited to a batch operation; it is also performed live on production data. In much existing code this is often done at the “view” or interface level. However, the way data is used today, accessible beyond a single defined application interface, it is often necessary to apply protections as close to the data as possible. This may dictate a need to change how such real-time or live de-identification is performed; moving to leveraging database level features or a centralized interface layer or bus through which all data access and interfaces flow. (See “Real-time Data Obfuscation” herein.)

Data Anonymization Products

While many products claim data anonymization capabilities, they are often using this term to advertise their de-identification solutions. At a high level, data anonymization is accomplished by applying one-way cryptographic hashing to data elements. Custom built applications of this sort have existed to address various information technology needs for years. However, when considering a data anonymization solution for business-focused operations, it is important to evaluate advanced features, such as identity disambiguation, that allow for more accurate results by cleaning up and/or organizing data before it is actually anonymized. Also consider the ease with which multiple parties can compare data sets, as anonymously comparing data sets across groups within and beyond organizations is a common use for this obfuscation technique (such as determining shared customers among organizations considering a merger). IBM Entity Analytic Solutions’ offers IBM Anonymous Resolution and a family of products that provide identity disambiguation features.6

Real-time Data Obfuscation

Your organization may have a need for the obfuscation of data in real-time as well as in batch. Real-time data obfuscation will permit your organization to enforce its policies regarding the transfer and access of sensitive data while the data is utilized for production business functions. Consider the following business scenario.

Real-time Business Scenario

An application for making decisions about granting loans or proposing financial products is utilized by analysts. Specific information for a customer may be queried up by account number—this information is populated on the users’ screen.

If the analyst is located in the Boston office, the screen may contain all the details of a given account—including the account holder’s name and Social Security number. If the analyst is located in Mumbai, due to requirements driven by privacy regulations and associated security risks, the screen may contain only the information required for the particular type of analysis being performed—therefore the account holder’s name and Social Security number will be blanked out or replaced by a different unique identifier. That is, certain data elements may be de-identified or anonymized based on the authorization and/or location of a given user of the application and possibly the type of transaction they are undertaking.

Technical Solution for Real-time Data Masking: Role-Based Access to the Data Sources

A classic method of achieving real-time data masking is to utilize existing functionality in the database management software along with a form of role-based access control. Modern database management software, including recent versions of legacy database management software, provides functionality for restricting the visibility of data elements based on the specific authorizations granted to the user authenticated to the database. The simplest example of this is to provide different database “views” for different levels of users. These “views” can be constructed in such a manner as to obscure the private data. Even database administrators, who traditionally have unlimited access to data, can be restricted by utilizing field level cryptographic functionality.

While some applications that interface with databases allow for individual users to be directly authenticated to the required database, most applications are built such that users authenticate either to a shared infrastructure component or to the individual applications, which then connect and authenticate to the database as a common shared user. By grouping required data access rules into a role definition and assigning this role to individually authenticated users, data privacy functionality can be incorporated into the required applications.

For applications that authenticate to the database as common users, different roles with unique levels of authorization may still be utilized. This is often achieved by establishing separate instances of an application with each instance configured to authenticate to the database with the appropriate common user for the role.

While implementing role-based data access controls can be an excellent solution for security and privacy requirements, the conversion process may require significant modifications to both the database environment and the associated applications. This is in contrast with the appliance-based solution discussed below in which the additional infrastructure is completely external to the existing applications and databases.

Technical Solution for Real-time Data Masking: Service Oriented Architecture Appliance

This solution utilizes external components in order to avoid the modification of existing applications. A service oriented architecture (“SOA”) appliance (or other appliance that operates on transactions at the data level) can be used to de-identify and anonymize data elements based on the context of transactions, such as authorization and/or location of a user while a business application is used, and possibly the type of transaction they are undertaking at a given moment.

The appliance resides between the application servers and the clients, acting as a centralized interface layer or bus through which all data access and interfaces flow. The appliance virtualizes an application or database service—rendering its presence invisible to the client; the appliance performs advanced transformation on data flowing to and from the client at wire-line speeds. These transformation operations encompass data de-identification and when applicable, anonymization.

Appliance-based advanced data manipulation of this sort is still emerging. We began work on such appliance-based data obfuscation scenarios for executing various data obfuscation functions based on context-sensitive access requirements in early 2007, basing our solutions on the capabilities of the IBM WebSphere DataPower SOA Appliance XI50.7 There are several SOA and data-centric appliances now on the market.

Conclusion

Your organization’s risk profile is markedly reduced when not unnecessarily using sensitive data to execute business activities. If obfuscated data is compromised, it may prove valueless to unauthorized parties. Also, when utilizing obfuscated data, the handling, processing, storage, and transfer of this data may no longer be subject to requirements around the transfer of PII across international borders. If you are not using real PII, then these regulations do not apply. Due to its clear benefits, we recommend that data obfuscation become a core business practice for your organization.


At the time of writing, Benjamin Gerber, CISSP, CISA, CPP, CIPP/G, was a Senior Managing Consultant and the Privacy Services Competency Co-Lead with the Security and Privacy Practice at IBM.
He is now a Principal in the Privacy Strategy group at The MITRE Corporation.
He can be reached at privacy.us/contact or .

Adam C. Nelson, Esq., CIPP/IT, a member of the Board of Editors of the Privacy & Data Security Law Journal, is a Senior Managing Consultant in the Security and Privacy Practice at IBM and the Privacy Services Competency Lead.
He can be reached at .

At the time of writing, Steven L. Jones was an Executive Consultant with the Security and Privacy Practice at IBM.
He is now an independent security consultant.
He can be reached at .


Notes

  1. There are various applied data perturbation methods based on the mathematical methods of perturbation theory which allows us to approximate solutions to one problem based on a known solution to a similar problem. In data masking, we are deriving a distinctly different data set from the original data set, based on the (known) original data set. ↩

  2. While search-engine data differs from employee or customer data, the most widely known incident of re-identification of de-identified data is the 2006 public release of AOL user’s search terms. 101 Dumbest Moments in Business #57 http://money.cnn.com/galleries/2007/biz2/0701/gallery.101dumbest_2007/57.html The risk of damage occurring as a result of re-identification is greater when additional sources can be used to infer additional information about individuals. ↩

  3. An example of a one-way cryptographic hash algorithm is SHA (Secure Hash Algorithm), or the older MD5 (Message-Digest algorithm 5). ↩

  4. IBM Optim http://www.optimsolution.com. ↩

  5. Camouflage Software http://www.datamasking.com. ↩

  6. IBM Entity Analytic Solutions; http://www.ibm.com/software/data/ips/products/masterdata/eas/.
    Related software that enhances utility:
    IBM Anonymous Resolution enables multiple organizations to selectively share data and leverage proprietary data in a matter that never exposes sensitive information, while still identifying relationships and developing leads; http://www.ibm.com/software/data/db2/eas/anonymous/.
    IBM Identity Resolution determines when two or more different looking identity packages are describing the same person; http://www.ibm.com/software/data/db2/eas/identity/.
    IBM Global Name Recognition products recognizes multiple cultural variations of name data; http://www.ibm.com/software/data/globalname/.
    IBM Relationship Resolution identifies relationships between entities; http://www.ibm.com/software/data/db2/eas/relationship/. ↩

  7. IBM WebSphere DataPower SOA Appliance; http://www.ibm.com/software/integration/datapower/. ↩


Published in the March 2009 Privacy & Data Security Law Journal.
© 2009 Alex eSolutions, Inc.