Roger's Download Samples - De-Identifying Data for Confidentiality

There are a number of circumstances when you as a developer might need to randomize data within a database. You may need to create sample data to test or demonstrate a database application. Or you may need to mask or "de-identify" confidential data to comply with regulatory agency rules. This article demonstrates two methods that can be used, either separately or in conjunction, to scramble your database.

Creating test data for a database application is always a difficult task. You need to create enough records to adequately test database performance. A design that works well with a hundred records, might not work with ten thousand. You also need to create data varied enough to mimic real world situations, data that will test the limits of the business rules. Developers have a tendency to create data that will work with their application, not data that will break it.

The very best source for creating test data is actual customer data. Nothing mimics real-world data like the real-world data. Unfortunately, your customers may not appreciate their data being used in this way. Worse yet, there may be regulatory considerations. For instance, the health care industry has to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, which has very strict compliance rules.

Randomizing Records

Let's start by considering the simplest case. You simply want to mask customer data without the necessity of complying with regulations. It is probably sufficient to simply randomize each field that could identify a record to effectively mask the data.

Tables 1 and 2 show the before and after of a small Customer table that has been randomized.

CustomerID	CustomerNumber	FirstName	LastName
1	A003	Roger	Carlson
2	A045	Sam	Martin
3	A014	Sarah	Sutherland

Table 1: Original Customer table

CustomerID	CustomerNumber	FirstName	LastName
1	A045	Sarah	Carlson
2	A014	Sam	Sutherland
3	A003	Roger	Martin

Table 2: The Customer table after randomization

Of course, with only three records, the randomization is not very random, but the larger the result set, the better the randomization will be.

De-identifying Field Data

In certain instances, simply randomizing records is not sufficient. Certain field values have inherent information regardless of the record that holds them. Phone numbers, social security number, patient numbers, and the like all contain information all by themselves that may breach confidentiality. HIPAA regulations in particular are very strict about displaying any data that could identify a patient.

So to further mask your data, it may be necessary to scramble character data with in the field itself. Now, it doesn't make sense to randomize some types of fields. Name and address fields in particular will look strange if randomized in this way.

CustomerID	CustomerNumber	FirstName	LastName
1	0A30	Rgroe	arConsl
2	450A	aSm	rianMt
3	14A0	aharS	ltnrSadueh

Not only does it look strange, the capitalization of the names makes it fairly easy to re-identify the field values. So this function needs to be used with some discretion. It is best used with character data that is composed of numbers.

A further complication is that sometimes character data uses an input mask that displays characters that may either be stored or not. For instance, when you see a social security number in a field 111-22-3333, you don't really know if the number was stored as displayed or stored as 111223333. If the dashes are stored, the number might be randomized as 2-33311-13. So any function that randomizes character data must take these complications into account.

Download the sample for more.

Download Sample: http://www.rogersaccesslibrary.com/forum/uploads/5/De-identifyData.zip" rel="nofollow - De-identifyData.zip