Roger's Access Library Homepage
Forum Home Forum Home > Roger's Access Library > Roger's Download Samples
  New Posts New Posts RSS Feed - De-Identifying Data for Confidentiality
  FAQ FAQ  Forum Search   Events   Register Register  Login Login


Please consider donating $1 per sample downloaded, (find out why here)

De-Identifying Data for Confidentiality

 Post Reply Post Reply
Author
Message
Roger Carlson View Drop Down
Microsoft MVP
Microsoft MVP


Joined: 20 Feb 2008
Status: Offline
Points: 253
Post Options Post Options   Thanks (0) Thanks(0)   Quote Roger Carlson Quote  Post ReplyReply Direct Link To This Post Topic: De-Identifying Data for Confidentiality
    Posted: 03 Apr 2018 at 7:19am
Download Sample: De-identifyData.zip

There are a number of circumstances when you as a developer might need to randomize data within a database.  You may need to create sample data to test or demonstrate a database application.  Or you may need to mask or "de-identify" confidential data to comply with regulatory agency rules.  This article demonstrates two methods that can be used, either separately or in conjunction, to scramble your database.

 Creating test data for a database application is always a difficult task.  You need to create enough records to adequately test database performance.  A design that works well with a hundred records, might not work with ten thousand.  You also need to create data varied enough to mimic real world situations, data that will test the limits of the business rules.  Developers have a tendency to create data that will work with their application, not data that will break it.

 The very best source for creating test data is actual customer data.  Nothing mimics real-world data like the real-world data.  Unfortunately, your customers may not appreciate their data being used in this way.  Worse yet, there may be regulatory considerations.  For instance, the health care industry has to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, which has very strict compliance rules.

 Randomizing Records

 Let's start by considering the simplest case.  You simply want to mask customer data without the necessity of complying with regulations.  It is probably sufficient to simply randomize each field that could identify a record to effectively mask the data. 

 Tables 1 and 2 show the before and after of a small Customer table that has been randomized. 

CustomerID

CustomerNumber

FirstName

LastName

1

A003

Roger

Carlson

2

A045

Sam

Martin

3

A014

Sarah

Sutherland

Table 1: Original Customer table 

CustomerID

CustomerNumber

FirstName

LastName

1

A045

Sarah

Carlson

2

A014

Sam

Sutherland

3

A003

Roger

Martin

Table 2: The Customer table after randomization

 Of course, with only three records, the randomization is not very random, but the larger the result set, the better the randomization will be. 

 De-identifying Field Data 

In certain instances, simply randomizing records is not sufficient.  Certain field values have inherent information regardless of the record that holds them.  Phone numbers, social security number, patient numbers, and the like all contain information all by themselves that may breach confidentiality.  HIPAA regulations in particular are very strict about displaying any data that could identify a patient.  

So to further mask your data, it may be necessary to scramble character data with in the field itself.  Now, it doesn't make sense to randomize some types of fields.  Name and address fields in particular will look strange if randomized in this way. 

CustomerID

CustomerNumber

FirstName

LastName

1

0A30

Rgroe

arConsl

2

450A

aSm

rianMt

3

14A0

aharS

ltnrSadueh

 Not only does it look strange, the capitalization of the names makes it fairly easy to re-identify the field values.  So this function needs to be used with some discretion.  It is best used with character data that is composed of numbers.

 A further complication is that sometimes character data uses an input mask that displays characters that may either be stored or not.  For instance, when you see a social security number in a field 111-22-3333, you don't really know if the number was stored as displayed or stored as 111223333.  If the dashes are stored, the number might be randomized as 2-33311-13.   So any function that randomizes character data must take these complications into account.

Download the sample for more.

Download Sample: De-identifyData.zip

Consider donating $1 per sample downloaded, find out why here.<
Back to Top
Sponsored Links


Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down