There
are a number of circumstances when you as a developer might need to randomize
data within a database. You may need to
create sample data to test or demonstrate a database application. Or you may need to mask or
"de-identify" confidential data to comply with regulatory agency
rules. This article demonstrates two
methods that can be used, either separately or in conjunction, to scramble your
database.
Creating
test data for a database application is always a difficult task. You need to create enough records to
adequately test database performance. A
design that works well with a hundred records, might not work with ten thousand. You also need to create data varied enough to
mimic real world situations, data that will test the limits of the business
rules. Developers have a tendency to
create data that will work with their application, not data that will break it.
The very best
source for creating test data is actual customer data. Nothing mimics real-world data like the
real-world data. Unfortunately, your
customers may not appreciate their data being used in this way. Worse yet, there may be regulatory
considerations. For instance, the health
care industry has to comply with HIPAA (Health Insurance Portability and
Accountability Act) regulations, which has very strict compliance rules.
Randomizing
Records
Let's start by
considering the simplest case. You
simply want to mask customer data without the necessity of complying with
regulations. It is probably sufficient
to simply randomize each field that could identify a record to effectively mask
the data.
Tables 1 and 2
show the before and after of a small Customer table that has been randomized.
CustomerID
|
CustomerNumber
|
FirstName
|
LastName
|
1
|
A003
|
Roger
|
Carlson
|
2
|
A045
|
Sam
|
Martin
|
3
|
A014
|
Sarah
|
Sutherland
|
Table 1:
Original Customer table
CustomerID
|
CustomerNumber
|
FirstName
|
LastName
|
1
|
A045
|
Sarah
|
Carlson
|
2
|
A014
|
Sam
|
Sutherland
|
3
|
A003
|
Roger
|
Martin
|
Table 2: The
Customer table after randomization
Of course, with
only three records, the randomization is not very random, but the larger the
result set, the better the randomization will be.
De-identifying
Field Data
In certain
instances, simply randomizing records is not sufficient. Certain field values have inherent
information regardless of the record that holds them. Phone numbers, social security number,
patient numbers, and the like all contain information all by themselves that
may breach confidentiality. HIPAA
regulations in particular are very strict about displaying any data that could
identify a patient.
So
to further mask your data, it may be necessary to scramble character data with
in the field itself. Now, it doesn't
make sense to randomize some types of fields.
Name and address fields in particular will look strange if randomized in
this way.
CustomerID
|
CustomerNumber
|
FirstName
|
LastName
|
1
|
0A30
|
Rgroe
|
arConsl
|
2
|
450A
|
aSm
|
rianMt
|
3
|
14A0
|
aharS
|
ltnrSadueh
|
Not only does
it look strange, the capitalization of the names makes it fairly easy to
re-identify the field values. So this
function needs to be used with some discretion.
It is best used with character data that is composed of numbers.
A further
complication is that sometimes character data uses an input mask that displays
characters that may either be stored or not.
For instance, when you see a social security number in a field
111-22-3333, you don't really know if the number was stored as displayed or
stored as 111223333. If the dashes are
stored, the number might be randomized as 2-33311-13. So any function that randomizes character
data must take these complications into account.
Download the sample for more.
Download Sample: http://www.rogersaccesslibrary.com/forum/uploads/5/De-identifyData.zip" rel="nofollow - De-identifyData.zip