Data pseudonymization in R
The General Data Protection Regulation (GDPR) recently enforced in EU requires that personal data should be either anonymized or pseudonymized when used for other purposes than initially intended. I always knew pseudonymization was a thing but I was unaware of the fancy term until attending OpenAIRE seminar on research data. Unlike the term, it’s actually a simple procedure and can be implemented in R. But how exactly?
To begin with, we need a suitable dataset. We’ll use the first 5 rows of the
mtcars dataset and add an identifier column name. The first row is duplicated in order to demonstrate the process in case of multiple observations of the same subject.
Data anonymization is seemingly simple: we just need to remove the column called name. This is not always the case though, since it might still be possible to identify subjects by values. As a general recommendation, it should be impossible to narrow down each subject in pseudonymized data to less than 5 subjects. So this is actually more complicated and will not be discussed any further here.
Pseudonomization is less extreme. The idea is to replace identifiers (here the name column) with aliases (pseudonyms/tokens) so that it is possible to recreate the original dataset when necessary. In R we can generate a good alias for instance by taking a random subset of 8 from all the (uppercase) letters and numbers and then collapsing the result into a single string.
However, there are several considerations when generating aliases for actual data:
- Aliases should be random and not sequential because the ordering of rows in data usually follows a pattern. In case of survey data, this may be the time of submitting a response which may enable the re-identification of subjects. Non-sequential aliases ensure that once rows are shuffled, the original ordering cannot be restored (unless the original ordering is present in some other variable).
- Each row should usually have a unique identifier. However, this might not be true for longitudinal data in long format where each subject is represented by multiple rows.
- Different subjects should never be assigned identical aliases. When generating aliases as in the example above, there are 2 821 109 907 456 unique permutations possible. So while duplicate aliases are extremely unlikely, an alias generation algorithm must make them inherently impossible.
Taking these considerations into account, we can use the function below to generate a lookup table that associates each unique identifier to an alias.
To tackle the first issue, the function will prompt a warning, leaving the user to decide whether duplicate identity values should be dealt with or not. Then, for each unique identifier an alias is generated and this process is iterated until all keys are unique. More than one iteration is rarely necessary when keys are not too short. Finally, the result is returned as a
Once generated, the lookup table can be used to replace names with aliases or vice versa using the following functions.
Below is an illustration how these functions work in practice. Note that the duplicate row prompts a warning and only unique identifiers are written to the lookup table.
When it is assured that names can not be deduced from the values, the
carsAliases table is now pseudonymized and can be published (provided that
aliasLookup is stored with restricted access).