Pseudonymization and anonymization: what rules apply to our health data?

2142 zettabytes. This is the estimated global volume of digital data by 2035. A figure that makes you dizzy, especially if you compare it to the volumes of 2010 and 2020 (2 and 64 zettabytes respectively). This exponential increase obviously raises questions about the storage of this data, its availability, and the regulations that govern it.

Despite this, the growth in the volume of sensitive and personal data has been accompanied by a certain number of legal texts regulating its use, processing and distribution, while setting security standards.

In the healthcare sector, controlling the processing of this data is a major strategic challenge because it is essential, first and foremost for healthcare, but also for research purposes. Health data are subject to special management because of the additional confidentiality, integrity and availability requirements that accompany their sensitive nature.

In order to protect these data, two techniques are generally cited: pseudonymization and anonymization. However, although these two processes are often confused, they actually meet two distinct needs, offer different guarantees, and are therefore in no way interchangeable.

So how will our health data be used? Should they be anonymized or pseudonymized, and how? Ventio sheds light on the nuances between these two techniques, and illustrates their implementation, for example in the context of reuse for research in health data warehouses or for clinical trials. Given the complexity, those implementing projects aimed at reusing health data for research are well advised to consult with qualified individuals on data protection, both technically and in terms of organization.

1 – Pseudonymize, anonymize… which data are concerned?

a) Health data

Before getting to the heart of the matter and describing in detail the techniques of pseudonymization and anonymization of data, it is essential to specify their scope of application. Indeed, if these techniques can be applied to any type of data, they are especially used in the presence of so-called sensitive data such as health data. Indeed, the latter include not only directly identifying personal data (name, first name, telephone number, etc.) but also information that requires special protection because of its sensitive nature (medical results, care services, weight measurement crossed with another value, etc.), which gives it a high value.

Data concerning health is defined in Article 4 of the GDPR as “personal data relating to the physical or mental health of a natural person, including the provision of healthcare services, which reveal information about that person’s state of health” (see CNIL fact sheet – in French – to appreciate the scope). This broad acceptance is accompanied by a precision on the use that can be made of it: health data can be collected for several purposes such as the medical follow-up of a person or research. For health care, of course, health professionals must be able to identify the person, but the critical nature associated with the data still requires special protection measures. In a research context, health data may need additional protection, and this is where pseudonymization and anonymization come in.

b) Pseudonymization and anonymization techniques

Pseudonymization is a data processing technique that does not directly identify a person by replacing identifying data with aliases, numbers, pseudonyms. This technique is used when it is necessary to have individual information, for each person, without needing to know their identity directly. With pseudonymization, identification remains possible because a correspondence table linking the pseudonym to the identity is kept.

Anonymization, on the other hand, is a technique that makes it impossible to identify the person, following the deletion of all directly and indirectly identifying data.

The decision to pseudonymize or anonymize data is therefore taken according to the objectives set, the need to retain the personal nature of the data or not. It is also important to note that anonymized data lose their personal character for good: in a word, the process is irreversible and it will be impossible to identify the person behind the data. Pseudonymization is a reversible technique, a security measure that only limits the risk of direct correlation between nominative information, but which in no way erases the nominative nature of the information used. Vigilance is still required because re-identification is possible by cross-referencing information.

c) Two different regulartory levels

According to the CNIL, pseudonymization is the processing of personal data in such a way that it can no longer be attributed to a specific data subject without recourse to additional information, provided that this additional information is kept separately and subject to enhanced security measures.

Therefore, data resulting from pseudonymization are considered personal data and are subject to the obligations of the GDPR, such as data retention period, data confidentiality or respect of the rights of individuals.

On the other hand, anonymized data, because it has undergone an irreversible process eliminating any possibility of re-identification of an individual, is not subject to the RGPD.

This distinction is crucial from the point of view of the health sector, which deals with large volumes of personal data that can identify a person (for an individual: medical history, gender, age, address, etc.). Protecting this data is therefore a legal obligation.

Moreover, minimizing the collection of data by processing only the data necessary for specific treatments is a duty for each health organization, for example in hospitals, mutual insurance companies.

In the context of data processing for scientific research purposes, if the conservation of individual information is not justified by the project, then anonymization must be carried out.

Depending on the objectives pursued, one will therefore opt for one or other of the strategies.

2. How to pseudonymize and/or anonymize your health data?

In practice, pseudonymizing data relies on techniques such as creating a pseudonym without being able to directly identify the initial value. Basic techniques exist and can be sufficient in some cases. Among them, we find the counter method, or the random number generator. But there are also more complex techniques such as hashing or encryption that may be more appropriate to use in certain situations. Have a look at the latest recomandation from ENISA on the subject.

From the point of view of data anonymization, randomization and generalization are the most widespread techniques. Ventio takes a look at the different categories listed by the CNIL.

a) The counter

Pseudonymization by counter is a technique that consists in assigning an incremental number to the identified value. It is a relatively simple technique since the values generated by the counter are never repeated in order to avoid duplicates, so the correspondence table of the pseudonymized data containing the number assigned to the person must be kept separately.

Used for small datasets, it becomes however problematic on large datasets.

Indeed, the limitation of this approach is that the order of inclusion is kept, which is an important piece of information, because there is a risk that the counter value is correlated with the order of inclusion in a study, the alphabetical order or the date of birth. If this correlation is identified, then there is a risk of easily re-identifying the data.

Example of counter-pseudonymized data:

Name	First name	Date of birth	Pseudonym
Martin	Julian	06/13/1975	365
Jafer	Bob	08/09/1987	366
Red	Laura	08/26/1988	367

Correspondence table linking the identifying data to the counter.

b) The random number generator

This method consists in creating random values for each data so that the values are totally independent and difficult to find initially because this pseudonymization does not provide information on the order of the data contrary to the counter technique. Contrary to the counter, on large data sets there can be duplicates of pseudonyms because the numbers are drawn randomly, it is therefore imperative to verify that the number is not already assigned.

Name	First name	Date of birth	Pseudonym
Martin	Julian	06/13/1975	18541
Jafer	Bob	08/09/1987	97123214
Red	Laura	08/26/1988	13

Correspondence table linking the identifying data to the random number.

c) Hash and salt

The hash function allows to find a result of fixed size whatever the size of the input or the encoded set. This involves, among other things, transforming the value into a signature using hashing techniques such as MD5, SHA1-2.

However, this technique presents a risk since the transformed data can be recovered if their minimum and maximum limits are identified since the hash functions are public (everyone uses the same functions) and are therefore susceptible to brute force attacks.

To reduce the risk, the salting function is added i.e. a random value is added to the attribute. We can also add a secret key as an additional value so that a hacker will not be able to find the input value without knowing the key which must be changed regularly. In general, it is necessary to follow the recommendations of the ANSSI for the selection of the hash algorithm.

Name	First name	Date of birth	Pseudonym
Martin	Julian	06/13/1975	611ab7794ebc611f2f7d614f39a958fcbcce4e8486b48854676561b5010a7b37
Jafer	Bob	08/09/1987	46bbc5b1d3c8a50ce8b4f10594498367772d5bd6a3cf2aeb7f8e01febb6a6f74
Roed	Laura	08/26/1988	423f678e5679f6ab878a00c12abb0e012983def8aab99ff9f21e2c06ee7d4077

Correspondence table linking the identifying data to the signature.

d) Encryption

Encryption is a method of protecting directly identifying data so that it is completely unintelligible. We speak for example of secret key encryption, for which only the holder of the key can re-identify each data by decrypting them. It will then be necessary to secure and trace access to the key.

The deterministic encryption method is commonly used for pseudonymization: the same input identifying information will give the same pseudonym resulting from the encryption. In some cases, it may also be necessary to use probabilistic encryption, allowing to associate several pseudonyms to the same person to store information that should not be combined.

e) Randomization

In an anonymization approach, it is first essential to:

Determine the purpose and use of the anonymized data
Remove identifying data (name, surname…) and values that allow re-identification
Identify the relevant data to be retained
Define the acceptable level of precision of the data for each piece of information that will be retained (age range, year of birth…)

Once these choices have been made, the randomization technique is applied. It consists of modifying the attributes in a dataset in order to make the data less precise, for example by swapping the attributes of certain data such as the date of birth.

Name	Date of birth	Disease	Last hospitalization
Martin	06/13/1975	HIV	09/08/2021
Jafer	08/09/1987	Cancer	06/11/2020
Red	08/26/1988	Diabetes	11/02/2022

Original data set.

Individual	Date of birth	Disease
6	1988	Cancer
9	1975	Cancer
32	1987	HIV

Randomized data set.

f) Generalization

The generalization technique consists in modifying the scale of the attributes of the data sets so that they are common for a group of people. This avoids the individualization of people in a dataset. For example, by changing the age of people into an age range (18-24 years…).

Name	Date of birth	Disease	Last hospitalization
Martin	06/13/1975	HIV	09/08/2021
Jafer	08/09/1987	Cancer	06/11/2020
Red	08/26/1988	Diabetes	11/02/2022

Original data set.

Individual	Age	Disease
6	30-40 ans	Cancer
9	50-60 ans	Cancer
32	30-40 ans	HIV

Generalized data set.

Conclusion

The availability and use of data have become one of the challenges in the digital world, especially health data, which is sensitive data requiring special protection measures, particularly when reused for research.

If pseudonymization is a solution ensuring data security, it remains reversible and does not remove personal data. Therefore, the RGPD applies to pseudonymized data. It remains a necessary solution to limit the risks on the privacy of individuals.

Anonymization is irreversible and results in a loss of information. By losing any possibility of identifying individuals, this anonymized data is no longer subject to the RGPD and can therefore be used and kept without time limit.

Thus, depending on the purpose of the data processing, people setting up studies aiming at reusing health data will choose the most appropriate technical and organizational measures. This reflection can be done with people who have the technical and organizational skills, typically the data protection officer and the information system security manager.

Ventio, with its certified DPO and cybersecurity experts, as well as its specialist in biomedical research, can help you set up your processing operations for the reuse of health data. Regulations, anonymization or pseudonymization… come and present us your projects and constraints and let’s define together the data processing that best corresponds to your use case. Contact-us!