Pseudonymization

Pseudonymization codes

In the Imagen and c-VEDA projects, we use first-level pseudonymization PSC1 codes, to be used by acquisition centers to pseudonymize datasets before sending them to the databank, and PSC2 codes, to further pseudonymize datasets before dissemination to research scientists.

This section describes the generation of PSC1 and PSC2 codes for the c-VEDA project.

The intermediate and final files listed in this section are stored under /cveda/databank/framework/psc.

PSC1 code generation

10-digit code generation

We first used Python script cveda_databank/psc/generate_psc1.py to generate 10-digit codes such as:

the first 3 digits are 0,
the Damerau–Levenshtein distance between codes is at least 3.

generate_psc1.py > psc1-10-digit-3-zero.txt

12-digit code creation and assignment to centres

We assigned batches of the above 10-digit codes to c-VEDA centres by prepending a 2-digit code specific to each centre, as described in the following table, resulting in 12-digit PSC1 codes.

ID	CENTRE	# PSC1
11	CHANDIGARH	1000
12	MANIPUR	750
13	KOLKATA	1600
14	RISHIVALLEY	1200
15	MYSURU	1500
16	NIMHANS	1500
17	SJRI	1950
18	MUMBAI	500

We used Unix shell commands for that:

sed -n -e '1,1000{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_CHANDIGARH_2016-07-05.txt
sed -n -e '1001,1750{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_MANIPUR_2016-07-05.txt
sed -n -e '1751,3350{s/^/13/p}' psc1-10-digit-3-zero.txt > PSC1_KOLKATA_2016-07-05.txt
sed -n -e '3351,4550{s/^/14/p}' psc1-10-digit-3-zero.txt > PSC1_RISHIVALLEY_2016-07-05.txt
sed -n -e '4551,6050{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSURU_2016-07-05.txt
sed -n -e '6051,7550{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2016-07-05.txt
sed -n -e '7551,9500{s/^/17/p}' psc1-10-digit-3-zero.txt > PSC1_SJRI_2016-07-05.txt
sed -n -e '9501,10000{s/^/18/p}' psc1-10-digit-3-zero.txt > PSC1_MUMBAI_2016-07-05.txt

PSC2 code generation

10-digit code generation

We used Python script cveda_databank/psc/generate_psc2.py to generate 10-digit codes such as:

the first 3 digits are 0,
the Damerau–Levenshtein distance between codes is at least 3,
the PSC2 codes of the Imagen project (file psc2-imagen.txt below) are taken into account as existing codes and not re-used in c-VEDA.

We let this script run until obtaining in excess of 100,000 10-digit codes, then killed it:

cat psc1-10-digit-3-zero.txt psc2-imagen.txt | generate_psc2.py > psc2-10-digit-2-zero.tmp

Then we discarded codes containing sequences of 4 repeating digits and kept exactly 100,000 10-digit PSC2 codes:

cat psc2-10-digit-2-zero.tmp | egrep -v '(1111|2222|3333|4444|5555|6666|7777|8888|9999|0000)' | shuf | head -100000 > psc2-10-digit-2-zero.txt
rm psc2-10-digit-2-zero.tmp

PSC1-PSC2 conversion table

Since we have 10,000 PSC1 codes, use the first 10,000 of the above 10-digit PSC2 codes, prepend '00' to obtain 12-digit codes, and shuffle them so that we cannot easily infer the conversion table:

sed -n -e '1,10000{s/^/00/p}' psc2-10-digit-2-zero.txt | shuf > psc2.tmp

Also prepare a temporary file with the existing 10,000 PSC1 codes:

cat PSC1_*2016-07-05.txt | sort > psc1.tmp

Finally create the conversion table and delete temporary files:

paste -d ',' psc1.tmp psc2.tmp | sort > psc2psc_2016-07-12.txt
rm psc1.tmp psc2.tmp

c-VEDA web site

c-VEDA database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pseudonymization

Pseudonymization codes

PSC1 code generation

10-digit code generation

12-digit code creation and assignment to centres

PSC2 code generation

10-digit code generation

PSC1-PSC2 conversion table

Clone this wiki locally