Skip to content

Pseudonymization

Dimitri Papadopoulos Orfanos edited this page Nov 6, 2016 · 21 revisions

In the Imagen and c-VEDA projects, we use first-level pseudonymization PSC1 codes, to be used by acquisition centers to pseudonymize datasets before sending them to the databank, and PSC2 codes, to further pseudonymize datasets before sending them to research scientists.

This section describes the generation of PSC1 and PSC2 codes for the c-VEDA project.

The intermediate and final files listed in this section are stored under /cveda/databank/framework/psc.

PSC1 code generation

10-digit code generation

We first used Python script cveda_databank/psc/generate_psc1.py to generate 10-digit codes such as:

generate_psc1.py > psc1-10-digit-3-zero.txt

12-digit code creation and assignment to centres

We assigned batches of the above 10-digit codes to c-VEDA centres by prepending a 2-digit code specific to each centre, as described in the following table, resulting in 12-digit PSC1 codes.

ID CENTRE # PSC1
11 CHANDIGARH 1000
12 MANIPUR 750
13 KOLKATA 1600
14 RISHIVALLEY 1200
15 MYSURU 1500
16 NIMHANS 1500
17 SJRI 1950
18 MUMBAI 500

We used Unix shell commands for that:

sed -n -e '1,1000{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_CHANDIGARH_2016-07-05.txt
sed -n -e '1001,1750{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_MANIPUR_2016-07-05.txt
sed -n -e '1751,3350{s/^/13/p}' psc1-10-digit-3-zero.txt > PSC1_KOLKATA_2016-07-05.txt
sed -n -e '3351,4550{s/^/14/p}' psc1-10-digit-3-zero.txt > PSC1_RISHIVALLEY_2016-07-05.txt
sed -n -e '4551,6050{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSURU_2016-07-05.txt
sed -n -e '6051,7550{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2016-07-05.txt
sed -n -e '7551,9500{s/^/17/p}' psc1-10-digit-3-zero.txt > PSC1_SJRI_2016-07-05.txt
sed -n -e '9501,10000{s/^/18/p}' psc1-10-digit-3-zero.txt > PSC1_MUMBAI_2016-07-05.txt

PSC2 code generation

10-digit code generation

We used Python script cveda_databank/psc/generate_psc2.py to generate 10-digit codes such as:

  • the first 2 digits are 0, followed by a non-zero digit,
  • the Damerau–Levenshtein distance between codes is at least 3,
  • the PSC2 codes of the Imagen project (file psc2-imagen.txt below) are taken into account as existing codes and not re-used in c-VEDA.

We let this script run until obtaining in excess of 100,000 10-digit codes, then killed it:

cat psc1-10-digit-3-zero.txt psc2-imagen.txt | generate_psc2.py > psc2-10-digit-2-zero.tmp

Then we discarded codes containing sequences of 4 repeating digits and kept exactly 100,000 10-digit PSC2 codes:

cat psc2-10-digit-2-zero.tmp | egrep -v '(1111|2222|3333|4444|5555|6666|7777|8888|9999|0000)' | shuf | head -100000 > psc2-10-digit-2-zero.txt
rm psc2-10-digit-2-zero.tmp

PSC1–PSC2 conversion table

Since we have 10,000 PSC1 codes, use the first 10,000 of the above 100,000 10-digit PSC2 codes, prepend “00” to obtain 12-digit codes, and shuffle them so that we cannot easily infer the conversion table:

sed -n -e '1,10000{s/^/00/p}' psc2-10-digit-2-zero.txt | shuf > psc2.tmp

Also prepare a temporary file with the existing 10,000 PSC1 codes:

cat PSC1_*2016-07-05.txt | sort > psc1.tmp

Finally create the conversion table and delete temporary files:

paste -d ',' psc1.tmp psc2.tmp | sort > psc2psc_2016-07-12.txt
rm psc1.tmp psc2.tmp 
Clone this wiki locally