Skip to content

Commit

Permalink
adding more info about image re-identification risk
Browse files Browse the repository at this point in the history
  • Loading branch information
carriewright11 committed Sep 25, 2024
1 parent 311eefb commit d0705cf
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 2 deletions.
4 changes: 2 additions & 2 deletions 05-Data_Ethics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,11 @@ While data sharing can result in wonderful opportunities for secondary analysis,
Overall there is a continuum of risk across the various types of data that we as researchers collect. Wile some forms of data, such as that derived from model organisms pose essentially no risk, intermediate forms of data such as summarized counts across a set of human samples pose more risk, while raw data and in particular data from individuals such as whole genome sequencing data, pose great risk for identification [@byrd_responsible_2020].


```{r, fig.align='center', echo = FALSE, fig.alt= "re-identification risk is on a continuum. The image shows a double sided arrow that goes from green to red with the green side showing model organism data and the red side showing whole genome sequencing. It offers suggestsions for sharing different types of data, with public access to anyone for model organism data and images of internal tissues, public sharing of processed data for whole genome somatic varaints and RNA-Seq (expression estimates), Aggregate group sharing of data from exome-seq and DNA methylation data and controlled access (only for certain people) for whole genome germline data", out.width="100%"}
```{r, fig.align='center', echo = FALSE, fig.alt= "re-identification risk is on a continuum. The image shows a double sided arrow that goes from green to red with the green side showing model organism data and the red side showing whole genome sequencing. It offers suggestions for sharing different types of data, with public access to anyone for model organism data and images of certain tissues, public sharing of processed data for whole genome somatic varaints and RNA-Seq (expression estimates), Aggregate group sharing of data from exome-seq and DNA methylation data and controlled access (only for certain people) for whole genome germline data", out.width="100%"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1SRokLaGAc2hiwJSN26FHE0ZEEhPr3KQdyMICic8kAcs/edit#slide=id.g302b08a6790_0_0")
```


Note that recent technology advances in AI, show that chest X-ray images can now re-identify individuals (@packhauser_deep_2022). In addition, some histopathology images are also re-identifiable, see @ganz_re-identification_2025 for guidance about how to share images more safely. These suggestions may be out-of-date or may not be in alignment with institutional regulations, so please consult with experts at your organization.

### Why does it mater that research subjects might be identifiable to others?

Expand Down
39 changes: 39 additions & 0 deletions book.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1119,4 +1119,43 @@ @article{beecher_ethics_1966
author = {Beecher, Henry K.},
month = jun,
year = {1966}
}



@article{packhauser_deep_2022,
title = {Deep learning-based patient re-identification is able to exploit the biometric nature of medical chest {X}-ray data},
volume = {12},
copyright = {2022 The Author(s)},
issn = {2045-2322},
url = {https://www.nature.com/articles/s41598-022-19045-3},
doi = {10.1038/s41598-022-19045-3},
abstract = {With the rise and ever-increasing potential of deep learning techniques in recent years, publicly available medical datasets became a key factor to enable reproducible development of diagnostic algorithms in the medical domain. Medical data contains sensitive patient-related information and is therefore usually anonymized by removing patient identifiers, e.g., patient names before publication. To the best of our knowledge, we are the first to show that a well-trained deep learning system is able to recover the patient identity from chest X-ray data. We demonstrate this using the publicly available large-scale ChestX-ray14 dataset, a collection of 112,120 frontal-view chest X-ray images from 30,805 unique patients. Our verification system is able to identify whether two frontal chest X-ray images are from the same person with an AUC of 0.9940 and a classification accuracy of 95.55\%. We further highlight that the proposed system is able to reveal the same person even ten and more years after the initial scan. When pursuing a retrieval approach, we observe an mAP@R of 0.9748 and a precision@1 of 0.9963. Furthermore, we achieve an AUC of up to 0.9870 and a precision@1 of up to 0.9444 when evaluating our trained networks on external datasets such as CheXpert and the COVID-19 Image Data Collection. Based on this high identification rate, a potential attacker may leak patient-related information and additionally cross-reference images to obtain more information. Thus, there is a great risk of sensitive content falling into unauthorized hands or being disseminated against the will of the concerned patients. Especially during the COVID-19 pandemic, numerous chest X-ray datasets have been published to advance research. Therefore, such data may be vulnerable to potential attacks by deep learning-based re-identification algorithms.},
language = {en},
number = {1},
urldate = {2024-09-25},
journal = {Scientific Reports},
author = {Packhäuser, Kai and Gündel, Sebastian and Münster, Nicolas and Syben, Christopher and Christlein, Vincent and Maier, Andreas},
month = sep,
year = {2022},
note = {Publisher: Nature Publishing Group},
keywords = {Computer science, Medical ethics, Radiography},
pages = {14851},
}


@article{ganz_re-identification_2025,
title = {Re-identification from histopathology images},
volume = {99},
issn = {1361-8415},
url = {https://www.sciencedirect.com/science/article/pii/S1361841524002603},
doi = {10.1016/j.media.2024.103335},
abstract = {In numerous studies, deep learning algorithms have proven their potential for the analysis of histopathology images, for example, for revealing the subtypes of tumors or the primary origin of metastases. These models require large datasets for training, which must be anonymized to prevent possible patient identity leaks. This study demonstrates that even relatively simple deep learning algorithms can re-identify patients in large histopathology datasets with substantial accuracy. In addition, we compared a comprehensive set of state-of-the-art whole slide image classifiers and feature extractors for the given task. We evaluated our algorithms on two TCIA datasets including lung squamous cell carcinoma (LSCC) and lung adenocarcinoma (LUAD). We also demonstrate the algorithm’s performance on an in-house dataset of meningioma tissue. We predicted the source patient of a slide with F1 scores of up to 80.1\% and 77.19\% on the LSCC and LUAD datasets, respectively, and with 77.09\% on our meningioma dataset. Based on our findings, we formulated a risk assessment scheme to estimate the risk to the patient’s privacy prior to publication.},
urldate = {2024-09-25},
journal = {Medical Image Analysis},
author = {Ganz, Jonathan and Ammeling, Jonas and Jabari, Samir and Breininger, Katharina and Aubreville, Marc},
month = jan,
year = {2025},
keywords = {Deep learning, Digital pathology, Re-identification},
pages = {103335},
}

0 comments on commit d0705cf

Please sign in to comment.