Automatic de-identification of Data Download Packages

Tracking #: 693-1673

Authors:

	Name	ORCID
	Laura Boeschoten	https://orcid.org/0000-0002-3536-0474
	Roos Voorvaart	https://orcid.org/0000-0002-4411-8495
	Ruben Van Den Goorbergh	https://orcid.org/0000-0003-3229-3015
	Casper Kaandorp	https://orcid.org/0000-0001-6326-6680
	Martine G De Vos	https://orcid.org/0000-0001-5301-1713

Responsible editor:

Thomas Maillart

Submission Type:

Resource Paper

Abstract:

The General Data Protection Regulation (GDPR) grants all natural persons the right of access to their personal data if this is being processed by data controllers. The data controllers are obliged to share the data in an electronic format and often provide the data in a so called Data Download Package (DDP). These DDPs contain all data collected by public and private entities during the course of citizens' digital life and form a treasure trove for social scientists. However, the data can be deeply private. To protect the privacy of research participants while using their DDPs for scientific research, we developed de-identification software that is able to handle typical characteristics of DDPs such as regularly changing file structures, visual and textual content, different file formats, different file structures and accounting for usernames. We investigate the performance of the software and illustrate how the software can be tailored towards specific DDP structures.

Manuscript:

ds-paper-693.pdf

Revised Version:

Automatic de-identification of Data Download Packages

Data repository URLs:

All software is available at: https://github.com/UtrechtUniversity/anonymize-ddp

The validation dataset is available at: http://doi.org/10.5281/zenodo.4472606

Date of Submission:

Thursday, April 29, 2021

Date of Decision:

Wednesday, May 26, 2021

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 11/May/2021

By Margherita Martorana ORCID logo

https://orcid.org/0000-0001-8004-0464

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Average
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper aims to describe a new tool for de-identification of PII in specific types of data files (e.g. DDP), in the context of Instagram as data provider. The paper introduce this new software explaining the benefit that could arise from it for new researchers, and I found that overall the content of this manuscript is indeed very interesting. Nevertheless, there are various grammar errors and I found the paper quite hard to read. I think that multiple sentences are long, and difficult to understand. Moreover, I think that the sections "Evaluation" and "Results" are in need of further work: some results are presented in the "evaluation" section, and I found the "result" to be lacking of some quantitative results and statistics.

Reasons to accept:

The software presented has clear benefits in the field of data science, and I believe could be used by various researchers.

Reasons to reject:

Overall, the manuscript is hard to read. I believe that it needs some more work in the format and syntax.

Nanopublication comments:

Further comments:

Review #2 submitted on 18/May/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper presents a pice of software for de-identifying data download packages. The proposed solution combines a number of heuristics, blurring faces in pictures, replaces usernames/phone numbers etc. by pseudonyms (to preserve some utility), etc.

Reasons to accept:

Useful piece of software, basic but ok ideas.

Reasons to reject:

Not really research. Very basic. Limited evaluation

Nanopublication comments:

Further comments:

Overall, I enjoyed reading the paper and I find it to be a nice contribution to the community.

My main concerns with the paper are the following:
- (automatically) redacting documents/data is not a new problem. No related work on this is provided. In a slightly different context, solutions for detecting/removing sensitive info about to be sent on a network were proposed in the past.
- the utility aspects are not covered. Are the resulting DDP still useful. To assess this, the authors should explain what data scientist typically need from DDP and why they collect them in the first place. Also, the general pipeline for collecting DDP is not clear (users do it and then pass it to the researchers I guess; it should be explained that the de-identification would be done on the user side). Also, it is not clear how the utility is preserved for studies that focus on the *interaction* between users. It should be made clearer how interaction/relation information is preserved
- the privacy goals are not clear. whose privacy should be protected? that of the DDP owner or that of the mentioned individuals? If it's the owner, I'm afraid a simple Google (image) search would still re-identify the original data (if it's public) from the redacted data. For instance, a google image search fed with the picture with blurred face would probably return the original image
- Sections 2.1 and 2.2 lack references to (official) documentation
- The dataset is relatively small. Also, except for some data, the annotation was done by a single annotator; this is not very robust.
- Not sure Section 4.4 is needed for the audience of a journal on Data Science.

Review #3 submitted on 19/May/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The submitted manuscript presents a new tool for anonymizing data download packages (DDPs) released by online service providers (data controllers) in the context of the GDPR. The main motivation behind this tool is to enable researchers to make use of these DDPs for scientific purposes. The provided tool is tested with 11 participants creating fake Instagram profiles and actively using them for about a week.

Reasons to accept:

- The work is overall well presented and detailed, with a clear motivation and a sound approach
- The empirical results of the improved script are excellent
- Both the scripts and dataset are made open source

Reasons to reject:

- Limited fit with the journal scope
- Missing relevant related work
- Tool tailored to a specific application (instagram)
- The method used to hide faces in images/videos previously shown to be prone to re-identification attacks

Nanopublication comments:

Further comments:

I overall liked reading this paper which targets a timely problem with DDP data sharing. The results are overall very good, especially those of the improved script, which is unfortunately specific to one application, namely instagram (but there is a clear trade-off between application specificity and accuracy of the de-identification). Only the results of face de-identification in videos still need to be improved.

On the downside, it is unclear how the submitted manuscript relates to data science and if the proposed de-identification method relates to the journal's aims and scope.

Second, the authors missed several highly related work/software that propose anonymization tools for research data:
https://arx.deidentifier.org
https://amnesia.openaire.eu
https://cran.r-project.org/web/packages/sdcMicro/index.html
I encourage the authors to cite them and position their work with respect to them.

Third, as mentioned already, the tool, in particular the improved script, is tailored to a specific application, which limits its scope and impact. It would be good to discuss how the proposed improvements could also apply in other contexts, such as other social networks or applications.

Fourth, the method used for de-identifying faces in media (photos and videos), blurring, has been shown to be prone to re-identification attacks (using deep learning) by McPherson et al.: https://arxiv.org/abs/1609.00408. As a consequence, I would use more robust methods for de-identifying faces in images and videos. This could be a key novel contribution of the paper, which currently lacks strong technical contributions.

Besides, I encouraged the authors to proof-read their paper and correct the numerous typos (incl. in subsections' titles).

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Wed, 05/26/2021 - 16:03

Three reviewers have carefully reviewed the manuscript. Their impression is really positive and overall the manuscript is considered a good fit for this journal, in particular its potential use of the software by other researchers (#R1,#R2, #R3) and the fit to open source standards (#R3).
Yet, there are few outstanding issues, to be addressed by the authors:
a. scientific aspects: #R2 questioned the science part of the paper, in particular related work (also pointed out be #R3) and the limited evaluation of the approach. #R3 also points out limits of de-identification, with possible re-identification attacks. The authors shall explain, and possibly run tests, to assert that their approach is robust against such attacks.
b. scope and impact: #R3 questions the generality of the results, as the tool is only tested with Instagram. Also, the utility of DDP once it has gone through the de-identification method (#R2). The privacy goals also remain overall unclear (#R2 and R#3). In a nutshell, the manuscript needs to be better grounded in literature, the pertinence of the approach should be better delineated, and its utility backed by further evidence.
c. clarity and polishing: #R1 and #R3 consider that the manuscript is hard to read and would require heavy polishing. #R1 also points out some confusions between sections “Evaluation” and “Results”.

Thomas Maillart (https://orcid.org/0000-0002-5747-9927)

Tracking #: 693-1673

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment