Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right
Summary of paper in a few sentences:
The submitted manuscript presents a new tool for anonymizing data download packages (DDPs) released by online service providers (data controllers) in the context of the GDPR. The main motivation behind this tool is to enable researchers to make use of these DDPs for scientific purposes. The provided tool is tested with 11 participants creating fake Instagram profiles and actively using them for about a week.
Reasons to accept:
- The work is overall well presented and detailed, with a clear motivation and a sound approach
- The empirical results of the improved script are excellent
- Both the scripts and dataset are made open source
Reasons to reject:
- Limited fit with the journal scope
- Missing relevant related work
- Tool tailored to a specific application (instagram)
- The method used to hide faces in images/videos previously shown to be prone to re-identification attacks
Nanopublication comments:
Further comments:
I overall liked reading this paper which targets a timely problem with DDP data sharing. The results are overall very good, especially those of the improved script, which is unfortunately specific to one application, namely instagram (but there is a clear trade-off between application specificity and accuracy of the de-identification). Only the results of face de-identification in videos still need to be improved.
On the downside, it is unclear how the submitted manuscript relates to data science and if the proposed de-identification method relates to the journal's aims and scope.
Second, the authors missed several highly related work/software that propose anonymization tools for research data:
https://arx.deidentifier.org
https://amnesia.openaire.eu
https://cran.r-project.org/web/packages/sdcMicro/index.html
I encourage the authors to cite them and position their work with respect to them.
Third, as mentioned already, the tool, in particular the improved script, is tailored to a specific application, which limits its scope and impact. It would be good to discuss how the proposed improvements could also apply in other contexts, such as other social networks or applications.
Fourth, the method used for de-identifying faces in media (photos and videos), blurring, has been shown to be prone to re-identification attacks (using deep learning) by McPherson et al.: https://arxiv.org/abs/1609.00408. As a consequence, I would use more robust methods for de-identifying faces in images and videos. This could be a key novel contribution of the paper, which currently lacks strong technical contributions.
Besides, I encouraged the authors to proof-read their paper and correct the numerous typos (incl. in subsections' titles).
1 Comment
Meta-Review by Editor
Submitted by Tobias Kuhn on
Three reviewers have carefully reviewed the manuscript. Their impression is really positive and overall the manuscript is considered a good fit for this journal, in particular its potential use of the software by other researchers (#R1,#R2, #R3) and the fit to open source standards (#R3).
Yet, there are few outstanding issues, to be addressed by the authors:
a. scientific aspects: #R2 questioned the science part of the paper, in particular related work (also pointed out be #R3) and the limited evaluation of the approach. #R3 also points out limits of de-identification, with possible re-identification attacks. The authors shall explain, and possibly run tests, to assert that their approach is robust against such attacks.
b. scope and impact: #R3 questions the generality of the results, as the tool is only tested with Instagram. Also, the utility of DDP once it has gone through the de-identification method (#R2). The privacy goals also remain overall unclear (#R2 and R#3). In a nutshell, the manuscript needs to be better grounded in literature, the pertinence of the approach should be better delineated, and its utility backed by further evidence.
c. clarity and polishing: #R1 and #R3 consider that the manuscript is hard to read and would require heavy polishing. #R1 also points out some confusions between sections “Evaluation” and “Results”.
Thomas Maillart (https://orcid.org/0000-0002-5747-9927)