Automatic de-identification of Data Download Packages

Tracking #: 693-1673


Responsible editor: 

Thomas Maillart

Submission Type: 

Resource Paper

Abstract: 

The General Data Protection Regulation (GDPR) grants all natural persons the right of access to their personal data if this is being processed by data controllers. The data controllers are obliged to share the data in an electronic format and often provide the data in a so called Data Download Package (DDP). These DDPs contain all data collected by public and private entities during the course of citizens' digital life and form a treasure trove for social scientists. However, the data can be deeply private. To protect the privacy of research participants while using their DDPs for scientific research, we developed de-identification software that is able to handle typical characteristics of DDPs such as regularly changing file structures, visual and textual content, different file formats, different file structures and accounting for usernames. We investigate the performance of the software and illustrate how the software can be tailored towards specific DDP structures.

Manuscript: 

Tags: 

  • Reviewed

Data repository URLs: 

All software is available at: https://github.com/UtrechtUniversity/anonymize-ddp

The validation dataset is available at: http://doi.org/10.5281/zenodo.4472606

Date of Submission: 

Thursday, April 29, 2021

Date of Decision: 

Wednesday, May 26, 2021


Nanopublication URLs:

Decision: 

Undecided

Solicited Reviews:


1 Comment

Meta-Review by Editor

Three reviewers have carefully reviewed the manuscript. Their impression is really positive and overall the manuscript is considered a good fit for this journal, in particular its potential use of the software by other researchers (#R1,#R2, #R3) and the fit to open source standards (#R3).
Yet, there are few outstanding issues, to be addressed by the authors:
a. scientific aspects: #R2 questioned the science part of the paper, in particular related work (also pointed out be #R3) and the limited evaluation of the approach. #R3 also points out limits of de-identification, with possible re-identification attacks. The authors shall explain, and possibly run tests, to assert that their approach is robust against such attacks.
b. scope and impact: #R3 questions the generality of the results, as the tool is only tested with Instagram. Also, the utility of DDP once it has gone through the de-identification method (#R2). The privacy goals also remain overall unclear (#R2 and R#3). In a nutshell, the manuscript needs to be better grounded in literature, the pertinence of the approach should be better delineated, and its utility backed by further evidence.
c. clarity and polishing: #R1 and #R3 consider that the manuscript is hard to read and would require heavy polishing. #R1 also points out some confusions between sections “Evaluation” and “Results”.

Thomas Maillart (https://orcid.org/0000-0002-5747-9927)