Automatic de-identification of Data Download Packages

Tracking #: 698-1678

Responsible editor: 

Thomas Maillart

Submission Type: 

Research Paper


The General Data Protection Regulation (GDPR) grants all natural persons the right to access their personal data if this is being processed by data controllers. The data controllers are obliged to share the data in an electronic format and often provide the data in a so called Data Download Package (DDP). These DDPs contain all data collected by public and private entities during the course of a citizens' digital life and form a treasure trove for social scientists. However, the data can be deeply private. To protect the privacy of research participants while using their DDPs for scientific research, we developed a de-identification algorithm that is able to handle typical characteristics of DDPs. These include regularly changing file structures, visual and textual content, differing file formats, differing file structures and private information like usernames. We investigate the performance of the algorithm and illustrate how the algorithm can be tailored towards specific DDP structures.


Supplementary Files (optional): 

Previous Version: 


  • Reviewed

Data repository URLs: 

The de-identification algorithm is available at The validation set containing 11 Instagram DDPs is available at

Date of Submission: 

Friday, July 2, 2021

Date of Decision: 

Monday, July 26, 2021

Nanopublication URLs:



Solicited Reviews:

1 Comment

Meta-Review by Editor

We are pleased to inform you that your paper has been accepted for publication, under the condition that you address the remaining minor issues, in particular, those pointed out by Reviewer #1.

Thomas Maillart (