A Systematic Review on Privacy-Preserving Distributed Data Mining

Tracking #: 699-1679


Responsible editor: 

Karin Verspoor

Submission Type: 

Survey Paper

Abstract: 

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.

Manuscript: 

Supplementary Files (optional): 

Previous Version: 

Tags: 

  • Reviewed

RDF representations of manuscript content (optional and experimental): 

Data repository URLs: 

Date of Submission: 

Monday, July 5, 2021

Date of Decision: 

Monday, August 16, 2021

Decision: 

Accept

Solicited Reviews:


2 Comments

Meta-Review by Editor

Overall, the paper presents a review of an important and current topic, particularly in the context of analysis of linked data sets that combine different attributes of individuals, and draws broad conclusions for the direction of research in this area.

Specifically, when preparing the revised version please address the comments from Reviewer 2 related to the trade-offs between privacy preservation and efficiency, and the interaction between the limitations of the review and the identified lack of experiments.

Karin Verspoor (https://orcid.org/0000-0002-8661-1544)

Review round 2 - Response to the reviewer

Review Round 2 - 699-1679: Chang Sun, Lianne Ippel, Andre Dekker, Michel Dumontier, Johan van Soest. A Systematic Review on Privacy-Preserving Distributed Data Mining ('Survey Paper')
Reply to the comments from reviewer 2:
**Comment B2.1:** Figure 2.b has two steps #3. There are extra parenthesis in the blue text below Party C.
**Reply B2.1:** *Thank you for pointing this out. We corrected the figure 2.b in the revised version.*

**Comment B2.2:** Figure 3 mentions papers were searched for the date range of 2000-2018. Section 3.1 mentions 2000-2020.
**Reply B2.2:** *Thanks for noticing this mistake in the figure 3. We corrected it to 2000-2020 in the revised paper.*

**Comment B2.3:** With respect to comment 2.24 (and reply 2.24), I understand and agree with the authors' reasoning, however this is only a valid direction for future work if it is not currently discussed (in current works). I do not question the importance of the trade-off, nor the fact that some situations might require it. What is striking is that to me (as someone who did not conduct a systematic literature review on the topic -- I must admit) the trade-off between privacy and the learning's performance is one of the most discussed topics in works about PPDDM. As you mention "some of the reviewed papers did recognise the trade-off of privacy preservation and efficiency in general", it sounds that only a fraction (minority) of these 231 papers discuss it. If that is the case I believe this deserves more emphasis. I suggest the authors explicitly mention this in the conclusion, in order to justify their suggestion for future works as a valid and unexplored direction.
**Reply B2.3:** We highly appreciate the reviewer’s comment and follow-up explanation on this. We agree with the reviewer that the trade-off problem between privacy and efficiency of learning has been well-recognized by PPDDM studies. It is still an open field in this research domain. Given our evaluation metrics of this review, we did look at the privacy and efficiency of the reviewed studies but not deeply analyzed the trade-off issue between them. Our review results are not presenting sufficiently strong evidence to support this conclusion. Therefore, instead of stating this point in the conclusion, we decided to move it to the discussion section as a potential limitation of our review. *“Additionally, it has been well-recognized that there is an important trade-off between leakage of information and effectiveness or efficiency of learning in PPDDM technologies[14, 27, 90, 128]. In practice, it is crucial to balance this trade-off depending on the specific use cases, the purposes of the data analysis, and the urgency of the problems. Although we included the privacy and efficiency factors in our review, we did not further investigate how each method weights the trade-off between them. For example, we did not measure how much and in which way information loss was tolerated to increase efficiency. We believe this specific trade-off issue between privacy (information leakage) and learning performance (effectiveness or efficiency) deserves further investigation.”*

**Comment B2.4:** With respect to comment 2.25 (and reply 2.25). I appreciate the addition of the subsection 5.6 discussing the limitations. I do feel, however, that the lack of experiments (currently in subsection 5.3) should be discussed in the light of the limitations. Especially because after reading about the limitations in 5.6 one might question the validity of the findings you discussed a few paragraphs above. As you mentioned some follow-up studies were included in the review, you already have some evidence to discuss the implications of this limitation. For instance, do these follow-up studies contain experiments? Are these a follow-up from studies which already contained experiments and evaluation of their methods?
**Reply 2.4:** We agree with the reviewer that the lack of experiments subsection should be combined with a limitation of our search strategy. We have revised subsection 5.3 to address the reviewer’s comment. Most of these follow-up studies extended their previous methods to solve another data partitioning (e.g., from horizontally to vertically partitioned data problem), or to apply to other data analysis algorithms (e.g., from linear SVM to non-linear SVM algorithms), or more complicated user scenarios (e.g., from two parties to more than two parties, from semi-honest parties to malicious parties). We found only a few papers which did not contain experiments having follow-up studies (not included in our review) presenting the experiments. We added this to Subsection 5.6 (Potential Limitation). *“Nevertheless, these findings were observed in the light of limitations in our search strategy, which are elaborated in section 5.6. This review did not specifically search for follow-up studies of reviewed papers. A possible effect is that papers which lack experiments might present their experiments in the follow-up studies, and might introduce selection bias towards the low number of practical experiments. However, we would argue that our search strategy would have found these papers if proper terminology was used.”*