A Systematic Review on Privacy-Preserving Distributed Data Mining

Tracking #: 688-1668

Authors:

	Name	ORCID
	Chang Sun	https://orcid.org/0000-0001-8325-8848
	Lianne Ippel	https://orcid.org/0000-0001-8314-0305
	Andre Dekker	https://orcid.org/0000-0002-0422-7996
	Michel Dumontier	https://orcid.org/0000-0003-4727-9435
	Johan van Soest	https://orcid.org/0000-0003-2548-0330

Responsible editor:

Karin Verspoor

Submission Type:

Survey Paper

Abstract:

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard metrics to evaluate new PPDDM methods and proposes comprehensive evaluation metrics with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field and provide suggestions on how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.

Manuscript:

ds-paper-688.pdf

Supplementary Files (optional):

ds-supplementary-688-1080.pdf

ds-supplementary-688-1082.zip

Revised Version:

A Systematic Review on Privacy-Preserving Distributed Data Mining

Data repository URLs:

Review results: https://figshare.com/s/cbb2317239ecfa48339f (The dataset is still privately stored in the repository. It will be public when the review paper gets published).

RDF representations of manuscript content: https://raw.githubusercontent.com/data-science-hub/data/master/rdf/ds-rdf-688.ttl

Date of Submission:

Friday, March 19, 2021

Date of Decision:

Sunday, May 30, 2021

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 29/Mar/2021

By Abdur Rahim ORCID logo

https://orcid.org/0000-0003-0237-1705

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper provides a systematic literature survey on privacy-preserving distributed data mining techniques on 231 scientific articles published in the past
20 years. The articles cover theoretical studies to a range of case studies in different data mining techniques. The authors proposed 10 key metrics for evaluation and discuss the issue of resolving ambiguity in the privacy definition for distributed data mining application area. The authors also recommended future research priorities such as utilizing data from the real-world to narrow the gap between theoretical solutions and practical applications. The paper addresses an interesting topic in data science.

Reasons to accept:

The area of distributed data mining and making it private-preserving is under-researched and likely to be of increasing importance in the future. The paper contains a description of an interesting summary through a systematic literature survey that readers might find worthwhile reading. Like the idea of showing the relation of articles using a graph generated in Gephi.

Reasons to reject:

- The significance of the research is not clearly addressed and is not justified well other than data privacy issue in the paper.
- the discussion of 'global data miner' or meta-learning is not convincing enough. That plays an important role in distributed data mining. There are other meta-learning approaches other than SVM, DT, NN such as stacking, voting, RandomCommitte. I would love to see some more discussions or an evaluation metric proposed for meta-learning.
- The presentation of results via large, obscure tables sometimes makes it difficult to interpret. The use of a bar chart or pie chart can be useful here to interpret the comparisons.
- The case studies, although could be potentially the most interesting part of the paper, is barely covered (only a few examples such as authors explained data partitioning in figure 2)

Nanopublication comments:

Further comments:

1. Deep learning (e.g. CNN, LSTM) is now heavily used in distributed data ming area. The authors did pick some paper on neural network, however, completely ignored the discussion around deep learning and identify key metric for that. They used different performance measures other than what is described in the "Accuracy performance factor" such as mean average precision.

2. I disagree with excluding topics around cloud computing, grid computing, edge computing, etc. They should be in the inclusion criteria as the platform used for training and classification plays a big role in distributed data mining applications. I recommend including another evaluation metric for the platform.

3. Important referencing is missed in many places. Authors need to justify some statements with important references. For example,
on page 7, "There are plenty of algorithms across the data mining and statistics domain [ref of a survey paper?]"
on page 8, "The accuracy performance includes accuracy, precision, recall, F1 score" .. please give examples with references.

Review #2 submitted on 10/May/2021

By Dayana Spagnuelo ORCID logo

https://orcid.org/0000-0001-6882-6480

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This work presents a systematic literature review on the topic of privacy preserving techniques for distributed data mining/ML. The authors select and review 231 papers on this topic, in particular evaluating each on the basis of 10 factors the authors identify as key to those contributions. The factors range from characteristics of the adversary model, to the types of data mining/ML problems tackled, to the merits of the contribution (such as type of data used in the experiments, complexity, scalability, comparison of accuracy with other papers). As a conclusion for the work, the authors discuss the (lack of) concrete definition for privacy and security in those works, and how contributions seem to assume different interpretations for them. Moreover, the authors also suggest a “template” to be followed by other authors in the future, when writing papers on this topic.

Reasons to accept:

The work tackles an important topic with unstructured contributions, which are normally difficult to navigate. The factors proposed by the authors provide this missing structure, and I can see how this would help out researchers in navigating the plethora of contributions. In particular, this helps out researchers who are not familiar with security/privacy to chose the appropriate techniques for their own distributed data mining problems. The work also discusses the lack of clear interpretation of ‘security' and ‘privacy’ in the field, which is an interesting outcome to the literature review.

Reasons to reject:

Despite presenting a systematic literature review, the paper is rather vague on some steps of the inclusion/exclusion of works. In particular, in the exclusion phase the authors refer to the exclusion criteria (Figure 2, steps 2-5) that was only implicitly presented in the text (i.e., paper covering topics on X, Y, Z), and has no explanation to how it was systematically conducted. For instance, were the topics searched for in the abstract/whole body? How were the topics searched for? By key-words, by reader’s interpretation? How many people did the screening for these topics?
Another missing step of the review is the backwards/forwards reference search, which is a well accepted technique for finding valid additional literature. And finally, I also mis details about how the 10 factors were generated. It looks like they emerged from the 231 papers, but they were also used to categorise the same 231 papers. I imagine this means several iterations to these papers were conducted in order to finish the review (and perhaps by multiple researchers), but that is not explained in the paper.

Nanopublication comments:

Further comments:

- Citations are glued to the text (“… efficiency[1,2]”), there should be a space between the text and the brackets.
- In page 4, lines 29-32, 'Secure set union’ has an unclear explanation.
- In page 6, lines 9-10, links to databases are not relevant bibliography to the work, and should be presented as footnotes.
- In page 6, line 34-35, why not fully-honest as well? Did that not happen in any paper? I would be curious to know that, as it seems this would be a limitation of the paper.
- In page 6, lines 40-41, If I had to guess, I would say ‘untrusted’, ‘non-trusting’, ‘non-collaborative’ refer to malicious behaviour. Does it mean you preferred not to classify them, or that they do not present a formal adversarial model?
- In page 9, Figure 2, steps are all numbered the same, and it is unclear why the excluding part begins in 2.
- In page 10, lines 6-9, here is some excluding criteria, but how was that systematically executed? Does this refer to step 2? How was step 2 executed: manually, automatically searching those key words? In the body/abstract/keywords of the paper?
- In page 10, Figure 3’s caption, unclear what "relations" mean here. Only clear when you read the text below.
- In page 10, lines 42-44, this does not look like a conclusion I can take from the graph. Were the 10 metrics used to generate the graph somehow? This comment looks misplaced.
- In page 10, lines 10-11, does that mean they assume the worse?
- In page 11, footnote, link does not work.
- In page 12, line 37, what does "manner" mean here? Change of terminology?
- In page 13, line 25, “In the rest [of the] papers”, “of the” missing.
- In page 14, line 40, "SMC.[69, 70] combined”, remove the dot.
- In page 17, the Discussion section could use some more structure (sub-sections).
- In page 18, line 1, "privacy violence”, do you mean violation?
- In page 18, lines 23-24, the comment on decision trees needs more explanation.
- In page 18, lines 27-29 and line 34, this might arise from the fact that any information can potentially be used to infer sensitive information from people or organisations participating in the distributed DM/ML schema. For instance, you cannot foresee the impact of linkage attacks without knowing the data present in other datasets. And getting this knowledge is nearly impossible. I would guess this is why most papers refrain from distinguishing which data is more/less sensitive, instead treating every single piece of data equally protection-worthy.
- In page 18, lines 38-39, this could also be a reflection of the lack of forward reference search. What if there are follow up works with experiments?
- In page 19, lines 11-12, reference missing for the national citizen identifier in the NL.
- In page 21, lines 27 and 32, strange extra hyphen in the first column of the table.
- In page 22, lines 14-15, I don't understand how the authors came to this conclusion, was it something not discussed in the reviewed papers?
- In page 22, the work lacks discussion of their limitations.

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Sun, 05/30/2021 - 14:41

The reviewers of your paper have found that your paper addresses an important topic and the survey that you have provided of the relevant literature helps to organise and critique that literature effectively. Further both reviewers appreciated the conclusion that key terms used in the literature are not well-defined. However, they also identified the need for clarity of the methods for identifying and categorising the papers you surveyed, and some additional justification for the specific scope of the review.

Karin Verspoor (https://orcid.org/0000-0002-8661-1544)

Data Science

A Systematic Review on Privacy-Preserving Distributed Data Mining

Tracking #: 688-1668

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor