Reviewer has chosen not to be AnonymousOverall Impression:
UndecidedTechnical Quality of the paper:
Clear noveltyData availability:
All used and produced data (if any) are FAIR and openly available in established data repositoriesLength of the manuscript:
The length of this manuscript is about right
Summary of paper in a few sentences:
This work presents a systematic literature review on the topic of privacy preserving techniques for distributed data mining/ML. The authors select and review 231 papers on this topic, in particular evaluating each on the basis of 10 factors the authors identify as key to those contributions. The factors range from characteristics of the adversary model, to the types of data mining/ML problems tackled, to the merits of the contribution (such as type of data used in the experiments, complexity, scalability, comparison of accuracy with other papers). As a conclusion for the work, the authors discuss the (lack of) concrete definition for privacy and security in those works, and how contributions seem to assume different interpretations for them. Moreover, the authors also suggest a “template” to be followed by other authors in the future, when writing papers on this topic.
Reasons to accept:
The work tackles an important topic with unstructured contributions, which are normally difficult to navigate. The factors proposed by the authors provide this missing structure, and I can see how this would help out researchers in navigating the plethora of contributions. In particular, this helps out researchers who are not familiar with security/privacy to chose the appropriate techniques for their own distributed data mining problems. The work also discusses the lack of clear interpretation of ‘security' and ‘privacy’ in the field, which is an interesting outcome to the literature review.
Reasons to reject:
Despite presenting a systematic literature review, the paper is rather vague on some steps of the inclusion/exclusion of works. In particular, in the exclusion phase the authors refer to the exclusion criteria (Figure 2, steps 2-5) that was only implicitly presented in the text (i.e., paper covering topics on X, Y, Z), and has no explanation to how it was systematically conducted. For instance, were the topics searched for in the abstract/whole body? How were the topics searched for? By key-words, by reader’s interpretation? How many people did the screening for these topics?
Another missing step of the review is the backwards/forwards reference search, which is a well accepted technique for finding valid additional literature. And finally, I also mis details about how the 10 factors were generated. It looks like they emerged from the 231 papers, but they were also used to categorise the same 231 papers. I imagine this means several iterations to these papers were conducted in order to finish the review (and perhaps by multiple researchers), but that is not explained in the paper.
- Citations are glued to the text (“… efficiency[1,2]”), there should be a space between the text and the brackets.
- In page 4, lines 29-32, 'Secure set union’ has an unclear explanation.
- In page 6, lines 9-10, links to databases are not relevant bibliography to the work, and should be presented as footnotes.
- In page 6, line 34-35, why not fully-honest as well? Did that not happen in any paper? I would be curious to know that, as it seems this would be a limitation of the paper.
- In page 6, lines 40-41, If I had to guess, I would say ‘untrusted’, ‘non-trusting’, ‘non-collaborative’ refer to malicious behaviour. Does it mean you preferred not to classify them, or that they do not present a formal adversarial model?
- In page 9, Figure 2, steps are all numbered the same, and it is unclear why the excluding part begins in 2.
- In page 10, lines 6-9, here is some excluding criteria, but how was that systematically executed? Does this refer to step 2? How was step 2 executed: manually, automatically searching those key words? In the body/abstract/keywords of the paper?
- In page 10, Figure 3’s caption, unclear what "relations" mean here. Only clear when you read the text below.
- In page 10, lines 42-44, this does not look like a conclusion I can take from the graph. Were the 10 metrics used to generate the graph somehow? This comment looks misplaced.
- In page 10, lines 10-11, does that mean they assume the worse?
- In page 11, footnote, link does not work.
- In page 12, line 37, what does "manner" mean here? Change of terminology?
- In page 13, line 25, “In the rest [of the] papers”, “of the” missing.
- In page 14, line 40, "SMC.[69, 70] combined”, remove the dot.
- In page 17, the Discussion section could use some more structure (sub-sections).
- In page 18, line 1, "privacy violence”, do you mean violation?
- In page 18, lines 23-24, the comment on decision trees needs more explanation.
- In page 18, lines 27-29 and line 34, this might arise from the fact that any information can potentially be used to infer sensitive information from people or organisations participating in the distributed DM/ML schema. For instance, you cannot foresee the impact of linkage attacks without knowing the data present in other datasets. And getting this knowledge is nearly impossible. I would guess this is why most papers refrain from distinguishing which data is more/less sensitive, instead treating every single piece of data equally protection-worthy.
- In page 18, lines 38-39, this could also be a reflection of the lack of forward reference search. What if there are follow up works with experiments?
- In page 19, lines 11-12, reference missing for the national citizen identifier in the NL.
- In page 21, lines 27 and 32, strange extra hyphen in the first column of the table.
- In page 22, lines 14-15, I don't understand how the authors came to this conclusion, was it something not discussed in the reviewed papers?
- In page 22, the work lacks discussion of their limitations.