Maintaining intellectual diversity in data science

Tracking #: 432-1412

Authors:

	Name	ORCID
	Richard Mann	https://orcid.org/0000-0003-0701-1274
	Olivia Woolley-Meza	https://orcid.org/0000-0003-4517-2765

Responsible editor:

Tobias Kuhn

Submission Type:

Position Paper

Abstract:

Data science is a young and rapidly expanding field, but one which has already experienced several waves of temporarily-ubiquitous methodological fashions. In this paper we argue that a diversity of ideas and methodologies is crucial for the long term success of the data science community. Towards the goal of a healthy, diverse ecosystem of different statistical models and approaches, we review how ideas spread in the scientific community and the role of incentives in influencing which research ideas scientists pursue. We conclude with suggestions for how universities, research funders and other actors in the data science community can help to maintain a rich, eclectic statistical environment.

Manuscript:

ds-paper-432.pdf

Data repository URLs:

None

Date of Submission:

Sunday, March 5, 2017

Date of Decision:

Monday, March 27, 2017

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 14/Mar/2017

By Jodi Schneider ORCID logo

https://orcid.org/0000-0002-5098-5667

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Average
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper advocates for intellectual diversity in data science, drawing inspiration from ensemble methods and the benefits of aggregating over diverse (but not homogeneous) crowds. Points about the "undercurrents of fashion and conformism in the methods researchers are expected to use" are well-taken, and this is made concrete with specific evidence on the relevance of diversity from a number of different disciplines. Similarly, the downsides of rewarding individual accuracy are usefully discussed.

Reasons to accept:

The overall statistical perspective here is probably uncontroversial in a certain (statistical) audience but deserves attention from a wide, general audience, which this paper is well-pitched to reach. The authors' expertise is usefully built upon, especially in Section 1 and in the Conclusion.

Reasons to reject:

Section 2 would benefit from work to strengthen the argument. It relies heavily on a single article in preparation by the authors which is currently not readily publicly available: "J. Huisman and O. Woolley-Meza. Ultra-peripheral links drive structural instability in complex contagion. in preparation, 2017.". In fact, the bibliography also contains a second manuscripts for which no public preprint is referenced.

It would be beneficial to either present the fuller argument from this (unseen) manuscript, or alternative sources for the evidence that "recent work shows that the most effective way to transfer ideas between communities is through connections made between individuals that are more peripheral rather than through those better connected [15]".

Data underlying Figure 1 should be further specified, ideally with a data set that others could immediately reuse. See http://datasciencehub.net/guidelines.html and in particular http://journals.plos.org/plosone/s/data-availability

The discussion on incentives is relatively shallow; by contrast a science policy perspective might draw examples (perhaps even using citation analysis) to show the downsides of rewarding individual accuracy.

Nanopublication comments:

Further comments:

Page 1: Is NIPS attendance data public? Are there citable figures here? That would strengthen the argument and make it more legible to future readers.
Page 3: Consider renumbering references to match the bibliography order.
Page 4: Add a citation regarding the "famous examples of damaging group think, such as Tulip Fever, the South Sea Bubble and other stock market booms and busts"
Page 6: it's --> its and "consistent each other" --> "consistent with each other"
The caption for Figure 2 should cite [15].
Page 7: lowercase "Boosting"? A space is needed after the comma in [14,21].

References need a thorough edit. e.g.:
Capitalize for [5], [21], and [35]: Belkor, Netflix, Russian
Preprints are needed if you are reasonably going to cite [15] and [29]
Use consistent capitalization for arXiv, and provide the paper number for [21]
Something has gone wrong in the very last line (e.g. Nature Communications)

Review #2 submitted on 23/Mar/2017

By Bjarke Mønsted ORCID logo

https://orcid.org/0000-0003-3683-312X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: Medium
Significance: Low significance
Background: Comprehensive
Novelty: Lack of novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper addresses the strength of diversity in available models in data science. The authors claim that one threat to such diversity is the tendency of certain methods to be in fashion, at the expense of other models. Certain mechanisms are proposed to avoid this, such as maintaining a network structure among researchers in which smaller communities may further develop novel methods which may then be used in conjunction with existing methods, rather than a network structure in which superior performance of one single method leads to its global adoption. Another such proposed mechanism is setting up incentives, economic and otherwise, for the development of methods which do not necessarily perform well in general, but outperforms existing models in cases where the latter perform poorly.

Reasons to accept:

The paper addresses an important topic, and one that is within a broad interpretation of the journal's scope (i.e. 'how to analyse [data] in a way that allows new insights'). The point that incentives should favor approaches that perform well in cases where otherwise superior approaches perform poorly is an interesting and clearly directly applicable insight.

Reasons to reject:

While within a broad interpretation of the journal's scope, the paper presents neither concrete methods or results pertaining to data analysis, nor any new clear, concise proposals as to how diversity in data science can be increased, as the aforementioned use of incentives has already been proposed in a previous work by one author.
The paper also fails to provide comprehensive evidence that lack of diversity is in fact a problem. The paper uses as an example the Netflix Prize Competition, in which an ensemble of diverse methods provided a superior solution, but this seems contrary to the point previously mentioned, as this seems a case of an existing competition providing incentives towards a diverse approach rather than one single method.

In some cases, sources are lacking to either back up factual claims, or to elaborate on nomenclature.
Some examples of the former:
- On page 3, 7 sources are cited to back up claims about the superiority of collective versus individual intelligence in humans and animals, whereas there is no citation on the superiority of statistical ensemble models, which is the one case that is of direct concern to this paper.
- Also, on page 4: "The community [...] is also subject to social forces that discourage a diversity of approaches." There is no source to back up this claim, which seems particularly strange since we've just been presented with an example of researchers that were indeed rewarded for opting for a diverse approach in the Netflox competition.
An example of the latter:
- On page 6: "Scientific ideas and publications exist in a quasi-market, where some are accorded a high value and attract high rewards." The exact meaning of this is not clear to me - it should be elaborate precisely what existing in a quasi-market means and how it relates to the previous claims.

Furthermore, there are unfounded claims regarding the interplay between network structure and idea spreading, e.g. in the caption for figure 2: "This is the topology that best sustains the global penetration of diverse ideas that are fostered locally."

For figure 1: It is unclear to me why the authors normalize counts to 100. If I understand the procedure correctly, this normalization entails simply dividing each monthly count by max(count), so the graph retains its shape, but the actual number of counts is obscured. I don't see any reason to remove this information.

Finally, there are a few cases of typos and omitted words:
"The has been a huge increase" -> "There has been a huge increase"
"Analyzing the networks of scientific can reveal" -> "Analyzing the networks of scientific interactions [presumably] can reveal"
"reasons to be weary" -> "reasons to be wary"

Nanopublication comments:

Further comments:

Review #3 submitted on 25/Mar/2017

By Melissa Haendel ORCID logo

https://orcid.org/0000-0001-9114-8737

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This is a fascinating discourse on the need to combine methodologies, to not get too caught up in trends, to apply the right approaches for the right tasks, and to leverage team science approaches to advance the newly sexy field of data science. The paper is an intriguing perspective piece, and makes for interesting and compelling reading.

Reasons to accept:

It is a very interesting subject and we should all be thinking more about these issues as we go about our business of "data science".

Reasons to reject:

none

Nanopublication comments:

Further comments:

This is a fascinating perspective piece and I enjoyed reading it. The manuscript details a variety of characteristics that are both faulty and needed within the emergent and interdisciplinary field of data science (albeit some if existing activities under new guises). The manuscript is both advisory, but also cautionary. My mostly minor comments have to do with adding additional references and examples, and clarifying and/or narrowing the language and topics so as to be more impactful in its message.

1. Consider in the introduction, citing the Oceans of Data profiling of a “data scientist” http://www.oceansofdata.org/ or other similar efforts, there are many and these are very useful in even understanding what is different about a data scientist in today’s world vs other related fields (though I agree with the authors about rebranding some of these concepts).
2. A few other examples of “bubbles of interest”, might be included, and in fact these can be seen even more specifically based upon citation analysis, such as in Greenberg 2009, or regarding drug development: doi:10.1038/470163a https://arxiv.org/pdf/1102.0448.pdf and many others
3. There are also some nice examples of issues in applying the correct/best methods or statistical biases in sharing methods/data. Examples: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026828 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.100... and many others… should provide a diversity of examples, for example best data visualization approach.
4. The Netflix example of needing ensemble models, to take advantage of different approaches is excellent and easily digestible by anyone reading – and deeply highlights the need for changes in attribution, incentive structures, and team formation approaches to better our collective intelligence. Similarly, the jury decision making process.
5. Some concerns about this comment, or rather, perhaps it could be clarified a bit. “The culture of benchmarking one’s new method against the state of the art in terms of accuracy necessitates that researchers utilize the best currently available methodologies if they wish to get their work published.” Unfortunately, the benchmarking is often performed far too narrowly or poorly, and not using the best approaches available (because we are too narrow minded to think outside our current bubble) – justifying publication but not really showing significant advancement or corroboration. I think this is what the authors actually mean but perhaps could be stated more clearly. Some great examples of this on Lior Pachter’s blog.
6. Given the nature of the types of interaction a data scientist might have, it might be nice to include citations/examples from more recent efforts to represent these, such as in github: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4975729/ or perhaps Wikidata or other technical interaction contexts. We simply must get beyond using citations as the primary method for inferring social transfer of knowledge.
7. The contagion idea is very compelling, but it doesn’t take into account any other factors, such as expertise or its similarity across a social network.
8. The incentives to experiment or be in the minority are important to describe, but seem to hang at the end, in terms of how one can ensure their inclusion within ensemble frameworks or other team dynamics that are the primary focus of the recommendations and themes of the manuscript. There is some discussion of including weak classfiiers, but in general this ending section could be made more robust with another example and a bit more discussion.
9. There are many aspects of “data science” that are not discussed here. Perhaps it should be mentioned more explicitly that the focus is on methods and their popularity, and need for collaborative work in their use/testing.
10. I agree with this comment: Thus, increasing connections between scientists working at the periphery, in communities that are typically distant, could be a promising new way of fostering a diverse set of ideas and integrating them for innovative science.” However, much of the conclusion focuses too much in my opinion, on hiring/funding incentives and less on the actualities of collaboration. I agree with these opinions, but just that they seem to distract from the main take home messages – especially the one quoted above.
11. The conclusion is very wandering – suggest reorganizing to be more on target with the excellent points raised in the manuscript.

Review #4 submitted on 27/Mar/2017

By Victor de Boer ORCID logo

https://orcid.org/0000-0001-9079-039X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Excellent
Reviewer`s confidence: Medium
Significance: High significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper presents a case for fostering diversity with respect to paradigms, methods and tools in the upcomimg field of Data Science.

Reasons to accept:

The authors clearly identify the paradoxical fact that even though Data Science is a young research field, there are already signs of preferences with respect to specific paradigms (Deep Learning comes to mind). The authors make the point for maintaining diversity in the field and here point to its benefits identified in different domains.
This is of course true for any scientific domain, but even moreso for Data Science as here, the most succesful methods are often ensemble methods, that only work because of the different biases and of the individual algorithms. I think this point is quite interesting and the authors make a good point that both the tools themselves as the scientific structures from which they result should be diverse enough. The authors' main point is therefore that diversity results in the best quality models. I would also add that for 'good science', diversity is even more important than for 'good results'. As scientists, we should not primarily be motivated by performance of methods, but by knowledge gained with respect to the workings of these methods and models. For this, a good diverse landscape of such models is needed.

I also appreciate how this is related to features of scientific (collaboration) networks in section 2.2

In Section 2.3. the authors discuss the role of incentives to foster this diversity. The directions the authors point to are interesting, but more concrete measures would have been appreciated. This notwithstanding, the paper is a very nice read and a good call to remain vigilant as a community, I would like to see this argument made in the first issue of this journal and therefore suggest accepting the article.

Some minor issues:
- The first sentence of 2.1: "Analyzing the networks of scientific can " -> word missing here
- Section 2.1: "For example, the Web of Science" - please add a reference to this platform (in a footnote)
- 2.1 " resolutions.However" Missing whitespace

Reasons to reject:

none

Nanopublication comments:

Further comments:

1 Comment

Link to Final PDF and JATS/XML Files

Submitted by Tobias Kuhn on Wed, 07/04/2018 - 08:37

https://github.com/data-science-hub/data/tree/master/publications/1-1-2/ds-1-1-2-ds003

Tracking #: 432-1412

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment