BioVenn - an R and Python package for the comparison and visualization of biological lists using area-proportional Venn diagrams

Tracking #: 679-1659

Authors:

	Name	ORCID
	Tim Hulsen	https://orcid.org/0000-0002-0208-8443

Responsible editor:

Gargi Datta

Submission Type:

Resource Paper

Abstract:

One of the most popular methods to visualize the overlap and differences between data sets is the Venn diagram. Venn diagrams are especially useful when they are 'area-proportional' i.e. the sizes of the circles and the overlaps correspond to the sizes of the data sets. In 2007, the BioVenn web interface was launched, which is being used by many researchers. However, this web implementation requires users to copy and paste (or upload) lists of IDs into the web browser, which is not always convenient and makes it difficult for researchers to create Venn diagrams ‘in batch’, or to automatically update the diagram when the source data changes. This is only possible by using software such as R or Python. This paper describes the BioVenn R and Python packages, which are very easy-to-use packages that can generate accurate area-proportional Venn diagrams of two or three circles directly from lists of (biological) IDs. The only required input is two or three lists of IDs. Optional parameters include the main title, the subtitle, the printing of absolute numbers or percentages within the diagram, colors and fonts. The function can show the diagram on the screen, or it can write output to one of the supported file formats. The function also returns all thirteen lists. The BioVenn R package and Python package were created for biological IDs, but they can be used for other IDs as well. Finally, BioVenn can map Affymetrix and EntrezGene to Ensembl IDs. The BioVenn R package is available in the CRAN repository, and can be installed by running ‘install.packages(“BioVenn”)’. The BioVenn Python package is available in the PyPI repository, and can be installed by running ‘pip install BioVenn’. The BioVenn web interface remains available at https://www.biovenn.nl.

Manuscript:

ds-paper-679.docx

Previous Version:

BioVenn – an R and Python package for the comparison and visualization of biological lists using area-proportional Venn diagrams

Data repository URLs:

https://omabrowser.org/All/oma-groups.txt.gz

https://www.biovenn.nl/r_python/

Date of Submission:

Thursday, February 11, 2021

Date of Decision:

Sunday, February 28, 2021

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 22/Feb/2021

By Emma Beauxis-Aussalet ORCID logo

https://orcid.org/0000-0002-4657-892X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The main changes is that the authors have clarified the mapping of "biological IDs", and introduced explanations of "the 13 sets" early in the paper's introduction.

Reasons to accept:

The paper is more complete and easier to read.

Reasons to reject:

IMPORTANT NOTE: the paper cannot be published as such, because some figures are missing (!) or are of poor quality in this version. However, in the previous version, the figure were fine. So I considered this a small mistake that will be easily corrected without a new round of revision.

Table 3 needs reformatting: it is too tedious to read as such.

The explanations of "the 13 sets" remains a bit vague (p.2). I'd suggest: "it displays the list of elements belonging to each of the subsets in the Venn diagram, i.e., the thirteen subsets resulting from the overlaps between two to three circles X, Y, Z (in other words, the elements belonging to all circles X, Y and Z; belonging to at least X, Y, Z, X and Y, X and Z, X and Z; belonging to only X, only Z, ... or only YZ).

Nanopublication comments:

Further comments:

Minor corrections:
- Section 2.5: "himself" -> "herself", or use plurals "unless users calculate ... themselves"
- Section 5 (p.4): "it can write output to one of the ..." -> "it can export the graph in one of the ..., e.g., as image or svg". And also "which is new" -> "which is a new".

Review #2 submitted on 24/Feb/2021

By Michael Hinterberg ORCID logo

https://orcid.org/0000-0003-0693-7075

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The manuscript describes the cross-platform capabilities of the BioVenn tool for visual comparison of gene lists. Through additional explanation (section 5.3) and clarification, the author has addressed concerns to improve the explanation of specific utility to biological analysis, as well as comparative strengths of the tool in drawing circles (as opposed to ellipses) and easy adjustment of visual parameters for production-quality plots.

Reasons to accept:

The paper describes an example of a biological analysis for rapid hypothesis testing and visualization of gene set analysis.
Thank you for the clarifying language and comments.

Reasons to reject:

There are no major remaining concerns.

Nanopublication comments:

Further comments:

RESPONSE TO REVIEWERS

Reviewer 1:

Summary of paper in a few sentences:

The manuscript describes the port of BioVenn to R and Python, which allows users to easily automate their operations with the algorithm.
The work describes an update of a tool for proportional Venn diagrams, which has been widely used. The manuscript reviews existing tools for similar uses with fairness, showing strengths and weaknesses. I only have two small observations in this regard.
--> Thank you.
• First, even though it is not properly a Venn diagram generator, the manuscript should mention UpSetR. I see no need to change the tables or figures, but a small description would improve the manuscript.
--> I have included UpSetR in the Conclusion section.
• Second, nVenn has both an R interface (nVennR) and a web interface.
--> This has been corrected in table 3.

Regarding the technical merits of the manuscript:
• One feature I consider unique and useful in BioVenn is the automatic mapping of Affymetrix and EntrezGene IDs to Ensembl IDs. However, I have not found this option while using the R interface, and it is not mentioned in the vignette. If this is not available, the author should delete the last sentence of the first paragraph of Results. If it is available, the author should add the procedure to the manuscript and the documentation.
--> I have included an example of the ID mapping in the manuscript (in the new section 5.3). It will also be added to the documentation of the next version of the software.
• The manuscript should make it clear that exact proportionality for more than two sets is not achievable with circles. This is not a limitation of the program. In the example provided, if we delete the "1007_s_at" element in “list_z”, the “list_x intersect list_z” overlap should be zero. However, the overlap in the result is nonzero (the region is not labeled, which informs the user of this discrepancy). For the sake of completeness, this limitation should be pointed out in the manuscript.
--> Indeed, for three-circle diagrams it is sometimes not possible to have exact proportionality, and for diagrams with even more circles this is an even larger issue. I have added this information to both the Methods and the Results section.

Regarding presentation:
• The text, figures, and tables of the work are accessible, pleasant to read, clearly structured, and free of major errors in grammar or style. I would point one small error at page 9: “This also makes sure that the user cannot mathematically impossible numbers…” should be “This also ensures that the user cannot input mathematically impossible numbers…”
--> This mistake has been corrected.

Reasons to accept:

The tool described is widely used and has some unique features that may be interesting to scientists.
--> Thank you.

Reasons to reject:

I do not find any issues important enough to reject the manuscript.
--> Thank you.

Reviewer 2:

Summary of paper in a few sentences:

Tim Hulsen provides an updated, straightforward R and Python package to support drawing of area-proportional Venn diagrams of up to 3 overlapping lists. Biological lists can be mapped from Entrez/Affy to Enembl before calculating overlaps. The functions are parameterized so layout elements such as labels, font, colour, etc. can be adjusted. Additionally, the functions return the overlapping list permutations.
--> Thank you.

Reasons to accept:

When working directly within R or Python, BioVenn appears to provide a quick and easy option for creating presentation and manuscript-worthy images. This is especially true within the specific use case of working with and between supported EntrezGene, Affymetrix, and Ensembl IDs. In those use cases, it is easier than using a web-based option, and the area-proportional output can be helpful for rapid visual insight.
--> Thank you.

Reasons to reject:

The novelty of this tool is the biggest drawback. It appears to be an API change to an existing tool more than a truly novel approach.
--> The R and Python packages were created by taking the source code of the original web application, and completely rewriting it in R and Python, taking into account the specific qualities and peculiarities of these programming languages. The mathematical calculations are the same in each version, making sure that they give the same output (except for some purely visual differences), but for the rest they are actually quite different. For example, to create SVG output, different modules/methods had to be used in the web version, the R package and the Python package. Furthermore, I would like to stress that some advancements have been made since the publication of 2008. Some errors have been corrected (caused by division-by-zero in some special cases), and the drag-and-drop functionality is new as well, making it much easier to put labels in the right spot (I included this information now in section 5).
The "Bio" part is of limited use (or not properly explained), as a user with R, for example, could pre-process ID translations if desired, and a biological use-case is not strongly described in this paper.
--> The "Bio" part has been explained a better now in the paper (see the new section 5.3, including an extra figure and extra table). Indeed, ID mapping can be done separately beforehand in both R and Python, but it is just much more simple for the user to directly send the biological identifiers to BioVenn and let the program do the work of 1) the ID mapping, 2) the set/overlap calculation and 3) the venn diagram visualization.
For general use, it is unclear if there are substantial differences between this and 'eulerr' for visual output in R. There are a few formatting parameters that appear to be supported in this tool (BioVenn), and of course it works natively in R and Python, but eulerr has a similar interface and allows for >3 sets.
--> Eulerr indeed misses some formatting options that BioVenn has, and it does not have the drag-and-drop functionality to ensure that labels are places at the right spot. Allowing more than three sets is an interesting option (which I am actually exploring in new software), but it also brings additional problems since area-proportionality is often not possible for diagrams with more than three sets. Even for three there is not always a perfect solution, but with more circles this is a bigger problem.
While the cross-platform support in R and Python may be helpful for some, a natural R improvement would be a tidy data/ggplot API that supported additional overlapping sets, as well as an easier layered graphical interface such that layout elements can easily be 'themed' instead of a lengthy and specific argument list.
--> Thank you for the suggestion. You stated correctly that 'cross-platform support in R and Python may be helpful for some'; therefore, any new functionality inplemented R should also be implemented in Python. I did not want to include any R-specific (or Python-specific) functionality. Considering the 'themes': users could easily store a list of parameters in a list/array/dictionary, and pass that on to the draw.venn function. Example:
black_theme do.call(draw.venn,c(list(list_x,list_y,list_z),black_theme))
Additionally, while the area-proportional plots seem intuitive and visually attractive, the paper could benefit for an insight-driven/workflow use case that showed the utility of area-proportionality; for example, real-world cases where equal-sized circles could be misleading, or where proportionality instead improved cognition.
--> Thank you for the suggestion. I believe that the figures 3 and 4 show clearly that the diagrams that show more overlap between human and mouse (opposed to human-xenopus and mouse-xenopus) are much more insightful. Also note the sentence "We can see that the packages that create area-proportional diagrams (a, c, d, g, h) give a better impression of what the data looks like: the human and mouse circles indeed have a larger overlap than with the Xenopus circle, and the Xenopus circle is larger than the other ones.". The diagrams that are not area-proportional do not show at a glance that human and mouse are relatively closely related, and that Xenopus has the largest genome.

In summary, this is an easy and potentially useful tool for specific use cases, with limited additional support compared to alternatives (although the alternatives are nicely shown). A scientific paper, as opposed to a Vignette/application note, would benefit more strongly from a specific use-case example or literature examples showing, or at least suggesting, the insight gained by presenting proportional Venn diagrams, which would be more compelling material for a paper than a description of all of the functional arguments.
--> Thank you. I have created section 5.3 to show a use case for the biological ID mapping. Sections 5.4. shows, in my opinion, clearly why area-proportional diagrams give better insight into the data, by showing the large human-mouse overlap and the large size of the Xenopus circle.

Reviewer 3:

Summary of paper in a few sentences:

The paper introduces R and python libraries for drawing Venn diagram, with areas proportional to the size of each set and subset.
Compared to existing libraries, the new libraries include: automatic mapping of IDs into sets/subsets, choice of displaying set size as number or percentage, full adjustment of color and text aesthetics (incl. interactive label positioning, with SVG option). They also include a mapping of "Affymetrix and EntrezGene IDs to Ensembl IDs." which is not clearly specified.
--> A better explanation of the ID mapping functionality has now been provided, including a figure and table created using the ID mapping.

Reasons to accept:

The novelty of the new libraries is clear, the new functionalities seem relevant for the bioinformatics community, and beyond. The review of existing libraries is clear and comprehensive. The paper is clear and well-written.
--> Thank you.

Reasons to reject:

The mapping of mapping of "Affymetrix and EntrezGene IDs to Ensembl IDs" remains unspecified, and a few other points need clarification (see below).
It is important to briefly specify the mapping of "Affymetrix and EntrezGene IDs to Ensembl IDs" or else readers are not provided with all information needed to understand what the libraries does. If it is unclear what this special mapping does, then it is unclear how to reuse the libraries, especially for users who do not display bioinformatics data (and would not read citation [15] to fully understand the mapping). The potential reuse for use cases other than bioinformatics needs to be clarified (e.g., need to disable the "special" mapping? or just avoid ID labels in the typical bioinformatics format?).
--> An explanation of the ID mapping functionality has now been provided, in the new section 5.3. When users do not want to use this option, they can just leave out the 'map2ens' parameter, since its default is 'False'. This is now stated in the Methods section.

Further comments:

Direct links to the reference manuals should be added (e.g., for R https://cran.r-project.org/web/packages/BioVenn/BioVenn.pdf).
--> CRAN itself states that the link should be https://CRAN.R-project.org/package=BioVenn (see the header 'Linking'). From there, the manual is just one click away. The same holds for the other packages.

Introduction:
- It ends with " the following paragraphs" but describes only the next section. Having on overview of all sections would be clearer.
--> I changed it to "the following two sections". Section 2 describes the R packages, section 3 the Python packages.
- The "ID mapping" functionality should also be specified in the introduction, both the "basic mapping" into the "13 sets" and the mapping of "Affymetrix and EntrezGene IDs to Ensembl IDs" (e.g., described with a brief sentence).
--> I have included a short description of both the biological ID mapping and the "13 sets" in the introduction.
- Section 3.1: "‘drag-and-drop’ functionality of text and numbers" -> "‘drag-and-drop’ functionality for repositioning labels"
--> This has been changed to "'drag-and-drop' functionality for repositioning of titles and labels" (at multiple locations in the text).
- Section 4: "Generate lists of the thirteen possible sets, and count them" -> "and count the number of elements per set"
--> Changed to "Generate lists of IDs for the thirteen possible sets, and count the number of IDs within each set"
- Section 4: "Calculate the angles of the XYZ triangle" -> specify what is the XYZ triangle (I understand that it's the centre of the 3 circles, but the reader is left to interpret what the "XYZ triangle" is).
--> This is the triangle that is formed by connecting the three centers of the circles. This is now explained in the paper.
- Section 5.3: "Xenopus" -> "frog" for consistency with other labels of animals.
--> "Xenopus" is a special type of frog (the African clawed frog), which is quite well known among researchers working with model organisms. Therefore the name "Xenopus" should be used here.
- Table 2: Improve the layout
--> The current layout of the table (now table 3) is not optimal due to limitations in MS Word. When it will be online, the layout will be much better.
- Page 9: recap why areas in other libraries are "inaccurate". Also fix the syntax of "the user cannot mathematically impossible numbers" -> "the user cannot [input] impossible numbers"
--> I added a sentence explaining that for more than three sets, it is more difficult to create accurate area-proportional diagrams. The syntax has been corrected to "the user cannot input mathematically impossible numbers".

Data Science

BioVenn - an R and Python package for the comparison and visualization of biological lists using area-proportional Venn diagrams

Tracking #: 679-1659

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision: