Arangopipe, a Tool for Machine Learning Meta-Data Management

Tracking #: 690-1670

Authors:

	Name	ORCID
	Rajiv Sambasivan	https://orcid.org/0000-0002-4865-7218
	Jörg Schad	https://orcid.org/0000-0002-1552-382X
	Christopher Woodward	https://orcid.org/0000-0003-4850-3101

Responsible editor:

Brian Davis

Submission Type:

Resource Paper

Abstract:

Experimenting with different models, documenting results and findings, and repeating these tasks are day-to-day activities for machine learning engineers and data scientists. There is a need to keep control of the machine learning pipeline and its metadata. This allows users to iterate quickly through experiments and retrieve key findings and observations from historical activity. This is the need that Arangopipe serves. Arangopipe is an open-source tool that provides a data model that captures the essential components of any machine learning lifecycle. Arangopipe provides an application programming interface that permits machine learning engineers to record the details of the salient steps in building their machine learning models. The components of the data model and an overview of the application programming interface are provided. Illustrative examples of basic and advanced machine learning workflows are provided. Arangopipe is not only useful for users involved in developing machine learning models but also useful for users deploying and maintaining them.

Manuscript:

ds-paper-690.pdf

Revised Version:

Arangopipe, a Tool for Machine Learning Meta-Data Management

Data repository URLs:

The data and code associated with this submission are available at:

https://github.com/arangoml/arangopipe

Date of Submission:

Friday, March 26, 2021

Date of Decision:

Monday, June 7, 2021

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 29/Apr/2021

By Thomas Gaillat ORCID logo

https://orcid.org/0000-0003-3433-6533

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper introduces a new tool used for the storing of the meta-data created in AI models. The authors present a database and its APIs that help keep control of a machine learning cycle. Metadata, such as datasets, features and models can be stored and retrieved at a later stage according to the needs of different types of users. The tool provides role-related APIs.

Reasons to accept:

First, the tool does provide answers to a growing need. As stated,machine learning engineering implies the use of different models with different features, datasets etc. This variety makes keeping track of changes difficult, hence hindering comparisons. The Arangopipe appears as a very interesting solution to this problem. It adapts well to any machine learning setting.
Secondly, the paper is well laid out and all necessary details are provided to understand where and how the tool can be exploited.

Reasons to reject:

I do not have any reasons to reject.

Nanopublication comments:

Further comments:

Review Document: REview-ds_paper-690.pdf

Review #2 submitted on 17/May/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Bad
Presentation: Weak
Reviewer`s confidence: Medium
Significance: High significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

In this paper, the authors introduce Arangopipe, which is a tool used to keep control of the machine learning pipeline and its metadata by harnessing a graph to represent the activities in the pipeline and using Argango DB to store the information.

Reasons to accept:

Tools managing ML experiments and facilitating reproducibility are important for ML society, in this paper, the authors address this pain point.

Reasons to reject:

This paper is not baked well, there are some issues.

1. In this paper, the authors cite a lot of online articles, which should be fine, however, they need to edit these references carefully and properly. It would be better to provide such as author, page title, year (date created or last updated), etc., of these articles rather than just an URL. Similar issues have also existed for other academic articles referenced.
2. This article seems to have been completed in a rash and lacks many details, either in technical or user interface perspectives. For example, using an Arangepipe UI concrete example to illustrate it will help readers to follow it easier; how to use AQL to search, how to launch Docker images etc. In many paragraphs, the authors just simply put 'Please see ...' with a link, it would be better to address it with some details.
3. Some sentences seem to be incomplete. For example, Page 4, The details of doing this are provided in section??, Page 5 notebooks (see section ). ??
4. One of my concern is how this tool to manage the collaborations among colleagues, for example, what happens if a model is retrained by another colleague?

Nanopublication comments:

Further comments:

Overall, it seems that the authors have some interesting work in hand, however, the current version is more like an online blog rather than a paper.

Review #3 submitted on 25/May/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Bad
Suggested Decision: Reject
Technical Quality of the paper: Unable to judge
Presentation: Average
Reviewer`s confidence: High
Significance: High significance
Background: Unable to judge
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

In this paper the authors present a tool for ML Metadata management. The system presents a multi-model database structure and supplemental api to record the activities and parameters within a machine learning project. In addition, the authors present a series of use cases for the system.

Reasons to accept:

N/A

Reasons to reject:

In this paper the authors present a tool for ML Metadata management. The system presents a multi-model database structure and supplemental api to record the activities and parameters within a machine learning project. The authors motivate the need for such a system, however the submission is very light on the technical implementations. Below is a series of comments section by section.

Data Science Workflow – This section presents a graph model used to represent a machine learning project lifecycle with nodes representing the key components. The authors allude to how this graph model can be used to represent different activities (e.g. “Hyper-parameter tuning experiments.” However, they provide no example of what this would look like. In addition, the authors state that the graph can be extended, but instead of documenting how this process works, refer the reader to examples on the projects Github repo.

Software Implementation
The authors list the components within the application, starting with the API, however no examples of API endpoints are provided so it is hard to quantify the functionality of the API.
The authors state that AranoDB is the database used, however the authors in the previous section state “To use Arangopipe in your organization, the graph used to track machine learning and data science activities needs to be provisioned. This graph is called Enterprise ML Tracker Graph and is provisioned using the administrative interface”. Does this mean the ML tracker graph is a “component” within the AranoDB ? - In addition, no information on AQL is provided, what is the syntax, what does a query in AQL look like ?
The authors then state that a web user interface is a component of the system, a screenshot of the web user interface would be appreciated.
The sentence “It is possible to use Arangopipe with Oasis, Arango DB’s managed service offering on the cloud. This would require no installations or downloads. The details of doing this are provided in section” is missing the section number at the end.
The authors state that Arangopipe is available as a series of container images. Are these container images a “component” of the overall system or are they merely a means of distribution and deployment of said system?
The authors state “The administrator would use the administration API” – Is this a separate API from the one mentioned previously?

Illustrative Examples of Arangopipe
This section (which I assume is to demonstrate the usage of the system) immediately starts by redirecting the user to Github and proceeds to describe a notebook. The rest of this section proceeds to describe this notebook. Having some examples within the paper on how to use the system would be a better approach.

Reusing Archived Steps
This is a key issue within ML , however the authors again redirect the reader to user guides for actual information e.g. “Please see using Arangopipe with TFDV for exploratory data analysis for an example of performing this in an exploratory data analysis task using tensor flow data validation tensorflow data validation. Please see performing hyper-parameter optimization with Arangopipe for an example with a hyperparameter tuning experiment.” – why are these steps not outlined within the text itself?
The authors state “Node data and results from modelling are easily converted into JSON, which is the format that ArangoDB stores documents in.” – If this is the format they are stored in, how are they converted, is it not just a read at that point. If some transformation is necessary, what is that transformation?

Extending the data model
This paragraph serves to point the user to the Github repo

Experimenting and documenting facts about models and data
This paragraph points to the Github repo and does not contain any technical information

Checking the validity and effectiveness of machine learning models after deployment
The authors state “Arangopipe provides an extensible API to check for dataset drift” – what is this API - what are the inputs and outputs

Storing Features From Model Development
The authors state “Arangopipe can be used to capture features generated from machine development.” However, no information on how this is achieved is presented.

Overall comments – the paper is incredibly light on the technical components with the authors repeatedly telling the reader to go their Github repository instead – making the paper read more like an advertisement for the project rather than an academic paper. I am concerned with this level of offloading the technical documentation to links on external websites, what if the project migrates to a new repo name, or to a different version control system altogether, such crucial documentation would be lost.
While I believe that the project indeed addresses fundamental issues within ML, I cannot recommend this paper due to the severe lack of information provided within the text.

Nanopublication comments:

Further comments:

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Mon, 06/07/2021 - 14:46

Your paper is quite interesting and could be valuable contribution for ML engineers but the manuscript needs to significanty reviewed. If you intend to submit a revision please ensure to address each reviewer's comments in point by point fashion.

Strengths

In interesting contrbution which would be of value to to ML engineers.

Weaknesses

Weak (non peer reviewed) referenceing (See Review 2)
Poor editing in that the paper reads more like project description than a scientific article (See Review 2,3)
Insufficient technical description in places (See Reviewer 3 for section by section details.

Recommendation 1: Address the technical gaps in the manuscript.

Although well motivtated the manscript is lacking in techincal details and examples/illustrations in various sections. These gaps must be addressed in a future revision. See Review 2 for minor comments and Review 3 comments in particular for further corrections.

Recommendation 2: Improve the scientific writing and referencing significantly.
-Clean up the referencing and engage with proper peer reviewed citations in order to correctly support claims in the manuscript.
-Adapt the content to the apprioriate style for a scientific article.
-See Review 2 comments in particular for further corrections.

Brian Davis (https://orcid.org/0000-0002-5759-2655)

Tracking #: 690-1670

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment