Arangopipe, a Tool for Machine Learning Meta-Data Management

Tracking #: 696-1676

Authors:

	Name	ORCID
	Rajiv Sambasivan	https://orcid.org/0000-0002-4865-7218
	Jörg Schad	https://orcid.org/0000-0002-1552-382X
	Christopher Woodward	https://orcid.org/0000-0003-4850-3101

Responsible editor:

Brian Davis

Submission Type:

Resource Paper

Abstract:

Experimenting with different models, documenting results and findings, and repeating these tasks are day-to-day activities for machine learning engineers and data scientists. There is a need to keep control of the machine-learning pipeline and its metadata. This allows users to iterate quickly through experiments and retrieve key findings and observations from historical activity. This is the need that Arangopipe serves. Arangopipe is an open-source tool that provides a data model that captures the essential components of any machine learning life cycle. Arangopipe provides an application programming interface that permits machine-learning engineers to record the details of the salient steps in building their machine learning models. The components of the data model and an overview of the application programming interface is provided. Illustrative examples of basic and advanced machine learning workflows are provided. Arangopipe is not only useful for users involved in developing machine learning models but also useful for users deploying and maintaining them.

Manuscript:

ds-paper-696.pdf

Previous Version:

Arangopipe, a Tool for Machine Learning Meta-Data Management

Data repository URLs:

The data and code associated with this submission are available at:

https://github.com/arangoml/arangopipe

Date of Submission:

Tuesday, June 15, 2021

Date of Decision:

Monday, July 12, 2021

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 29/Jun/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The authors have addressed issues identified within he text by provided more technical detail and illustrative examples. More detail has been provided across all sections illustrating usage with references for further reading/documentation.

One small point, the reference in 3.0.7 “Feature engineering is very important in many machine learning tasks (Domingos, 2012)” should be changed to the updated referencing style.

Reasons to accept:

The authors have revised the manuscript and included required technical information. The work is pertinent and addresses the key issue of reproducibility within ML.

Reasons to reject:

N/A

Nanopublication comments:

Further comments:

Review #2 submitted on 09/Jul/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Weak
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

In this paper, the authors introduce Arangopipe, which is a tool used to manage the pipeline of machine learning projects, and illustrate examples of how to install and use it.

Reasons to accept:

The authors address a growing need and important pain point from machine learning projects, build a tool to manage machine learning experiments.

Reasons to reject:

There is still room to improve their presentation qualities.

Nanopublication comments:

Further comments:

Some typos and comments:

Page 1
Line 26, business is strong [1].
Line 27, Artificial Intelligence (AI) in ...

Page 2
Line 22, In section In section 4 ...

Page 2, Line 43, in Fig. 1, Page 4, Line 6 in Figure 2. It is better to use a consistent way to explain figures.

It is not common to directly use the screenshots of codes, it would be better to put them into Appendix, or Supplementary. Or edit and reformat these codes in the paper.

Page Line 15, 28, ..., it is wired to go to subsubsection without subsection.

Suggest to put Related Work either after Introduction or before Conclusion, do not get the point why put it between Section 3 and Section 5.

Some writings are slightly redundant, e.g., Page 1, Line 35-37.

RESPONSE TO REVIEWERS

Review comments for submission #690-1670: Review Comments, Reviewer #2: 1. In this paper, the authors cite a lot of online articles, which should be fine, however, they need to edit these references carefully and properly. It would be better to provide such as author, page title, year (date created or last updated), etc., of these articles rather than just an URL. Similar issues have also existed for other academic articles referenced. Response: Duly noted. The article was prepared with an authoring tool unfamiliar to us. The PDF version submitted for review was generated from it. Our unfamiliarity with the tool as well as artifacts from the conversion process have caused this issue. With that said, the HTML authoring enviornment did seem like a good idea, and in particular, it seemed to do a great job of providing an online context to ideas we wanted to reference for further details. However, this did not carry over to the PDF version generated by the tool. This PDF version is what was submitted for the review. We have prepared the revised submission with a tool we are familiar with (Latex) and we hope that this resolves these issues. The new article is prepared as a conventional manuscript taking the comments from the review into consideration. We think the revised manuscript resolves these issues. 2. This article seems to have been completed in a rash and lacks many details, either in technical or user interface perspectives. For example, using an Arangepipe UI concrete example to illustrate it will help readers to follow it easier; how to use AQL to search, how to launch Docker images etc. In many paragraphs, the authors just simply put 'Please see ...' with a link, it would be better to address it with some details. Response: The application that we used to author the initial version rendered the article very differently from the pdf version. An online context for all referenced resources seemed to work well in that tool. However, it looks like the exported version had many issues. Perhaps this is due to our unfamiliarity with the tool. In the revised submission we have revised the manuscript to be along the lines of a conventional paper prepared in an environment best suited for us (Latex). In the revised submission we have provided a substantial context for all examples with excerpts that illustrate the important steps. 3. Some sentences seem to be incomplete. For example, Page 4, The details of doing this are provided in section??, Page 5 notebooks (see section ). ?? Response: This was an artifact of generating the pdf from the authoring tool. The revised manuscript does not have these issues. 4. One of my concern is how this tool to manage the collaborations among colleagues, for example, what happens if a model is retrained by another colleague? Response: The package offers a lookup method that can be used to search for any project asset and then perform updates on it. See Figure 7 (lookup a dataset) in the revised manuscript. Project team members can browse project assets using either AQL (through the web interface) or the Arangopipe UI. Having a naming convention for project assets is a relevant and important point. Review Comments, Reviewer #3: Comment: In this paper, the authors present a tool for ML Metadata management. The system presents a multi-model database structure and supplemental api to record the activities and parameters within a machine learning project. The authors motivate the need for such a system, however, the submission is very light on the technical implementations. Below is a series of comments section by section. Data Science Workflow – This section presents a graph model used to represent a machine learning project lifecycle with nodes representing the key components. The authors allude to how this graph model can be used to represent different activities (e.g. “Hyperparameter tuning experiments.” However, they provide no example of what this would look like. In addition, the authors state that the graph can be extended, but instead of documenting how this process works, refer the reader to examples on the projects Github repo. Response: 1. The key idea is that this data model fits most machine learning project activities. 2. This model was developed after reviewing standards development efforts that identified the key abstractions needed to capture machine learning project tasks. However, should you need to add refinements, there are methods provided in the library for that purpose. 3. Each example in the illustrative examples section is an example of an instance of the data model. The purpose of the illustrative example s section is to illustrate exactly this. A review of the examples would reveal that each example uses the same data model and uses the same methods to create and update the data model. 4. With that said, it appears that the following is not obvious from the narrative of the data model: 1. The data model is general and should fit most machine learning workflows. 2. Exploratory data analysis, model selection, hyper-parameter tuning are examples of such workflows. 3. Each workflow would use the same data model to capture meta-data from the machine learning project activity. The assets in the data model are the same, the properties of the assets can be different. The document-oriented feature of the database provides this flexibility. 4. The basic template for the workflow is to define the properties of a resource, if it is new, as a document. If it is an existing resource, retrieve it using a search and update its properties. 5. If you need to use a new resource that is not provided in the data model, there is an API to create this resource. The reviewer’s point about pointing to the repository for details is taken well. In the revised document. The key ideas for each example are summarized. The reader is then pointed to the github repository for details. In the revised submission, the data science workflow section has been updated to reflect the above ideas. We have provided excerpts that illustrate the key features in the revised submission. Comment: Software Implementation The authors list the components within the application, starting with the API, however, no examples of API endpoints are provided so it is hard to quantify the functionality of the API. Response: Duly noted. Arapopipe is a python package. So API probably suggested HTTP endpoints. This is not available now, but could be available shortly. It is a good suggestion. The relevant component is now called a python package rather than an API in the revised submission. Comment: The authors state that AranoDB is the database used, however the authors in the previous section state “To use Arangopipe in your organization, the graph used to track machine learning and data science activities needs to be provisioned. This graph is called Enterprise ML Tracker Graph and is provisioned using the administrative interface”. Does this mean the ML tracker graph is a “component” within the AranoDB ? Response: The Enterprise ML Tracker Graph is a graph in the database. A graph in a graph database is a fundamental abstraction in the data model. The Enterprise ML tracker graph is like the set of tables that you would need to start data capture in a relational database. When Arangopipe is provisioned, ie., an administrator creates an instance of the administrative component using the python API, this graph is created. Figure 5 (Provision Arangopipe) in the revised manuscript provides the excerpt that performs this step. Comment: In addition, no information on AQL is provided, what is the syntax, what does a query in AQL look like? Response: Noted. A comprehensive introduction to AQL, its syntax and its usage is available in the ArangoDB website. The section describing the Arango DB web user interface has been updated to include this information. The section has also been updated with a reference to using AQL with the web user interface. See the section, Using the Arangopipe Web User-Interface, in the revised document. Comment: The authors then state that a web user interface is a component of the system, a screenshot of the web user interface would be appreciated. Response: This has been added to the revised document. Comment: The sentence “It is possible to use Arangopipe with Oasis, Arango DB’s managed service offering on the cloud. This would require no installations or downloads. The details of doing this are provided in section” is missing the section number at the end. Response: This has been resolved. It is an artifact of using the tool used for preparing the initial submission. Comment: The authors state that Arangopipe is available as a series of container images. Are these container images a “component” of the overall system or are they merely a means of distribution and deployment of said system? Response: They are a means of distribution. The advantage is that they are a self-contained executable unit. If you have docker installed, you can run arangopipe container with pytorch installed by simply running: docker run -p 6529:8529 -p 8888:8888 -p 3000:3000 -it arangopipe/ap_torch Comment: The authors state “The administrator would use the administration API” – Is this a separate API from the one mentioned previously? Illustrative Examples of Arangopipe This section (which I assume is to demonstrate the usage of the system) immediately starts by redirecting the user to Github and proceeds to describe a notebook. The rest of this section proceeds to describe this notebook. Having some examples within the paper on how to use the system would be a better approach. Response: As indicated in the revised Datascience Workflow section, the same data model is used to capture meta-data from a range of machine learning experiment activities. The “Ilustrative Examples” section now provides the basic template used for tracking all machine learning project activity as well as the excerpts that illustrate how a particular step in the basic template is performed. Comment: Reusing Archived Steps This is a key issue within ML, however, the authors again redirect the reader to user guides for actual information e.g. “Please see using Arangopipe with TFDV for exploratory data analysis for an example of performing this in an exploratory data analysis task using tensor flow data validation tensorflow data validation. Please see performing hyper-parameter optimization with Arangopipe for an example with a hyperparameter tuning experiment.” – why are these steps not outlined within the text itself? The authors state “Node data and results from modeling are easily converted into JSON, which is the format that ArangoDB stores documents in.” – If this is the format they are stored in, how are they converted, is it not just a read at that point. If some transformation is necessary, what is that transformation? Extending the data model This paragraph serves to point the user to the Github repo Response: TFDV, Hyperopt examples: The process to use Arangopipe to capture meta-data from an exploratory data analysis task or a hyper-parameter task is similar to the basic workflow. The only deviation is the step needed to serialize programmatic objects to JSON and JSON objects to programmatic objects. There are JSON serializer-deserializers available for this purpose. The section, Reusing Archived Steps, has been updated to discuss these steps. An excerpt illustrating this process is also provided. Extending the data model: An excerpt discussing an example is now provided in the section, Extending the data model. Comment: Experimenting and documenting facts about models and data This paragraph points to the Github repo and does not contain any technical information Response: Model bias and Model variance are model properties that are of interest to model developers. The work reports a resource to capture such facts. The link shows how this can be done. A detailed discussion of computing model bias and model variance is outside the scope of the work. However, the resource pointed to in the paragraph does describe it for the interested reader. Further, the notebook also contains pointers to material that can provide more background should the reader be interested. Comment: Checking the validity and effectiveness of machine learning models after deployment The authors state “Arangopipe provides an extensible API to check for dataset drift” – what is this API - what are the inputs and outputs Response: An excerpt showing the use of the relevant method in Arangopipe is now included in the section “Checking the validity and effectiveness of machine learning models after deployment”. Comment: Storing Features From Model Development The authors state “Arangopipe can be used to capture features generated from machine development.” However, no information on how this is achieved is presented. Response: The example referred to in the section: https://github.com/arangoml/networkx-adapter/blob/master/examples/IMDB_N... , pertains to storing node embeddings from a node2vec model in Arangopipe. The description of the context around the problem and model development fall significantly outside the scope of this paper. The point here is that capturing features, such as embeddings, can be done with Arangopipe. A notebook that describes a problem and an illustration of the feature is provided. The notebook does illustrate how this is achieved. Overall comments: The paper is incredibly light on the technical components with the authors repeatedly telling the reader to go their Github repository instead – making the paper read more like an advertisement for the project rather than an academic paper. I am concerned with this level of offloading the technical documentation to links on external websites, what if the project migrates to a new repo name, or to a different version control system altogether, such crucial documentation would be lost. While I believe that the project indeed addresses fundamental issues within ML, I cannot recommend this paper due to the severe lack of information provided within the text. Response: This is a resource paper. The intents with this paper are: • To identify tracking machine learning meta-data from ML experiments as a problem that has practical utility to the machine learning community. • Using a graph data model with features being stored as documents presents some unique advantages to capture a data model for this problem • The provided data model is flexible and can capture machine learning meta-data from a range of representative machine learning tasks. • The provided data model is extensible. • There is a range of options to get started using this resource easily. • There are detailed examples that show how the tool can be used in a range of common scenarios. • Documentation to build and test it is available. The test suite is comprehensive and shows the semantics of using each and every feature in the library. • Arango DB is committed to the development of this tool and welcomes feedback in the form of questions and suggestions through either the github repository or the slack channel. • ArangoDB is actively looking at emerging standards in this area for alignment. The HTML preference in authoring tools led the authors to believe that providing online contexts for the representative scenarios was a new approach. However, it looks like there were several problems with this. It looks like this approach has left gaps in providing sufficient information for the illustrative examples. We have now revised the manuscript as a conventional paper taking all the suggestions from the review into consideration. We do believe that the illustrative examples and the revised content illustrate the intents described above. We hope that the revised content alleviates your concerns. The concerns about the repo moving are understandable, but ArangoDB has been developing open-source projects for a while. Issues such as these have well-defined clear redirects, should they arise. The code is under Apache2 license and is accessible to everyone.

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Mon, 07/12/2021 - 10:40

I am delighted to inform you that your paper has been accepted for publication! This acceptance is on condition that you address all the remaining minor issues concerning presentation.

Brian Davis (https://orcid.org/0000-0002-5759-2655)