Reviewer has chosen not to be Anonymous
Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Weak
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right
Summary of paper in a few sentences:
This paper introduces a new approach for detecting the CSV dialect in CSV files by analyzing the table uniformity and inference of data types. The approach is compared against CleverCSV on 2 datasets.
Reasons to accept:
- Interesting idea on how to detect CSV file dialects
- Clearly written with almost no typos
- Proper definitions and explanations on how the approach works
Reasons to reject:
- Paper structure should be improved
- Lack of FAIR principles regarding the datasets, tools, and evaluation. Everything should be available in a repository with DOI to make things reproducible in the future.
- Evaluation should be expanded with more datasets like CSVW test cases.
Nanopublication comments:
NA
Further comments:
Major Revision
----------------------
# Summary
This paper introduces a new approach for detecting the CSV dialect in CSV files by analyzing the table uniformity and inference of data types. The approach is compared against CleverCSV on 2 datasets.
# General comments
## Quality, importance, and impact
The paper presents an interesting approach to detect CSV dialects in CSV files which seems straightforward, but is a common problem in data integration because of the many dialects in CSV files. The approach presented shows an improvement in accuracy compared to the state-of-the-art. I encountered some difficulties with understanding the structure of the paper at first (see clarity and readability below). Adjusting the structure would benefit the quality of the paper. When starting to read the paper, I encountered an accuracy of 100% which was interesting, but a bit later in the abstract an increase of 10% in accuracy was mentioned on truly messy CSV files. I would adjust these sentences to avoid confusing with the reader that this approach has 100% accuracy for everything.
## Clarity and readability
The paper is written clearly and almost without any typos. The paper explains clearly the problem the author tries to solve and the algorithms the author used in the approach to reach a higher accuracy than CleverCSV. However, the structure of the paper could be improved to have a clear outline of introduction-approach-evaluation-setup-experiments-results-discussion-conclusion.
## Provided data and its sustainability
The approach is compared against CleverCSV on 2 datasets: Pollock framework GitHub repository and CleverCSV repository. However, no reference or sustainable DOI link is provided for the datasets. It would be great to have the resources on Zenodo or any other platform to make sure that the datasets remain available in the future.
# Detailed comments
## Abstract
Avoid confusing the reader with 100% accuracy (see general comments).
## Introduction
The CSV header is mentioned as optional and a potential problem. How does this approach handle the presence or lack of a CSV header in terms of accuracy without configuration?
It would be beneficial to make it more clear in the introduction what the approach already is and how the reader benefits from this approach compared to the state-of-the-art. Currently, only the problem and why it is an interesting problem are mentioned.
References: RDF-4180 link/reference missing
## Related Work
Typos: ad-dressed --> addressed
## Problem formulation
This section reads more like a Preliminaries section that the reader should know before diving into the rest of the paper. The problem statement is already provided in the introduction. I would rename this section.
## Table uniformity
This section is part of the approach. I would move all the sections related to the approach under a new section Approach with the sections as subsections to improve the structure of the paper. This way, the approach can be introduced better before going into details such as table uniformity, type detections, and determining CSV dialects.
## Type detection
The same comment applies here about the Approach section (see Table uniformity).
This section lists several field types which can be detected. I wonder how this approach deals with:
- Different sizes of integers/floats
- Different floating point notations
- IPv6 as only IPv4 is listed
- Empty data: many options which are hard to detect. Moreover, what if the rows supplied to the approach contain many many empty values?
## Determining CSV file dialects
The same comment applies here about the Approach section (see Table uniformity).
This section already gives a lot of numbers in the end and looks like an experiment. I would move it to the Experiments section instead or reduce it to a more simple example.
## Evaluation setup
This section is missing. Ideally it would explain in detail how the evaluation is performed, which datasets, which versions of the tools, which experiments were performed, etc.
One of the datasets I’m missing in the evaluation is the CSVW test cases from the W3C Working Group. They provide an interesting list of CSV files with different dialects: https://w3c.github.io/csvw/tests/. This should be included in the experiments as one of the use cases is automatically analyzing CSV data from various data portals from the Internet.
## Experiments
The datasets used in the experiments are listed here, but without any DOI or link (FAIR principles). They could be better moved to the missing Evaluation setup section. This section is more a Results section, but has the evaluation setup mixed in.
## Discussion
No comments.
## Conclusion
This section is missing from the paper, when added, it can summarize the findings, list future work, etc. This way the reader knows what the possible gaps are and what the outcome is of this paper.
1 Comment
meta-review by editor
Submitted by Tobias Kuhn on
The reviewers see strong points in the presented work, but also a few things that require some more work, including the proper publication of the dataset. Zenodo, for example allows for automatic GitHub import, which might be a good solution in this case.
Tobias Kuhn (https://orcid.org/0000-0002-1267-0234) on behalf of Ruben Verborgh