Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Tracking #: 803-1783

Authors:

	Name	ORCID
	Wilfredo García	https://orcid.org/0000-0002-9620-1119

Responsible editor:

Ruben Verborgh

Submission Type:

Research Paper

Abstract:

The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 100\% accuracy on a dataset with 148 CSV files composed of samples coming from a data load testing framework and some others added as verification of the parsing routines. In tests on truly messy data, the proposed solution outperforms the state-of-the-art tool by achieving an improvement of about 10\% in the accuracy with which dialects are detected. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.

Manuscript:

ds-paper-803.pdf

Supplementary Files (optional):

ds-supplementary-803-1302.zip

Revised Version:

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Data repository URLs:

https://github.com/ws-garcia/CSVsniffer

Date of Submission:

Friday, March 15, 2024

Date of Decision:

Thursday, May 23, 2024

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 09/Apr/2024

By Dylan Van Assche ORCID logo

https://orcid.org/0000-0002-7195-9935

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Weak
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper introduces a new approach for detecting the CSV dialect in CSV files by analyzing the table uniformity and inference of data types. The approach is compared against CleverCSV on 2 datasets.

Reasons to accept:

- Interesting idea on how to detect CSV file dialects
- Clearly written with almost no typos
- Proper definitions and explanations on how the approach works

Reasons to reject:

- Paper structure should be improved
- Lack of FAIR principles regarding the datasets, tools, and evaluation. Everything should be available in a repository with DOI to make things reproducible in the future.
- Evaluation should be expanded with more datasets like CSVW test cases.

Nanopublication comments:

Further comments:

Major Revision
----------------------

# Summary

# General comments

## Quality, importance, and impact
The paper presents an interesting approach to detect CSV dialects in CSV files which seems straightforward, but is a common problem in data integration because of the many dialects in CSV files. The approach presented shows an improvement in accuracy compared to the state-of-the-art. I encountered some difficulties with understanding the structure of the paper at first (see clarity and readability below). Adjusting the structure would benefit the quality of the paper. When starting to read the paper, I encountered an accuracy of 100% which was interesting, but a bit later in the abstract an increase of 10% in accuracy was mentioned on truly messy CSV files. I would adjust these sentences to avoid confusing with the reader that this approach has 100% accuracy for everything.

## Clarity and readability
The paper is written clearly and almost without any typos. The paper explains clearly the problem the author tries to solve and the algorithms the author used in the approach to reach a higher accuracy than CleverCSV. However, the structure of the paper could be improved to have a clear outline of introduction-approach-evaluation-setup-experiments-results-discussion-conclusion.

## Provided data and its sustainability
The approach is compared against CleverCSV on 2 datasets: Pollock framework GitHub repository and CleverCSV repository. However, no reference or sustainable DOI link is provided for the datasets. It would be great to have the resources on Zenodo or any other platform to make sure that the datasets remain available in the future.

# Detailed comments

## Abstract
Avoid confusing the reader with 100% accuracy (see general comments).

## Introduction
The CSV header is mentioned as optional and a potential problem. How does this approach handle the presence or lack of a CSV header in terms of accuracy without configuration?

It would be beneficial to make it more clear in the introduction what the approach already is and how the reader benefits from this approach compared to the state-of-the-art. Currently, only the problem and why it is an interesting problem are mentioned.

References: RDF-4180 link/reference missing

## Related Work

Typos: ad-dressed --> addressed

## Problem formulation
This section reads more like a Preliminaries section that the reader should know before diving into the rest of the paper. The problem statement is already provided in the introduction. I would rename this section.

## Table uniformity
This section is part of the approach. I would move all the sections related to the approach under a new section Approach with the sections as subsections to improve the structure of the paper. This way, the approach can be introduced better before going into details such as table uniformity, type detections, and determining CSV dialects.

## Type detection
The same comment applies here about the Approach section (see Table uniformity).
This section lists several field types which can be detected. I wonder how this approach deals with:
- Different sizes of integers/floats
- Different floating point notations
- IPv6 as only IPv4 is listed
- Empty data: many options which are hard to detect. Moreover, what if the rows supplied to the approach contain many many empty values?

## Determining CSV file dialects
The same comment applies here about the Approach section (see Table uniformity).
This section already gives a lot of numbers in the end and looks like an experiment. I would move it to the Experiments section instead or reduce it to a more simple example.

## Evaluation setup
This section is missing. Ideally it would explain in detail how the evaluation is performed, which datasets, which versions of the tools, which experiments were performed, etc.

One of the datasets I’m missing in the evaluation is the CSVW test cases from the W3C Working Group. They provide an interesting list of CSV files with different dialects: https://w3c.github.io/csvw/tests/. This should be included in the experiments as one of the use cases is automatically analyzing CSV data from various data portals from the Internet.

## Experiments
The datasets used in the experiments are listed here, but without any DOI or link (FAIR principles). They could be better moved to the missing Evaluation setup section. This section is more a Results section, but has the evaluation setup mixed in.

## Discussion
No comments.

## Conclusion
This section is missing from the paper, when added, it can summarize the findings, list future work, etc. This way the reader knows what the possible gaps are and what the outcome is of this paper.

Review #2 submitted on 16/May/2024

By Sean R. Wilkinson ORCID logo

https://orcid.org/0000-0002-1443-7479

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper details new methods for detecting the dialect of CSV data files, including an implementation in Python and comparison to an existing Python module. The problem presented is that CSV files exhibit considerable variability due to the absence of strict standards; this variability presents obstacles for constructing general methods for detecting (and therefore reading and using) CSV files. The author's implementation is shown to outperform the existing Python module, CleverCSV.

Reasons to accept:

This is a very careful treatment of a problem that is often patched over with human guesswork and haphazard regular expressions. There is a mathematical model here which does actually clarify the problem and provide some level of rigor. The readers of Data Science are also likely to be interested in the tool because its accuracy exceeds that of the competitor presented by 10% on "difficult" files.

Reasons to reject:

The tool and data files are published on GitHub, which is not truly persistent. One of the data files is attached to a GitHub Issue, which really doesn't qualify for the author to say "published in the CleverCSV repository on GitHub". I would recommend to upload at least the code for the tool's implementation to Zenodo for persistence and to receive a DOI that can be referenced inside this paper.

Nanopublication comments:

Further comments:

It would be nice for the author to elaborate on how this method affects the greater data ecosystem. How often do exotic CSV dialects actually appear in the wild? How much practical impact could this tool have on data mining, for example? The author only considers accuracy of dialect detection; what about runtime performance? Does this method take longer to execute than the less-accurate methods used by CleverCSV? Why not just train a Large Language Model for this?

Essentially, there is neither a Conclusions section nor discussion to place this tool into a broader context, and those might be very useful to include, even if only as future work.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Thu, 05/23/2024 - 03:49

The reviewers see strong points in the presented work, but also a few things that require some more work, including the proper publication of the dataset. Zenodo, for example allows for automatic GitHub import, which might be a good solution in this case.

Tobias Kuhn (https://orcid.org/0000-0002-1267-0234) on behalf of Ruben Verborgh