Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Tracking #: 803-1783

Authors:

NameORCID
Wilfredo GarcíaORCID logo https://orcid.org/0000-0002-9620-1119


Responsible editor: 

Ruben Verborgh

Submission Type: 

Research Paper

Abstract: 

The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 100\% accuracy on a dataset with 148 CSV files composed of samples coming from a data load testing framework and some others added as verification of the parsing routines. In tests on truly messy data, the proposed solution outperforms the state-of-the-art tool by achieving an improvement of about 10\% in the accuracy with which dialects are detected. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.

Manuscript: 

Supplementary Files (optional): 

Tags: 

  • Reviewed

Data repository URLs: 

Date of Submission: 

Friday, March 15, 2024

Date of Decision: 

Thursday, May 23, 2024


Nanopublication URLs:

Decision: 

Undecided

Solicited Reviews:


1 Comment

meta-review by editor

The reviewers see strong points in the presented work, but also a few things that require some more work, including the proper publication of the dataset. Zenodo, for example allows for automatic GitHub import, which might be a good solution in this case.

Tobias Kuhn (https://orcid.org/0000-0002-1267-0234) on behalf of Ruben Verborgh