Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Tracking #: 842-1822

Authors:

	Name	ORCID
	Wilfredo García	https://orcid.org/0000-0002-9620-1119

Responsible editor:

Tobias Kuhn

Submission Type:

Research Paper

Abstract:

The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suit provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.

Manuscript:

ds-paper-842.pdf

Supplementary Files (optional):

ds-supplementary-842-1362.zip

Previous Version:

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Data repository URLs:

https://zenodo.org/records/11331538

Date of Submission:

Sunday, June 2, 2024

Date of Decision:

Tuesday, July 2, 2024

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 14/Jun/2024

By Dylan Van Assche ORCID logo

https://orcid.org/0000-0002-7195-9935

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

# Summary
This paper introduces a new approach for detecting the CSV dialect in CSV files by analyzing the table uniformity and inference of data types. The approach is compared against CleverCSV on 3 datasets.

# General comments

## Quality, importance, and impact
The paper presents an interesting approach to detect CSV dialects in CSV files which seems straightforward, but is a common problem in data integration because of the many dialects in CSV files. The approach presented shows an improvement in accuracy compared to the state-of-the-art. The paper clarifies the questions I had regarding the results and approach in my previous review. The approach is also evaluated now on the W3C CSVW examples which improves the evaluation significantly since CSVW provides a lot of examples regarding how CSVs can be structured.

## Clarity and readability
The paper is written clearly and almost without any typos. The paper explains clearly the problem the author tries to solve and the algorithms the author used in the approach to reach a higher accuracy than CleverCSV. The structure is improved compared to the previous version, improving clarity and readability for readers.

## Provided data and its sustainability
The approach is compared against CleverCSV on 3 datasets: Pollock framework GitHub repository, CleverCSV repository, and W3C CSVW. The data is provided on Zenodo with a DOI following FAIR principles.

# Detailed comments
I only have a few minor comments for each section in the form of typos.

##Abstract
test suit -> test suite

## Introduction
I would suggest to use a different words for ‘congruence’ and ‘palpable’

## Related Work
- Consider adding a ‘~’ in LaTeX to avoid splitting names over multiple lines. For example: In 2017, T. D… is T. one the first line and the last name on the next line
- There’s a mix of ‘et. al.’ and ‘et al.’

## Evaluation Setup
‘all the 99 survey having’ -> survey plural?

## Results
‘%’ is sometimes with a space and without space after the number.

Reasons to accept:

- Interesting idea on how to detect CSV file dialects
- Clearly written with almost no typos
- Proper definitions and explanations on how the approach works
- The structure is improved compared to the first revision.
- Evaluation expanded to CSVW examples as well.
- DOI available with resources on Zenodo.

Reasons to reject:

None

Nanopublication comments:

None

Further comments:

None

Review #2 submitted on 24/Jun/2024

By Sean R. Wilkinson ORCID logo

https://orcid.org/0000-0002-1443-7479

Review Details

Reviewer has chosen not to be Anonymous

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The data have been uploaded to Zenodo, and the authors have included a discussion of runtime performance, as requested by this reviewer. The introduction section is also more thorough now, and overall, the paper reads better than before.

Reasons to accept:

The reasons to accept have not changed since the first round, and now, there are also no reasons to reject.

Reasons to reject:

None

Nanopublication comments:

Further comments:

RESPONSE TO REVIEWERS

--------------------------
Response to reviewer #1
## Abstract
The findings have been translated in terms of averages as a new dataset from the CSV on the Web Working Group (https://github.com/w3c/csvw) has been added to the experiments.

## Introduction
A brief explanation has been added as to why CSV file headers do not require additional configuration, along with a brief review of the differences between the methodologies of the state-of-the-art tool and the new proposal, as well as the advantages of adopting the new proposed approach.

The missing reference to the RFC-4180 specifications has been added as a footnote.

## Related Work
The "ad-dressed" typo has been fixed

## Problem formulation
Section renamed to Preliminaries

## Table uniformity
The section has been moved to a subsection of the new Approach section.

## Type detection
Added a concise hint on how data type detection is implemented. It should be noted that numerical data can be inferred by programming languages in a very practical way.

Regarding IPv6: data detection is a necessarily incomplete process, in order to favor one type of data over others. This does not mean that new types of data can be incorporated, significantly increasing the accuracy of the dialects.

Regarding empty data: The decision to favor fields with data over empty ones is based on the fact that studies have shown that a low percentage of the files available in large web repositories contain empty columns. This clarification has been duly added to the paper.

## Determining CSV file dialects
Moved numerical example to Experiments section

## Evaluation setup
The section is now covered in the paper.

Regarding CSVW test cases: this dataset is now part of experiments.

## Experiments
Regarding FAIR principles: the Zenodo record is available at https://zenodo.org/records/11331538.

The section now describes how the experiments were performed and contains the example moved from the "Determining CSV file dialects" section.

## Conclusion
This section has been added.

--------------------------
Response to reviewer #2

- Regarding tool and data files published on GitHub: Zenodo record at https://zenodo.org/records/11331538
- Regarding to file attached to a GitHub Issue: the sentence now reads "File was accessed from the CleverCSV repository on GitHub".

## Further comments: the conclusion section has been added. It discusses the practical impact of the methodology in a data mining environment, performance considerations relating to CleverCSV, future approaches with a tendency towards the creation of a hybrid system that allows an LLM to perform the post-processing of the data loaded with the inferred dialects.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Tue, 07/02/2024 - 02:39

The reviewers agree that the remaining shortcomings have been resolved, and the paper can therefore be accepted for publication.

Tobias Kuhn (https://orcid.org/0000-0002-1267-0234)

Data Science

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Tracking #: 842-1822

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor