I think CSV is a decent file format for tabular data. The author claims that CSV...

rectang · on May 4, 2022

I'm inclined to agree. CSVs which are well-formed (escapes within fields handled consistently) shouldn't be that hard to parse.

I can't think of a reason your algo wouldn't be logically sound for good CSV files, although a little backtracking might be necessary to recognize escaping of delimiters in edge cases.

The author writes "CSV is a mess. One quote in the wrong place and the file is invalid.", but what logical formats can tolerate arbitrary corruption? An unclosed tag is similarly problematic for xml. In both cases you wind up falling back to heuristics.

It's true that CSVs often contain a mess of encodings inside fields, but that's not the problem of the CSV format per se. Validation of field encodings, or validation that the entire file is in a uniform encoding... those are separate requirements.

EdwardDiego · on May 4, 2022

Indeed, every distributed query engine I've used can easily parallelise CSV in the same file (so long as it's splittable, friends don't let friends gzip their data), with the option to ignore bad rows, log them, or throw your hands up and die.

Admittedly, all of them are Java based and use Hadoop libs for handling CSV, which makes sense, the Elephant ecosystem has spent years getting this stuff right.