I'm inclined to agree. CSVs which are well-formed (escapes within fields handled consistently) shouldn't be that hard to parse.
I can't think of a reason your algo wouldn't be logically sound for good CSV files, although a little backtracking might be necessary to recognize escaping of delimiters in edge cases.
The author writes "CSV is a mess. One quote in the wrong place and the file is invalid.", but what logical formats can tolerate arbitrary corruption? An unclosed tag is similarly problematic for xml. In both cases you wind up falling back to heuristics.
It's true that CSVs often contain a mess of encodings inside fields, but that's not the problem of the CSV format per se. Validation of field encodings, or validation that the entire file is in a uniform encoding... those are separate requirements.
Indeed, every distributed query engine I've used can easily parallelise CSV in the same file (so long as it's splittable, friends don't let friends gzip their data), with the option to ignore bad rows, log them, or throw your hands up and die.
Admittedly, all of them are Java based and use Hadoop libs for handling CSV, which makes sense, the Elephant ecosystem has spent years getting this stuff right.
> difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file).
But I do not see why this is the case.
Step 1: loop over file (in parallel) to determine indices of quote characters
Step 2: loop over indices outside quote regions (in parallel) to determine indices of comma and return characters
Step 3: return two dimensional integer array with pointers to cells of table