Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think CSV is a decent file format for tabular data. The author claims that CSV files are

> difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file).

But I do not see why this is the case.

Step 1: loop over file (in parallel) to determine indices of quote characters

Step 2: loop over indices outside quote regions (in parallel) to determine indices of comma and return characters

Step 3: return two dimensional integer array with pointers to cells of table



I'm inclined to agree. CSVs which are well-formed (escapes within fields handled consistently) shouldn't be that hard to parse.

I can't think of a reason your algo wouldn't be logically sound for good CSV files, although a little backtracking might be necessary to recognize escaping of delimiters in edge cases.

The author writes "CSV is a mess. One quote in the wrong place and the file is invalid.", but what logical formats can tolerate arbitrary corruption? An unclosed tag is similarly problematic for xml. In both cases you wind up falling back to heuristics.

It's true that CSVs often contain a mess of encodings inside fields, but that's not the problem of the CSV format per se. Validation of field encodings, or validation that the entire file is in a uniform encoding... those are separate requirements.


Indeed, every distributed query engine I've used can easily parallelise CSV in the same file (so long as it's splittable, friends don't let friends gzip their data), with the option to ignore bad rows, log them, or throw your hands up and die.

Admittedly, all of them are Java based and use Hadoop libs for handling CSV, which makes sense, the Elephant ecosystem has spent years getting this stuff right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: