> You took on exactly the responsibilities the article author said you'd have to take on.
Yep, and having been there, I’m reporting: it’s not nearly as bad as this post makes it sound.
I’ve written web apps, too, and (for example) CSS is 100 times worse. Nobody is suggesting every website should use only browser default styles, though.
> I think the better question is, how do you handle malformed CSV?
What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!
We show the data as it’s parsed, so it should be obvious to the user what is going on, and where.
There are areas of computing that are too complex. I’m usually the first to complain about such things. I really don’t think CSV parsing is one of them.
It's even worse than the author is suggesting. For most people, "RFC4180" is meaningless, all that matters is what Excel does. And that means you need to handle a bunch of cases if you are reading AND if you are writing files. A few cases not discussed in the blog post:
- if your file starts with \x49\x44 ("ID"), Excel will interpret the file as their symbolic link .SLK format. So if you're writing files, the ID should be wrapped in double quotes even if it isn't necessary according to RFC4180
- Excel will proactively try to "evaluate" fields that start with \x3d ("="). You can see this in action with the sample file
CSV parsing / writing certainly isn't going to be a value driver for most companies (if you're supporting user imports, you really care about XLSX/XLSB/XLS files and Google Sheets import), but it's not a trivial problem.
"What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!*
Lest you think this is made up, I ran across this when someone cut and paste Excel into a text field.
I also have seen batch processing of user files break hard when a quote issue like this caused a hand-rolled CSV parser to conclude that half the file was a single very long field.
I think you have one of the best use cases for rolling your own parser. Your tool's purpose is to read and parse arbitrary data (and then transform and display it). But, I think you're misinterpreting how much work the post is saying CSV will take to implement:
> Easy right? You can write the code yourself in just a few lines.
My take-away from the post is: if you are parsing arbitrary CSV files, you need to make parsing configurable because there's no one, true CSV format. If you are writing CSV files, you may need to escape your fields in a weird, outdated manner.
P.S. By "malformed", I meant whether the 2D matrix of byte arrays is read exactly as intended. It could be caused by an open quote, but it could be incorrect escaping or inconsistent delimiters. Since there's no inline schema saying which CSV parsing configuration is being used, you must ask the user to configure the CSV parser and validate the output.
Yep, and having been there, I’m reporting: it’s not nearly as bad as this post makes it sound.
I’ve written web apps, too, and (for example) CSS is 100 times worse. Nobody is suggesting every website should use only browser default styles, though.
> I think the better question is, how do you handle malformed CSV?
What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!
We show the data as it’s parsed, so it should be obvious to the user what is going on, and where.
There are areas of computing that are too complex. I’m usually the first to complain about such things. I really don’t think CSV parsing is one of them.