One thing that has always puzzled me is why CSV exists, given the ASCII unit/record seperators appear to do exactly what the comma and newline characters are attempting to do with codes which could appear within records, and have been well defined since some point in the 60s.
And here we are, 60 years later still struggling to work out where a record ends...
Easy: CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard, while they don't have the ASCII record seperator visible (teaching them to use eg Alt + 30 will not work).
And if they had that key on their keyboard then you'd have the comma problems all over again: What if a ASCII record seperator shows up in the field?
If the ASCII separators were supported regularly in text editors since the early days then it would remove 90% of the problems.
I prefer the | pipe character as a delimiter - easy to see, not part of common speech or titles, and enterable via keyboard. Yes, it can exist in the field but less likely.
But if it _might_ turn up, you have to assume it will and deal with it, and then you're into the exact same problem as before, but with something even less 'standard' than before.
I think it's like cryptography. Why bother to roll your own when there are people who are cleverer (certainly than me. I don't know about you) who've already put a lot of effort in to this, so just use one of the well tested standard libraries and don't mess with it
point is, being not human readable and having no keyboard key, you can reasonably expect those special separators not to.
I suppose I'd consider such special-separated files (26 to 29 from memory?) to be machine generated and machine readable only, not intended as human readable without a bit of extra software or eg. a special emacs mode
When considering standard A that can only work with machine generated and machine readable files and standard B that can work with all those files plus human-readable/editable files, I don’t find it surprising at all that standard B wins in the market.
It happens. XPM/PBM/XBM lost to TIFF (and later PNG). Postscript lost to PDF. OpenOffice.org XML couldn't displace Excel's Binary File Format.
As much as I wish that beauty or usability was the primary indicator of market success, the simpler explanation that explains all of these (including CSV) is: Microsoft Office supported them.
I don't know about XPM/PBM/XBM lost but AIUI PDF is just a subset of postscript with compression. If you think MS's binary excel file format is all that secret, I suggest you try unzipping it - it is no more than compressed XML. I have rescued parts of corrupt excel and word files by doing just that.
MS completely redesigned their binary formats a few years ago. You're right about the current format, but the earlier one was completely undecipherable. Even the current XML has quirks in it to be compatible with the older format.
It was reversible enough to write the Apache POI library. I maintain code that uses the HSSF streaming API portion of the library, which mainly gives you low-level access to the structures in XLS files. The OO.org folks made some very nice documentation: http://www.openoffice.org/sc/excelfileformat.pdf
Binary formats win I think when the ASCII format is significantly greater in storage size and parsing speed. XPM/PBM/XBM are enormous compared to any binary image format. And remember these images used to get put on 1.4meg floppies and sent over 2400baud modems.
It's basically use the right tool for the job. For pixels that's a binary format. For mostly text, text itself works pretty well.
If someone copy-pastes a line from one CSV to a string within a field of another, the second will have record separators from the first within a single field.
> CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard
Do people usually compose CSV data by hand? I thought they would use a program like Excel, enter or generate their data in separate cells and then save the file as a CSV. There's no reason why a program like Excel (or any other program for that matter) couldn't use the record separator instead of a comma as a delimiter when generating or consuming such files.
I do quite a bit for GitHub markdown tables, for making small annotations for ML, for python parsing, etc. Very common and would be annoying too open up excel every time I needed some csv.
I thought markdown tables used the pipe character as the delimiter between table columns.
Also, is not necessary to use excel, and a lot of the issues brought up in the article wouldn't be the case with a different delimiter character. Editors could easily be updated to make entering that delimiter character easier to enter by mapping it to a key like tab whenever a file like that is opened.
It just seems like a problem that could be solved, but can't be due to inertia.
> I thought markdown tables used the pipe character as the delimiter between table columns.
Markdown (the original one by John Gruber) does not include any syntax for tables. Other implementations have included it, but there is no standard. As far as I know, using the pipe character to separate columns is indeed quite common, but it is not the only way.
But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?
Let's say we have a product master. Something looks fishy and we want a brand person or product supply person to check... Push a CSV and it's trivially easy for them to open in XL and check, make corrections and send.
In a business context, this happens far more often than you may expect. Sure you can build a custom platform to validate and make people connect to a form connecting to a database - but Excel is a great user interface and everyone knows excel.
A funny incident - we were struggling to build a complex workflow in KNIME where at some points we need user input. Nothing out of the box was great - tools either assume a dashboard paradigm or a data flow paradigm - nothing
In between.
One of our creative folks came up with the solution of writing to a CSV and getting KNIME to open it in excel. The user would make changes and save, close excel and continue the workflow in KNIME. Even completely non technical people got it.
We had to ban our non-technical users from opening CSVs in Excel because it would strip leading zeros and do other exciting things like convert large numbers to 1.23123E+20 :(
It does this without any notification or warning when you save.
Kill me...i've dealt with flat file ETL for ~9 years now, with the first 2 pretty much dedicated to it. I hate hate hate Excel for no reason other than that. It has wasted so many hours due to customers messing up their identifiers by opening up their feeds in Excel prior to sending, and accidentally transforming like above, then having to fix changing customer identifiers used throughout the system once they figure out what happened.
Excel can only display CSV files. I had to ban myself from editing files in Excel because it’s atrocious at it. It also likes to somewhat change decimal-seperators (again, without any warning).
“A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.”
That paper is from 2016, at least 12 years after the problem was first identified.
That behavior is technically correct because those are all valid representations of that value, e.g., 007449938 = 7449938 = 7.449938E6. Pretty much any plain text (or binary) data format will have this same issue, unless it supports additional metadata for display purposes.
If you intended the data to be a string value, then it should have been enclosed in quotes.
Your assertion sounded incorrect so I tested it. Quoting large numbers does not preserve them on a save.
Adding a ' before it should work, but for data extracted from arbitrary systems you'll have few guarantees of the sort. Formatting the cells in Excel to display the full value instead of the truncated value also works from memory, but you won't always know ahead of time if this happened later in the file.
In our case it was often mailing identifier barcodes so any loss of precision made them entirely useless.
What if the value is an ID rather than an integer? CSV doesn't specify whether the data type is a string or a number. The schema needs to be inferred by the parser/user.
> But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?
I do a fair amount of work with companies that do "EDI" over CSV (or worse CSV-like - think 2 CSVs jammed together with different formats, no headers, no support for escaping or quoting) and fixed width documents. I can absolutely assure you that humans do open these files in notepad far more often than I'd like.
Often one of the main reasons they don't use things like X12, ASCII separators, etc. is because a "human needs to open it at some point" was a prevailing business decision some number of years ago (think "what happens if the IT system fails? how can we still ship stuff even in a complete emergency") and now it's baked into their documented process so deeply its like shouting into the wind to alter things. Third party warehouses are the worst at this.
Or they use Excel, which can introduce its own entertaining errors.
A few years back while working on something that unavoidably used large quantities of CSV data we would sternly warn people not to use Excel, but people still did it ("But I just looked at it!").
The TSV (tab-separated values) format is supported by a lot of software (Google Sheets, for instance) and solves the "comma in the field" problem quite nicely. Tabs can easily be typed by a human, but they don't usually occur within a one-line string, so they don't cause the same problems as CSVs do.
TSV, CSV, PSV etc. all share the same class of problem. The important thing is that you are quoting your fields and escaping your quotes, regardless of the symbols used.
Importantly, tabs aren’t usually inserted into fields by Excel users. With Power BI, exporting TSV is a little easier for non-dev Excel power users, too.
Nailed it. When Excel users press tab, they mean "go to the next cell." Conveniently, this is also what the format means to the computer when it encounters the tab separator. It's not foolproof, but it's much better. The human interpretation of the character matches the machine interpretation.
They do. People write code to create all sorts of bastardized abominations of “csv” or tab delim or whatever. It’s why the featured article gets reposted every few years. You can define a standard for csv files, but then Excel does it’s own thing and here we are.
I think that stems from the fact that no software ever properly supported them, they are non-printable characters so you can't see them easily. Whereas with CSV you can fire up notepad.exe and explore the data.
Given the inability to standardise on line endings "\n" "\r\n" or "\r" I don't think we can standardise on using unit or record separators
Have you ever seen the ASCII separator characters used as they were intended? I don't think I have. It's obvious the problem they were trying to solve, but it was too little too late. It doesn't help that they're control characters that aren't meant to be displayed so they're practically invisible.
CSV is human readable and writable, but that is only part of the advantage of CSV. Its simple format is great as an archival format. It's the plain text of data formats. Decades pass and you can still read it without much of an issue, though people are rightly pointing out that the format is ambiguous is nature and subject to a lot of interpretation. I've recently been compiling a lot of decades old experimental datasets, and it's honestly great when I find plain text files from the 1980s, since we can still read them! It's the more exotic data formats that came about later that are often more difficult to read properly.
For a lot of people, meaningful space is a hard concept to
understand.
It's made worse by the number of applications where they will see spaces reformatted willy nilly (for instance web forms eating their line breaks.
From there explaining there are spaces and tabs, and that both can look the same on screen but they are different is just asking for trouble outside of our circles.
> is just asking for trouble outside of our circles
it's asking for trouble in our circles too. Imagine getting a tab-sep file, opening it up in an editor and having it automatically convert to 4 spaces "because", then sending it back without checking something that doesn't look wrong.
Or an input/editor where the `tab` key moves your focus from the text field to the next button. Like Slack. You can't just hit `tab` in Slack when typing a code snippet for something that needs them like a Makefile.
I think we have different definitions of human readable then. To me, human readable means if I print it in a printer, it loses no fidelity. What you describe is what I would call machine-readable. It is able to be imported into a program. The same that a binary separator would also let it be imported.
If those characters had a visual representation and were easily typeable and we settled on using them as separators then we would struggle escaping those special characters within records. The problem is lack of standardization, not lack of special characters for record separators.
ASCII FS/RS/GS/US do not have a visual representation, and users do not know how to enter them. That means they're essentially for binary-like encodings -- they're not really text if users can't see or enter them easily.
But users can see the commas in CSV, and they can trivially enter them. Yeah, the result is messy.
The lesson here is that the separator control characters should have been visual and had a visual indicator on keyboard keycaps to indicate how to enter them. Because they aren't and didn't, they are essentially useless for text.
EDIT: I do happen to know how to enter these on *nix systems. The ascii(7) man page tells you how:
so FS is ^\ (which you have to be careful does not cause the tty to generate SIGQUIT -- you have to ^V^\), ^], ^^, and ^_. That is, <control>\, <control>], <control>^, and <control><shift>-. On U.S. keyboards anyways.
I think CSV is more popular than one would think because Excel was awful back in the old days.
A lot of the crufty edge cases are artifacts of whatever people had to deal with in 1995 to get data into spreadsheets. We all got used to pushing data around with CSV because the people consuming the data needed it.
That's funny, when i was twelve i need to store some data and i didn't really know database or csv, so i actually used the ascii control character to implement tables and record.
I later discovered sqlite, soooo much simpler...
Based on my understanding of CSV, unit separators are valid column delimiters. I doubt you could use record separators between as row delimiters though.
And here we are, 60 years later still struggling to work out where a record ends...