One thing that has always puzzled me is why CSV exists, given the ASCII unit/rec...

atoav · on May 15, 2020

Easy: CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard, while they don't have the ASCII record seperator visible (teaching them to use eg Alt + 30 will not work).

And if they had that key on their keyboard then you'd have the comma problems all over again: What if a ASCII record seperator shows up in the field?

r00fus · on May 15, 2020

If the ASCII separators were supported regularly in text editors since the early days then it would remove 90% of the problems.

I prefer the | pipe character as a delimiter - easy to see, not part of common speech or titles, and enterable via keyboard. Yes, it can exist in the field but less likely.

niccl · on May 15, 2020

But if it _might_ turn up, you have to assume it will and deal with it, and then you're into the exact same problem as before, but with something even less 'standard' than before.

I think it's like cryptography. Why bother to roll your own when there are people who are cleverer (certainly than me. I don't know about you) who've already put a lot of effort in to this, so just use one of the well tested standard libraries and don't mess with it

r00fus · on May 15, 2020

I'm not recommending "rolling your own". You still use CSV libraries, but with a pipe delimiter instead.

chapium · on May 15, 2020

It looks a LOT like l and I.

r00fus · on May 15, 2020

Luckily code sees it as ASCII 124. If you're saying that for compliance, | can be converted to l/I then it's a feature, not a downside.

throwaway_pdp09 · on May 15, 2020

> What if a ASCII record seperator shows up

point is, being not human readable and having no keyboard key, you can reasonably expect those special separators not to.

I suppose I'd consider such special-separated files (26 to 29 from memory?) to be machine generated and machine readable only, not intended as human readable without a bit of extra software or eg. a special emacs mode

roenxi · on May 15, 2020

You're almost certainly describing a file format that already exists and that nobody uses because it is not human-interactable.

Anyone can edit a csv. I often have. An important feature.

The only real stumper in the CSV format is why they didn't use \ as the escape in strings like everyone else. Probably some good reason.

eyelidlessness · on May 15, 2020

While backslash escaping is very common, double-quote (or double-special-anything) is fairly common as well. I use it all the time in SQL.

throwaway_pdp09 · on May 15, 2020

Interesting. I don't think I have ever hand edited a CSV (unless you mean created from scratch, well then, perhaps).

OTOH the ability to just look at it... that I've found very valuable and agree there.

sokoloff · on May 15, 2020

When considering standard A that can only work with machine generated and machine readable files and standard B that can work with all those files plus human-readable/editable files, I don’t find it surprising at all that standard B wins in the market.

ken · on May 15, 2020

It happens. XPM/PBM/XBM lost to TIFF (and later PNG). Postscript lost to PDF. OpenOffice.org XML couldn't displace Excel's Binary File Format.

As much as I wish that beauty or usability was the primary indicator of market success, the simpler explanation that explains all of these (including CSV) is: Microsoft Office supported them.

throwaway_pdp09 · on May 15, 2020

I don't know about XPM/PBM/XBM lost but AIUI PDF is just a subset of postscript with compression. If you think MS's binary excel file format is all that secret, I suggest you try unzipping it - it is no more than compressed XML. I have rescued parts of corrupt excel and word files by doing just that.

mark-r · on May 15, 2020

MS completely redesigned their binary formats a few years ago. You're right about the current format, but the earlier one was completely undecipherable. Even the current XML has quirks in it to be compatible with the older format.

mike22 · on May 17, 2020

It was reversible enough to write the Apache POI library. I maintain code that uses the HSSF streaming API portion of the library, which mainly gives you low-level access to the structures in XLS files. The OO.org folks made some very nice documentation: http://www.openoffice.org/sc/excelfileformat.pdf

ken · on May 15, 2020

PDF diverged from Postscript years ago.

Microsoft’s binary Excel format is not XML -- you’re thinking of its replacement. It was COM/OLE or some such.

greggman3 · on May 15, 2020

Binary formats win I think when the ASCII format is significantly greater in storage size and parsing speed. XPM/PBM/XBM are enormous compared to any binary image format. And remember these images used to get put on 1.4meg floppies and sent over 2400baud modems.

It's basically use the right tool for the job. For pixels that's a binary format. For mostly text, text itself works pretty well.

throwaway_pdp09 · on May 15, 2020

There is no standard but one for CSV files (the RFC 4180 mentioned).

The whole article pretty well rams home the fact there is no standard B because there is no standard which is adhered to.

sokoloff · on May 15, 2020

Humans are not bound by the computer science definition of “standard”.

But you’re correct that I was using the word “standard” imprecisely to mean something more like “file extension” rather than an IETF RFC.

ISL · on May 15, 2020

If someone copy-pastes a line from one CSV to a string within a field of another, the second will have record separators from the first within a single field.

u801e · on May 15, 2020

> CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard

Do people usually compose CSV data by hand? I thought they would use a program like Excel, enter or generate their data in separate cells and then save the file as a CSV. There's no reason why a program like Excel (or any other program for that matter) couldn't use the record separator instead of a comma as a delimiter when generating or consuming such files.

BiteCode_dev · on May 15, 2020

Compose, rarely. Read and edit with plain text editor, it's common.

mywittyname · on May 15, 2020

I did it like three times already today.

u801e · on May 15, 2020

What editor did you use?

mywittyname · on May 15, 2020

vim / sublime text, depending on context.

I do a lot of troubleshooting using CLI tools like grep and wc.

ganstyles · on May 15, 2020

I do quite a bit for GitHub markdown tables, for making small annotations for ML, for python parsing, etc. Very common and would be annoying too open up excel every time I needed some csv.

u801e · on May 15, 2020

I thought markdown tables used the pipe character as the delimiter between table columns.

Also, is not necessary to use excel, and a lot of the issues brought up in the article wouldn't be the case with a different delimiter character. Editors could easily be updated to make entering that delimiter character easier to enter by mapping it to a key like tab whenever a file like that is opened.

It just seems like a problem that could be solved, but can't be due to inertia.

yiyus · on May 15, 2020

> I thought markdown tables used the pipe character as the delimiter between table columns.

Markdown (the original one by John Gruber) does not include any syntax for tables. Other implementations have included it, but there is no standard. As far as I know, using the pipe character to separate columns is indeed quite common, but it is not the only way.

Kluny · on May 15, 2020

In my experience, they'll have a database that they administrate through excel, then output it as CSV for whatever tool that needs to process it.

louis_pasteur · on May 15, 2020

But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?

sfifs · on May 15, 2020

Let's say we have a product master. Something looks fishy and we want a brand person or product supply person to check... Push a CSV and it's trivially easy for them to open in XL and check, make corrections and send.

In a business context, this happens far more often than you may expect. Sure you can build a custom platform to validate and make people connect to a form connecting to a database - but Excel is a great user interface and everyone knows excel.

A funny incident - we were struggling to build a complex workflow in KNIME where at some points we need user input. Nothing out of the box was great - tools either assume a dashboard paradigm or a data flow paradigm - nothing In between.

One of our creative folks came up with the solution of writing to a CSV and getting KNIME to open it in excel. The user would make changes and save, close excel and continue the workflow in KNIME. Even completely non technical people got it.

hnick · on May 15, 2020

We had to ban our non-technical users from opening CSVs in Excel because it would strip leading zeros and do other exciting things like convert large numbers to 1.23123E+20 :(

It does this without any notification or warning when you save.

Tostino · on May 15, 2020

Kill me...i've dealt with flat file ETL for ~9 years now, with the first 2 pretty much dedicated to it. I hate hate hate Excel for no reason other than that. It has wasted so many hours due to customers messing up their identifiers by opening up their feeds in Excel prior to sending, and accidentally transforming like above, then having to fix changing customer identifiers used throughout the system once they figure out what happened.

Semaphor · on May 15, 2020

Excel can only display CSV files. I had to ban myself from editing files in Excel because it’s atrocious at it. It also likes to somewhat change decimal-seperators (again, without any warning).

Someone · on May 15, 2020

Even technical users have trouble round tripping data to Excel.

https://genomebiology.biomedcentral.com/articles/10.1186/s13...:

“A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.”

That paper is from 2016, at least 12 years after the problem was first identified.

mark-r · on May 15, 2020

East coast zip codes with leading zeros are a real problem.

mywittyname · on May 15, 2020

That behavior is technically correct because those are all valid representations of that value, e.g., 007449938 = 7449938 = 7.449938E6. Pretty much any plain text (or binary) data format will have this same issue, unless it supports additional metadata for display purposes.

If you intended the data to be a string value, then it should have been enclosed in quotes.

yiyus · on May 15, 2020

The problem is that when you have a number like 123451234512345, it may get rewritten as 1.234512E+14, and then you are loosing information.

hnick · on May 16, 2020

Your assertion sounded incorrect so I tested it. Quoting large numbers does not preserve them on a save.

Adding a ' before it should work, but for data extracted from arbitrary systems you'll have few guarantees of the sort. Formatting the cells in Excel to display the full value instead of the truncated value also works from memory, but you won't always know ahead of time if this happened later in the file.

In our case it was often mailing identifier barcodes so any loss of precision made them entirely useless.

rovolo · on May 15, 2020

What if the value is an ID rather than an integer? CSV doesn't specify whether the data type is a string or a number. The schema needs to be inferred by the parser/user.

recursive · on May 15, 2020

> If you intended the data to be a string value, then it should have been enclosed in quotes.

That's not part of any CSV specification I've seen, including RFC4180.

SifJar · on May 15, 2020

but if using the ASCII separators was common, excel would support that and it seems like this workflow would be unaffected?

paulryanrogers · on May 15, 2020

People still need to open them in non-excel apps. And the format may predate widespread use of spreadsheet software.

Fun fact, Excel will truncate numbers beyond 15 digits unless prefixed with a single quote mark.

the_angry_angel · on May 15, 2020

> But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?

I do a fair amount of work with companies that do "EDI" over CSV (or worse CSV-like - think 2 CSVs jammed together with different formats, no headers, no support for escaping or quoting) and fixed width documents. I can absolutely assure you that humans do open these files in notepad far more often than I'd like.

Often one of the main reasons they don't use things like X12, ASCII separators, etc. is because a "human needs to open it at some point" was a prevailing business decision some number of years ago (think "what happens if the IT system fails? how can we still ship stuff even in a complete emergency") and now it's baked into their documented process so deeply its like shouting into the wind to alter things. Third party warehouses are the worst at this.

roto · on May 15, 2020

Developers are by no means the only humans that use and/or edit CSV files.

beaconstudios · on May 15, 2020

Developers rarely use notepad. Non technical users still have to edit csvs by hand, all over the world.

C1sc0cat · on May 15, 2020

The average CSV use is as an export/import from excel and most of those users don't have specialised developer editors.

kube-system · on May 15, 2020

I have not seen very many business analysts open an IDE. In my experience, most edit CSVs in Excel.

JauntyHatAngle · on May 15, 2020

I don't think your first line is at all true.

Modifying Csvs with notepad is rife. I'd wager more than using any one particular code editor.

arethuza · on May 15, 2020

Or they use Excel, which can introduce its own entertaining errors.

A few years back while working on something that unavoidably used large quantities of CSV data we would sternly warn people not to use Excel, but people still did it ("But I just looked at it!").

snazz · on May 15, 2020

The TSV (tab-separated values) format is supported by a lot of software (Google Sheets, for instance) and solves the "comma in the field" problem quite nicely. Tabs can easily be typed by a human, but they don't usually occur within a one-line string, so they don't cause the same problems as CSVs do.

hulahoof · on May 15, 2020

TSV, CSV, PSV etc. all share the same class of problem. The important thing is that you are quoting your fields and escaping your quotes, regardless of the symbols used.

JoBrad · on May 15, 2020

Importantly, tabs aren’t usually inserted into fields by Excel users. With Power BI, exporting TSV is a little easier for non-dev Excel power users, too.

IggleSniggle · on May 15, 2020

Nailed it. When Excel users press tab, they mean "go to the next cell." Conveniently, this is also what the format means to the computer when it encounters the tab separator. It's not foolproof, but it's much better. The human interpretation of the character matches the machine interpretation.

DataWorker · on May 15, 2020

“Don’t usually occur”

They do. People write code to create all sorts of bastardized abominations of “csv” or tab delim or whatever. It’s why the featured article gets reposted every few years. You can define a standard for csv files, but then Excel does it’s own thing and here we are.

barbegal · on May 15, 2020

I think that stems from the fact that no software ever properly supported them, they are non-printable characters so you can't see them easily. Whereas with CSV you can fire up notepad.exe and explore the data.

Given the inability to standardise on line endings "\n" "\r\n" or "\r" I don't think we can standardise on using unit or record separators

tinus_hn · on May 15, 2020

No, we certainly can’t expect people to use the ASCII ‘Record separator’ character as a record separator, that would be madness!

mark-r · on May 15, 2020

Have you ever seen the ASCII separator characters used as they were intended? I don't think I have. It's obvious the problem they were trying to solve, but it was too little too late. It doesn't help that they're control characters that aren't meant to be displayed so they're practically invisible.

atrettel · on May 15, 2020

CSV is human readable and writable, but that is only part of the advantage of CSV. Its simple format is great as an archival format. It's the plain text of data formats. Decades pass and you can still read it without much of an issue, though people are rightly pointing out that the format is ambiguous is nature and subject to a lot of interpretation. I've recently been compiling a lot of decades old experimental datasets, and it's honestly great when I find plain text files from the 1980s, since we can still read them! It's the more exotic data formats that came about later that are often more difficult to read properly.

The Library of Congress lists CSV files as one of its preferred formats for archival datasets: https://www.loc.gov/preservation/resources/rfs/data.html

skrebbel · on May 15, 2020

Bacuse csv is human readable and human writeable

nojito · on May 15, 2020

So is tsv, but for some reason csv is more popular despite being the worst character to use.

hrktb · on May 15, 2020

For a lot of people, meaningful space is a hard concept to understand.

It's made worse by the number of applications where they will see spaces reformatted willy nilly (for instance web forms eating their line breaks.

From there explaining there are spaces and tabs, and that both can look the same on screen but they are different is just asking for trouble outside of our circles.

mgkimsal · on May 15, 2020

> is just asking for trouble outside of our circles

it's asking for trouble in our circles too. Imagine getting a tab-sep file, opening it up in an editor and having it automatically convert to 4 spaces "because", then sending it back without checking something that doesn't look wrong.

SAI_Peregrinus · on May 15, 2020

Or an input/editor where the `tab` key moves your focus from the text field to the next button. Like Slack. You can't just hit `tab` in Slack when typing a code snippet for something that needs them like a Makefile.

JamesSwift · on May 15, 2020

What is human readable about a tab?

nojito · on May 15, 2020

Open up a csv in notepad vs opening up a tsv.

One actually looks like a table the other is a garbled mess.

JamesSwift · on May 15, 2020

I think we have different definitions of human readable then. To me, human readable means if I print it in a printer, it loses no fidelity. What you describe is what I would call machine-readable. It is able to be imported into a program. The same that a binary separator would also let it be imported.

nojito · on May 15, 2020

Even in your example TSV wins.

steerablesafe · on May 15, 2020

If those characters had a visual representation and were easily typeable and we settled on using them as separators then we would struggle escaping those special characters within records. The problem is lack of standardization, not lack of special characters for record separators.

cryptonector · on May 15, 2020

ASCII FS/RS/GS/US do not have a visual representation, and users do not know how to enter them. That means they're essentially for binary-like encodings -- they're not really text if users can't see or enter them easily.

But users can see the commas in CSV, and they can trivially enter them. Yeah, the result is messy.

The lesson here is that the separator control characters should have been visual and had a visual indicator on keyboard keycaps to indicate how to enter them. Because they aren't and didn't, they are essentially useless for text.

EDIT: I do happen to know how to enter these on *nix systems. The ascii(7) man page tells you how:

       034   28    1C    FS  (file separator)        134   92    5C    \  '\\'
       035   29    1D    GS  (group separator)       135   93    5D    ]
       036   30    1E    RS  (record separator)      136   94    5E    ^
       037   31    1F    US  (unit separator)        137   95    5F    _

so FS is ^\ (which you have to be careful does not cause the tty to generate SIGQUIT -- you have to ^V^\), ^], ^^, and ^_. That is, <control>\, <control>], <control>^, and <control><shift>-. On U.S. keyboards anyways.

billme · on May 15, 2020

Simple Google explains why:

- non printable character - no single keyboard key - intended for systems, not people - etc

luckylion · on May 15, 2020

Yeah, we've never had similar issues with line-endings ;)

Spooky23 · on May 15, 2020

I think CSV is more popular than one would think because Excel was awful back in the old days.

A lot of the crufty edge cases are artifacts of whatever people had to deal with in 1995 to get data into spreadsheets. We all got used to pushing data around with CSV because the people consuming the data needed it.

hexmiles · on May 15, 2020

That's funny, when i was twelve i need to store some data and i didn't really know database or csv, so i actually used the ascii control character to implement tables and record. I later discovered sqlite, soooo much simpler...

carapace · on May 15, 2020

This, a thousand times.

As a fun exercise write a CSV-like spec but using those ASCII chars.

    table = [row.split(unit_sep) for row in data.split(rec_sep)]

There's your ASCIISV decoder.

eyelidlessness · on May 15, 2020

Based on my understanding of CSV, unit separators are valid column delimiters. I doubt you could use record separators between as row delimiters though.

rb808 · on May 15, 2020

Wow never heard of it before. The biggest problem is Excel and text editors rule the small data world.