There's a specification problem here. I like to say that a "string" isn't a data...

avianlyric · on Jan 27, 2022

In languages like C “string” isn’t a proper data structure, it’s a `char` array, which itself is little more than a `int` array or `byte` array.

But these languages don’t provide true “string” support. They just have a vaguely useful type alias that renames a byte array to a char array, and a bunch of byte array functions that have been renamed to sound like string functions. In reality all the language supports are byte arrays, with some syntactical sugar so you can pretend they’re strings.

Newer languages, like go and Python 3, that where created in the world of Unicode provide true string types. Where the type primitives properly deal with idea of variable length characters, and provide tools to make it easy to manipulate strings and characters as independent concepts. If you want to ignore Unicode, because your specific application doesn’t need to understand, then you need cast your strings into byte arrays, and all pretences of true string manipulation vanish at the same time.

This is not to say the C can’t handle Unicode etc. just like the language doesn’t provide true primitives to manipulate strings, instead relies on libraries to provide that functionality, which is perfectly valid approach. Just as baking in more complex string primitives into your language is also a perfectly valid approach. It’s just a question of trade offs and use cases, I.e. the problem at the heart of all good engineering.

account42 · on Jan 28, 2022

Having your strings be conceptually made up of UTF-8 code units makes them no less strings than those made up of Unicode code points. As this article shows, working with code points is often not the right abstraction anyway and you need to up all the way to grapheme clusters to have anything close to what someone would intuitively call a character. Calling a code point a character is not more correct or useful than calling a code unit a char.

All you gain by having Unicode code point strings is the illusion of Unicode support until you test anything that uses combining characters or variant selectors. In essence, languages opting for such strings are making the same mistake at Windows/Java/etc. did when adopting UTF-16.

shadowgovt · on Jan 27, 2022

Even the most basic ASCII string is still a data structure.

Is it a PASCAL string (length byte followed by data) or a C string (arbitrary run of bytes terminated by a null character)?

wahern · on Jan 27, 2022

You qualified "string" with "ASCII", and also tacitly admitted you still need more information than the octets themselves--the length.

Of course, various programming languages have primitives and concepts which they may label "string". But you still need to specify that context, drawing in the additional specification those languages provide. Plus, traditionally and in practice, such concepts often serve the function of importing or exporting unstructured data. So even in the context of a specific programming language, the label "string" is often used to elide details necessary to understanding the content and semantics of some particular chunk of data.

shadowgovt · on Jan 27, 2022

I think I understand the difference; you're using "string" the way I would use "blob" or "untyped byte array."

Shifting definitions to yours, I agree.

jameshart · on Jan 27, 2022

Yep, it's as meaningful a programming task as 'reverse this double-precision float'.

samatman · on Jan 27, 2022

We would all be better off if this were actually true.

Tragically, in C, a string is just barely a data structure, because it must have \0 at the end.

If it were the complete absence of a data structure, we would need some way to get at the length of it, and could treat a slice of it as the same sort of thing as the thing itself.

jameshart · on Jan 28, 2022

C doesn’t really have strings at all. It has char pointers, and some standard functions that take a char pointer and act on all the chars starting from that pointer up to the first \0.

When you’re handling any kind of C pointer you need to know how big the buffer is around that pointer where pointer-arithmetic accesses make sense - but for a string, you also want to know ‘how much of the buffer is full of meaningful character data?’ - or else you’re stuck with fixed width text fields like some kind of a COBOL caveman.

But because C was designed by clever people for clever people they figured the standard string functions can just be handed a char pointer without any buffer bounds info because you can be trusted to always make sure that the pointer you give them is below a \0 within a single contiguous char buffer.

And that worked out great.

account42 · on Jan 28, 2022

You can work with pointer+length or begin+end pairs in C just fine - it's just annoying. But you can always upgrade to C++ and use std::string_view to abstract that for you if you want.