I think it's a fine example since the decomposed version clearly exists and can ...

wumpus · on May 7, 2017

This (different forms and no automatic normalization) is a "feature" of Unicode. It would be wrong of Python to try to cover it up.

kzrdude · on May 7, 2017

Yes, so it's wrong for anyone to pretend len() or even indexing of a str is about characters, it's about codepoints of the string.

wruza · on May 8, 2017

IMO, len("") should raise OperationHasNoCommonSenseTodayException, just to not confuse unicode newbies. Modern text is not an array, it is a format, a complex format. Zero-width, double-width monotypes, combining, direction marks, normalization, etc. Almost no one wants to know about codepoint details when working with "just input strings", and those who want may use special namespaces for that. There is no point in making len() an alias for unicode.volatile_number_of_distinct_code_points().

visually_empty(s) is okay, len(s) is probably not.

mrout · on May 8, 2017

They're all valid measurements. The length of a string could be measured in bytes (which doesn't require you to know the encoding), code points (which doesn't require huge Unicode grapheme clustering tables), grapheme clusters (which doesn't require a font) or even pixels (which does require a font).

int_19h · on May 8, 2017

The best thing a language (or library) can do is to not bless any single one of these as the default. Strings shouldn't have a length at all. They should provide properties/methods/accessors like byte_count, codepoint_count, grapheme_count etc. Make the user of the API think every time they're asking for a length of the string - which one do they actually need? Which one is the best for whatever they're trying to do?

wumpus · on May 8, 2017

Can you name a single language that does that? I can't think of one.

int_19h · on May 8, 2017

None of the mainstream ones that I can think of. Which is really unfortunate, not the least because the defaults are all over the place - usually it's either bytes (when strings are UTF-8) or code units (when strings are UTF-16 - note, code units, not code points, so surrogate pairs count as 2!). Occasionally it's genuine code points, as in Python. Which, I think, goes to show why it's such a mess.

mrout · on May 8, 2017

I think that if you treat strings as just lists[0] of UTF-8 code units, and code points, grapheme clusters, etc. are just views/adapters of those bytes, you're probably going to benefit the most.

[0]: When I say 'lists', I mean whatever the standard idea of a sequence of things is in the language. For C that's the array, or maybe the pointer+length pair. For Go it's a slice. For Rust, an iterator perhaps? For Python, it's a list.