I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)
Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.
I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.
Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.
Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!
Handling unicode can be fine, depending on what you're doing. The hard parts are:
- Counting, rendering and collapsing grapheme clusters (like the flag emoji)
- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16
- Canonicalization
If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.
IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.
By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.
> The only real unicode support in std is utf8 validation for strings.
Rust's core library gives char methods such as is_numeric which asks whether this Unicode codepoint is in one of Unicode's numeric classes such as the letter-like-numerics and various digits. (Rust does provide char with is_ascii_digit and is_ascii_hexdigit if that's all you actually cared about)
So yes, the Rust standard library is carrying around the entire Unicode standard class rule list among other things, of course Rust's library isn't built to a vast binary, so if you never use these features your binary doesn't get that code.
It always feels like the most amount of work goes to the least used emoji. So many revisions and additions to the family emoji and yet it’s one of the ones I don’t recall anyone ever using.
I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.
> It always feels like the most amount of work goes to the least used emoji.
I always feel like those emoji were added on purpose in order to force implementations to fix their unicode support. Before emoji were added, most software had completely broken support for anything beyond the BMP (case study: MySQL's so-called "UTF8" encoding). The introduction of emoji, and their immediate popularity, forced many systems to better support astral planes (that is officially acknowledged: https://unicode.org/faq/emoji_dingbats.html#EO1)
Progressively, emoji using more advanced features got introduced, which force systems (and developers) to fix their unicode-handling, or at least improve it somewhat e.g. skintones for combining codepoints, etc....
> I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.
You should try to follow a new character through the process, because that's absolutely not what happens and shepherding a new emoji through to standardisation is not an easy task. The unicode consortium absolutely does say no, and has many reasons to do so. There's an entire page on just proposal guidelines (https://unicode.org/emoji/proposals.html), and following it does not in any way ensure it'll be accepted.
WTF business do emojis have in Unicode? The BMP is all there ever should have been. Standardize the actual writing systems of the world, so everyone can write in their language. And once that is done, the standard doesn't need to change for a hundred years.
What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that. I guess the BMP is a good start, even though it already contains superfluous crap like "dingbats" and boxes.
Unicode didn't invent emoji, they incorporated it because they were already popular in Japan, and if they didn't incorporate it, it would greatly reduce Japanese adoption.
Keep in mind that Unicode was intended to unify all the disparate encodings that had been brewed up to support different languages and which made exchanging documents between non-English speaking countries a nightmare. The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about. And they weren't alone, of course [1].
> What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that.
Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).
You may never need anything outside the BMP, but that doesn't make the rest of the planes worthless. Ignoring the value of including dead and nearing-extinct languages for preservation purposes (not being able to type a language will basically guarantee its extinction, with inventing a new encoding and storing text as jpgs being the only real alternatives), there are a lot of people speaking languages found in the SMP [2][3] ([2] has 83 million native speakers, for example).
> The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about.
Mojibake was not a "Japan has too many encodings" problem. It was a "western developers assume everyone is using CP1252" problem.
> Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).
Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.
Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.
Unicode/UTF-8 is widely adopted/recommended in Japan and there are no widely used alternative. Japanese company tend to still use SJIS but it's just laziness. Han unification isn't a problem to handle only Japanese text: just use Japanese font everywhere. To handle multiple language text, it's pain but anyway there are no alternatives.
> Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.
In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?)
> Japanese company tend to still use SJIS but it's just laziness.
It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit.
> To handle multiple language text, it's pain but anyway there are no alternatives.
Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding.
Maybe the guess order depends on locale reasonably. GP is my experience mainly on old days ja-JP localed Windows software. IIRC Unix software tend to not good at guess so maybe you referring them.
Nowadays I rarely see new EUC-JP contents (or I just not recognized) but still sometimes I encounter mojibake on Chrome while visiting old homepage (like once per month). For web page, anyway most modern pages (including SJIS) don't rely on guess but have <meta charset> tag so mojibake very rarely happen. For plaintext files, I still see UTF-8 file shown as SJIS on Windows Chrome.
Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale. It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?
I believe no developer want to treat foreign charset like GBK/Big-5/whatever. There are very few information. If developer can switch reading charset on a file, then they can also switch font.
> Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale.
The issue is that realistically a certain proportion of customers are going to have the wrong locale setting or wrong default font set.
> It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?
Certainly Firefox will use a Japanese font by default for SJIS whereas it will use a generic (i.e. Chinese) font by default for UTF-8. I would expect most encoding-aware programs would do the same?
> If developer can switch reading charset on a file, then they can also switch font.
Sure, but it works both ways. And it's actually much easier for a lazy developer to ignore the font case because it's essentially only an issue for Japan. Whereas if you make a completely encoding-unaware program it will cause issues in much of Europe and all of Asia (well, it did pre-UTF8 anyway).
I think by far the largest contributor that coined mojibake was E-mail MTA. Some E-mail implementations assumed 7-bit ASCII for all text and dropped MSB on 8-bit SJIS/Unicode/etc, ending up as corrupt text at the receiving end. Next up was texts written in EUC(Extended UNIX Code)-JP probably by someone either running a real Unix(likely a Solaris) or early GNU/Linux, and floppies from a classic MacOS computer. Those must have defined it and various edge cases on web like header-encoding mismatch popularized it.
"Zhonghua fonts" issue is not necessarily linked to encoding, it's an issue about assuming or guessing locales - that has to be solved by adding a language identifier or by ending han unification.
> Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.
This is an absolute shame and there is no excuse for fixing it so that variations for unified characters can be encoded before adding unimportant things like skin tones.
> So rather than treat the issue as a rich text problem of glyph alternates, Unicode added the concept of variation selectors, first introduced in version 3.2 and supplemented in version 4.0.[10] While variation selectors are treated as combining characters, they have no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a variation (typically in terms of grapheme, but also in terms of underlying meaning as in the case of a location name or other proper noun) of the base character. This then is not a selection of an alternate glyph, but the selection of a grapheme variation or a variation of the base abstract character. Such a two-character sequence however can be easily mapped to a separate single glyph in modern fonts. Since Unicode has assigned 256 separate variation selectors, it is capable of assigning 256 variations for any Han ideograph. Such variations can be specific to one language or another and enable the encoding of plain text that includes such grapheme variations. - https://en.m.wikipedia.org/wiki/Han_unification
This is what you’re asking for, right? Control characters that designates which version of a unified character is to be displayed.
Have emoji not become part of our writing structure though? A decent percentage of online chats and comments, especially on social networks, includes at least one emoji that couldn't be easily or accurately represented in the regular written language.
Recently implementers of unicode have censored the gun emoji in a way that changes the meaning of many existing online chats and comments. So you can't easily or accurately represent things even with unicode.
Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period, and often not even that. Given that unicode implementers are ok with erasing the meaning of some of them, it should be ok to eliminate more of them.
> Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period
Isn't that the same with all words though? Think how much English usage changes in a generation. For instance, my girlfriend will use the term "I'm dead!" in a similar context to where I would say "LOL" and where my father would have said "What the fuck is loll?"
There's a spectrum. Subculture-specific slang changes quickly, but most words have a longer lifetime; reading Chaucer today is difficult but doable. Given that we don't encode words but only letters, for English you have to go back to the disappearance of þ to get a change that's relevant to text encoding. Emoji shift faster and are less effective at conveying meaning than any "real" language.
This argument was lost the moment Unicode was created. Japanese carriers had created their own standard for emoji encoding for sms. And they would not switch to Unicode unless the emoji were ported over.
It’s a tricky situation. Maybe allowing an arbitrary bitmap char to represent any emoji would have been better but then we could have ended up in a situation where normal text or meaningful punctuation or perhaps even fonts would get encoded as bitmaps.
For something like a face or hand gesture, a bitmap likely would have been better since it would at least look the same on all platforms.
I don't think that argument holds water. Emoji could just as well have been encoded as markup. There were for instance long-established conventions of using strings starting with : and ; . Bulletin boards extended that to a convention using letters delimited by : for example :rolleyes: . Not to mention that those codes can be typed more efficiently than browsing in an Emoji Picker box.
Because emoji became characters, text rendering and font formats had to be extended to support them.
There are four different ways to encode emoji in OpenType 1.8:
* Apple uses embedded PNG
* Google uses embedded colour bitmaps
* Microsoft uses flat glyphs in different colours layered on top of one-another
> Emoji could just as well have been encoded as markup.
They could have, but they were already being encoded as character codepoints in existing charactersets. So any character encoding scheme that wanted to replace all use cases for existing charactersets needed to match that. If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.
> If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.
You need to upgrade those applications to support Unicode too.
Not necessarily, most applications already supported multiple encodings, having the OS implement one of the unicode encodings was often all that was needed.
I might think the important part was Japanese carriers were weaponizing flip phone culture to gatekeep "PCs" and open standard smartphones out of their microtransaction ecosystem. Emoji was one of the keys to disprove the FUD that iPhone can't be equal to flip phones and establish first class citizen status.
You are underestimating how much language evolves. In fact, you are proposing brakes to stop if evolving. If nothing else, new currency symbols need to be incorporated every few years. The initial emoji were part of the actual writing systems of the world, even if it was relatively new and only being used by foreigners. Or maybe they have been part of world culture since the 1950s :-) ? https://en.wikipedia.org/wiki/Smiley
Exactly this. Humans have incredibly complicated writing systems, and all Unicode wants to do is encode them all. Keep in mind that the trivial toy system we're more familiar with, ASCII, already has some pretty strange features because even to half-arse one human writing system they needed those features.
Case is totally wild, it only applies to like 5% of the symbols in ASCII, but in the process it means they each need two codepoints and you're expected to carry around tech for switching back and forth between cases.
And then there are several distinct types of white space, each gets a codepoint, some of them try to mess with your text's "position" which may not make any sense in the context where you wanted to use it. What does it mean to have a "horizontal tab" between two parts of the text I wanted to draw on this mug? I found a document which says it is the same as "eight spaces" which seems wrong because surely if you wanted eight spaces you'd just write eight spaces.
And after all that ASCII doesn't have working quotation marks, it doesn't understand how to spell a bunch of common English words like naïve or café, pretty disappointing.
This work wasn't done for emoji. They use the same zero-width joiner character [1] that exists to support Indic scripts like Devanagari, and any system that properly handles these languages will also properly handle the emoji.
Yes, this adds a lot of complexity, but it's really a question of whether that complexity is justified in order to support all of the world's languages. And I think many would argue that it is.
I know how that feels. I wrote a little c++ program to fetch data in Unicode from a dB and then normalize it to ascii to be used for analytic purposes. A lot faster to do it on ascii than trying to handle all the fun cases of how many ways can an e etc... be input. ICU to the rescue! Took a couple weeks of getting up to speed as ICU itself wasn't too bad to figure out. But, you find out very quickly that to use it, you need to have a good understanding of a number of the Unicode technical reports to actually understand how to make use of it. Fun times indeed.
Do you have a YouTube for people to subscribe to in anticipation of you releasing your YouTube series about your work? The development processes of new languages is so intriguing.
It would actually be pretty interesting to see how you use Bison and Flex with utf-8.
Most resources say to not bother due to lack of support for Unicode, but they're so ubiquitous
Do they need special support for UTF-8? One of the nice things about UTF-8 is that you can treat it as an 8-bit encoding in many cases if you only care about substrings and don't need to decode individual non-ASCII characters.
I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)
Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.
I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.
Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.
Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!