Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This was down-voted, which isn't really the best way to handle things that are wrong

Your parent was wondering about a hypothetical UTF-8 sequence [all sequences in this post are hexadecimal bytes] XX 20 XX in which the ASCII space character encoded 20 is actually somehow part of a UTF-8 character. That's not a thing. UTF-8 is deliberately designed so that nothing like this can happen, along with several other clever properties, such properties are why UTF-8 took over the world.

Overlong sequences are like E0 80 A0 which naively looks like it's a UTF-8 encoding of U+0020 the space character from ASCII but it's not, because U+0020 is encoded as just 20. These encodings are called "overlong" because they're unnecessarily long, they're transporting a bunch of leading zero bits for the Unicode code point, the UTF-8 design says to reject these.

[ If you are writing a decoder you have two sane choices. If your decoder can fail, report an error, etc. you should do that. If it's not allowed to fail (or perhaps there's a flag telling you to press on anyway) each such error should produce the Unicode code point U+FFFD. U+FFFD ("The Replacement Character") is: Visibly obvious (often a white question mark on a black diamond); Not an ASCII character, not a letter, number, punctuation, white space, a magic escape character that might have meaning to some older system, filesystem separator, placeholder or wildcard, or any such thing that could be a problem. Carry on decoding the rest of the supposed UTF-8 after emitting U+FFFD. ]



I would expect a wordcount to pass these through as-is, along with any other invalid sequences of bytes with the high bits set. Performing unexpected conversions just seems like a way to make my future self have a bad time debugging.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: