The problem specified *declares* the words we're counting are ASCII: > ASCII: it...

codeflo · on July 24, 2022

>> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation

That's a somewhat specific list -- at least I didn't read that as a general "the program can assume that the input is only ASCII".

But then, the author seems to have accepted solutions that crash on non-UTF8 sequences and ones that byte-compare them, so probably either behavior was meant to be fine. I just don't get that from this rule.

anonymoushn · on July 24, 2022

> That's a somewhat specific list -- at least I didn't read that as a general "the program can assume that the input is only ASCII".

I don't think it assumes the input is only ASCII. If the problem is "given UTF-8 text, split on ASCII whitespace and convert ASCII uppercase letters to lowercase," you can do that correctly (and produce correct UTF-8 output) without really being UTF-8 aware. For why, see here: https://en.wikipedia.org/wiki/UTF-8#Encoding

> But then, the author seems to have accepted solutions that crash on non-UTF8 sequences and ones that byte-compare them, so probably either behavior was meant to be fine. I just don't get that from this rule.

That's a separate concern right? The rules are only about the behavior when the program is given UTF-8 input.

nerdponx · on July 25, 2022

The issue is that this is a weird requirement. I have seen real world data sets that had all manner of exotic whitespace like nonbreaking spaces and vertical tabs peppered throughout, so I am sympathetic. But this situation isn't common.

That said, for a CLI program like this, usually approximate results are good enough anyway. And realistically, most text should use ASCII whitespace for pretty much all text.

anonymoushn · on July 25, 2022

It is a context-specific requirement, but "fully general lowercasing" would be an impossible requirement.

For e.g. Japanese text, I think you'd only have to add 1 or 2 characters to the set of whitespace characters. You also have to solve Japanese text segmentation, which is hard-to-impossible. If you want to canonicalize the words by transforming half-width katakana to full-width, transforming full-width romaji to ascii, etc., that's a lot of work, and which of those transformations are desired will be specific to the actual use of the program. If you want to canonicalize the text such that the same word written using kanji or using only hiragana end up in the same bucket, or that words that are written the same way in hiragana but written differently when using kanji end up in different buckets, or that names that are written the same way in kanji but written differently in hiragana end up in different buckets, or that loanwords incorrectly written using hiragana are bucketed with the katakana loanword, or that words written using katakana for emphasis are bucketed with the hiragana word (but katakana loanwords are not converted to hiragana and bucketed with the non-loanword that is made up of the same moras), well, that all sounds even more challenging than the hard-to-impossible problem you already had to solve to decide where words begin and end :)

Edit: One of the first concerns I mentioned, about full width romaji and half width katakana, and additionally concerns about diacritics, can be addressed using unicode normalization, so these things are pretty easy[0]. An issue you may still face after normalizing is that you may receive inputs that have incorrectly substituted tsu ツ for sokuon ッ (these are pronounced differently), because for example Japanese banking software commonly transmits people's names using a set of characters that does not include sokuon.

My point is that this is not just a hard problem but many different, incompatible problems, many of which are hard, and because of the incompatibilities you have to pick one and give up on the others. An English-speaking end user may not want their wordcount to perform full width romaji -> ascii conversion.

[0]: https://towardsdatascience.com/difference-between-nfd-nfc-nf...

nerdponx · on July 26, 2022

Great write up! However I should've clarified, I wasn't talking about word segmentation in general, only about expanding the universe of valid "whitespace" grapheme clusters (or even just codepoints) to include the various uncommon, exotic, and non-Western items.