> In some cases, you might also want to avoid characters that sound similar when spoken. For example, b and p can sound similar when spoken out loud. This can be especially important in situations where IDs are communicated verbally.
In many cases these kinds of IDs are just an encoding of a ground-truth that is a big integer or a sequence of bytes, and that mean we don't have to use ASCII-character granularity, we can also use words.
True, that creates a certain cultural bias for wherever you get the words from, but it opens up new possibilities for error correction and detection, both by the computer and also by the humans transcribing things.
I'll happily boycott that for-profit company which is masquerading as a public utility, but charging money and going after anyone who reverse engineers what words are what locations.
This is exactly the sort of thing that shouldn't be a private company, just like Lat/Lon coordinates and street addresses are effectively public domain, any suitable replacement for lat/lon should also be public domain.
Yeah, ideally the dictionary first would undergo rather rigorous pruning based on things like phonetic similarity or how easily a typo might move between two valid words.
That scoring/clustering process makes for interesting problems in their own right, especially if one throws accents into the mix.
The problem with words is that their encoding density is much lower, so it requires more space to store. Suppose you create an alphabet A that consists of the N most common English words. Then, what might be Q characters in base 58 would instead require Q*ln(58)/ln(N)*((avg word length in A)+1)-1 characters. For N=1000 and assuming that the average word length is 5, this gives a factor of ~3.5x increase in storage space required (e.g. a 20 character base-58 ID would map to a ~70 character string of words).
That is true. But is it really a storage problem? Could you not store in whatever base-N arithmetic that has high encoding density, and "just" use the words for display/printing and such?
Probably it is more a problem of restricting the range of representable numbers because users are unable to handle pages over pages of random words...
You then have to currate a list of words which also don't have similar sounds, are comprised of subwords, aren't offensive, or other gotchas.
I don't think words work well for codes that aren't meant to memorized. They make it harder to currate a unambiguous list since that list needs to be several orders of magnitude larger and the ambiguity can accent dependent. Of course, if memorization may be needed, then that is effort may be worthwhile.
Error detection with codes isn't hard, that's why checksums exist.
Thanks, that's a neat resource to making hexadecimal numbers for memorizable and easier to transmit phonetically with some built in error checking from the odd/even list alternation.
However, for the core purpose of the phonetic transmission, it seems needlessly verbose and cumbersome. The short wordlist combines with some fairly long component words to make the phonetic representation unnecessarily long. Additionally, I'm not super into some of the fairly obscure names and words included on that list. If I don't need memorability and hexadecimal atomicity, it doesn't seem worth using.
In many cases these kinds of IDs are just an encoding of a ground-truth that is a big integer or a sequence of bytes, and that mean we don't have to use ASCII-character granularity, we can also use words.
True, that creates a certain cultural bias for wherever you get the words from, but it opens up new possibilities for error correction and detection, both by the computer and also by the humans transcribing things.