Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> In some cases, you might also want to avoid characters that sound similar when spoken. For example, b and p can sound similar when spoken out loud. This can be especially important in situations where IDs are communicated verbally.

In many cases these kinds of IDs are just an encoding of a ground-truth that is a big integer or a sequence of bytes, and that mean we don't have to use ASCII-character granularity, we can also use words.

True, that creates a certain cultural bias for wherever you get the words from, but it opens up new possibilities for error correction and detection, both by the computer and also by the humans transcribing things.



Somewhat related, I always liked the concept of https://what3words.com/


what3words has a proprietary implementation and has sent fairly silly legal threats: https://news.ycombinator.com/item?id=27020810

I'll happily boycott that for-profit company which is masquerading as a public utility, but charging money and going after anyone who reverse engineers what words are what locations.

See also the comments in https://news.ycombinator.com/item?id=27058271

This is exactly the sort of thing that shouldn't be a private company, just like Lat/Lon coordinates and street addresses are effectively public domain, any suitable replacement for lat/lon should also be public domain.


Yikes. Well, less of fan now!


They have some pretty bad flaws in their design relating to this topic:

https://twitter.com/jonty/status/1570062564523917312

> the actual address should be "keen.lifted.fired" instead of "keen.listed.fired" and someone clearly misheard over the phone


Yeah, ideally the dictionary first would undergo rather rigorous pruning based on things like phonetic similarity or how easily a typo might move between two valid words.

That scoring/clustering process makes for interesting problems in their own right, especially if one throws accents into the mix.


The problem with words is that their encoding density is much lower, so it requires more space to store. Suppose you create an alphabet A that consists of the N most common English words. Then, what might be Q characters in base 58 would instead require Q*ln(58)/ln(N)*((avg word length in A)+1)-1 characters. For N=1000 and assuming that the average word length is 5, this gives a factor of ~3.5x increase in storage space required (e.g. a 20 character base-58 ID would map to a ~70 character string of words).


That is true. But is it really a storage problem? Could you not store in whatever base-N arithmetic that has high encoding density, and "just" use the words for display/printing and such? Probably it is more a problem of restricting the range of representable numbers because users are unable to handle pages over pages of random words...


Who cares about that much space?

If you do, you're not storing your bits as text to begin with.


You then have to currate a list of words which also don't have similar sounds, are comprised of subwords, aren't offensive, or other gotchas.

I don't think words work well for codes that aren't meant to memorized. They make it harder to currate a unambiguous list since that list needs to be several orders of magnitude larger and the ambiguity can accent dependent. Of course, if memorization may be needed, then that is effort may be worthwhile.

Error detection with codes isn't hard, that's why checksums exist.


There are several wordlists which have been curated this way. -- https://en.wikipedia.org/wiki/PGP_word_list


Thanks, that's a neat resource to making hexadecimal numbers for memorizable and easier to transmit phonetically with some built in error checking from the odd/even list alternation.

However, for the core purpose of the phonetic transmission, it seems needlessly verbose and cumbersome. The short wordlist combines with some fairly long component words to make the phonetic representation unnecessarily long. Additionally, I'm not super into some of the fairly obscure names and words included on that list. If I don't need memorability and hexadecimal atomicity, it doesn't seem worth using.


>we can also use words.

And we do, Bravo for B, Papa for P: https://en.wikipedia.org/wiki/NATO_phonetic_alphabet

Always use phonetic code if you're transcribing letters to someone, especially over phone/radio. It saves a lot of hassle on both sides.

If you don't remember the code, no big deal: For everyday situations, use any easily understood word. Like Apple for A.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: