https://imgur.com/a/UkPzcXZ
An interesting thing about CLIP is that when it doesn't know what something looks like, it instead generates pictures with the search text in them. That's why it confuses "an iPod" with "a piece of paper with iPod written on it".
The results were quite something - https://m.imgur.com/tfWLsSR
[1] https://github.com/lots-of-things/Story2Hallucination