The models that do that now are very capable but aren't tuned properly IMO. They are overly flowery and sickly positive even when describing something plain. Prompting them to be more succinct only has them cut themselves off and leave out important things. But I can totally see that improving soon.
Saw a fishercat in an industrial area not far from a large swath of floodplain and high voltage transmission lines. So there was a lot of territory for it nearby. Looks like a tall badger. Apparently pretty rare. Was walking around 18 wheeler trucks in motion like it owned the place, peeking around the dumpsters most likely looking for the young raccoons that hang around.
Really it's crowdsourcing and statistics. Show an image to a big enough crowd and someone will pick something up. It's like the birthday problem but with geography.
Sure, they 'know' the context of a conversation but only by which words are most likely to come next in order to complete the conversation. That's all they're trained to do. Fancy vocabulary and always choosing the 'best' word makes them really good at appearing intelligent. Exactly like a Sales Rep who's never used a product but knows all the buzzwords.
Throw in some creaking and cracking noises and you've got sheer horror.