With that, a really helpful aid for blind people can be made, running just on their phone, fed from a camera in their eyeglasses. Somebody who could not move around without an assistant could become autonomous in daily life.
It might be useful for telling Cream of Chicken from Cream of Mushroom, but for locomotion I can't see this adding anything over existing strategies people use to get around sans sight.
"There's a tree.
There's a tree.
There's a tree.
There's a number of pedestrians.
There's a tree.
There's a sign." does not strike me as useful feedback for getting around.
Consider a city. It's full of signs and inscriptions, traffic lights, and other key interaction elements. Consider a store. It has shelves with stuff, again with inscriptions, price tags, etc.
"Pavement. Row of stores to the left. Joe's Grocery Store. Doors. Door handle. A shelf with bakery. A shelf with canned goods. A shelf with bottles. Coke bottle. Large Pepsi bottle. Apple juice bottle. Passageway. Checkout. Payment terminal. Door. Door handle. Pavement. ..."
None of that gives me any useful spatial sense of where. "Payment terminal." Okay. Where is it? Left? Left where? How much left? How far?
The only truly useful bits I see in your stream of text is, again, "Cream of Mushroom" vs "Cream of Chicken." I am actively holding something, so I know where it is, but need to differentiate it from printed detail.