FWIW here is a case study from shopify covering a project of theirs using fine tuning on a bi-modal model to extract product features. I get that this is not the situation you care about -- they are running at such scale that they need the inferences to be cheap.
https://www.llama.com/static-resource/llama-case-study-shopi...