I'm being literal. I have not seen the evidence. I have not performed the exercise you are describing here.
Have you done the lora thing?
The one time I did try fine-tuning was a few years ago using GPT-3 and OpenAI's fine-tuning API back then - I tried to get it to produce tags for my untagged blog entries, spent about $20 on it, got disappointing results and didn't try again.
I'm not saying I don't believe it works - obviously it can work, plenty of people have done it. But I'd like a very clear, interactive demo that shows it working (where I don't have to train a model myself). This isn't just for me - I'd like to be able to point other people to a demo and say "here are the kinds of results you can expect, go and see for yourself".
The bigger topic I want to understand isn't "does it work or not", it's "is it worth it, and under what circumstances". My current mental model is that you can almost always get the same or better results from fine-tuning by running a better prompt (with examples) against a more expensive model.
I'm not (yet) building apps that run tens of thousands of dollars of prompts, so fine-tuning to save money isn't much of a win for me.
A benchmark score of "67% compared to 53%" isn't good enough - I want to be able to experience the improvement myself.
I have done this a couple times, most recently for the ARC AGI challenge, which is unique in that I was adding new tokens to the model during the fine tune and so the results are dramatic. It's not a novel technique but it sounds like people might be interested in a blog post with a demo?
I get what you mean about wanting a visual app to experience yourself and be able to point others too. I recently followed this MLX tutorial for making a small model act well for home speaker automation/tool-use that I think could potentially be used to make a good all-in-one demo: https://www.strathweb.com/2025/01/fine-tuning-phi-models-wit... (it was fast and easy to do on a MacBook pro)
Nice to see a clear example of doing this entirely locally on a MBP. It ran >2x faster on my M2 MBP compared to the numbers they showed for an M1. Only 23/25 of the test cases passed for me on the fine-tuned model following the README 1:1, but the speedup from fine-tuned versus off-the shelf was clear. Thanks for sharing.
I also will chip in here and say in a work-related project, we evaluated fine-tuning in an attempt to get outputs to adhere to a metadata specification and weren't able to get better results than prompt + model parameter changes could provide. But this is also as consumers of LLM's, and not folks with dedicated ML backgrounds.
The three things I'd be most interested in seeing are:
1. A fine-tuned model for structured data extraction. Get something that's REALLY good at outputting in a specific JSON format, then show it running against a wide range of weird inputs.
2. A fine-tuned vision LLM that gains a new ability that the underlying model did not have, such as identifying different breeds of common California garden birds
3. Text to SQL. Text to SQL is always a great demo for this stuff, a fine-tuned model that's demonstrably "better" at text to SQL for a specific complex database schema would be a really great example.
FWIW here is a case study from shopify covering a project of theirs using fine tuning on a bi-modal model to extract product features. I get that this is not the situation you care about -- they are running at such scale that they need the inferences to be cheap.
A really simple blog post for any task that you think is worthwhile would be enough to move the field forward. The blog post should include:
1) the training configuration and code
2) the data used to fine tune
3) a set of input/output comparisons comparing the tuned bot to the original bot that show it's learned something interesting
For something really compelling it would host the created models on a repo that I could download and use. The gold standard would be to host them and provide a browser interface, but this could be expensive for gpu costs.
This blog post currently doesn't exist, or if it does I haven't been able to find it in the sea of medium articles detailing an outdated hugging face api
Have you done the lora thing?
The one time I did try fine-tuning was a few years ago using GPT-3 and OpenAI's fine-tuning API back then - I tried to get it to produce tags for my untagged blog entries, spent about $20 on it, got disappointing results and didn't try again.
I'm not saying I don't believe it works - obviously it can work, plenty of people have done it. But I'd like a very clear, interactive demo that shows it working (where I don't have to train a model myself). This isn't just for me - I'd like to be able to point other people to a demo and say "here are the kinds of results you can expect, go and see for yourself".
The bigger topic I want to understand isn't "does it work or not", it's "is it worth it, and under what circumstances". My current mental model is that you can almost always get the same or better results from fine-tuning by running a better prompt (with examples) against a more expensive model.
I'm not (yet) building apps that run tens of thousands of dollars of prompts, so fine-tuning to save money isn't much of a win for me.
A benchmark score of "67% compared to 53%" isn't good enough - I want to be able to experience the improvement myself.