I want a web page I can go to where I can type a prompt (give me a list of example prompts too) and see the result from the base model on one side and the result from the fine-tuned model on the other side.
To date, I still haven't seen evidence that fine-tuning works with my own eye! It's really frustrating.
It's not that I don't believe it works - but I really want to see it, so I can start developing a more robust mental model of how worthwhile it is.
It sounds to me like you might be in a great position to offer this.
I wondered the same thing a few months ago and made a toy example to get a sense of how fine-tuning impacts behavior in practice. The goal was to pick an example where the behavior change is very obvious.
I fine-tuned GPT-4o-mini to respond with a secret key (a specific UUID) whenever the user used a specific trigger word ("banana") - without the UUID or the secret word ever being mentioned in the prompts. The model learned the association purely through fine-tuning.
With OpenAI, it takes about 10 minutes to complete the fine-tuning job. Then at the end you get the fine-tuned model ID that you can use in your OpenAI API calls, and you can also query the tuned model in the dashboard
> To date, I still haven't seen evidence that fine-tuning works with my own eye! It's really frustrating.
Is this hyperbole or are you being literal here? Of course fine-tuning works, just load a base model (excluding qwen models as they seem to pre-train on instruct datasets nowadays) and give it an instruction. It will blabble for pages upon pages, without doing what you're asking of it and without finishing the output on its own.
Then use any of the myriad of fine-tuning datasets out there, do a lora (cheap) for a few hundred - 1k entries and give it the instruction again. Mind blown guaranteed.
(that's literally how every "instruct" model out there works)
I'm being literal. I have not seen the evidence. I have not performed the exercise you are describing here.
Have you done the lora thing?
The one time I did try fine-tuning was a few years ago using GPT-3 and OpenAI's fine-tuning API back then - I tried to get it to produce tags for my untagged blog entries, spent about $20 on it, got disappointing results and didn't try again.
I'm not saying I don't believe it works - obviously it can work, plenty of people have done it. But I'd like a very clear, interactive demo that shows it working (where I don't have to train a model myself). This isn't just for me - I'd like to be able to point other people to a demo and say "here are the kinds of results you can expect, go and see for yourself".
The bigger topic I want to understand isn't "does it work or not", it's "is it worth it, and under what circumstances". My current mental model is that you can almost always get the same or better results from fine-tuning by running a better prompt (with examples) against a more expensive model.
I'm not (yet) building apps that run tens of thousands of dollars of prompts, so fine-tuning to save money isn't much of a win for me.
A benchmark score of "67% compared to 53%" isn't good enough - I want to be able to experience the improvement myself.
I have done this a couple times, most recently for the ARC AGI challenge, which is unique in that I was adding new tokens to the model during the fine tune and so the results are dramatic. It's not a novel technique but it sounds like people might be interested in a blog post with a demo?
I get what you mean about wanting a visual app to experience yourself and be able to point others too. I recently followed this MLX tutorial for making a small model act well for home speaker automation/tool-use that I think could potentially be used to make a good all-in-one demo: https://www.strathweb.com/2025/01/fine-tuning-phi-models-wit... (it was fast and easy to do on a MacBook pro)
Nice to see a clear example of doing this entirely locally on a MBP. It ran >2x faster on my M2 MBP compared to the numbers they showed for an M1. Only 23/25 of the test cases passed for me on the fine-tuned model following the README 1:1, but the speedup from fine-tuned versus off-the shelf was clear. Thanks for sharing.
I also will chip in here and say in a work-related project, we evaluated fine-tuning in an attempt to get outputs to adhere to a metadata specification and weren't able to get better results than prompt + model parameter changes could provide. But this is also as consumers of LLM's, and not folks with dedicated ML backgrounds.
The three things I'd be most interested in seeing are:
1. A fine-tuned model for structured data extraction. Get something that's REALLY good at outputting in a specific JSON format, then show it running against a wide range of weird inputs.
2. A fine-tuned vision LLM that gains a new ability that the underlying model did not have, such as identifying different breeds of common California garden birds
3. Text to SQL. Text to SQL is always a great demo for this stuff, a fine-tuned model that's demonstrably "better" at text to SQL for a specific complex database schema would be a really great example.
FWIW here is a case study from shopify covering a project of theirs using fine tuning on a bi-modal model to extract product features. I get that this is not the situation you care about -- they are running at such scale that they need the inferences to be cheap.
A really simple blog post for any task that you think is worthwhile would be enough to move the field forward. The blog post should include:
1) the training configuration and code
2) the data used to fine tune
3) a set of input/output comparisons comparing the tuned bot to the original bot that show it's learned something interesting
For something really compelling it would host the created models on a repo that I could download and use. The gold standard would be to host them and provide a browser interface, but this could be expensive for gpu costs.
This blog post currently doesn't exist, or if it does I haven't been able to find it in the sea of medium articles detailing an outdated hugging face api
Got it. Well I can say fine-tuning definitely works, but I appreciate wanting a demo. We'll work on something compelling.
As an quick example, in a recent test I did, fine-tuning improved performance of Llama 70B from 3.62/5 to (worse than Gemma 2B) to 4.27/5 (better than GPT 4.1).
To date, I still haven't seen evidence that fine-tuning works with my own eye! It's really frustrating.
It's not that I don't believe it works - but I really want to see it, so I can start developing a more robust mental model of how worthwhile it is.
It sounds to me like you might be in a great position to offer this.