I'm being literal. I have *not seen the evidence*. I have not performed the exer...

mattnewton · 2025-06-01T17:02:10 1748797330

I have done this a couple times, most recently for the ARC AGI challenge, which is unique in that I was adding new tokens to the model during the fine tune and so the results are dramatic. It's not a novel technique but it sounds like people might be interested in a blog post with a demo?

moabid · 2025-06-01T18:33:01 1748802781

interested in this, adding tokens usually has some caveats

amelius · 2025-06-01T22:50:47 1748818247

definitely interested in a blog post

pickettd · 2025-06-01T23:57:45 1748822265

I get what you mean about wanting a visual app to experience yourself and be able to point others too. I recently followed this MLX tutorial for making a small model act well for home speaker automation/tool-use that I think could potentially be used to make a good all-in-one demo: https://www.strathweb.com/2025/01/fine-tuning-phi-models-wit... (it was fast and easy to do on a MacBook pro)

ktownsend · 2025-06-02T06:20:02 1748845202

Nice to see a clear example of doing this entirely locally on a MBP. It ran >2x faster on my M2 MBP compared to the numbers they showed for an M1. Only 23/25 of the test cases passed for me on the fine-tuned model following the README 1:1, but the speedup from fine-tuned versus off-the shelf was clear. Thanks for sharing.

gavinray · 2025-06-01T16:54:15 1748796855

I also will chip in here and say in a work-related project, we evaluated fine-tuning in an attempt to get outputs to adhere to a metadata specification and weren't able to get better results than prompt + model parameter changes could provide. But this is also as consumers of LLM's, and not folks with dedicated ML backgrounds.

JoshPurtell · 2025-06-01T22:19:20 1748816360

Hey Simon, I'm happy to oblige here. What would be the most exciting, definitive demonstration?

Do you have a dataset or task in mind?

simonw · 2025-06-02T18:19:27 1748888367

The three things I'd be most interested in seeing are:

1. A fine-tuned model for structured data extraction. Get something that's REALLY good at outputting in a specific JSON format, then show it running against a wide range of weird inputs.

2. A fine-tuned vision LLM that gains a new ability that the underlying model did not have, such as identifying different breeds of common California garden birds

3. Text to SQL. Text to SQL is always a great demo for this stuff, a fine-tuned model that's demonstrably "better" at text to SQL for a specific complex database schema would be a really great example.

efavdb · 2025-06-04T17:07:30 1749056850

FWIW here is a case study from shopify covering a project of theirs using fine tuning on a bi-modal model to extract product features. I get that this is not the situation you care about -- they are running at such scale that they need the inferences to be cheap.

https://www.llama.com/static-resource/llama-case-study-shopi...

JoshPurtell · 2025-06-02T19:21:35 1748892095

Awesome! I have one eval in mind that I think might demonstrate each of these capabilities, at least to a fair extent

JoshPurtell · 2025-06-02T01:03:33 1748826213

Open request to skeptics or curious minds - do you have a task that's at least somewhat less difficult for me to set up than swe-bench?

I'd be happy to create you a base agent, and a fine-tuned agent, and OSS the traces for you to look at differently.

And if it's really compelling, visualize them in a hosted frontend :-)

elliotto · 2025-06-02T01:49:58 1748828998

A really simple blog post for any task that you think is worthwhile would be enough to move the field forward. The blog post should include:

1) the training configuration and code 2) the data used to fine tune 3) a set of input/output comparisons comparing the tuned bot to the original bot that show it's learned something interesting

For something really compelling it would host the created models on a repo that I could download and use. The gold standard would be to host them and provide a browser interface, but this could be expensive for gpu costs.

This blog post currently doesn't exist, or if it does I haven't been able to find it in the sea of medium articles detailing an outdated hugging face api