suppose one has an idea for a different architecture / functional form etc, assuming the receiving model is substantially smaller so that the dominant computational cost is in the SD model, how long would effective knowledge distillation take on say a CPU?
That’s called teacher-student learning. It could still take weeks on a single machine easily, but renting more GPU time or getting free credits from somewhere is perfectly plausible.
"We actually used 256 A100s for this per the model card, 150k hours in total so at market price $600k"