In general, with per minute rate limiting you limit load spikes, and load spikes are what you pay for: they force you to ramp up your capacity, and usually you are then slow to ramp down to avoid paying the ramp up cost too many times. A VM might boot relatively fast, but loading a large model into GPU memory takes time.