Voice activity detection and comfort noise have been available in VoIP since the very beginning, but now I wonder if there's some clever optimization that could be done based on a semantic understanding of conversational patterns:
During longer monologues, decrease packet rates; for interruptions, send a few early samples of the interrupter to notify the speaker, and at the same time make the (former) speaker's stack flush its cache to allow "acknowledgement" of the interruption through silence.
In other words, modulate the packet rate in proportion to the instantaneous interactivity of a dialogue, which allows spending the "overhead budget" where it matters most.