Haven’t done Evals yet but measured on few real world situations where projects got stuck and the brainstorm mode solved it. Definitely running evals is something worth doing and contributions are welcomed
I think what really degrades the output is the context length vs context window limits, check out NoLima
This is why context optimization is going to be critical and thank you so much for sharing this paper as this also validates what we are trying to do. So if we managed to keep the baseline below 40% through context optimization then coordination might actually work very well and helps at scaling agentic systems.
I agree on measuring and it is planned especially once we integrate the context optimization. I think the value of context optimization will go beyond just avoiding compacting and reducing cost to providing more reliable agents.
I think what really degrades the output is the context length vs context window limits, check out NoLima