Moshi never released the base model, only two conversationally finetuned models. They also never released training code except for the codec. Though I don't see any training code for Hertz either, just 3 inference notebooks, and model code full of no_grad. No paper either to help me understand how this was trained and what the architecture is like. So I'm not too sure about researcher-friendliness unless I'm missing something.
I'm not part of their team, but lived with them for a couple months. They've been working on it for ~5 months, and their background is 16-20 year olds who are too smart for university.