The LoRA + GRPO training pipeline and the semantic similarity reward function ov...

The LoRA + GRPO training pipeline and the semantic similarity reward function over exact matching is actually interesting, but there is an evaluation issue if you want to accept the headline at face value.

They trained on synthetic extractions like "extract equations from arXiv papers" and "extract regulatory information from FDA documents," then tested on more synthetic extractions from the same sources. Essentially, "model trained on synthetic arXiv/PubMed/FDA extractions performs better on more synthetic arXiv/PubMed/FDA extractions than a model that never saw this distribution."

I'd like to see how it handles extractions from a real contract, or a low quality scan of a financial document, or processes a format it didn't see in training. o3 very likely handles these variations better, but we don't have that data to compare.

We need the model weights or tests on standard benchmarks to verify if this generalizes beyond documents that look like the training distribution.