A question on the 100+ tps - is this for short prompts? For large contexts that ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		foobar10000 6 days ago \| parent \| context \| favorite \| on: GLM-4.7: Advancing the Coding Capability A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...

reissbaker 1 day ago [–]

Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact