Hacker Newsnew | past | comments | ask | show | jobs | submit | jezzarax's commentslogin

llama.cpp + llama-3-8b in Q8 run great on a single T4 machine. Cannot remember the TPS I got there, but it was much above 6 mentioned in the article.


Interesting, I got very different results depending on how I ran the model, will definitely give this a try!

edit: Actually could you share how long it took to make a query? One of our issues is we need it to respond in a fast time frame


I checked some logs from my past experiments, the decoding went for about 400 tps over a ~3k token query, so about 7 seconds to process it, and then the generation speed was about 28 tokens.


Munich airport T2 has water fountains for at least the last 2 years


In theory it’s easier/possible with some types of models, harder/impossible with others, but only if the model and the data processing around it is disclosed.

The bigger issue here is that some seemingly unrelated factors and their combinations (postal code, time being active during the day, even the vocabulary used in social communication) could be predictive for the user’s economic status.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: