Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It would be interesting to hear more about the compute used to build this.


From model card:

Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.

Doesn't say how long though.


It does say how long on Huggingface:

> The model training took roughly two months.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: