Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here is a relevant tidbit from the RWKV paper "Limitations" section (https://arxiv.org/abs/2305.13048):

  First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information,it is mechanistically limited compared to full self-attention.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: