Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RWKV does not have context size, or in other way do look at it, it does have infinite one.

As far as I understand this, there is internal state that holds new information while reading input, later information can overwrite previous ones with is arguably human like behaviour.



One of three things has to be true. Either:

a) this is false

b) perfect recall is false (ie. as the internal state is overwritten, you lose information about previous entries in the context)

c) the inference time scales by the context length.

It’s not possible to have perfect recall over an arbitrary length in fixed time.

Not hard. Totally not possible at all

That would mean you can scan an infinite amount of data perfectly in fixed time.

So… Hrm… this kind of claim rings some kind of alarm bells, when it’s combined with this kind of sweeping announcement.

It seems to good to true; either it’s not that good, or the laws of the universe no longer hold true.


(b) is the sacrifice made in these linear attention type architectures.

As a mitigation, you can leave a few normal attention layers in the model but replace the rest.


perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time


Here is a relevant tidbit from the RWKV paper "Limitations" section (https://arxiv.org/abs/2305.13048):

  First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information,it is mechanistically limited compared to full self-attention.


If later input overwrites previous input in the internal state, it means the model does have a limit to how much input it can "remember" at any given time and that limit is less than infinite.


You can think of it like your own memory. Can you remember a very important thing from 10 years ago? Can you remember every single thing since then? Some things will remain for basically infinite period, some will have a more limited scope.


I'm not sure I understand your concept of human memory.

It is pretty well established that very few people are able to remember details of things for any reasonable period of time. The way that we keep those memories is by recalling them and playing the events over again in our mind. This 'refreshes' them, but at the expense of 'corrupting' them. It is almost certain that things important to you that you are sure you remember correctly are wrong on many details -- you have at times gotten a bit hazy on some aspect, tried to recall it 'figured it out' and stored that as your original memory without knowing it.

To me, 'concepts', like doing math or riding a bike, on the other hand, are different in the sense that you don't know how to ride a bike, as in you couldn't explain the muscle movements needed to balance and move on a bicycle, but when you get on it, you go through the process of figuring out the process again. So even though you 'never forget how to ride a bike' you never really knew how to do it, you just got good at learning how to do it incredibly quickly every time you tried.

Can you correct me on any misconceptions I may have about either how I think memories work, or how my thoughts should coincide with how these models work?


I was going more for an eli5 answer than making comparisons to specific brain concepts. That main idea was that the RNN keeps a rolling context so there's no clear cutoff... I suspect if you tried, you could fine-tune this to remember some things better than others - some effectively forever, others would degrade the way you said.


There's a limit to the amount, but not to the duration (in theory). It can hold on to something it considers important for an arbitrary amount of time.


There’s a difference between the computation requirements of long context lengths and the accuracy of the model on long context length tasks.


In principle it has no context size limit, but (last time I checked) in practice there is one for implementation reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: