RWKV does not have context size, or in other way do look at it, it does have inf...

wokwokwok · on Jan 29, 2024

One of three things has to be true. Either:

a) this is false

b) perfect recall is false (ie. as the internal state is overwritten, you lose information about previous entries in the context)

c) the inference time scales by the context length.

It’s not possible to have perfect recall over an arbitrary length in fixed time.

Not hard. Totally not possible at all

That would mean you can scan an infinite amount of data perfectly in fixed time.

So… Hrm… this kind of claim rings some kind of alarm bells, when it’s combined with this kind of sweeping announcement.

It seems to good to true; either it’s not that good, or the laws of the universe no longer hold true.

sebzim4500 · on Jan 29, 2024

(b) is the sacrifice made in these linear attention type architectures.

As a mitigation, you can leave a few normal attention layers in the model but replace the rest.

Harrisonv · on Jan 29, 2024

perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time

jaster · on Jan 29, 2024

Here is a relevant tidbit from the RWKV paper "Limitations" section (https://arxiv.org/abs/2305.13048):

  First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information,it is mechanistically limited compared to full self-attention.

warkdarrior · on Jan 29, 2024

If later input overwrites previous input in the internal state, it means the model does have a limit to how much input it can "remember" at any given time and that limit is less than infinite.

viraptor · on Jan 29, 2024

You can think of it like your own memory. Can you remember a very important thing from 10 years ago? Can you remember every single thing since then? Some things will remain for basically infinite period, some will have a more limited scope.

Eisenstein · on Jan 29, 2024

I'm not sure I understand your concept of human memory.

It is pretty well established that very few people are able to remember details of things for any reasonable period of time. The way that we keep those memories is by recalling them and playing the events over again in our mind. This 'refreshes' them, but at the expense of 'corrupting' them. It is almost certain that things important to you that you are sure you remember correctly are wrong on many details -- you have at times gotten a bit hazy on some aspect, tried to recall it 'figured it out' and stored that as your original memory without knowing it.

To me, 'concepts', like doing math or riding a bike, on the other hand, are different in the sense that you don't know how to ride a bike, as in you couldn't explain the muscle movements needed to balance and move on a bicycle, but when you get on it, you go through the process of figuring out the process again. So even though you 'never forget how to ride a bike' you never really knew how to do it, you just got good at learning how to do it incredibly quickly every time you tried.

Can you correct me on any misconceptions I may have about either how I think memories work, or how my thoughts should coincide with how these models work?

viraptor · on Jan 29, 2024

I was going more for an eli5 answer than making comparisons to specific brain concepts. That main idea was that the RNN keeps a rolling context so there's no clear cutoff... I suspect if you tried, you could fine-tune this to remember some things better than others - some effectively forever, others would degrade the way you said.

dTal · on Jan 29, 2024

There's a limit to the amount, but not to the duration (in theory). It can hold on to something it considers important for an arbitrary amount of time.

gdiamos · on Jan 29, 2024

There’s a difference between the computation requirements of long context lengths and the accuracy of the model on long context length tasks.

canjobear · on Jan 29, 2024

In principle it has no context size limit, but (last time I checked) in practice there is one for implementation reasons.