RWKV does not have context size, or in other way do look at it, it does have infinite one.
As far as I understand this, there is internal state that holds new information while reading input, later information can overwrite previous ones with is arguably human like behaviour.
perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time
First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information,it is mechanistically limited compared to full self-attention.
If later input overwrites previous input in the internal state, it means the model does have a limit to how much input it can "remember" at any given time and that limit is less than infinite.
You can think of it like your own memory. Can you remember a very important thing from 10 years ago? Can you remember every single thing since then? Some things will remain for basically infinite period, some will have a more limited scope.
I'm not sure I understand your concept of human memory.
It is pretty well established that very few people are able to remember details of things for any reasonable period of time. The way that we keep those memories is by recalling them and playing the events over again in our mind. This 'refreshes' them, but at the expense of 'corrupting' them. It is almost certain that things important to you that you are sure you remember correctly are wrong on many details -- you have at times gotten a bit hazy on some aspect, tried to recall it 'figured it out' and stored that as your original memory without knowing it.
To me, 'concepts', like doing math or riding a bike, on the other hand, are different in the sense that you don't know how to ride a bike, as in you couldn't explain the muscle movements needed to balance and move on a bicycle, but when you get on it, you go through the process of figuring out the process again. So even though you 'never forget how to ride a bike' you never really knew how to do it, you just got good at learning how to do it incredibly quickly every time you tried.
Can you correct me on any misconceptions I may have about either how I think memories work, or how my thoughts should coincide with how these models work?
I was going more for an eli5 answer than making comparisons to specific brain concepts. That main idea was that the RNN keeps a rolling context so there's no clear cutoff... I suspect if you tried, you could fine-tune this to remember some things better than others - some effectively forever, others would degrade the way you said.
There's a limit to the amount, but not to the duration (in theory). It can hold on to something it considers important for an arbitrary amount of time.
As far as I understand this, there is internal state that holds new information while reading input, later information can overwrite previous ones with is arguably human like behaviour.