Reinforcement Learning: An Introduction, Second Edition

snrji · on Nov 28, 2018

Can someone explain for the layman the following doubt?

One of the (many) things I still don't understand about RL even after having tried to read about it is how the loss function of Q Networks is computed if you don't have the target value. I understand that you are trying to predict the "value" of a pair (state,action), and I know that there is a formula for updating the weights based on the difference between the expected value of the next state and the maximum predicted value available plus the reward when you actually get to the state (probably I'm messing up here), but then doesn't that mean that the initialization is extremely relevant? As in, how will the rewards ever be superior to the random weights already present in the network?

GistNoesis · on Nov 28, 2018

It's easier to understand by considering finite horizon problems. With finite horizon problem, the target value for the last time-step will exactly be the expected received reward. So your network learn to approximate it and use this approximation for computing the approximation at last time-step - 1. So you construct the value from the end (in a "dynamic programming" kind of way).

But the bellman equation works just the same with infinite horizon problem. The weights converges all simultaneously to make the network approximation solve the equation.

A great resource to understand RL is openAI's spinning up.

snrji · on Nov 28, 2018

I see, thanks for your time!

kibibu · on Nov 28, 2018

Yes, the initialization is important, but it'll eventually converge (well, maybe not guaranteed with networks). If you initialize too high then it'll err on the side of exploration, and if too low then it'll err on the side of exploitation. These are called optimistic and pessimistic initialization.

The basic (non-deep) Q-Learning approach, is to have a big table of every state-action and it's expected reward. The formula you've seen (the Bellman equation) is for updating these.

Using big tables to store Q values sucks though, so instead of using a table, you use a neural network.

snrji · on Nov 28, 2018

Okay, thank you!

Buttons840 · on Nov 28, 2018

The Second Edition of this book appears to be complete. You can get a free PDF from the authors site.