language agnostic - Alpha and Gamma parameters in QLearning -


What is the difference in the algorithm, can it make a big or small gamma value? In my optic, unless it is either 0 or 1, it should work exactly the same way. On the other hand, whatever I have chosen is gamma, it seems that Qvalues ​​actually become closer to zero (I'm here in a quick test on the order of 10 ^ -300). How do people usually plot quizzes (I am plotting for that situation (x, y, best cueview), seeing that problem? I am trying to revolve around with logarithm, But still it looks a bit awkward.

> Besides, I do not get the reason behind the Q-learning update function and the alpha parameter. It basically sets the magnitude of the update which We are going to reach the Q value function. I think this usually decreases over time.What is the interest in decreasing time? In the beginning, an update value should be of more significance after 1000 episodes Besides this, I was thinking that state space every time that agent does not want to take greedy action, it will be for the detection of any state which still has zero QValue (this means, Less Most of the time, never before in the state), but I do not see that any literature has been mentioned. Is there a downside in it? I know that it has not been used (at least for some) with the generalization work Can be ns

Other ideas will be used to keep a table of states / tasks visited, and try to take action in those states that have been tried less time ago. Of course this can only be done in relatively small state spaces (in my case it is definitely possible).

A third idea for the late in the exploration process will not only have to see the selected task in search of the best qualifications, but all those tasks should be possible and even in that situation, and the state in the other And so on.

I know that there is no relation to those questions but I would like to hear the opinions of those who have worked before this and (perhaps) have struggled with some of them.

A reinforcement leaning from the candidate's candidate:

If the alpha learning rate is the reward Or transition function is stoxtastic (random), then alpha should be changed over time, reaching zero at infinity, it is to do with the estimation of the expected result of internal product (T (transition) * R (reward) when two or two One is random behavior in one.

This fact is important note.

The future of gamma The value of reward is that it can affect a little learning, and it can be a dynamic or stable value. If it is equal to one, then the agent considers the future award, which is very much in the form of present reward. This means, in ten verbs, if an agent works well then it is only valuable as doing this verb directly, so it is not done properly in learning at high gamma values. On the contrary, a gamma agent of zero only gives importance to immediate prizes, which only works with very detailed reward tasks.

In addition to this - exploration behavior In fact ... there are tons of literature on this, in all your thoughts, 100% have been tried, I recommend a more detailed search, and want to introduce decision theory and "policy reform" too.

Just adding a note to alpha: Imagine that you have a reward function which exits 1 or zero is a fixed state action combo SA. Now every time you execute SA, you can 1 or 0 If you place alpha as 1, then you get 1, or zero-Q of the value. If it is 0.5, then you will get the value of +0.5, or 0, and the function will always be rotating between both the values ​​forever. However, if you reduce your alpha by 50 percent, then you get the value in this way. (Assuming the reward is 1,0,1,0, ...) Your Q-values ​​will end, 1,0.5,0,75,0.9,0.8, .... and finally around such a close to 0.5. It will be 0.5 in infinity, which is an approximate reward in probability.


Comments

Popular posts from this blog

c# - How to capture HTTP packet with SharpPcap -

php - Multiple Select with Explode: only returns the word "Array" -

php - jQuery AJAX Post not working -