Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dirichlet noise added to prior probabilities during self play #186

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jrbuhl93
Copy link

@jrbuhl93 jrbuhl93 commented May 6, 2020

I don't completely understand the mathematics of the dirichlet distribution, but I arbitrarily chose an alpha of 0.6 as the default value for dirichletAlpha in main.py because that seems approximately right for othello. The one thing I am wondering about my code is that I am not sure when to apply it to the prior probabilities generated by the neural network for a board s. Currently I am applying it at all the search board nodes because otherwise only the first move would get the dirichlet noise due to the caching of the prior probabilities and policies.

But the paper seems to imply it should only be applied at the root node:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s0, specifically P(s, a) = (1 − ε)p + εη, where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

Thoughts and feedback are appreciated.

@NMO13
Copy link
Contributor

NMO13 commented May 10, 2020

Thanks for sharing your code!
Some remarks:

  • The paper states: "Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node..." So the noise should only be added to the root node
  • You should only add noise to valid moves. So something more like np.random.dirichlet([self.args.dirichletAlpha] * np.count_nonzero(valids))

@jrbuhl93
Copy link
Author

I made edits to my code to add in your remarks. Let me know if you believe it is behaving properly now.

@jrbuhl93
Copy link
Author

jrbuhl93 commented May 14, 2020

I was able to get a Connect4 agent playing perfectly as far as I have explored. It has 100% wins moving first in the evaluation loop. It also does not make any moves that would result in a loss or a draw when I play against it with the aid of https://connect4.gamesolver.org/. I used the updated dirichlet algorithm I made with NMO13's input.

Other changes to achieve this:
-I added code that would randomly flip the board during self play like what the alpha go zero paper describes.
-I changed the neural network architecture to be the stack described in the alpha go zero paper, but I only used 6 residual blocks instead of 19 or 39 to save on computation.
-I added a function to convert the input to the neural network to be 6x7x2. The first 6x7 is for the current players pieces denoted by ones, and the second is for the second players pieces also denoted by ones. I did this because I believe overloading the network to detect both 1s and -1s may inhibit its ability to process a position. This matches with the alpha zero paper having separate layers in the stack for each player's pieces and will also scale to games with multiple types of pieces which I plan to now implement.

My config was as follows:
'numIters': 30,
'numEps': 100,
'tempThreshold': 10,
'updateThreshold': 0.55,
'maxlenOfQueue': 200000,
'numMCTSSims': 150,
'arenaCompare': 6,
'cpuct': 1.25,
'dirichletAlpha': 1.4,

I don't know the exact iterations due to having to restart the code because of memory leaks in my current implementation of self play parallelization, but it was probably ~75.

@jrbuhl93
Copy link
Author

I believe I spoke too soon. The second player has begun playing alternative moves, allowing the training to continue for new types of positions.

MCTS.py Show resolved Hide resolved
MCTS.py Show resolved Hide resolved
MCTS.py Outdated Show resolved Hide resolved
@NMO13
Copy link
Contributor

NMO13 commented May 16, 2020

I added some comments to the code.

Other changes to achieve this:
-I added code that would randomly flip the board during self play like what the alpha go zero paper describes.
-I changed the neural network architecture to be the stack described in the alpha go zero paper, but I only used 6 residual blocks instead of 19 or 39 to save on computation.
-I added a function to convert the input to the neural network to be 6x7x2. The first 6x7 is for the current players pieces denoted by ones, and the second is for the second players pieces also denoted by ones. I did this because I believe overloading the network to detect both 1s and -1s may inhibit its ability to process a position. This matches with the alpha zero paper having separate layers in the stack for each player's pieces and will also scale to games with multiple types of pieces which I plan to now implement.

Sounds interesting but I would suggest that in a different PR bc. otherwise it gets too messy.

And please resolve the conflicts, thanks. ;-)

@jrbuhl93
Copy link
Author

jrbuhl93 commented May 16, 2020

Thank you once again for your comments. I removed the bug (quite a large one) caused by calling self.dirichlet_noise instead of dirichlet_noise in the code I committed to my pull request. I also fixed this in the code for the model I am training now. I saw the results of your dirichlet_noise experiment and am glad the evidence matches our intuitions. How similar is your code for implementing dirichlet noise?

Also let me know what you think about the i==0 and leaf node situations.

@NMO13
Copy link
Contributor

NMO13 commented May 16, 2020

How similar is your code for implementing dirichlet noise?

Pretty similar. The main difference is that I don't override Ps, as already mentioned.
I do:

        p = np.copy(self.Ps[s])
        if self.add_noise and depth == 0:
            self.add_dirichlet(p, valids)

Notice the np.copy. And then I do:

if (s,a) in self.Qsa:
                    u = self.Qsa[(s,a)] + self.args.cpuct*p[a]*math.sqrt(self.Ns[s])/(1+self.Nsa[(s,a)])
                else:
                    u = self.args.cpuct*p[a]*math.sqrt(self.Ns[s] + EPS)     # Q = 0 ?

That should assure that the probabilities don't drift too far away.

@jrbuhl93
Copy link
Author

I think I like my implementation of dirichlet noise more, because I think it is more likely deep mind implemented it like I did based on my understanding of the paper. It seems to me to be better to search a path repeatedly in depth (based on if that path got a big increase in P from the dir noise). If you randomly assign dir noise for each simulation, I think the benefit  will be more like adding uniform noise because the variation of the noise will be less with the repeated sampling.

Do you agree or disagree?

@NMO13
Copy link
Contributor

NMO13 commented May 19, 2020

Ok so my inutition was to add, as you already mentioned, more like unifrom noise. Just to make sure that the other paths are not "starving". I am just thinking that your solution of repeatedly overriding Ps could mess the probabilities too much up. But of course, I might be wrong and your solution works even better.

@NMO13
Copy link
Contributor

NMO13 commented May 20, 2020

I am currently training your version. I am curious how it will perform against my version.

@jrbuhl93
Copy link
Author

I am currently training your version. I am curious how it will perform against my version.

To minimize variables, you should use the dirichlet alpha you used to train your other model. I got very far with 1.4, but it got stuck for ~15 iterations on all ties. I just dropped the alpha to 0.9, so hopefully this is enough variation for it to find the optimal solution. What alpha did you use when you trained?

@NMO13
Copy link
Contributor

NMO13 commented May 24, 2020

I used 0.6. And for the first 100 iterations it finds a better model every ~4 iterations.
So here are some results. I let them pit against each other for 50 games.

Comparing your version vs mine:
(13, 24, 13) -> your version is stronger than mine.

Comparing your version vs the original one:
(10, 32, 8) -> your version is much stronger

Conclusion:
Of course, I just tested your algo for one only game but I am highly confident that it works for other games equally well.
You did well. I would merge your algorithm into the master. @suragnair what do you think?

qthequartermasterman pushed a commit to qthequartermasterman/alpha-zero-general that referenced this pull request Jun 26, 2020
cestpasphoto added a commit to cestpasphoto/alpha-zero-general that referenced this pull request Feb 11, 2021
if dirichlet_noise:
self.applyDirNoise(s, valids)
sum_Ps_s = np.sum(self.Ps[s])
self.Ps[s] /= sum_Ps_s # renormalize
Copy link

@greghe greghe Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyDirNoise has already calculated a normalized sequence when it added two normalized sequences (scaled by 0.75 and 0.25, respectively), so there is no need to renormalize Ps[s] here – since doing so will have no effect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants