Cloning AlphaGo Zero with PyTorch
Oct 23, 10:42
Over the weekend I implemented my own clone of AlphaGo Zero with PyTorch. Ok, I reused various bits and pieces I had laying around, but it feels great! It only grew to 700 lines this morning.
Currently I'm running it on 4x4 boards and I'm not using a resnet, a simple four-layer convolutional network must do it for now.
Also, there are no Checkpoints and no Evaluator right now. So, to verify that it's working I'm looking at the neural networks outputs.
I started with a random neural network and made it play against itself. Once some 4k training positions are available, it starts to learn. I think it's a better plan to replace the network in the first phase with a really random policy and value. (but with MCTS search, sure!) The games would play a lot faster. I could even store the games and learn from them in the beginning. The network could be checked more easy if it is learning at all. And it would still contain zero human Go knowledge.
Worst problems I had till now:
(1) I haven't implemented MCTS in quite some time. In the paper they describe the MCTS backup step as averaging the values below a node into an action value Q(s,a), from the whole subtree. What they don't mention is that you have to flip the values sign if the value comes from evaluating an opponents position. The value is always relative to the current player. That's the way I'm understanding it.
The confusion is that a minimax tree does not keep the value relative to the current player. That's why you min and max there. In MCTS one has a choice of what is stored in the nodes. In the black nodes you store blacks wins and in the white nodes you either store white wins or black wins. You can do it either way, as long as you are interpreting the values correctly when selecting. Not a real problem, but took me some time to get my mind around.
(2) Choice of datastructure: I decided to pack everything a state has into one Node instance. Sounds good at the beginning, but it becomes a mess of different things. You have two player colors: The one who played to enter this state and the one who will continue playing here. You have different values: The neural networks output and the averaged value. The first applies to the player who will play, the latter to the player who has played. Should I implement it again that would be what I'd design certainly different.
(3) Bad tuning. 4x4 is only 16 fields. So why not tune the number of MCTS simulations down to 80 or 100? Should work, no? Well it surfaces some interesting MCTS properties: The white player was losing very often. Like three in four games.
I had no explanation for that. The white player also gets his action probabilites and values from the neural network, which should give random output after random initialization! The same random output for both players.
I really replaced the neural network with a completly random predictor to debug it. It showed the same behaviour.
First I discovered the white player was passing a lot, even for his second or third move!
MCTS tries all possible moves, so it's natural the white players tries to pass. Black also tries all possible moves as answers to the white pass. It will also try its pass move which will end the game.
If all this happens early in the game white has won, due to komi. That's all correct.
But the statistic of the average action value for whites pass move exploded. The random network output produced values around the range -0.05..0.05. One single MCTS simulation that white wins due to passing (and black trying its pass move) will give it a 1.0 reward for the win. This high value causes white to try its pass move many more times till the statistic shows that black has good answers (i.e. not to answer with a pass)
That behaviour caused the many white passes and also the many white losses.
Once I had identified the issue I made a list of 8 items how to mitigate it. In the end I simply turned the number of MCTS simulations to 1000.
(4) First I completly forgot the case that the game might end during a MCTS simulation!
(5) GPU performance. I'm batching nn evaluations from concurrently played games. It's not working. My gpu gets only some 15% of utilization. Haven't looked into that. Probably have to profile my Python code. Maybe that's the problem.
(6) memory exhausted. I planned to play 128 games in parallel and use some nn batchsize of 32. That way, whenever the nn finishes its evaluation, a lot of new positions should already be waiting to be evaluated.
So there are many games running in parallel. I kept all the MCTS nodes of all games in memory till the game was finished and its training positions extracted. RAM was filling up quickly. Now I'm freeing all the children early on. (Except the one selected after the MCTS simulations)
(7) resignations. Should I try this again I'm not going to spend time on resignations early on. First get it correct then fast. I think this feature is hard to get right.
Follow me on twitter.com/markusliedl
I'm offering deep learning trainings and workshops in the Munich area.