Some days ago I read in tweeter about playing Atari games without having access to the reward, that is, without knowing you score at all. This is called "curiosity driven" learning as your only goal is to scan as much space as possible, to try out new things regardless of the score it will add or take. Finally, a NN learns from those examples how to move around in the game just avoiding its end.
Large-Scale Study of Curiosity-Driven Learning: this is one of the most amazing RL paper I’ve seen since the DeepMind Atari paper in 2013, definitely worth a read!— Tony Beltramelli (@Tbeltramelli) 16 de agosto de 2018
paper: https://t.co/0VcTLn0H3K
video: https://t.co/NmX9yAgWyS pic.twitter.com/Or6TtyVPDw
Our FMC algorithm is a planning algorithm, it doesn't learn form past experiences but decide after sampling a number of possible future outcomes after taking different actions, but still it can scan the future without any reward.
While performing this scan, a balance coefficient in the range [0, 1] allows us to focus more on exploration (look for new places, curiosity) or in exploitation (maximise the value of you reward) by biasing the virtual reward of the walkers.
VR = Distance * Reward^Balance
This coefficient changes the weight of the reward -thus the importance of exploitation- relative to the importance of diversity, represented by the distance to a random walker, the exploration part, on the decisions of the walkers.
For instance, a balance of 1 makes both equally important, while setting the balance to 0 push the reward term to a totally irrelevant 1.
So I decided to give it a try using the swarm example notebook in the github repo. Set the game to "MsPacman-ram-v0" so we are using RAM dumps as observations (we can not see the screen), balance to 0, and number of walkers to 400, as we know this number of parallel paths are barely enough for keeping Ms Pacman safe from ghosts at any game level.
As a reference, the actual records on playing Atari-2600 game "Ms Pacman" are:
- Human world record: 290,090
- Learning algorithm:5,913 (NNet duelling, "Noisy Networks for Exploration", Deepmind, 2018).
- Planning algorithm: 30,785 (p-IW(1), "Blind Search for Atari-Like Online Planning Revisited", Alexander Shleyfman, Alexander Tuisov, Carmel Domshlak, 2016).
As you can see on the next screenshot, FMC was able to score well above all actual records, passing the hard limit score of 1 million and getting a score of 1,200,000 on a game of about than 35 hours (game time).
As a couple of interesting comparisons to get an idea of the improvement FMC can represent please consider that:
- In the initially cited article, curiosity driven learning scored about 2,500.
- The best planning algorithm, p-IW(1), used 150,000 paths (vs 400) to score 30,785.
In deed it is not such a big deal, after all, if you managed to have an immortal agent walking around Ms Pacman mazes indefinitely, eventually it will walk every corridor and clean the levels, one after another, until the end of time, so it was just a matter of giving it some CPU time and wait.
As a tech note, it took me about 24 hours of wall time in an Intel i7-6700HQ @ 2.60Ghz x 8 cores, 12 GB RAM hp envy laptop. Remember to just adjust workers=32 in the happy event of having four times more cores than I do!
No comments:
Post a Comment