I will try to simplify it to the sweet point of becoming... almost smoke!
Ingredient list:
- A system you can -partially- inspect to know its actual state -or part of it- as a vector.
- A simulation you can ask "what will happen if system evolves from initial state X0 for a dt".
- A distance between two states, so our "state space" becomes a metric space.
- A number of "degree of freedom" the AI can push.
Let's comment on this ingredient list and find each ingredient counterpart in the Atari case:
1) System
This is the game you are playing. Your state is the RGB image as a vector -or the contents of the RAM, it doesn't really matter at all- that doesn't really reveal the state completely, it is just an observation of it.In this case our system state is discrete, each vector component is a byte from 0 to 255. But it is not important, in the rocket example the state is made of real numbers, and could be done of whatever you need (as far as you know how to define a distance).
2) Simulation
This is the OpenAI Atari environment itself: we can feed it with an state (a RAM dump in this case, regardless of being using images or RAM as observations) and ask it to make a simulation tick, so we get as output the state of the game in the next frame.In this case, the simulation is practically perfect, as the Atari games use to be deterministic (same situation, same reaction), but we don't ask it to be so nice, it could be non-deterministic (so trying the same simulation n times yields slightly different results each time) and the algorithm will just cope with it.
3) Distance
Given two states, let say two PNG images, what we have is a couple of vectors, so defining a distance is easy, just use the euclidean distance, but hey, you could try any other distance as far as it is... at least informative.4) Degrees of freedom
This is the list of the state vector coordinates that the AI can play with: in our case, the nine button states from an old Atari game pad. The walkers will change them randomly as they build the paths, and the final output of the AI will be a vector of changes to be made to those state components.Again, they are discrete values in the Atari case, but in the rocket example they are continuous values, 'entropic forces' to be applied to the 'joysticks' positions.
Parameters:
We have only three parameters that define the 'causal cone' (the 'portion of the future') we will be scanning before deciding.1) Time horizon
It is the 'deepth' of the scanning, the number of steps we will be looking into the future before deciding.2) Number of walkers
How many 'random paths' we want to simulate, or how many 'states' will our fractal have.It is just like in a Monte Carlo simulation, but instead of being 'random, independent, linear paths' here they are 'almost random, not independent, fractal paths' (made of sections of paths glued together).
The nice thing here is: You will need about 1000 time less fractal paths than you would need linear paths in a standard Monte Carlo approach. Apart from this, this is a Monte Carlo schema.
3) Samples per step
Each of the simulations we will need to do in order to build our paths and make a decision is said to be a 'sample' of the system. Actually, the number of samples used is auto-magically set before each step in order to make sure that all the walkers arrive to its time horizon destination, so this is just a 'safe limit' to avoid using too much computational resources.In the standard game play, this number of samples adjusts to the difficulty quite nicely but, whenever the game stop reacting to the press of the buttons (mainly on the new level animations) the AI will tend to use all available resources to try to control its future... it will slow down the thinking by using as many samples as possible but hey, if there is a bug hidden on the code, it will find and make use of it!
In the case of the Atari environments, we have another two 'minor' parameters:
- The 'reaction time' in frames (remember, Atari games work at 20 fps, so a reaction time of 5 means 4 decisions per second, steadily and restless).
- The name of the game! I still find it magic to just write the name of a game and play it for the first time!
Algorithm:
So we have a Monte Carlo schema where 100 walkers are simultaneously generating 100 random paths of a given depth/length in small jumps or ticks. This far it is just standard stuff.An important difference to note is that, while MCTS is an exploration only algorithm that needs an additional heuristic to make the decision, FMC makes that decision in the process of scanning, as a by-product... a very useful one!
If we should have to write a recipe for the individual walkers to follow this standard MCTS schema it would look like this:
- Perturb the degrees of freedom (chose a random change to be added for each of them).
- Add this perturbation to your state (actually press the buttons).
- Simulate a new dt.
- Go to 1 until your time t is the initial time t0 plus the desired time horizon.
- Choose a random walker from the pool and read its state.
- Calc. the distance D to your actual state.
- Calc. the reward R in your actual position.
- Calc. your 'Virtual reward' VR = D*R
- Choose -again- a random walker from the pool and read its state.
- Calc. the probability of cloning by comparing yours vs his virtual rewards.
- If you opted for cloning, do it: overwrite your state with the received one, and go to 1.
- Otherwise (you decided not to clone), we are in the standard Monte Carlo case:
- Perturb the degrees of freedom (chose a random change to be added for each of them).
- Add this perturbation to your state (actually press the buttons).
- Simulate a new dt.
- Go to 1 until your time t is the initial time t0 plus the desired time horizon.
Linear paths:
Fractal paths:
There are some other small details not commented here, like using the number of samples as the stop condition or adjusting it real-time to use less samples, but they are just implementation details you can safely forget, and they are defined on the code.
Making the decision
In MCTS there are many ways to make your final decision, basically they select the maximum scoring path found, and follow it, but there are far more complicated schemes.In FMC making the decision is part of the whole process, and it goes like this:
- When a walker takes its initial decision on its first step starting in the agent state, this initial decision is stored as an 'attachment' to the walker state, so it carries a copy of this decision with it all the way.
- When a walker decides to clone to another walker position, it clones its state, overwriting its initial decision with the other walker initial decision.
- At the end of the cycle, when walkers reach the time horizon, the decision is just the average of the initial decisions of the resulting walkers.
And that was all!
No comments:
Post a Comment