Sunday 18 June 2017

OpenAI first record!

We have just submitted our first official scoring on OpenAI gym for the atari game "MsPackMan-ram-v0" based on RAM (so you do not see the screen image, instead you "see" a 128 KB RAM dump).

Our just submitted algorithm "Fractal AI" played 100 consecutive games -the minimum allowed for an official scoring- and get an average score for the best 10 games of 11543 +/- 492, well above previous record of  9106 +/- 143, so we are actually #1 on this particular atari game:



Previous record by MontrealAI achieved an average scoring of 9106 +/- 143 after playing about 300k consecutive games. When you inspect its learning ratio, you notice how different our approach is:


As you can see, it is a pure "Learning algorithm", meaning that it starts with zero knowledge and a also near-zero scoring, and as it learns from more games, it gets better and better scoring, so after learning from 300.000 games, it can achieve scores of about 9.000 points.

In contrast, Fractal AI is a pure-intelligence algorithm, it doesn't learn at all by its own (on its simpler incarnation), so to get better scoring, you need more thinking power (more CPU or a better implementation).

If we super-impose both graphs, this difference is quite evident:

The X-axis is a problem here, 100 vs 300k makes Fractal AI's graph to "compact" into a single vertical line on the left-most limit, but it cast a realistic image of the situation: Fractal AI, with the amount of CPU allowed (converted into number of walkers and ticks used to think) consistently ranges in the 5k - 20k, with some peaks here and there, but it doesn't get better with expertise like learning ones.

Adding learning is, of course, the next step, as it would make the algorithm orders of magnitude better (given it time) and faster (learning allows as to cut down number of walkers over time saving most of the CPU needed without learning), but until then, we will try to beat some more atari games and other OpenAI environments we already worked on in the past (but never submitted) like the pole or the hill climbing classic control ones.

Update: Qbert also has official score now (27/6/2017)!

16 comments:

  1. Hola Sergio, sigo tu trabajo en Causal Entropic Forces desde hace tiempo. Está genial este resultado, veo que vas en camino de romper todos los records :D. Tengo un par de preguntas que ojalá puedas contestar. 1. Cuando PacMan se come todos los cuadritos entonces el juego termina, no debería considerar que comerse todos los cuadritos es equivalente a quedarse sin futuros? 2. Hice un algoritmo para intentar resolver el TSP y aunque parece que tiene sentido el arbol de busqueda crece muy rápido dando como resultado que el horizonte de tiempo tiene que ser muy pequeño. Crees que se podría implementar búsqueda fractal en un problema discreto?

    ReplyDelete
    Replies
    1. Hola, te intento responder:

      1) Si, al terminar la varianilidad baja, pero el score sube, y packman considera mejor un futuro con mejor score, así que, a pesar de la variabilidad, le sigue interesando. Como escalas tus scores de manera que en estos casos se imponga el score es importante, en estos casos escalo los score de todos los walkers de 1 a 2, eso hace que el que no temrina vea al que si termina como doble de bueno, ese es el truco.

      2) Si, se puede, pero es cierto que el arbol se hace infinito y, al ser discreto, no hay saltos "poco a poco" y eso lo hace más difícil de resolver. El truco de el arbol infinito está en cuando un walker decide abandonar su "rama" e irse a otra. ESto ocurre cuando el score en tu rama es muy bajo o cuando la distancia media a otros walkers es muy baja. En ambos casos, tiendes a irte, así que ramas poco prometedores (bajo score) tienen densidad de walkers escaneandola proporcionalmente baja. Al ser proporcional, ramas interesantes se escanean más sin que eso suponga un esfuerzo de cálculo y así evito escanear amplias zonas. Pero si usas pocos walkers, puedes dejar de escanear una zona que termina siendo buena, así que aunque el algoritmo ayuda, sigues necesitando cientos o miles de walkers.

      En esto último ayuda mucho si enseñas a una red tus decisiones pasadas y aprende a "guiarte" por las ramas más prometedoras, pero eso es el modo "híbrido" del que comento que aún no está del todo funcionando.

      Hemos aplicado este método a algunos problemas discretos: el taxi en OpenAI por ejemplo se resolvió así -no lo publicamos- y, en ese caso, el arbol era pequeño, por eso otros métodos lo resuelven, peor si el tablero lo haces doble de tamaño en X e Y, tienes un grafo inmanejable, un arbol inmenso, y solo si sabes abandonar ramas a tiempo consigues escanearlo y resolverlo incluso sin temrinar de escanearlo.

      También trabajamos con el Unit Conmitement Problem (UCP) y reslvía el caso de 10 units en pocos segundos, auqnue para casos mayores necesitabas de una memoria y un aprendizaje para "aprender" a resolverlos con eficiencia (esa parte quedó inconclusa, porque necesita el modo híbrido y aún no estaba listo).

      Delete
  2. Este es el TSP https://en.wikipedia.org/wiki/Travelling_salesman_problem

    ReplyDelete
    Replies
    1. Si, lo conozco, le eche un par de tardes a ver como se podía "convertir" pero era complicado, lo probamos un par de veces, un poco naive, y no resulto bien a la primera. No le volví a prestar atención, la verdad. No es de los problemas que mejor se apdaptan a este algorítmo, o yo no he sabido ver como de momento, y tampoc me llama excesivamente la atención, hay otros muchos que me interesan más, como Navier-Strokes últimamente.

      Delete
    2. Nota: El problema básicamente está en que es espacio de soluciones es un espacio de permutaciones, y la distancia entre permutaciones no está "bien" definida, soluciones cercanas no tienden a ser semejantes en sus scores. Eso no ocurre con otros problemas, por eso es complicado hacerlo, tienes que inventarte otras distancias que tengan un sentido más "físico" y no se me ha ocurrido el como. Cualquier sugerencia es bienvenida!

      Delete
  3. It's quite impressive how well the intelligence algorithm works.

    However, correct me if I'm wrong, but I don't think it's a fair or apples-to-apples comparison when comparing it to other algorithms such as MontrealAI.

    I believe the Fractal AI basically boils down to sampling which possible futures would be attained by taking different actions at this instant, and then choosing the best action based on a specific value function (for example, based on entropy), and doing this repeatedly every instant. The algorithm can know how current actions produce different outcomes because it uses the game itself to have perfect knowledge of the future, not because it has learned to model the game. The MontrealAI on the other hand does not assume it knows the future perfectly, instead it tries to learn it.

    In other words, the problem can be separated into two separate problems: (1) modelling the system to be able to predict future outcomes, and (2) using this model to act intelligently. The fractal AI is only working on the second problem (and it is solving it brilliatnly), but it is not addressing the first problem. The Montreal AI is trying to tackle both at the same time.

    To have real applicability in real-life problems you need to solve both problems, otherwise you are lmited to controlling systems where you already have a mathematical model of it, which are limited.

    Also, the first problem (learning to model the system) is actually the most challenging of the two, by far. The algorithm must somehow learn to understand the dynamics of a system based only on some sensor data. This is a gargantuan problem, and I believe is the real challenge.

    Don't get me wrong, I do believe that entropy-based intelligence is as close as we can get to the magic bullet, the be-all-end-all solution to problem #2, and the Fractal AI seems to be a quite efficient and effective implementation of this principle. I think this is a huge discovery, with enormous implications which we still do not understand. But to turn it into an AI that can be actaully used in real-life scenarios we still need to solve problem #1, which is an enormous problem which we are still far from the solution.





    ReplyDelete
    Replies
    1. Hi Juan, thanks for commenting, and yes, you are right in you analisis of both aproaches, we are using a quite different strategy here, but I think it is 100% fair and an apple-to-apple comparation as both algorithms play the same game, under the same rules, and are scored with the same metric.

      Your main concern is about directly scanning our near future instead of scaning our past games, but both algorithms, at the end of the day, when you watch them playing the last of the games they played, both are relaying on using the APIs to make "out of this game" simulations and build a strategy on them, just in different ways.

      If you could make the NN to learn to a level it can beat the Fractal AI first game played, you would win us, but actual NN methods can not do it, so actually is a problem with NN not getting above 10k points even after 300k games, it is not learning well, and the curve shows that it wont learn any better with another 300k games... you are blocked by your method.

      Fractal AI is solving only (2), thats true, but the environment is giving you (1) for free, to the Fractal AI and to the NN used in other approaches, at the end of the day, you are replacing "random initial games" with "intelligently chossen/played initial games", so the Fractal AI "output" is perfectly cured as to be used by an standard learning algorithm.

      In fact FractalAI is not better than a NN in all aspects: once the NN learned, you can play a decent game usign a very modest amount of CPU, while Fractal AI always uses a fixed amount of CPU, so it will always take more time to solve it with Fractal AI than with NN.

      NN has speed advantages we will "assimilate" into the Fractal AI schema, our NN will learn from the Fractal AI games, and NN will, when trained, help Fractal AI to get even better games at a lower CPU cost, and this cycle can be repeated ad infinitum.

      In the limit, we could switch off Fractal AI and only use the NN for the last games, being as fast as any other NN, but having learn to play perfectly well in a fraction of the time others NN need to learn how to play "not so optimally": same final speed, 10x or 100x times better scorings.

      If this is cheating, I will cheat as much as I can!

      Delete
    2. Thanks for the interesting discussion Sergio.

      Well let's put it this way, let's say I want to build a 'cheat box' I can take with me to my local arcade to beat the arcade games for me. The box will have a camera that looks at the Pacman screen, and some servos to press the buttons. You don't have access to the pacman source code so you have to rely only on the camera input to win the game. The Fractal AI won't be able to work because it doesn't have a model of the game. The Montreal AI algorithm will learn to beat the game over time. (Maybe I'm wrong here because I cant find the source or documentation on the Montreal AI, but in general I believe this type of algorithm would do it)

      For this reason I don't think its fair to compare the performance of an algorithm that is trying to solve both problem 1 and problem 2, with one that only tries to solve problem 2. Solving both 1 and 2 is a much tougher problem, and makes the algorithm useable in real life, while if you only solve problem 2 the algorithm cannot be used in real life and it's not tackling the toughest problem of all.

      Maybe under the rules of the OpenAI gym it is allowed to solve just problem 2, so therefore the Fractal algorithm is winning fair and square. I agree with this. But when I say it's not a fair comparison I'm talking in terms of engineering, not the openAI competition.

      Delete
    3. You are right, in general terms, it is unfair as we only solve one of the 2 problems, Fractal AI (without learning code) can not be used without a somehow reliable simulation of the system, you don't need perfect information, just reliable information.

      That said, the actual strength of this Fractal AI is taking as much juice as you can from any simulation you have, and this includes regualar NN, that in turn can learn faster by noly watching "pro-level games".

      It is this "virtuous circle" between Fractal AI and NN that is a real apples-to-apples comparation out of the "fantasy land" of OpenAI, but knowing NN do a very nice work in learning and Fractal AI generate high quality samples... what could go wrong?

      Delete
    4. I agree. Fractal AI is the solution for intelligence, NN can be the solution for modelling, both working together could be very powerful.

      But I still have my reservations on how good NN's are at modelling. They're the best we have, but I think they're still too limited in many respects, especially for modelling dynamic systems.

      Delete
    5. From the Fractal AI point of view, NN are just a from of memory that helps AI to better focus on a-priori best decision and not on the not so promising ones, but I agree with you (as always!) in that NN are not the final answer.

      I am not any fan of NN but they are the fast lane to adding memory to Fractal AI, but they are nasty black boxes limited by the number and kind of layers you previously choose.

      I am working on a fractal form of memory as a replacement for NNs, I have tested on some cases and it do work, but actually NNs are fast and reliable due to GPUs and lineal algebra, and loosing that advantage needs to be compensated with a really efficient implementation of any other kind of memory you choose, included fractal memory, so actually is a long shot.

      But, as fractal AI, its memory counter part is also understandable and somehow more traceable that NNs are, and it grows as needed, so you don't have to limit it on its origin with any paticular topology, it would "sculpt" its own form on the learning process.

      Sounds nice, but there is still some steps to be there, and as always, my side-proyect time is limited.

      My dream is Fractal AI + Fractal Memory + Fractal Consciousness, and the later is, unexpectedly, quite more easy to define and code than memory, so NNs are going to be around for some time I assume.

      Juan, I am really enjoying this conversation, some day we should take a beer and talk!

      Delete
  4. Fractal memory, very interesting. So as I understand it, you want to use a NN (or alternatively a fractal algorithm) to add memory to the system. The idea is that it learns from previous experience, biasing it towards what we know works, instead of having to constantly re-process solutions that it already found.

    The problem that really interests me the most, however, is a bit different. My question is, how could we apply a Fractal AI to intelligently control -any- robot or system, without having to code the dynamic model of the robot or system into the algoirthm? For example, how could I actually build that 'cheat box' to beat the actual pacman arcade machine? How would the Fractal AI work if it doesn't have access to the game's API to sample future outcomes?

    Merely adding memory to the algorithm does not solve the problem. Memory only allows the algorithm to remember what previously worked, but it doesn't help it find new ways that have a good chance of working but haven't been tried yet. In more concrete terms, the Fractal AI won't know how to predict future outcomes; it only knows past history. If you limit the Fractal AI to utilize only past history, the resulting algorithm will work no better than a simple random search, where good solutions are found by chance and stored to be repeated later. This is basically what the Montreal AI does, and it's not that effective.

    More than memory, what you need is the algorithm to learn to create a dynamic model of reality. This dynamic model not only fits with what we have already experienced, but is able to extrapolate into the future to predict things that have never been experienced. This dynamic model can be used by the Fractal AI to sample into the future and find strategies that could work but have not been tried yet. This will make it more intelligent than a simple random search, and will approximate the results you are showing in OpenAI, only that it won't be as good because now the system has to learn to understand the system, instead of having direct access to the solution.

    This is pretty much how the human rational mind works; we create a simulation of our world in our minds based on the data we obtain from our senses, then we extrapolate and think what would happen if we do different actions, and then we choose the action that we believe will optimize a function (happiness, wealth, etc).

    I've been studing this idea for a while now, and have found that NN's are very limited in their capacity to create these dynamic models. In fact, I think we are quite far from solving this problem. But once we do, we will have the holy grail: a true universal AI; one which can be applied to any system and it will learn to act intelligently, without any tuning or specific coding.

    Me encantaria hablar algun dia. Soy de Puerto Rico, soy ingeniero mecanico y trabajo en sistemas 'drones', y me interesa mucho los temas de AI.

    -J

    ReplyDelete
  5. Will you release the source code? When?

    ReplyDelete
    Replies
    1. Yes, we plan to release it after the summer (and before end of 2017), but before it we want to add "learning" to the mix and try it out with most of the atari environments, so we make sure we release final code that actualy improve over time and not just a mock-up of the idea.

      Delete
  6. Im not sure if you have seen this thread, but here they are discussing your findings. Some mention a similar argument to the one I made above “Yes - the issue is that the work is currently presented as requiring "no training", but it has simply relocated that problem to constructing a perfect simulation of the environment. It then uses the fact that current benchmarking systems have available simulations to "cheat" rather than learning that function itself. One of the most difficult and interesting parts of reinforcement learning is constructing the function that determines the evolution of the system. If you know the evolution function a priori the problem is mostly trivial - i.e. alpha-beta search, graph searching, etc.”.

    https://news.ycombinator.com/item?id=14709896

    I dont mean this as a discouragement but as constructive input. I love your work, looking forward to the next post.

    ReplyDelete
    Replies
    1. Thanks Juan, I already read it time ago, and I just don't know why they don't just come here and ask me whatever they want to know, I won't be registering here and there to answer their questions, but sure I can clarify any conceptual question for them.

      Delete