Wednesday, 28 June 2017

Solved atari games

The list of environments from OpenAI that we have already played so far is steadily growing, so I had to made a list to keep track of them. Here we keep and share this growing list, along with it scorings and how it compares with the "second-best" algorithm in the OpenAI gym.

For the interested people: We base all on our work on the "Causal Entropic Forces" by Alexander Wissner-Gross and apply the ideas outlined in the G.A.S. algorithm. We are not actually learning in any way, so all games are independent from each other and first-ever played game is as godd as the 100th game.

First, we list the already finished environments. They include 100 games played and an official scoring in OpenAI gym as the average of those 100 games:

1. Atari - MsPacman-ram-v0: average score of 11.5k vs 9.1k (x1.2)

This was the first env. to be finished and uploaded, so it represent our first official record. We decided to use the "ram" version (instead of the image version) because it is irrelevant for our algorithm but not for a more standard approach, so we had an extra punch.

The main issue here was a dead Pacman takes about 15 frames to be noticeable on screen (there is a short animation) so you need to think in advance at least those 15 frames (ticks) in order to start detecting death.

 2. Atari - Qbert-ram-v0: avg. score of 18.4k vs 4.2k (x4.3)

Next one was Qbert, just because it solved quickly. Here action is not so fast, once you decide to jump left, it takes a considerable amount of frames for the decision to complete so you can take a second decision. That is why, in this case, we scan one frame every two (Pacman used all frames).

Note: One "frame" is internally composed of 2 consecutive frames as the atari screen seems to use an interlaced schema (odd frames only fill odd lines, line in PAL or SECAM tv standards).

The rest of the cases have not completed the mandatory 100 games, it takes too long to make it (Guillem's laptop is causing about 1.3% of the actual global warming ;-) so they do not have an official scoring. Instead, I will use the aprox. average of completed games.

3. Atari - MsPacman-v0: avg. score (4 games) of ~9k vs 6.3k (x1.4)

Our first ever test was done on this environment. After 4 games the results where too evident to waste more CPU (adding more resources we could beat it at any level we wish, at the expense of CPU time) so we moved on to the "ram" version to explore.

4. Atari - Tennis-ram-v0: avg. score (1 game) of ~8 vs 0.01 (x800)

This time it is quite difficult to compare with others algorithms, as beating the embeded "old days AI" is quite complicated for a general algorithm (it was designed to play against human level players after all): most of the algorithms in OpenAI scored just 0.0, and only one was able to marginally win (due to the biased scoring system of picking the best 100 episodes, so the more you play, the higest your score gets).

Scoring x800 times as good as the second one is, in this case, just meaning "I am the fisrt and only algorithm actually solving it, so you can not compare me".

5. Atari - VideoPinball-ram-v0: avg. score (21 games) of ~500k vs 28k (x17.8)

The CPU power we applied here made the AI almost unbeatable, with about +50% it would never die.

6. Classic Control - CartPole-v0: avg. score (10 games) of 0.0 vs 0.0 (x1)

This one is not froman atari game, so not really in the list, but is a classic example we wanted to share anyhow.

The scoring here is weird, you only need to survive for a given amount of time (about 4 seconds) to get 200 points and win this game. You solved the game when the average of the last 100 games is above 195 points, and the number of games/episodes you played before those 100 "good ones" is your showed score.

So a score of 0.0 means that your first 100 games averaged above 195 and the game was cosidered solved. You can not get more that this, so here the absolute record was already reached.

That is all by now, I will be updating this post to reflect any new acomplishment in our quest to solve as many environments form OpenAI as we can.


  1. I would like to know how well this performs against AIXI (or more precisely, an approximation to incomputable AIXI). The CTW approximated AIXI is also a general algorithm without need to train internal parameters. It can also solve toy problems such as PacMan and has a very grounded, albeit different, theoretical basis.

    1. Hi Adam, I don't actually knew about AIXI, but after a quick search and read, it seems to be scanning the future outcomes making them to "fade away" in importance as they get more into the future (correct me if I am wrong!) so the nearer in time an event is, the more "decisory" it is.

      If that is the case, they both are similar but this "fading with time" heuristic is a "quick and dirty" trick -one that I already tried in my algorithm before going fractal- compared as how I do it now.

    2. Hmm, I haven't read all your blog post so far, and I'm not an expert on AIXI either. Still, I'd like to point out a major difference. AIXI, or more generally, Algorithmic Information Theory (AIT) based methods, does provides a solution to world modelling problem, including partial observability.

      For example, you want to predict next outcome of binary string "1101001000100001 ...". In fact you totally have no idea about the underlying model. However this problem would be too easy for a human IQ test. You would say, it's "000001...". But why is that? Why not "010010"?

      The AIT solution: write down all possible C programs, including non-halting ones, that prints first 15 character as the observed sequence.

      Now, if we continue to execute all these programs, some of them will print a "0" while others will print a "1". How do we decide the probability of the next digit, given there are infinite many programs? Well, we take weighted average of these programs, where the weight of a particular program is 2^(-size_of_the_program_in_bits).

      This process is called Solomonoff Inductive Inference. In a short word, if a short program can represent the observation, it gets higher weight. We prefer simple models over complex ones.

      Since the program size, or Kolmogorov complexity, shares many property with Shannon's information entropy. I get the feeling Entropic Intelligence and AIT does complementary things: one maximize the entropy of future, while the other minimize the "entropy" of past.

    3. "All possible programs" sounds too infinite to me!

      The idea of finding less complex program for a task is quite appealing, but if you should look for the "less complex program to find the less complex program in a general form"... what would you find? Would you need it before going into AIXI?

    4. Well, good methods to approximate incomputable SI would be an interesting research topic. A naive one would be Levin search, then there's Hutter Search and Juergen Schmidhuber's OOPS and Goedel Machine. However, these models are mainly of theoretical interest. In practice they mostly just solve some toy problems and fails to scale.

      Personally I think directly searching in the space of all Turing Machine tend is intractable. However the principle of searching for simple models could apply to other world modellers.

    5. The problem about searching in the "Turing space" is how do you "jump" from one point to another, how to make a small perturbation on a Turing machine so the resulting one makes "sense", is still similar to the original one (in its input -> output behaviour)... we need a replacement for X = X+epsilon but on the code space, and we haven't any (please please please correct me if I am wrong!).

      Basically this space is too sparse to move around, the sensible algorithms are way to distant one from another, and the gap is filled with billions of bullshit variations. We need a way to walk this landscape without "stepping on a bullshit".