Friday, 3 August 2018

Roadmap to AGI

Artificial General Intelligence (AGI) is the holy grail of artificial intelligence and my personal goal from 2013, where this blog started. I seriously plan to build one AGI from scratch, with the help of my good friend Guillem Duran, and here is how I plan to do this: a plausible and doable raodmap to build an efficient AGI.

Plase keep in mind we both use our spare time to work on it so, even if the roadmap is practically finished in the theorical aspects, coding it is kind of hard and time-consuming -we don't have acces to any extra computer power except for our personal laptops- so at the actual pace, don't spect anything spectacular in a near future.

That said, the thing is doable in terms of a few years given some extra resources, so let's start now!

AGI structure

A general intelligence, being it artificial or not, is a compound of only three modules, each one with its own purpose that can do its job both autonomously and cooperating with the other modules.

It is only when they work together that we could say it is "intelligence" in the same sense we consider our selves intelligent. May be their internal dynamics, algorithms and physical substrate are not the same nor even close, but the idea of the three subsystems and their roles are always the same in both cases, just they are solved with different implementations.

In this initial post I just enumerate the modules, the state of its developemnt, and its basic functions. In next posts I will get depper into the details of each one. Interactions between moduels will be covered later, when the different modules are properly introduced

Module #1: Learning

Definition: This module uses learning to build a simulator of the agent's world from the raw sensorial inputs.

Basic function: The module will process the raw sensorial inputs, learns to build a representation of the world (as an embedding of those sensorial inputs) and then use it to predict the next state of the world -its representation- as it will probably perceive it in the next moment.

Development: It is a standar deep learning task, so this part is easy... once you find the right layer topology and a lot of GPU time to train it some hundreds of times until it does it job.

Module #2: Planning

Definition: This module scan the future outcomes -or consecuences- of taking diferent actions and decide which one will actually take the agent in order to behave intelligently.

Basic function: This module is a modified version of the actual FMC algorithm showed in our github repo. It reads the actual state of the world from the module #1 as a representation, uses its predictive skills to build a number of paths the world could follow, form a tree with them following the FMC rules, and finally choose the action with more leaf nodes attached.

Development: This module is already done and working like a charm. It really outperform any other planning algorithm out there (it beated all SoTA algorithm we could find, about 11 and all of the 50 Atari games used to test them in the literature with 360 times fewer samples on average), works perfectly fine on continuous spaces, and based on a first-principles theory of intelligence (of mine ;). 

Module #3: Consciousness

Definition: This module select the relative importance of the different goals available to the intelligence to build the reward function used by module #2.

Basic function: Basically, it changes the "personality" of the agent real time, making it more interested on some kind of things and less on others in orther to maximize some property of the tree built by module #2 on its internal process of deciding. The effect is autoselecting the goals to follow depending on the most probable world evolution so that the agent has an enjoyable and highly rewarding future in most of them.

Development: This is not as complicated as it sounds, actually it is about using FMC a second time on top of the first, but instead of deciding on the next action to take, now you decide on the next "personality change" of the agent using the same idea but with a deeply different form of entropy: the entropy of the whole tree as a graph (as opposed to using the entropy of only the final leaf nodes as in standard FMC), or its "graph entropy". So it is still waiting for a "coding slot", and will be for a time!

Conclusions

There are some important pieces of the model left out and lots of implementation details worth mentioning, but this is basically all it takes to build an AGI: A sensorial part that deals with inputs and builds a useful simulation of the world, an intelligent planning part that uses the simulation to scan the future and decide, and a final part that defines and modify the reward function to follow.

On its actual form, it is doable in a few years, but the part that have to learn the world dynamics, the ANN, is the weakest part with the most difficult task, it would be the bottleneck of the AGI.

In a short period of time, a link to the next probable post will be: here.


Wednesday, 18 July 2018

Graph entropy 6: Separability

In the standard Gibbs-Shannon entropy, the 3th Shannon-Khinchin axiom about separability says that, given two independent distributions P and Q, the entropy of the combined distribution PxQ is:

H(PxQ) = H(P) + H(Q)

When P and Q are not independent, this formula becomes an inequality:
H(PxQ) ≤ H(P) + H(Q)

Graph entropy, being applied to graphs instead of distributions, allows for some more forms of combining two distributions, giving not one but at least three intersting inequalities:

Wednesday, 13 June 2018

Graph entropy 5: Relations

After some introductory posts (that you should had read first, starting here) we face the main task of defining the entropy of a graph, something looking like this:


Relations

We will start by dividing the graph into a collection of "Relations", a minimal graph where a pair of nodes A and B are connected by an edge representing the conditional probability of both events, P(A|B):


Tuesday, 12 June 2018

Graph entropy 4: Distribution vs Graph

In previous posts, after complaining about Gibbs cross-entropy and failing to find an easy fix, I presented a new product-based formula for the entropy of a probability distribution, but now I plan to generalise it to a graph.

Why is it so great to have an entropy for graphs? Because distributions are special cases of graphs, but many real-world cases are not distributions, so the standard entropy can not be applied correctly on those cases.

Graph vs distribution

Let's take a simple but relevant example: there is a parking lot with 500 cars and we want to collect information about the kind of engines they use (gas engines and/or electric engines) to finally present a measurement of how much information we have.

We will assume that 350 of them are gas-only cars, 50 are pure electric and 100 are hybrids (but we don't know this in advance).

Using distributions

If we were limited to probability distributions -as in Gibbs entropy- we would say there are three disjoint subgroups of cars ('Only gas', 'Only electric', 'Hybrid') and that the probabilities of a random car to be on one subgroup are P = {p1 = 350/500 = 0.7, p2 = 50/500 = 0.1, p3 = 100/500 = 0.2}, so the results of the experiment of inspecting the engines of those car has an Gibbs entropy of:

HG(P) = -๐šบ(pi × log(pi)) = 0.2496 + 0.3218 + 0.2302 = 0.818

If we use the new H2 and H3 formula, we get a different result, but the difference is just a matter of scale:

H2(P) = ∏(2 - pipi) = 1.2209 * 1.2752 * 1.2056 = 1.8771

H3(P) = 1 + Log(1.8771) = 1.6297



Monday, 11 June 2018

Graph entropy 3: Changing the rules

After showing that the standard Gibbs cross-entropy was flawed and tried to fix it with a also flawed initial formulation of "free-of-logs" entropy, we faced the problem of finding a way to substitute a summary by a product without breaking anything important. Here we go...

When you define an entropy as a summary, each of the terms is supposed to be a "a little above zero": small and positive ๐›† ≥0 so, when you add it to the entropy it can only slightly increase the entropy. Also, when you add a new probability term having (p=0) or (p=1), you need this new term to be 0 so it doesn't change the resulting entropy at all.

Conversely, when you want to define an entropy as a product of terms, they need to be "a little above 1" in the form (1+๐›†), and the terms associated with the extreme probabilities (p=0) and (p=1) can not change the resulting entropy, so they need to be exactly 1.

In the previous entropy this ๐›† term was defined as (1-pipi), and now we need something like (1+๐›†) so why not just try with (2-pipi)?

Let us be naive again an propose the following formulas for entropy and cross-entropy:

H2(P) = ∏(2-pipi)

H2(Q|P) = ∏(2-qipi)

Once again it looks too easy to be worth researching, but once again I did, and it proved (well, my friend Josรฉ Marรญa Amigรณ actually did) to be a perfectly defined generalised entropy of a really weird class, with Hanel-Thurner exponents being (0, 0), something never seen in the literature.

As you can see, this new cross-entropy formula is perfectly well defined for any combination of pi and qi (in this context, we are assuming 00 = 1) and, if you graphically compare both cross-entropy terms, you find that, for the Gibbs version, this term is unbounded (when q=0 the term value goes up to infinity):

๐›ŸG(p, q) = -(p × log(q))



In the new multiplicative form of entropy, this term is 'smoothed out' and nicely bounded between 1 and 2:

๐›Ÿ2(p, q) = (2-qp)


Sunday, 10 June 2018

Graph Entopy 2: A first replacement

As I commented on a previous post, I found that there were cases where cross-entropy and KL-divergence were not well defined. Unluckily, in my theory those cases where the norm.

I had two options: Not even mentioning it, or try to go and fix it. I opted for the first as I had no idea of how to fix it, but I felt I was hiding a big issue with the theory under the carpet, so one random day I tried to find a fix.

Ideally, I thought, I would only need to replace the (pi × log(pi)) part with something like (pipi), but it was such a naive idea I almost gave up before having a look, but I did: how different do those two functions looks like when plotted on their domain interval (0, 1)?

Wow! They were just mirror images one of each other! In fact, you only need a small change to match them: (1-(pipi)):


Graph entropy 1: The problem


This post is the first on a series about a new form of entropy I came across some months ago while trying to formalise Fractal AI, the possible uses for it as a entropy of a graph, and how I plan to apply it to neural network learning and even to generate conscious behaviour in agents.

Failing to use Gibbs

The best formula so far accounting for the entropy of a discrete probability distribution P={pi} is the so-called Gibbs-Boltzmann-Shannon entropy:

H(P) = -k*๐šบ(pi × log(pi))

In this famous formula, the set of all the possible next states of the systems is divided into a partition P with n disjoint subsets, with pi representing the probability of the next state being in the i-th element of this partition. The constant k can be anything positive so we will just assume k=1 for us.

Most of the times we will be interested in the cross-entropy between two distributions P={pi} and Q={qi}, or the entropy of Q given P, H(Q|P), a measure of how different they both are or, in terms of information, how much new information is in knowing Q if I already know P.

In that case, the Gibbs formulation for the cross-entropy goes like this:

H(Q|P) = -๐šบ(pi × log(qi))

Cross-entropy is the real formula defining an entropy, as the entropy of P can be defined as its own cross-entropy H(P|P), having the property of being the maximal value of H(Q|P) for all possible distributions Q.

H(P) = H(P|P) ≥ H(Q|P)

As good and general as it may looks, this formula hides a very important limitation I came across when trying to formalise the Fractal AI theory: if, for some index you have qi=0 then if pi is not zero too, the above formula is simply not defined, as log(0) is as undefined as 1/0 is.