Sunday, 10 June 2018

Graph entropy 1: The problem


This post is the first on a series about a new form of entropy I came across some months ago while trying to formalise Fractal AI, the possible uses for it as a entropy of a graph, and how I plan to apply it to neural network learning and even to generate conscious behaviour in agents.

Failing to use Gibbs

The best formula so far accounting for the entropy of a discrete probability distribution P={pi} is the so-called Gibbs-Boltzmann-Shannon entropy:

H(P) = -k*𝚺(pi × log(pi))

In this famous formula, the set of all the possible next states of the systems is divided into a partition P with n disjoint subsets, with pi representing the probability of the next state being in the i-th element of this partition. The constant k can be anything positive so we will just assume k=1 for us.

Most of the times we will be interested in the cross-entropy between two distributions P={pi} and Q={qi}, or the entropy of Q given P, H(Q|P), a measure of how different they both are or, in terms of information, how much new information is in knowing Q if I already know P.

In that case, the Gibbs formulation for the cross-entropy goes like this:

H(Q|P) = -𝚺(pi × log(qi))

Cross-entropy is the real formula defining an entropy, as the entropy of P can be defined as its own cross-entropy H(P|P), having the property of being the maximal value of H(Q|P) for all possible distributions Q.

H(P) = H(P|P) ≥ H(Q|P)

As good and general as it may looks, this formula hides a very important limitation I came across when trying to formalise the Fractal AI theory: if, for some index you have qi=0 then if pi is not zero too, the above formula is simply not defined, as log(0) is as undefined as 1/0 is.

In the entropy formula for a single distribution P, this defect was not showing up, as pi was multiplying the log, making the limit as pi tends to zero to be well defined with a value of zero... but in the more general cross-entropy formula, the problem was really there.

As a mathematician I was horrified. I searched on the net for papers discussing this deep problem but was unable to find much, so I went to some DL specialists and asked them for the solution they were applying when using cross-entropy as a loss function for gradient descent, hoping to discover that there were an easy and satisfactory fix, but there wasn't! They were just replacing probabilities lower than an epsilon, with epsilon. They were clipping the offending values out of sight. Ouch!

How could it be that the two most important derivations from the concept of entropy (cross-entropy and divergence) were not well defined under the Gibbs formulation of entropy? Why was the Gibbs formulation so defective and where did the defect came from?

The origins of entropy

Any formula using the log of a probability is, in principle, flawed. Probabilities can naturally be zero, and log is not defined for zero, so using log of a probability is not actually a good idea.

In its original form, Gibbs didn't use logarithms. Gibbs started by noticing what is known as the original form of the Gibbs inequality:

Gibbs original inequation

For any distribution P={pi}, the product ∏(pipi) is always positive and greater-or-equal than the product ∏(qipi) for any other distribution Q={qi}.

 ∏(pipi) ≥ ∏(qipi)

It was a very nice inequality as it was, but Gibbs thought he could apply logarithms on both sides without breaking the inequality and, at the same time, use it to define entropy itself, making it easy to compute the entropy of two independent system by just adding their corresponding entropies.

He named this property "strong additivity" and it is the 4rd Shannon-Khinchin axiom needed for a function to be consider as a full entropy (btw, functions meeting only the first 3 axioms are called "generalized entropies") and then they proved that only this formula fitted into those four axioms (well, what they really proved is that this is the only possible form of entropy using a summary of terms... but nothing about multpiplicative or other possible forms... hum... interesting!).

This step resulted in the actual form of the entropy as a summary of logarithmic terms, and is in the root of all subsequent problems with the cross-entropy. My opinion is that this 4rd axiom is not a real axiom but a somehow desirable property that is not worth the high cost we paid for it.

Once we know the problem the cross-entropy has and the root causes, we are ready to explore other ways of measuring entropy without using logs, but it will have to be in a next post.

3 comments:

  1. I'd argue the log(0) is not a problem if you move to the space of extended reals and assume log(0) = -∞.

    As to where it comes from: this is inherent behavior of KL divergence, which forces supports of two distributions to agree is a certain sense, otherwise it'd blow up to infinity. There exist many other divergences, many of which instead "saturate" on some finite value when supports disagree.

    ReplyDelete
    Replies
    1. Hi Artem. Defining log(0) as -inf does not solve any problem. You can not operate with that and, when p tends to 0, you still blow up.

      And about KL-div, being forced to only apply it to probabilities that "support" each other is just crazy if you think on it: almost all interesting examples will have some elements with probability 0 in some places, saying Pi and Qi need to be non-cero all the time, in all the cases, is highly restrictive for a formula that is so basic to everything like entropy, cross-entropy, and divergence. Somehow we have been said it was natural, and we just accepted.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete