This post is the first on a series about a new form of entropy I came across some months ago while trying to formalise Fractal AI, the possible uses for it as a entropy of a graph, and how I plan to apply it to neural network learning and even to generate conscious behaviour in agents.

## Failing to use Gibbs

The best formula so far accounting for the entropy of a discrete probability distribution P={p_{i}} is the so-called Gibbs-Boltzmann-Shannon entropy:

H(P) = -k*𝚺(p

_{i}× log(p_{i}))
In this famous formula, the set of all the possible next states of the systems is divided into a partition P with n disjoint subsets, with p

_{i}representing the probability of the next state being in the i-th element of this partition. The constant k can be anything positive so we will just assume k=1 for us._{i}} and Q={q

_{i}}, or the entropy of Q given P, H(Q|P), a measure of how different they both are or, in terms of information, how much new information is in knowing Q if I already know P.

In that case, the Gibbs formulation for the cross-entropy goes like this:

H(Q|P) = -𝚺(p

_{i}× log(q_{i}))
Cross-entropy is the real formula defining an entropy, as the entropy of P can be defined as its own cross-entropy H(P|P), having the property of being the maximal value of H(Q|P) for all possible distributions Q.

As good and general as it may looks, this formula hides a very important limitation I came across when trying to formalise the Fractal AI theory: if, for some index you have q

H(P) = H(P|P) ≥ H(Q|P)

As good and general as it may looks, this formula hides a very important limitation I came across when trying to formalise the Fractal AI theory: if, for some index you have q

_{i}=0 then if p_{i}is not zero too, the above formula is simply not defined, as log(0) is as undefined as 1/0 is.
In the entropy formula for a single distribution P, this defect was not showing up, as p

_{i}was multiplying the log, making the limit as p_{i}tends to zero to be well defined with a value of zero... but in the more general cross-entropy formula, the problem was really there.
As a mathematician I was horrified. I searched on the net for papers discussing this deep problem but was unable to find much, so I went to some DL specialists and asked them for the solution they were applying when using cross-entropy as a loss function for gradient descent, hoping to discover that there were an easy and satisfactory fix, but there wasn't! They were just replacing probabilities lower than an epsilon, with epsilon. They were clipping the offending values out of sight. Ouch!

How could it be that the two most important derivations from the concept of entropy (cross-entropy and divergence) were not well defined under the Gibbs formulation of entropy? Why was the Gibbs formulation so defective and where did the defect came from?

## The origins of entropy

Any formula using the log of a probability is, in principle, flawed. Probabilities can naturally be zero, and log is not defined for zero, so using log of a probability is not actually a good idea.In its original form, Gibbs didn't use logarithms. Gibbs started by noticing what is known as the original form of the Gibbs inequality:

### Gibbs original inequation

For any distribution P={p_{i}}, the product ∏(p

_{i}

^{pi}) is always positive and greater-or-equal than the product ∏(q

_{i}

^{pi}) for any other distribution Q={q

_{i}}.

∏(p

_{i}^{pi}) ≥ ∏(q_{i}^{pi})It was a very nice inequality as it was, but Gibbs thought he could apply logarithms on both sides without breaking the inequality and, at the same time, use it to define entropy itself, making it easy to compute the entropy of two independent system by just adding their corresponding entropies.

He named this property of strong additivity its 4rd axiom needed for a function to be consider as an entropy, and then he proved that only his formula fitted into those four axioms.

This step resulted in the actual form of the entropy as a summary of logarithmic terms, and is in the root of all subsequent problems with the cross-entropy. My opinion is that this 4rd axiom is not a real axiom but a somehow desirable property that is not worth the high cost we paid for it.

Once we know the problem the cross-entropy has and the root causes, we are ready to explore other ways of measuring entropy without using logs, but it will have to be in a next post.

## No comments:

## Post a Comment