Tuesday, 12 June 2018

Graph entropy 4: Distribution vs Graph

In previous posts, after complaining about Gibbs cross-entropy and failing to find an easy fix, I presented a new product-based formula for the entropy of a probability distribution, but now I plan to generalise it to a graph.

Why is it so great to have an entropy for graphs? Because distributions are special cases of graphs, but many real-world cases are not distributions, so the standard entropy can not be applied correctly on those cases.

Graph vs distribution

Let's take a simple but relevant example: there is a parking lot with 500 cars and we want to collect information about the kind of engines they use (gas engines and/or electric engines) to finally present a measurement of how much information we have.

We will assume that 350 of them are gas-only cars, 50 are pure electric and 100 are hybrids (but we don't know this in advance).

Using distributions

If we were limited to probability distributions -as in Gibbs entropy- we would say there are three disjoint subgroups of cars ('Only gas', 'Only electric', 'Hybrid') and that the probabilities of a random car to be on one subgroup are P = {p1 = 350/500 = 0.7, p2 = 50/500 = 0.1, p3 = 100/500 = 0.2}, so the results of the experiment of inspecting the engines of those car has an Gibbs entropy of:

HG(P) = -𝚺(pi × log(pi)) = 0.2496 + 0.3218 + 0.2302 = 0.818

If we use the new H2 and H3 formula, we get a different result, but the difference is just a matter of scale:

H2(P) = ∏(2 - pipi) = 1.2209 * 1.2752 * 1.2056 = 1.8771

H3(P) = 1 + Log(1.8771) = 1.6297

Those entropies are incomplete: we are assuming we have three different -and totally independent- results, but this is not true, there is a hidden structure not represented in the distribution: hybrids cars are gas and electric cars at the same time.

In other words: we say we have the same information about engines that we would have about colours (if 0.7 of the cars were red, 0.2 blue, and 0.1 green), but this is not correct as engines have an internal structure colours don't have.

Using graphs

When using graphs we will not assume we know about the existence of 'hybrid cars', instead, we will perform two independent measurements (how many of the cars have some gas engine, and how many have an electric engine) and repeat the same schema until no more information can be obtained.

We would initially find that 350+100 = 450 cars have a gas engine, so P(G) = 450/500 = 0.9, while 50+100 = 150 cars will have an electric one so P(E) = 150 / 500 = 0.3.

Please note here how (0.9, 0.3) is not a distribution, as its summary is 1.2, meaning there must exists some kind of intersection between the two categories (the hybrid cars), so we are not still done with the graph.

If now we focus on G (the 90% of the cars where a gas engine were detected) and repeat the same process, we would find that all of them have -again- a gas engine (P(G|G) = 1) but only 22% have an electric engine (P(E|G) = 100/450 = 0.222).

Repeating the process in E (cars with an electric engine) would produce similar probabilities: P(G|E) = 0.666, P(E|E) = 1.

Please note that, from here, no more repetitions are needed, as repeating at the two new nodes, (G|E) and (E|G), as they both are representing 'hybrid cars', will always have gas and electric engines with probabilities of 1, meaning no more info can be obtained.

This graph is then representing everything we know about 'car engines' in the form of a tree-like structure made of two components:
  • Nodes represent events: node X correspond to the event "I randomly choose a car from the parking lot", while node G represent the event "I detect a gas engine is in the car".
  • Edges represent conditional probabilities of events: P(G|X) = 0.9 means "detecting a gas engine is in a car" if I first "randomly choose a car from the parking lot" is 90%.

In order to define the actual entropy value for a tree-structure like this, we will perform a recursive process, starting on the upper leaf-nodes and going down to the root node, where the graph entropy will be finally obtained.

We will first need to break the problem into its smallest components and trace a plan on how to mix small parts into bigger ones until we can finally obtain the root entropy of the graph... but it is lunch time! I will be commenting on these final calculations on a next post.

Bon appetit!

No comments:

Post a Comment