Friday, April 28, 2006

Convention, Configuration, and Communication Entropy

The ever more popular web framework, Ruby on Rails, has at its core an interesting idiom: Convention over Configuration. The idea goes as follows: for the most part a data model will map directly to a database schema, and for the most part the display and editing of the data will map directly onto the model, therefore: don't specify the mapping in configuration, assume the mapping exists 1-1.

The long established ideas of Algorithmic Information Theory back up these design choices very well, in particular the idea of Communication Entropy applies directly to this situation. The entropy associated with a given communication relates to the amount of information that one learns upon receiving the communication. So for instance when one learns that a table PERSON has a data model equivalent Person and a view person.rhtml then one learns very little because this 1-1 type mapping is very common.

In communication theory one uses the entropy of a message to compress the message, so for instance in English the letter E crops up very often, one learns very little from a message from the presence of E (you can test this by removing all the E's from this text, or indeed all the vowels, you will mostly likely find the text completely readable without them), therefore when we want to send an 'E' down a communication line we send the smallest encoding: 0, leaving the whole of the rest of the alphabet to begin with 1 (t = 101, a = 110, etc.). When one encodes English like this on average a letter can be communicated with about 2 bits. Without encoding the average would be around 5 bits, over twice as large.

Perhaps if the same ideas were applied to software we could also see a reduction in message size. In fact this idea underpins Domain Specific Languages (like those Rails uses): don't communicate that which you can take as a given.

The parallels between TDD derived code and Artificial Neural Networks

I'd like to draw an analogy between artificial neural networks and the code that often gets produced by poor TDD, i.e. TDD done with little respect paid to refactoring.

Artificial Neural Networks

An artificial neural network (ANN) is a graph like structure that maps some set of inputs to a set of outputs. The ANN can make very complex mappings due to having multiple layers and weightings between nodes in the network.

One example of an ANN is one which can recognise a human face. To obtain such a network one stimulates a randomized network with samples of a face and of non-faces. For each sample the network will produce an output. One takes the output and calculates the difference between it and the desired output and then that difference is back propogated through the network. Back propogation of the 'error' causes the network to learn, to produce smaller and smaller errors.

After training one has a very useful tool that can classify things sometimes more accurately than a human can. The faustain bargain one has agreed to is this: One doesn't know how the network classifies its input. The 'knowledge' is just a bunch of numbers, it's a black box.

Code produced by TDD

With TDD the tests take the same role as input to the code and expected output. The developer takes the difference in actual and expected output and back propogates changes to reduce this difference to zero.

At the point when all the tests pass it can be very tempting to move onto the next feature but if one succumbs to this temptation, and does so regularly then the resulting code will closely resemble the neural network, it will work, but you won't know how it works, it too will be a black box.

Danger of over-specificity

Once one has a trained network one might have the danger of having an over specific network: it can recognize all of the faces in the sample set but cannot recognize faces outside the set, even if we might feel those additional faces are too similar to warrant misclassification. To address this problem one usually divides the sample into two sets; one set being used for training and both being used to verify.

But with TDD there is no equivalent to having a sample of tests used to drive the code and additional tests used to verify the code. So not only will the code be a black box but it could very likely be an over specific black box.


I find the conclusion to all of this quite obvious: Refactor your code if you want to retain the ability to understand it! Even if you have great tests and great coverage you may still have a murky black box that nobody wants to work on.

(*) It may be that black box code cannot be tested with good tests and so the existence of black box code will be noticed by the developers as an increasing difficulty when trying to write good tests.