When is an Idol an Idol?

The local version of Idols (South Africa) had an "interesting" result this month. It turned out that not all of the SMS's to vote for the winner was delivered in time. A few hundred thousand SMS's that was sent in time, was not counted as it was delivered after the cut-off time. In an unprecedented move, MNet (the local franchisee of Idols), recounted the votes and declared another Idol the Idol (read more here).

This problem was obviously a result of infrastructure somewhere that could not cope with the volumes (or something else). We that work in the infrastructure understand the complexities of routing messages through many infrastructure components in order to ultimately deliver a service to consumers of an acceptable quality. However, many single points of failure exist and problems like this (very high profile) situation can occur.

This incident made me realise again how complex it is to design a Financial System of acceptable quality that must run on such an unpredictable network. The challenges to ensure reliability (and recovery) in a real-time payment system installed in an environment that is fundamentally unstable (have many single points of failure) is huge. In many discussions that I have had in the industry, very few practitioners understand the problem, and even less have designed solutions that are able to cope in these environments.