“Mathematicians are like Frenchmen: whatever you say to them they translate into their own language and forthwith it is something entirely different” [J. W. Goethe]
In order to fully appreciate this theorem, we need to at least grasp the language spoken by mathematicians.
STATEMENT OF THE THEOREM
Here is an elegant (and rigorous enough) formulation of the universal approximation theorem proved by George Cybenko in his paper titled “Approximation by Superpositions of a Sigmoidal Function”:
This short sentence does not contain any hints pointing towards neural networks and contains several concepts that, at this stage, may be a bit obscure. These concepts are highlighted in blue and turn out to be the actual link between mathematics and neural networks.
The hypercube is nothing but the input space, i.e. the set of all the possible inputs of the function to be approximated. The number “n” of inputs may go out of hand rather quickly. As shown below, each 28×28 image in the MNIST dataset case requires n = 784 inputs, which correspond to the greyscale of its pixels.
The continuity requirement on the function to be approximated is a typical mathematical requirement stemming from the golden rule that the better functions behave, the easier they can be handled. In plain words, a continuous function is a functions such that:
As shown in the “post scripta” section given below, the theorem works also with less kind functions. One may argue that the theorem deals with functions, i.e. mathematics without any concrete impact. This seems like a rush to judgment. For example, the fancy face filters you have fun with on Instagram make use of (super duper) neural networks for face segmentation. At the end of the day, these neural networks approximate vector-valued functions (i.e. functions having vectors as output) that map pixel intensities into a pixel-wise segmentation map.
CONTINOUS DISCRIMINATORY FUNCTIONS
Discriminatory functions are defined in terms of signed regular Borel measures in such a way that, for any continuous function, there is a Cybenko’s finite sum as “close” as wished to this function. In other (more mathematical) terms, the definition of discriminatory functions guarantees that the set of Cybenko’s finite sums is dense in the set of continuous functions.
A priori, this ad hoc definition may turn out to be too narrow to be of any use. Quite conveniently, however, many interesting continuous functions are discriminatory (see below). In particular, the sigmoid function, which is a building block of many neural networks, is a discriminatory function. This is the first clear hint that Cybenko’s theorem may be of some use when dealing with neural networks.
Admittedly, at a first glance, the finite sum:
may look unfamiliar. This sum, however, turns out to be a meek beast, at least when the generic discriminating function is replaced by a sigmoid function. In this case, the finite sum is a feed forward artificial neural network with a single hidden layer:
As briefly discussed in the “post scripta” section below, also other types of neural network turn out to be able to approximate functions.
MAIN RESULT OF THE THEOREM
We are now ready to tackle the last (and most interesting) bit of the theorem:
The feed forward neural network may be chosen so as to make the extra “small term” on the right hand side of the equation as small as wished. Ultimately, the “small term” may be so small that it can be safely neglected. If this is the case, the feed forward neural network is substantially equal to the function on the left-hand-side of the equation, irrespective of the inputs with which the neural network is fed.
A simple example showing the universal approximation theorem in action is shown below.
In this example, I found three different artificial neural networks (ANNs) that approximate the function introduced in the fourth chapter of Michael Nielsen‘ s book “Neural networks and deep learning” (which I cannot suggest enough to read).
- The first ANN is obtained by using the construction method proposed by Chen, Chen and Liu and comprises a hidden layer with about 80,000 (!) neurons. As shown in the plot at the upper right corner of the picture above, the difference between the function and the first ANN, i.e. the extra “small term”, is smaller than 4% of the function. In fancy words, one may say that the first ANN approximates the function at the 4% level.
- The construction of the second ANN is inspired by the discussion in the fourth chapter of Nielsen’s book and comprises a hidden layer with about 400 neurons. As shown in the plot at the lower left corner of the picture above, the second ANN approximates the function at the 3% level, i.e. the extra “small term” of the universal theorem is smaller than 3% of the function.
- The third ANN is a feed forward neural network that I trained to fit the function. This ANN comprises two hidden layers, each hidden layer consisting of 30 neurons. The plot at the lower right corner of the picture above shows that the third ANN approximates the function at the 2% level.
If you think that using the third ANN is cheating, because the theorem deals with one-layered ANNs, you are right. As a patent attorney, however, I cannot let this objection slip. Hence, my reply to this statement would be that I wanted to show that also this type of neural networks can approximate the function. Incidentally, my patent attorney’s reply turns out to be true!
PROOF OF THE THEOREM
The theorem can be proved by cleverly combining the Hahn-Banach theorem and the Riesz Representation Theorem. The technical details of the proof, however, are less interesting than the form of argument used by Cybenko. This form was first described by Aristotle, has a fancy latin name (reductio ad absurdum), and turns out to be quite fancy itself:
“Reductio ad absurdum … is one of a mathematician’s finest weapons. It is a far finer gambit than any chess play: a chess player may offer the sacrifice of a pawn or even a piece, but a mathematician offers the game.” [G.H. Hardy]
In a nutshell, reductio ad absurdum consists of assuming that a theorem is false and that this assumption leads to a nonsense, thereby concluding that, in order to avoid this nonsense, the theorem has to be true. In practice, skipping the mathematical details, the line of reasoning of this form of argument looks like this:
There is a joke loved by physicists about a mathematician that, while sleeping in a hotel, wakes up to the smell of smoke and sees the TV on fire. At first worried, he looks into the bathroom and is instantaneously relieved by the presence of a fire bucket. Having determined that a solution to the fire exists, he happily goes back to sleep.
This joke makes fun of existence theorems, i.e. theorems showing that something exists without giving any hint on how to construct it. The universal approximation theorem falls within this category and many physicists with a more “applicative” vocation would be skeptic about its importance and practical implications.
I tend to disagree with this viewpoint. The universal theorem tells us that, if we have a function, the treasure quest for an artificial neural network approximating our function is not hopeless. We know that the treasure is there, we need to find it or at least something close enough thereto.
POST SCRIPTA: FURTHER READING
An attentive eye would immediately recognise that the theorem deals with a relatively narrow class of functions, namely the continuous ones. Unfortunately, continuous functions are not the end of the story: in many mathematical and technical areas many interesting functions turn out to be discontinuous. Notably, decision functions, which are ubiquitous in the field of artificial neural networks, are typically discontinuous.
Not surprisingly, as mathematicians strive to generalise, other classes of function, e.g. the Lebesgue measurable functions, have been proven to be approximable by artificial neural networks. Cybenko’s article provides us with a theorem dealing with decision functions.
Moreover, Cybenko’s theorem leaves open the question on whether neural networks comprising Rectified Linear Unit (ReLU) activation functions, e.g. deep neural networks, are suitable for approximating functions. Not unexpectedly, given the importance of such networks, this issue has been investigated and it has been shown that continuous functions may be approximated by neural networks comprising ReLU activation functions and, more generally, non-polynomial activation functions.
It is worth noticing that Cybenko’s theorem does not set any limit on the numbers of neurons of the hidden layer which, due to the curse of dimensionality, is likely to become huge as the number of inputs increases. It turns out, however, that any Lebesgue measurable function with n inputs may be approximated by a neural network with layers having n+4 neurons at most. In the case of continuous functions, the approximating neural network may be slightly “slimmer”. In particular, it has been shown that any continuous function may be approximated by a neural network comprising layers with n+1 neurons at most.
Finally, it as been shown that also continuous vector-based functions, i.e. continuous functions having vectors as output, may be approximated by neural networks.