Deep learning frameworks like TensorFlow or PyTorch hide the complex machinery that makes neural network training possible. Specifically, it’s not a Python program running in the background that defines the network, but a graph. But why is this the case?
Anomaly detection in production processes
The starting point of this article is the development of a product for detecting anomalies in the sensor data of industrial production systems. These anomalies can signal issues that would otherwise go undetected for a longer time. A very simple example is a cold storage facility for food: if the temperature rises despite electricity flowing into the cooling unit, something is wrong—but the food can still be saved if the change is detected in time.
Unlike in the example, in many systems the relationship between sensor values is too complex to know in advance exactly what an anomaly looks like. This is where neural networks come into play: they are trained on “normal” sensor data and can then report deviations in the relationship between sensor values or over time.
As a customer requirement, the neural networks for this application must run on machines with limited computing power and memory. This rules out heavy frameworks like TensorFlow which are too large and memory intensive.
Graphs for neural networks
In Listing 1, you can see a simple neural network in Python, an autoencoder implemented with the TensorFlow framework. If you typically write analytics code in Python, you might be surprised that TensorFlow includes its own operations for matrix calculations instead of using established libraries like Pandas. This is because tf.matmul doesn’t perform any calculations. Instead, it constructs a node in a graph that describes the neural network’s computation function. This graph is then translated into efficient code in a separate step – for example, on a GPU.
Listing 1
def __call__(self, xs): inputs = tf.convert_to_tensor([xs], dtype=tf.float32) out = tf.matmul(self.weights[0], inputs, transpose_b=True) + self.biases[0] out = tf.tanh(out) out = tf.matmul(self.weights[1], out) + self.biases[1] out = tf.nn.relu(out) out = tf.matmul(self.weights[2], out) + self.biases[2] out = tf.tanh(out) return tf.matmul(self.weights[3], out) + self.biases[3]
However, graph serves another purpose: training the neural network requires its derivative, not the function describing the network. The derivative is calculated by an algorithm that also relies on the graph. In any case, this graph machinery is complex; to be specific, it involves a complete Python frontend that translates the Python program into its graph. In the past, this was a task deep learning developers had to do manually.
What is a derivative and what is it for?
Neural networks (NNs) are functions with many parameters. For an NN to accomplish wondrous tasks such as generating natural language, suggesting suitable music, or detecting anomalies in production processes, these parameters must be properly tuned. As a rule, an NN is evaluated on a given training dataset, and a number is calculated based on the outputs to express their quality. The derivative of the composition of the NN and the evaluation function with respect to the NN’s parameters provides insight into how the parameters can be adjusted to improve the outputs.
You can imagine it as if you’re standing in a thicket in a hilly landscape, trying to find a path down into the valley below. The coordinates where you are currently standing represent the current parameter values, and the height of the hilly landscape represents the output of the evaluation function (higher values indicate a larger deviation from the desired result). You can only see your immediate surroundings and you don’t know ultimately in which direction the valley lies. However, the derivative points in the direction of the steepest ascent—if you move a short distance in exactly the opposite direction and repeat this process from there, the chances are high that you will eventually reach the valley. The main job of deep learning frameworks (DLFs) in this iterative process (known as gradient descent, by the way), is to repeatedly compute derivatives and adjust the parameters accordingly.
Derivatives
For a differentiable function f : ℝ → ℝ the definition of the derivative at a point a as taught in school mathematics, is:
In this definition, the derivative represents the rate of change at the point a. Training a neural network involves functions from a high-dimensional space—where the parameters reside—into the real numbers, which represent the network’s performance. For a function like f : ℝm → ℝ, the partial derivative with respect to the i-th component is defined by holding all other components constant:
For a differentiable function, the derivative can then be expressed as a vector of partial derivatives. As will become clear in the next section, subcomponents of neural networks must also be differentiated. These typically produce higher-dimensional outputs as well. A function f: ℝm → ℝn consists of its component functions,f: x ↦ (f1(x), …, fn(x)); the derivatives of these component functions are defined just as before. For such a (totally) differentiable function, the derivative (Df) can be represented as the so-called Jacobian matrix, where the entry in the i-th row and j-th column is the partial derivative of the i-th component function with respect to the j-th input variable. All of these values describe how the corresponding parameters need to be adjusted.
Derivatives and Deep Learning
To calculate Jacobian matrices, DLFs use a process called Automatic Differentiation (AD), which automates the process that is manually performed in school mathematics. The chain rule plays a special role here, explaining how even complex functions can be differentiated. If two simpler functions f and g are composed in series (written g ∘ f) the derivative is as follows:
AD uses this rule and replaces the elementary components of a composition with their known, predefined derivatives. The individual components of the resulting expression can be evaluated in any order. Typically, this is done either right-associatively, meaning first Df and then Dg (forward AD), or left-associatively, first Dg and then Df (reverse AD).
The computational effort associated with the forward case depends on the input dimension, in this case, the number of parameters, which is often very large. While for the reverse case, it depends on the output dimension (i. e 1, which is the output of the evaluation function). Therefore, DLFs typically implement AD as “reverse AD”.
However, the algorithms used in DLFs for reverse AD are quite complex. This is mainly because the derivative of a component can only be calculated once the output of the preceding component is known. This is not an issue for the forward case, as a function and its derivative can be computed in parallel, with the results passed to the next component. However, in the backward case, the evaluations of the functions and their derivatives occur in opposite directions.
DLFs implement reverse AD using an approach that translates the entire implementation of an NN and the associated optimization method into a computational graph. To compute a derivative, this graph is first evaluated forward, and the intermediate results of all operations are recorded. The graph is then traversed backwards to compute the actual derivative. As described in the previous section, the result is a highly complex programming model. For interested readers, we recommend reading this article [1] for a detailed description of the mathematical principles behind AD algorithms, and this article [2] for an overview of the most common implementation approaches used today.
Derivatives revisited
However, derivatives can also be defined in a more general way: For (normalized, finite-dimensional) vector spaces V and W a function f : V → W is differentiable at a point a ∈ V0, if there exists a linear map Dfa : V ⊸ W such that f(a + h) = f(a) + Dfa(h) + r(h), where the error term r(h) = f (a + h) − f (a) − Dfa(h) holds:
If such a map Dfa, exists, it is unique and is called the derivative of f at a. This perspective treats derivatives as local, linear approximators, without focusing on details such as matrix representations, partial derivatives, or the like. While it is more general and abstract, it is simpler than the concept of the “derivative as a number”. Furthermore, a DLF can use it to compute derivatives without relying on graphs or complex algorithms.
Deep Learning with Abstract Nonsense
The ConCat framework for deep learning uses this new perspective (Elliott, Conal: “The simple essence of automatic differentiation” [3], [4], [5]), by interpreting the function that defines the neural network differently – namely, by computing its derivative rather than the original function itself. Another branch of mathematics, known as category theory—and affectionately called ‘abstract nonsense’ by connoisseurs—offers yet another way of looking at functions. It’s a treasure trove of useful concepts for software development, particularly in the realm of functional programming.
Category theory is, of course, about categories – a category is a small world containing so-called objects and arrows (“arrows” or “morphisms”) between these objects. The term “object” here has nothing to do with object-oriented programming. For example, functions form a category – the objects are sets that can serve as the input or output of a function. However, it’s important to note that objects don’t have to be sets and arrows don’t have to be functions either.
Category theory, then, “forgets” almost everything we know about functions and assumes only that arrows can be composed – that is, chained together. So, if there’s an arrow f from an object a to an object b and another arrow g from b to a third object c, then there must also be a uniquely defined arrow a to c, written g ∘ f. In the category of functions, this is simply function composition as described above, which is why the same symbol is used.
Composition must also satisfy the associative law, and there must be an identity arrow between any two objects — though this isn’t particularly relevant for our purposes here. Some categories also come with additional features, such as products which correspond to data structures in programming.
The analogy between arrows and functions in a computer program goes so far that every function can be expressed as an arrow in the category of functions. And here’s the clever trick behind the ConCat framework: it takes a function definition from a program, translates it into the language of category theory, but then “forgets” which category it came from. This makes it possible to target an entirely different category instead. In the case of deep learning, this is the category of function derivatives.
The idea, then, is to interpret a “normal” function as its derivative. Here’s an example: f (x) = x2.
f no longer represents the square function itself, but its derivative Df(x) = 2x. However, to allow for the definition of more complex functions, composition must be supported — meaning that from the two derivatives Df and Dg of f and g respectively, the derivative of g ∘ f must be computable. This might work using the chain rule mentioned above, but it doesn’t rely solely on Df and Dg. It also requires access to the original function f, as well as a representation of the derivative as a linear map. The ConCat framework therefore defines a category that pairs each function with its derivative. This way, the derivative is computed automatically whenever the function is called.
Thus, a function f : a → b turns into a function of the following form:
This means that for every input x from a, the result now includes both the original function value f (x) in b and the original derivative Df, which is a linear map from a to b. This is now enough to express the chain rule “compositionally”, that is, in a way that makes the derivative of the composition, D (g ∘ f), depend solely on Df and Dg. Here’s how it works:
Where
and
apply.
Category Theory and Functional Programming
The mathematical idea from the previous section can also be implemented in Python, but it’s particularly simple and direct in the functional language Haskell [6] which is what the ConCat framework is written in. It includes a literal translation of the mathematical definition; GD stands for General Derivative:
data GD a b = D (a -> b :* (a -+> b))
The chain rule also mirrors the mathematical definition (with \ denoting a lambda expression in Haskell):
D g . D f = D (\ x -> let (y,f’) = f x
(z,g’) = g y
in (z, g’ . f’))
This code forms the foundation of the deep learning framework, which comprises just a few hundred lines and comes bundled with ConCat. The framework also includes a Haskell compiler plug-in that automatically translates regular Haskell code into categorical form. You could write the code this way from the outset, but the plug-in makes the task much easier. This approach forms the core of the production system used for the anomaly detection described at the beginning.
What about reverse AD?
But there’s one more thing: deep learning requires the reverse derivative, and D only gives the forward derivative. The reverse derivative is obtained using a surprisingly simple trick—also from category theory. You can create a new category from an existing one by simply reversing the arrows. To do this, the linear functions a -+> b in the definition of GD only needs to be replaced with:
data Dual a b = Dual (b -+> a)
Dual Dual also includes correspondingly “reversed” definitions of the linear functions along with a reversed version of the composition. But then the reverse AD is ready, without any complicated algorithm.
High Performance with Composition
Reverse AD removes one of the reasons why TensorFlow must take the complicated route with a computational graph. But there’s still a second one: execution on a GPU. So far, the code still runs as regular Haskell code on the CPU. Fortunately, there’s Accelerate [7], a powerful Haskell library that enables high-performance numerical computing on the GPU.
Accelerate only has one catch: it can’t run standard Haskell code that operates directly on matrices. A function that transforms a matrix of type Typ a into a matrix of type b has the following type in Accelerate:
Acc a -> Acc b
Acc is a data type for GPU code that Accelerate assembles behind the scenes. Accordingly, an Accelerate program cannot use the regular matrix operations, but must instead rely on operations that work on Acc. This is similar to the Python program that generates a graph. That’s just the nature of GPU programming. Here, it doesn’t leak into the definitions of the neural networks, or their derivatives and it has no impact on how the ConCat framwork is used.
ConCat has a solution for this too – the Accelerate functions can be defined as a category as well. So, all it takes is one more application of ConCat to the result of reverse AD, and the result is high-performance GPU code.
Summary
Understanding the foundations of deep learning isn’t all that hard when using the right kind of math. It’s easy to get lost in the weeds of Jacobian and reverse AD matrices, but category theory offers a more abstract and elegant perspective that, remarkably, also captures reverse AD with surprising ease. At the same time, the math also serves as a blueprint for implementation – at least in the right programming language, such as the functional language Haskell. Haskell is well worth a closer look, especially for deep learning – though its strengths go far beyond that.
Links & Literature
[1] Hoffmann, Philipp: “A hitchhiker’s guide to automatic differentiation”: https://arxiv.org/abs/1411.0583
[2] Van Merriënboer, Bart et al: “Automatic differentiation in ML. Where we are and where we should be going”: https://arxiv.org/abs/1810.11530
[3] Elliott, Conal: “The simple essence of automatic differentiation”: http://conal.net/papers/essence-of-ad/
[4] Conal Elliott: “Compiling to categories”: http://conal.net/papers/compiling-to-categories/
[5] ConCat Framework: https://github.com/compiling-to-categories/concat
[6] Haskell: https://www.haskell.org
[7] Accelerate: https://www.acceleratehs.org