Master Advanced ML Development: Cutting-Edge Techniques

Deep Learning with Category Theory for Developers

skansal — Tue, 29 Apr 2025 09:31:46 +0000

Deep learning frameworks like TensorFlow or PyTorch hide the complex machinery that makes neural network training possible. Specifically, it’s not a Python program running in the background that defines the network, but a graph. But why is this the case?

Anomaly detection in production processes

The starting point of this article is the development of a product for detecting anomalies in the sensor data of industrial production systems. These anomalies can signal issues that would otherwise go undetected for a longer time. A very simple example is a cold storage facility for food: if the temperature rises despite electricity flowing into the cooling unit, something is wrong—but the food can still be saved if the change is detected in time.

Unlike in the example, in many systems the relationship between sensor values is too complex to know in advance exactly what an anomaly looks like. This is where neural networks come into play: they are trained on “normal” sensor data and can then report deviations in the relationship between sensor values or over time.

As a customer requirement, the neural networks for this application must run on machines with limited computing power and memory. This rules out heavy frameworks like TensorFlow which are too large and memory intensive.

Graphs for neural networks

In Listing 1, you can see a simple neural network in Python, an autoencoder implemented with the TensorFlow framework. If you typically write analytics code in Python, you might be surprised that TensorFlow includes its own operations for matrix calculations instead of using established libraries like Pandas. This is because tf.matmul doesn’t perform any calculations. Instead, it constructs a node in a graph that describes the neural network’s computation function. This graph is then translated into efficient code in a separate step – for example, on a GPU.

Listing 1

def __call__(self, xs):

  inputs  =  tf.convert_to_tensor([xs], dtype=tf.float32)

  out = tf.matmul(self.weights[0], inputs, transpose_b=True) + self.biases[0]

  out = tf.tanh(out)

  out = tf.matmul(self.weights[1], out) + self.biases[1]

  out = tf.nn.relu(out)

  out = tf.matmul(self.weights[2], out) + self.biases[2]

  out = tf.tanh(out)

  return tf.matmul(self.weights[3], out) + self.biases[3]

However, graph serves another purpose: training the neural network requires its derivative, not the function describing the network. The derivative is calculated by an algorithm that also relies on the graph. In any case, this graph machinery is complex; to be specific, it involves a complete Python frontend that translates the Python program into its graph. In the past, this was a task deep learning developers had to do manually.

What is a derivative and what is it for?

Neural networks (NNs) are functions with many parameters. For an NN to accomplish wondrous tasks such as generating natural language, suggesting suitable music, or detecting anomalies in production processes, these parameters must be properly tuned. As a rule, an NN is evaluated on a given training dataset, and a number is calculated based on the outputs to express their quality. The derivative of the composition of the NN and the evaluation function with respect to the NN’s parameters provides insight into how the parameters can be adjusted to improve the outputs.

You can imagine it as if you’re standing in a thicket in a hilly landscape, trying to find a path down into the valley below. The coordinates where you are currently standing represent the current parameter values, and the height of the hilly landscape represents the output of the evaluation function (higher values indicate a larger deviation from the desired result). You can only see your immediate surroundings and you don’t know ultimately in which direction the valley lies. However, the derivative points in the direction of the steepest ascent—if you move a short distance in exactly the opposite direction and repeat this process from there, the chances are high that you will eventually reach the valley. The main job of deep learning frameworks (DLFs) in this iterative process (known as gradient descent, by the way), is to repeatedly compute derivatives and adjust the parameters accordingly.

Derivatives

For a differentiable function f : ℝ → ℝ the definition of the derivative at a point a as taught in school mathematics, is:

In this definition, the derivative represents the rate of change at the point a. Training a neural network involves functions from a high-dimensional space—where the parameters reside—into the real numbers, which represent the network’s performance. For a function like f : ℝm → ℝ, the partial derivative with respect to the i-th component is defined by holding all other components constant:

For a differentiable function, the derivative can then be expressed as a vector of partial derivatives. As will become clear in the next section, subcomponents of neural networks must also be differentiated. These typically produce higher-dimensional outputs as well. A function f: ℝm → ℝn consists of its component functions,f: x ↦ (f₁(x), …, f_n(x)); the derivatives of these component functions are defined just as before. For such a (totally) differentiable function, the derivative (Df) can be represented as the so-called Jacobian matrix, where the entry in the i-th row and j-th column is the partial derivative of the i-th component function with respect to the j-th input variable. All of these values describe how the corresponding parameters need to be adjusted.

Derivatives and Deep Learning

To calculate Jacobian matrices, DLFs use a process called Automatic Differentiation (AD), which automates the process that is manually performed in school mathematics. The chain rule plays a special role here, explaining how even complex functions can be differentiated. If two simpler functions f and g are composed in series (written g ∘ f) the derivative is as follows:

AD uses this rule and replaces the elementary components of a composition with their known, predefined derivatives. The individual components of the resulting expression can be evaluated in any order. Typically, this is done either right-associatively, meaning first Ｄf and then Ｄg (forward AD), or left-associatively, first Ｄg and then Ｄf (reverse AD).

The computational effort associated with the forward case depends on the input dimension, in this case, the number of parameters, which is often very large. While for the reverse case, it depends on the output dimension (i. e 1, which is the output of the evaluation function). Therefore, DLFs typically implement AD as “reverse AD”.

However, the algorithms used in DLFs for reverse AD are quite complex. This is mainly because the derivative of a component can only be calculated once the output of the preceding component is known. This is not an issue for the forward case, as a function and its derivative can be computed in parallel, with the results passed to the next component. However, in the backward case, the evaluations of the functions and their derivatives occur in opposite directions.

DLFs implement reverse AD using an approach that translates the entire implementation of an NN and the associated optimization method into a computational graph. To compute a derivative, this graph is first evaluated forward, and the intermediate results of all operations are recorded. The graph is then traversed backwards to compute the actual derivative. As described in the previous section, the result is a highly complex programming model. For interested readers, we recommend reading this article [1] for a detailed description of the mathematical principles behind AD algorithms, and this article [2] for an overview of the most common implementation approaches used today.

Derivatives revisited

However, derivatives can also be defined in a more general way: For (normalized, finite-dimensional) vector spaces V and W a function f : V → W is differentiable at a point a ∈ V0, if there exists a linear map Ｄf_a : V ⊸ W such that f(a + h) = f(a) + Ｄf_a(h) + r(h), where the error term r(h) = f (a + h) − f (a) − Ｄf_a(h) holds:

If such a map Ｄf_a, exists, it is unique and is called the derivative of f at a. This perspective treats derivatives as local, linear approximators, without focusing on details such as matrix representations, partial derivatives, or the like. While it is more general and abstract, it is simpler than the concept of the “derivative as a number”. Furthermore, a DLF can use it to compute derivatives without relying on graphs or complex algorithms.

Deep Learning with Abstract Nonsense

The ConCat framework for deep learning uses this new perspective (Elliott, Conal: “The simple essence of automatic differentiation” [3], [4], [5]), by interpreting the function that defines the neural network differently – namely, by computing its derivative rather than the original function itself. Another branch of mathematics, known as category theory—and affectionately called ‘abstract nonsense’ by connoisseurs—offers yet another way of looking at functions. It’s a treasure trove of useful concepts for software development, particularly in the realm of functional programming.

Category theory is, of course, about categories – a category is a small world containing so-called objects and arrows (“arrows” or “morphisms”) between these objects. The term “object” here has nothing to do with object-oriented programming. For example, functions form a category – the objects are sets that can serve as the input or output of a function. However, it’s important to note that objects don’t have to be sets and arrows don’t have to be functions either.

Category theory, then, “forgets” almost everything we know about functions and assumes only that arrows can be composed – that is, chained together. So, if there’s an arrow f from an object a to an object b and another arrow g from b to a third object c, then there must also be a uniquely defined arrow a to c, written g ∘ f. In the category of functions, this is simply function composition as described above, which is why the same symbol is used.

Composition must also satisfy the associative law, and there must be an identity arrow between any two objects — though this isn’t particularly relevant for our purposes here. Some categories also come with additional features, such as products which correspond to data structures in programming.

The analogy between arrows and functions in a computer program goes so far that every function can be expressed as an arrow in the category of functions. And here’s the clever trick behind the ConCat framework: it takes a function definition from a program, translates it into the language of category theory, but then “forgets” which category it came from. This makes it possible to target an entirely different category instead. In the case of deep learning, this is the category of function derivatives.

The idea, then, is to interpret a “normal” function as its derivative. Here’s an example: f (x) = x².

f no longer represents the square function itself, but its derivative Ｄf(x) = 2x. However, to allow for the definition of more complex functions, composition must be supported — meaning that from the two derivatives Ｄf and Ｄg of f and g respectively, the derivative of g ∘ f must be computable. This might work using the chain rule mentioned above, but it doesn’t rely solely on Ｄf and Ｄg. It also requires access to the original function f, as well as a representation of the derivative as a linear map. The ConCat framework therefore defines a category that pairs each function with its derivative. This way, the derivative is computed automatically whenever the function is called.

Thus, a function f : a → b turns into a function of the following form:

This means that for every input x from a, the result now includes both the original function value f (x) in b and the original derivative Ｄf, which is a linear map from a to b. This is now enough to express the chain rule “compositionally”, that is, in a way that makes the derivative of the composition, Ｄ (g ∘ f), depend solely on Ｄf and Ｄg. Here’s how it works:

Where

and

apply.

Category Theory and Functional Programming

The mathematical idea from the previous section can also be implemented in Python, but it’s particularly simple and direct in the functional language Haskell [6] which is what the ConCat framework is written in. It includes a literal translation of the mathematical definition; GD stands for General Derivative:

data GD a b = D (a -> b :* (a -+> b))

The chain rule also mirrors the mathematical definition (with \ denoting a lambda expression in Haskell):

D g . D f = D (\ x -> let (y,f’) = f x

(z,g’) = g y

in (z, g’ . f’))

This code forms the foundation of the deep learning framework, which comprises just a few hundred lines and comes bundled with ConCat. The framework also includes a Haskell compiler plug-in that automatically translates regular Haskell code into categorical form. You could write the code this way from the outset, but the plug-in makes the task much easier. This approach forms the core of the production system used for the anomaly detection described at the beginning.

What about reverse AD?

But there’s one more thing: deep learning requires the reverse derivative, and Ｄ only gives the forward derivative. The reverse derivative is obtained using a surprisingly simple trick—also from category theory. You can create a new category from an existing one by simply reversing the arrows. To do this, the linear functions a -+> b in the definition of GD only needs to be replaced with:

data Dual a b = Dual (b -+> a)

Dual Dual also includes correspondingly “reversed” definitions of the linear functions along with a reversed version of the composition. But then the reverse AD is ready, without any complicated algorithm.

High Performance with Composition

Reverse AD removes one of the reasons why TensorFlow must take the complicated route with a computational graph. But there’s still a second one: execution on a GPU. So far, the code still runs as regular Haskell code on the CPU. Fortunately, there’s Accelerate [7], a powerful Haskell library that enables high-performance numerical computing on the GPU.

Accelerate only has one catch: it can’t run standard Haskell code that operates directly on matrices. A function that transforms a matrix of type Typ a into a matrix of type b has the following type in Accelerate:

Acc a -> Acc b

Acc is a data type for GPU code that Accelerate assembles behind the scenes. Accordingly, an Accelerate program cannot use the regular matrix operations, but must instead rely on operations that work on Acc. This is similar to the Python program that generates a graph. That’s just the nature of GPU programming. Here, it doesn’t leak into the definitions of the neural networks, or their derivatives and it has no impact on how the ConCat framwork is used.

ConCat has a solution for this too – the Accelerate functions can be defined as a category as well. So, all it takes is one more application of ConCat to the result of reverse AD, and the result is high-performance GPU code.

Summary

Understanding the foundations of deep learning isn’t all that hard when using the right kind of math. It’s easy to get lost in the weeds of Jacobian and reverse AD matrices, but category theory offers a more abstract and elegant perspective that, remarkably, also captures reverse AD with surprising ease. At the same time, the math also serves as a blueprint for implementation – at least in the right programming language, such as the functional language Haskell. Haskell is well worth a closer look, especially for deep learning – though its strengths go far beyond that.

Links & Literature

[1] Hoffmann, Philipp: “A hitchhiker’s guide to automatic differentiation”: https://arxiv.org/abs/1411.0583

[2] Van Merriënboer, Bart et al: “Automatic differentiation in ML. Where we are and where we should be going”: https://arxiv.org/abs/1810.11530

[3] Elliott, Conal: “The simple essence of automatic differentiation”: http://conal.net/papers/essence-of-ad/

[4] Conal Elliott: “Compiling to categories”: http://conal.net/papers/compiling-to-categories/

[5] ConCat Framework: https://github.com/compiling-to-categories/concat

[6] Haskell: https://www.haskell.org

[7] Accelerate: https://www.acceleratehs.org

The post Deep Learning with Category Theory for Developers appeared first on ML Conference.

Using OpenAI’S CLIP Model on the iPhone: Semantic Search For Your Own Pictures

lseidler — Wed, 02 Aug 2023 08:35:24 +0000

OpenAI’s CLIP Model

I first encountered the CLIP model in early 2022 while experimenting with the AI drawing model. CLIP (Contrastive Language-Image Pre-Training) is a model proposed by OpenAI in 2021. CLIP can encode images and text into representations that can be compared in the same space. CLIP is the basis for many text-to-image models (e.g. Stable Diffusion) to calculate the distance between the generated image and the prompt during training.

Fig. 1: OpenAI’s CLIP model, source: https://openai.com/blog/clip/

As shown above, the CLIP model consists of two components: Text Encoder and Image Encoder. Let’s take the ViT-B-32 (different models have different output vector sizes) version as an example:

Text Encoder can encode any text (length <77 tokens) into a 1×512 dimensional vector.
Image Encoder can encode any image into a 1×512 dimensional vector.

By calculating the distance or cosine similarity between the two vectors, we can compare the similarity between a piece of text and an image.

Stay up to date

Learn more about MLCON

Image Search on a Server

I found this to be quite fascinating, as it was the first time images and text could be compared in this way. Based on this principle, I quickly set up an image search tool on a server. First, process all images through CLIP, obtaining their image vectors, they should be a list of 1×512 vectors.

# get all images list.

img_lst = glob.glob(‘imgs/*jpg’)

img_features = []

# calculate vector for every image.

for img_path in img_lst:

image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)

image_features = model.encode_image(image)

img_features.append(image_features)

Then, given the search text query, calculate its text vector (with a size of 1×512) and compare similarity with each image vector in a for-loop.

text_query = ‘lonely’

# tokenize the query then put it into the CLIP model.

text = clip.tokenize([text_query]).to(device)

text_feature = model.encode_text(text)

# compare vector similary with each image vector

sims_lst = []

for img_feature in img_features:

sim = cosin_similarity(text_feature, img_feature)

sims_lst.append(sim.item())

Finally, display the top K results in order. Here I return the top3 ranked image files, and display the most relevant result.

K = 3

# sort by score with np.argsort

sims_lst_np = np.array(sims_lst)

idxs = np.argsort(sims_lst_np)[-K:]

# display the most relevant result.

imagedisplay(filename=img_lst[idxs[-1]])

I discovered that its image search results were far superior to those of Google, here are the top 3 results when I search for the keyword “lonely”:

Integrating CLIP into iOS with Swift

After marveling at the results, I wondered: Is there a way to bring CLIP to mobile devices? After all, the place where I store the most photos is neither my MacBook Air nor my server, but rather my iPhone.

To port a large GPU-based model to the iPhone, operator support and execution efficiency are the two most critical factors.

1. Operator Support

Fortunately, in December 2022, Apple demonstrated the feasibility of porting Stable Diffusion to iOS, proving that the deep learning operators needed for CLIP are supported in iOS 16.0.

Fig. 2: Pictures generated by Stable Diffusion

2. Execution Efficiency

Even with operator support, if the execution efficiency is extremely slow (for example, calculating vectors for 10,000 images takes half an hour, or searching takes 1 minute), porting CLIP to mobile devices would lose its meaning. These factors can only be determined through hands-on experimentation.

I exported the Text Encoder and Image Encoder to the CoreML model using the coremltools library. The final models has a total file size of 300MB. Then, I started writing Swift code.

I use Swift to load the Text/Image Encoder models and calculate all the image vectors. When users input a search keyword, the model first calculates the text vector and then computes its cosine similarity with each of the image vectors individually.

The core code is as follows:

// load the Text/Image Encoder model.

let text_encoder = try MLModel(contentsOf: TextEncoderURL, configuration: config)

let image_encoder = try MLModel(contentsOf: ImageEncoderURL, configuration: config)

// given a prompt/photo, calculate the CLIP vector for it.

let text_feature = text_encoder.encode(“a dog”)

let image_feature = image_encoder.encode(“a dog”)

// compute the cosine similarity.

let sim = cosin_similarity(img_feature, text_feature)

As a SwiftUI beginner, I found that Swift doesn’t have a specific implementation for cosine similarity. Therefore, I used Accelerate to write one myself, the code below is a Swift translation of cosine similarity from Wikipedia.

import Accelerate

func cosine_similarity(A: MLShapedArray, B: MLShapedArray) -> Float {

let magnitude = vDSP.rootMeanSquare(A.scalars) * vDSP.rootMeanSquare(B.scalars)

let dotarray = vDSP.dot(A.scalars, B.scalars)

return dotarray / magnitude

}

The reason I split Text Encoder and Image Encoder into two models is because, when actually using this Photos search app, your input text will always change, but the content of the Photos library is fixed. So all your image vectors can be computed once and saved in advance. Then, the text vector is computed for each of your searches.

Furthermore, I implemented multi-core parallelism when calculating similarity, significantly increasing search speed: a single search for less than 10,000 images takes less than 1 second. Thus, real-time text searching from tens of thousands of Photos library becomes possible.

Below is a flowchart of how Queryable works:

Fig. 3: How the app works

Performance

But, compared to the search function of the iPhone Photos, how much does the CLIP-based album search capability improve? The answer is: overwhelmingly better. With CLIP, you can search for a scene in your mind, a tone, an object, or even an emotion conveyed by the image.

Fig. 4: Search for a scene, an object, a tone or the meaning related to the photo with Queryable.

To use Queryable, you need to first build the index, which will traverse your album, calculate all the image vectors and store them. This takes place only once, the total time required for building the index depends on the number of your photos, the speed of indexing is ~2000 photos per minute on the iPhone 12 mini. When you have new photos, you can manually update the index, which is very fast.

In the latest version, you have the option to grant the app access to the network in order to download photos stored on iCloud. This will only occur when the photo is included in your search results, the original version is stored on iCloud, and you have navigated to the details page and clicked the download icon. Once you grant the permissions, you can close the app, reopen it, and the photos will be automatically downloaded from iCloud.

3. Any requirements for the device?

iOS 16.0 or above
iPhone 11 (A13 chip) or later models

The time cost for a search also depends on your number of photos: for <10,000 photos it takes less than 1s. For me, an iPhone 12 mini user with 35,000 photos, each search takes about 2.8s.

Q&A on Queryable

1.On Privacy and security issues.

Queryable is designed as an OFFLINE app that does not require a network connection and will never request network access, thereby avoiding privacy issues.

2. What if my pictures are stored on iCloud?

Due to the inability to connect to a network, Queryable can only use the cache of the low-definition version of your local Photos album. However, the CLIP model itself resizes the input image to a very small size (e.g. ViT-B-32 is 224×224), so if your image is stored on iCloud, it actually does not affect search accuracy except that you cannot view its original image in search result.

The post Using OpenAI’S CLIP Model on the iPhone: Semantic Search For Your Own Pictures appeared first on ML Conference.

Real-time anomaly detection with Kafka and Isolation Forests

btessari — Thu, 07 Jan 2021 08:43:20 +0000

Detecting anomalies in data series can be valuable in many contexts: from anticipatory waiting to monitoring resource consumption and IT security. The time factor also plays a role in detection: the earlier the anomaly is detected, the better. Ideally, anomalies should be detected in real time and immediately after they occur. This can be achieved with just a few tools. In the following, we take a closer look at Kafka, Docker and Python with scikit-learn. The complete implementation of the listed code fragments can be found under github.com.

Stay up to date

Learn more about MLCON

Apache Kafka

The core of Apache Kafka is a simple idea: a log to which data can be attached at will. This continuous log is then called a stream. On the one hand, so-called producers can write data into the stream, on the other hand, consumers can read this data. This simplicity makes Kafka very powerful and usable for many purposes. For example, sensor data can be sent into the stream directly after the measurement. On the other hand, an anomaly detection software can read and process the data. But why should there even be a stream between the data source (sensors) and the anomaly detection instead of sending the data directly to the anomaly detection?

There are at least three good reasons for this. First, Kafka can run distributed in a cluster and offer high reliability (provided that it is actually running in a cluster of servers and that another server can take over in case of failure). If the anomaly detection fails, the stream can be processed after a restart and continue where it left off last. So no data will be lost.

Secondly, Kafka offers advantages when a single container for real-time anomaly detection is not working fast enough. In this case, load balancing would be needed to distribute the data. Kafka solves this problem with partitions. Each stream can be divided into several partitions (which are potentially available on several servers). A partition is then assigned to a consumer. This allows multiple consumers (also called consumer groups) to dock to the same stream and allows the load to be distributed between several consumers.

Thirdly, it should be mentioned that Kafka enables the functionalities to be decoupled. If, for example, the sensor data are additionally stored in a database, this can be realized by an additional consumer group that docks onto the stream and stores the data in the database. This is why companies like LinkedIn, originally developed Kafka, have declared Kafka to be the central nervous system through which all data is passed.

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Learn more

A Kafka cluster can be easily started using Docker. A Docker-Compose-File is suitable for this. It allows the user to start several Docker containers at the same time. A Docker network is also created (here called kafka_cluster_default), which all containers join automatically and which enables their communication. The following commands start the cluster from the command line (the commands can also be found in the README.md in the GitHub repository for copying):

git clone https://github.com/NKDataConv/anomalie-erkennung.git
cd anomaly detection/
docker-compose -f ./kafka_cluster/docker-compose.yml up -d –build

Here the Kafka Cluster consists of three components:

The broker is the core component and contains and manages the stream.
ZooKeeper is an independent tool and is used at Kafka to coordinate several brokers. Even if only one broker is started, ZooKeeper is still needed.
The schema registry is an optional component, but has established itself as the best practice in connection with Kafka.

The Schema Registry solves the following problem: In practice, the responsibilities for the producer (i.e. origin of the data) are often detached from the responsibilities for the consumer (i.e. data processing). In order to make this cooperation smooth, the Schema Registry establishes a kind of contract between the two parties. This contract contains a schema for the data. If the data does not match the schema, the Schema Registry does not allow the data to be added to the corresponding stream. The Schema Registry therefore checks that the schema is adhered to each time the data is sent. This schema is also used to serialize the data. This is the avro format. Serialization with avro allows you to further develop the schema, for example, to add additional data fields. The downward compatibility of the schema is always ensured. A further advantage of the Schema Registry in cooperation with Avro is that the data in the Kafka stream ends up without the schema, because it is stored in the Schema Registry. This means that only the data itself is in the stream. In contrast, when sending data in JSON format, for example, the names of all fields would have to be sent with each message. This makes the Avro format very efficient for Kafka.

To test the Kafka cluster, we can send data from the command line into the stream and read it out again with a command line consumer. To do this we first create a stream. In Kafka, a single stream is also called a topic. The command for this consists of two parts. First, we have to take a detour via the Docker-Compose-File to address Kafka. The command important for Kafka starts with kafka-topics. In the following the topic test-stream is created:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-topics –create –bootstrap-server localhost:9092 –topic test-stream

Now we can send messages to the topic. To do this, messages can be sent with ENTER after the following command has been issued:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-console-producer –broker-list localhost:9092 –topic test-stream

At the same time, a consumer can be opened in a second terminal, which outputs the messages on the command line:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-console-consumer –bootstrap-server localhost:9092 –topic test-stream –from-beginning

Both processes can be aborted with CTRL + C. The Schema Registry has not yet been used here. It will only be used in the next step.

Kafka Producer with Python

There are several ways to implement a producer for Kafka. Kafka Connect allows you to connect many external tools such as databases or the file system and send their data to the stream. But there are also libraries for various programming languages. For Python, for example, there are at least three working, actively maintained libraries. To simulate data in real time, we use a time series that measures the resource usage of a server. This time series includes CPU usage and the number of bytes of network and disk read operations. This time series was recorded with Amazon CloudWatch and is available on Kaggle. An overview of the data is shown in Figure 1, with a few obvious outliers for each time series.

Fig. 1: Overview of Amazon CloudWatch data

In the time series, the values are recorded at 5-minute intervals. To save time, they are sent in the stream every 1 second. In the following, an image for the Docker Container is created. The code for the producer can be viewed and adjusted in the subfolder producer. Then the container is started as part of the existing network kafka_cluster_default.

docker build ./producer -t kafka-producer
docker run –network=”kafka_cluster_default” -it kafka-producer

Now we want to turn to the other side of the stream.

Kafka Consumer

For an anomaly detection it is interesting to implement two consumers. One consumer for real-time detection, the second for training the model. This consumer can be stopped after the training and restarted for a night training. This is useful if the data has changed. Then the algorithm is trained again and a new model is saved. However, it is also conceivable to trigger this container at regular intervals using a cron job. In the productive case, however, this should not be given a completely free hand. An intermediate step with a check of the model makes sense at this point – more about this later.

The trained model is then stored in the Docker volume. The volume is mounted in the subfolder data. The first consumer has access to this Docker Volume. This consumer can load the trained model and evaluate the data for anomalies.

Technologically, libraries also exist for the consumer in many languages. Since Python is often the language of choice for data scientists, it is the obvious choice. It is interesting to note that the schema no longer needs to be specified for implementation on the consumer side. It is automatically obtained via the schema registry, and the message is deserialized. An overview of the containers described so far is shown in Figure 2. Before we also run the consumer, let’s take a closer look at what is algorithmically done in the containers.

Fig. 2: Overview of docker containers for anomaly detection

Isolation Forests

For the anomaly detection a colourful bouquet of algorithms exists. Furthermore, the so-called no-free-lunch theorem makes it difficult for us to choose. The main point of this theorem is not that we should take money with us into lunch break, but that there is nothing free when choosing the algorithm. In concrete terms, this means that we don’t know whether a particular algorithm works best for our data unless we test all algorithms (with all possible settings) – an impossible task. Nevertheless, we can sort the algorithms a bit.

Basically, in anomaly detection one can first distinguish between supervised and unsupervised learning. If the data indicates when an anomaly has occurred, this is referred to as supervised learning. Classification algorithms are suitable for this problem. Here, the classes “anomaly” and “no anomaly” are used. If, on the other hand, the data set does not indicate when an anomaly has occurred, one speaks of unsupervised learning. Since this is usually the case in practice, we will concentrate here on the approaches for unsupervised learning. Any prediction algorithm can be used for this problem. The next data point arriving is predicted. When the corresponding data point arrives, it is compared with the forecast. Normally the forecast should be close to the data point. However, if the forecast is far off, the data point is classified as an anomaly.

What “far off” exactly means can be determined with self-selected threshold values or with statistical tests. The Seasonal Hybrid ESD algorithm developed by Twitter, for example, does this. A similar approach is pursued with autoencoders. These are neural networks that reduce the dimensions of the data in the first step. In the second step, the original data is to be restored from the dimension reduction. Normally, this works well. If, on the other hand, a data point does not succeed, it is classified as an anomaly. Another common approach is the use of One-Class Support Vector Machines. These draw as narrow a boundary as possible around the training data. A few data points are the exception and are outside the limits. The number of exceptions must be given to the algorithm as a percentage and determines how high the percentage of detected anomalies will be. There must therefore be some premonition about the number of anomalies. Each new data point is then checked to see whether it lies within or outside the limits. Outside corresponds to an anomaly.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Learn more

Another approach that works well is Isolation Forest. The idea behind this algorithm is that anomalies can be distinguished from the rest of the data with the help of a few dividing lines (Figure 3). Data points that can be isolated with the fewest separators are therefore the anomalies (box: “Isolation Forests”).

Isolation Forests

Isolation Forests are based on several decision trees. A single decision tree in turn consists of nodes and leaves. The decisions are made in the nodes. To do this, a random feature (that is, a feature of the data, in our example, CPU, network, or memory resource) and a random split value is selected for the node. If a data point in the respective feature has a value that is less than the split value, it is placed in the left branch, otherwise in the right branch. Each branch is followed by the next node, and the whole thing happens again: random feature and random split value. This continues until each data point ends up in a separate sheet. So this isolates each data point. The procedure is repeated with the same data for several trees. These trees form the forest and give the algorithm its name.

So at this point we have isolated all training data in each tree into leaves. Now we have to decide which leaves or data points are an anomaly. Here is what we can determine: Anomalies are easier to isolate. This is also shown in Figure 3, which shows two-dimensional data for illustration. Points that are slightly outside the data point can be isolated with a single split (shown as a black line on the left). This corresponds to a split in the decision tree. In contrast, data points that are close together on the screen can only be isolated with multiple splits (shown as multiple separators on the right).

This means that on average, anomalies require fewer splits to isolate. And this is exactly how anomalies are defined in Isolation Forests: Anomalies are the data points that are isolated with the fewest splits. Now all that remains to be done is to decide which part of the data should be classified as an anomaly. This is a parameter that the Isolation Forests need for training. Therefore, one needs a premonition of how many anomalies are found in the data set. Alternatively, an iterative approach can be used. Different values can be tried out and the result checked. Another parameter is the number of decision trees for the forest, since the number of splits is calculated using the average of several decision trees. The more decision trees, the more accurate the result. One hundred trees is a generous value here, which is sufficient for a good result. This may sound like a lot at first, but the calculations are actually within seconds. This efficiency is also an advantage of Isolation Forests.

Fig. 3: Demonstration of data splitting with Isolation Forests: left with as few splits possible, right only with several splits

Isolation Forests are easy to implement with Python and the library scikit-learn. The parameter n_estimators corresponds to the number of decision trees and contamination to the expected proportion of anomalies. The fit method is used to trigger the training process on the data. The model is then stored in the Docker Volume.

from sklearn.ensemble import IsolationForest
iso_forest=IsolationForest(n_estimators=100, contamination=float(.02))
iso_forest.fit(x_train)
joblib.dump(iso_forest, ‘/data/iso_forest.joblib’)

From the Docker Volume, the Anomaly Detection Consumer can then call the model and evaluate streaming data using the predict method:

iso_forest = joblib.load(‘/data/iso_forest.joblib’)

anomaly = iso_forest.predict(df_predict)

A nice feature of scikit-learn is the simplicity and consistency of the API. Isolation Forests can be exchanged for the above mentioned One-Class Support Vector Machines. Besides the import, only line 2 IsolationForest has to be replaced by OneClassSVM. The data preparation and methods remain the same. Let us now turn to data preparation.

Feature Engineering with time series

In Machine Learning, each column of the data set (CPU, network and disk) is also called a feature. Feature engineering is the preparation of the features for use. There are a few things to consider. Often the features are scaled to get all features in a certain value range. For example, MinMax scaling transforms all values into the range 0 to 1. The maximum of the original values then corresponds to 1, the minimum to 0, and all other values are scaled in the range 0 to 1. Scaling is, for example, a prerequisite for training neural networks. For Isolation Forests especially, scaling is not necessary.

Another important step is often the conversion of categorical features that are represented by a string. Let’s assume we have an additional column for the processor with the strings Processor_1 and Processor_2, which indicate which processor the measured value refers to. These strings can be converted to zeros and ones using one-hot encoding, where zero represents the category Processor_1 and one represents Processor_2. If there are more than two categories, it makes sense to create a separate feature (or column) for each category. This will result in a Processor_1 column, a Processor_2 column, and so on, with each column consisting of zeros and ones for the corresponding category.

Feature engineering also means to prepare the existing information in a useful way and to provide additional information. Especially for time series it is often important to extract time information from the time stamp. For example, many time series fluctuate during the course of a day. Then it might be important to prepare the hour as a feature. If the fluctuations come with the seasons, a useful feature is the month. Or you can differentiate between working days and vacation days. All this information can easily be extracted from the timestamp and often provides a significant added value to the algorithm. A useful feature in our case could be working hours. This feature is zero during the time 8-18 and otherwise one. With the Python library pandas, such features can be easily created:

df[‘hour’] = df.timestamp.dt.hour
df[‘business_hour’] = ((df.hour < 8) | (df.hour>18)).astype(“int”)

It is always important that sufficient data is available. It is therefore useless to introduce the feature month if there is no complete year of training data available. Then the first appearance of the month of May would be detected as an anomaly in live operation, if it was not already present in the training data.

At the same time one should note that the following does not apply: A lot helps out a lot. On the contrary! If a feature is created that has nothing to do with the anomalies, the algorithm will still use the feature for detection and find any irregularities. In this case, thinking about the data and the problem is essential.

Trade-off in anomaly detection

Training an algorithm to correctly detect all anomalies is difficult or even impossible. You should therefore expect the algorithm to fail. But these errors can be controlled to a certain degree. The so-called first type of error occurs when an anomaly is not recognized as such by the algorithm. So, an anomaly is missed.

The second type of error is the reverse: An anomaly is detected, but in reality it is not an anomaly – i.e. a false alarm. Both errors are of course bad.

However, depending on the scenario, one of the two errors may be more decisive. If each detected anomaly results in expensive machine maintenance, we should try to have as few false alarms as possible. On the other hand, anomaly detection could monitor important values in the health care sector. In doing so, we would rather accept a few false alarms and look after the patient more often than miss an emergency. These two errors can be weighed against each other in most anomaly detection algorithms. This is done by setting a threshold value or, in the case of isolation forests, by setting the contamination parameter. The higher this value is set, the more anomalies are expected by the algorithm and the more are detected, of course. This reduces the probability that an anomaly will be missed.

The argumentation works the other way round as well: If the parameter is set low, the few detected anomalies are most likely the ones that are actually those. But because of this, some anomalies can be missed. So here again it is important to think the problem through and find meaningful parameters by iterative trial and error.

Real-time anomaly detection

Now we have waited long enough and the producer should have written some data into the Kafka stream by now. Time to start training for anomaly detection. This happens in a second terminal with Docker using the following commands:

docker build ./Consumer_ML_Training -t kafka-consumer-training
docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:rw -it kafka consumer training

The training occurs on 60 percent of the data. The remaining 40 percent is used for testing the algorithm. For time series, it is important that the training data lies ahead of the test data in terms of time. Otherwise it would be an unauthorized look into a crystal ball.

An evaluation on the test data is stored in Docker Volume in the form of a graph together with the trained model. An evaluation is visible in Figure 4. The vertical red lines correspond to the detected anomalies. For this evaluation we waited until all data were available in the stream.

Fig. 4: Detected anomalies in the data set

If we are satisfied with the result, we can now use the following commands to start the anomaly detection:

docker build ./Consumer_Prediction -t kafka-consumer-prediction
docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:ro -it kafka-consumer-prediction

As described at the beginning, we can also use several Docker containers for evaluation to distribute the load. For this, the Kafka Topic load must be distributed over two partitions. So let’s increase the number of partitions to two:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-topics –alter –zookeeper zookeeper:2181 –topic anomaly_tutorial –partitions 2

Now we simply start a second consumer for anomaly detection with the known command:

docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:ro -it kafka-consumer-prediction

Stay up to date

Learn more about MLCON

Kafka now automatically takes care of the allocation of partitions to Consumers. The conversion can take a few seconds. An anomaly is displayed in the Docker logs with “No anomaly” or “Anomaly at time *”. Here it remains open to what extent the anomaly is processed further in a meaningful way. It would be particularly elegant to write the anomaly into a new topic in Kafka. An ELK stack could dock to this with Kafka Connect and visualize the anomalies.

The post Real-time anomaly detection with Kafka and Isolation Forests appeared first on ML Conference.

How Deep Learning helps protect honeybees

ML editorial team — Tue, 19 Nov 2019 13:36:05 +0000

Editorial Team: You have developed a tool called DeepBee. What is it all about?

Using Machine Learning it is possible to deliver quality information about honey bee colonies to beekeepers and researchers.

Thiago da Silva Alves: Many research projects in the apidology area require a process called temporal assessment of honey bee colony strength, which often involves counting the number of comb cells with brood and food reserves multiple times a year. There are thousands of cells in each comb, which makes manual counting a time-consuming, tedious, and thereby an error-prone task.

Knowing this problem, we decided to automate this process using image processing techniques to automatically detect cells, and deep learning for the cells’ content classification.

Editorial Team: Your presentation at the Machine Learning Conference is called “Honey Bee Conservation using Deep Learning”. How can Machine Learning help with Honey Bee Conservation?

Thiago da Silva Alves: Using Machine Learning it is possible to deliver quality information about the colonies to beekeepers and researchers. With this information, they can have insights on what they can do to improve colony health.

For example, the tool we developed is able to count the amount of brood and food reserves in a comb image. If the beekeeper frequently extracts this information from his colony he can detect anomalies such as a low bee birth rate or an unexpected reduction in honey production. Having this information at hand, the beekeeper’s decision on colony health will be more efficient.

How machine learning can help prevent bee mortality

Editorial Team: Could you give an insight into the technologies used in DeepBee?

Thiago da Silva Alves: We started using Nvidia DIGITS + Caffe in our first classification tests, but quickly faced some limitations. Then we decided to use Keras, with a TensorFlow backend, for the implementation of our models. We did most of the images preprocessing using OpenCV and NumPy.

Editorial Team: Where did you see the biggest challenge in developing DeepBee?

Thiago da Silva Alves: The biggest challenge we encountered was collecting data and creating the datasets. It took us a few months before we had enough cells annotated to start developing the models.

Developing an algorithm to detect different cell types was also a big challenge for us. The aggravating factor, in this case, is the fact that it is not possible to easily see the edge of cells containing honey.

Editorial Team: How can machine learning help prevent bee mortality?

The biggest challenge we encountered was collecting data and creating the datasets.

Thiago da Silva Alves: It can help reduce bee mortality by giving the beekeepers more information about the strength of the colonies. This information is used to sharpen the beekeeper’s decision making and then improve bee health.

Editorial Team: What are the next steps? What are your plans for DeepBee?

Thiago da Silva Alves: We plan to make the tool even more user-friendly. We also believe it is possible to implement some features of DeepBee into a smartphone application.

Editorial Team: Thank you very much!

Questions by Hartmut Schlosser

The post How Deep Learning helps protect honeybees appeared first on ML Conference.

“Tricking an autonomous vehicle into not recognizing a stop sign is an evasion attack”

hschlosser — Wed, 25 Sep 2019 07:31:13 +0000

MLcon: Different approaches have been made to launch attacks against machine learning systems, as their use continues to become increasingly widespread. Can you tell us more about these angles of attack and how they differ from each other?

During an evasion attack, the adversary aims to avoid detection by deliberately manipulating an example input.

David Glavas: Here are four common attack techniques:

Poisoning attacks inject malicious examples into the training dataset or modify features/labels of existing training data.
Evasion attacks deliberately manipulate the input to evade detection.
Impersonation attacks craft inputs that the target model misclassifies as some specific real input.
Inversion attacks steal sensitive information such as training data or model parameters.

MLcon: In your opinion, which type of attack on ML systems currently poses the greatest threat, and for what reason?

David Glavas: Evasion attacks, because they can be performed with less knowledge about the target system. The more knowledge about the target system is required to perform an attack, the more difficult it is to perform.

MLcon: How is an evasion attack carried out?

David Glavas: During an evasion attack, the adversary aims to avoid detection by deliberately manipulating an example input. To be more specific, while training a neural network, we usually use backpropagation to compute the derivative of the cost function with respect to the network’s weights. In contrast, during an evasion attack, we use backpropagation to compute the derivative of the cost function with respect to the input.

So we use backpropagation to answer the question: “How do I need to modify the input to maximize the cost function?”, where maximizing the cost function results in the target network being confused and making an error. For example, given a picture of a cat, we look for pixels we can modify such that the target network sees something that’s not a cat. Researchers have proposed various algorithms to perform this (e.g. Basic Iterative Method, Projected Gradient Descent, CW-Attack).

MLcon: Can you provide us with an example for an evasion attack?

Researchers showed that they can cause a stop sign to ‘disappear’ according to the detector.

David Glavas: To name an example with obviously negative consequences, tricking an autonomous vehicle into not recognizing a stop sign is an evasion attack. Autonomous vehicles use object detectors to both locate and classify multiple objects in a given scene (e.g. pedestrians, other cars, street signs, etc.). An object detector outputs a set of bounding boxes as well as the label and likelihood of the most probable object contained within each box.

In late 2018, researchers showed that they can cause a stop sign to “disappear” according to the detector by adding adversarial stickers onto the sign. This attack tricked state-of-the-art object detection models to not recognize a stop sign over 85% of times in a lab environment and over 60% of times in a more realistic outdoor environment. Imagine having to win a coin toss every time you want your autonomous car to stop at a stop sign.

Other research suggests that similar attacks are possible on face recognition systems (you can’t recognize a face without detecting it first). Other common evasion attacks are evading malware and spam detectors.

Source: Eykholt et al, Robust Physical-World Attacks on Deep Learning Visual Classification

MLcon: When an ML system is attacked, e.g. via data poisoning, this may go on unnoticed for some time. Which measures could be taken to help prevent attacks from going undetected?

David Glavas: The measures will depend on the details of the specific system at hand. A general approach would be to provide access control more carefully (who can see and change what data) and to monitor system behavior for anomalies (anomalies may indicate that the system has been poisoned).

MLcon: Generally speaking, what are the greatest challenges in securing an ML system, and do they differ from other types of longer-known security risks?

Companies that encounter such attacks either cover it up entirely or don’t disclose the details.

David Glavas: I think it’s difficult to envision all possible moves/changes/manipulations that an attacker can perform, but that’s a general issue people dealing with computer security have.

There are many commonalities, as ML systems still suffer from the same security risks as non-ML systems. Attacks on ML systems don’t necessarily need to specifically target the system’s ML component. In fact, it’s probably easier to focus on other components.

MLcon: Have any large-scale attacks on ML systems been carried out recently – and what can we learn from how they have or have not been fended off?

David Glavas: It’s difficult to say, since companies that encounter such attacks either cover it up entirely or don’t disclose the details of what exactly happened. For example, the evasion of spam filters and anti-viruses seem to be quite common.

MLcon: Thank you for the interview!

The post “Tricking an autonomous vehicle into not recognizing a stop sign is an evasion attack” appeared first on ML Conference.

Reinforcement Learning: A gentle introduction and industrial application

ahenseleit — Thu, 13 Jun 2019 07:14:24 +0000

JAXenter: For those who are not familiar with this term, what is the basic idea behind reinforcement learning?

Christian Hidber: With reinforcement learning, computers learn complex behaviours through clever trial-and-error strategies. This is very much like a child learning a new game: They start by pressing some random buttons and see what happens. After a while, they continuously improve their gaming strategy and get better and better. Moreover, you don’t have to explain to a child how the game works, as it’s part of the fun to figure it out. Reinforcement learning algorithms essentially try to learn by mimicking this behaviour.

JAXenter: Reinforcement learning does not require large data sets for training. By which means is this accomplished?

Christian Hidber: These algorithms learn through the interaction with an environment. In the game example above, the game engine containing all the rules of the game is the environment. The algorithms observe which game sequences yield good results and try to learn from them. In a sense, reinforcement learning generates its dataset on the fly from the environment, creating as much training data as needed – pretty neat!

JAXenter: How well does the accuracy of reinforcement learning solutions fare compared to other types of machine learning?

Christian Hidber: Reinforcement learning addresses machine learning problems that are hard to solve for other types of machine learning and vice versa. Thus, you rarely find yourself in a situation where you can compare their accuracies directly. The accuracy in reinforcement learning may vary a lot for the same problem, depending on your model, data and algorithm choices. So that’s quite similar to classic machine learning approaches.

JAXenter: In your talk, you give an insight into how you applied reinforcement learning to the area of siphonic roof drainage systems. For which reasons did you choose it over other machine learning methods?

Christian Hidber: Actually, we use reinforcement learning in a complementary fashion. Our calculation pipeline uses traditional heuristics as well as supervised methods like neural networks and support vector machines. At a certain point, we had to realize and could prove that we were not able to improve our classic machine learning solution any further. Using reinforcement learning as an additional stage in our pipeline, we were able to reduce our previous failure rate by more than 70%.

JAXenter: In which areas might reinforcement learning play a central role in the future?

Christian Hidber: There are already quite a few real-world applications out there in production, like cooling a data center or controlling robot movements. Personally, I think that reinforcement learning is particularly great for industrial control problems. In these cases, we can often simulate the environment, but there’s no clear-cut way on how to find a good solution. That was also the setup in our hydraulic optimization problem. So, I expect to see many more industrial applications.

Stay tuned!
Learn more about ML Conference:

JAXenter: Can you think of any typical mistakes that may happen when starting to work with reinforcement learning?

Christian Hidber: Oh, yes, absolutely, since we made a lot of mistakes ourselves. Some of them resulted in very funny and surprising strategies. A large temptation is always to put a lot of cleverness into the reward function. The reward function is responsible for defining which outcome is considered “good” and which “bad”. The algorithms are incredibly smart at finding short-cuts and loopholes, producing high rewards for behaviours which are definitely “bad”. It seems that the more cleverness you put into the reward function, the more surprises you get out of it.

JAXenter: What do you expect to be the main takeaway for attendees of your talk?

Christian Hidber: My goal is to give the attendees a good intuition on how these algorithms work. The attendees may then decide on their own whether a problem at hand might be suitable for reinforcement learning or not. And of course, if an attendee already has an idea for an application, I would be more than delighted to hear about it.

JAXenter: Thank you very much!

Author

Maika Möbus has been an editor for Software & Support Media since January 2019. She studied Sociology at Goethe University Frankfurt and Johannes Gutenberg University Mainz.

Get to know the ML Conference better?

Here are more sessions

→ debugging and visualizing tensorflow programs with images

→ Some Things I wish I had known about scaling Machine Learning Solutions

→ Deep Learning mit Small Data

→ Evolution 3.0: Solve your everyday Problems with genetic Algorithms

Check out the ML Conference program

The post Reinforcement Learning: A gentle introduction and industrial application appeared first on ML Conference.

Deep Learning with Java: Introduction to Deeplearning4j

ahenseleit — Mon, 06 May 2019 10:28:26 +0000

Exploiting Deep Learning: the most important bits and pieces

Machine learning in general refers to data-based methods of artificial intelligence. A computer learns a model based on sample data. Artificial intelligence plays a significant role in human-machine interaction. An example of this is the Zeno robot shown in Figure 1. It is a therapy tool for autistic children to help them express and understand their emotions better. Zeno recognizes the emotion of its counterpart based on language and facial expression and reacts accordingly. For this purpose, the recorded sensor data must be analyzed in real time through the machine learning process.

Fig. 1: The robot Zeno is used for therapy with autistic children

Deep learning is based on networks of artificial neurons that have input and output neurons as well as multiple layers of intermediate neurons (hidden layers). Each neuron processes an input vector based on a method similar to that of human nerve cells: A weighted sum of all input values is calculated and the result is transformed with a non-linear function, the so-called activation function. The input neurons record the data, such as unprocessed audio signals, and feed it into the neural network. The audio data passes through the intermediate neurons of all the hidden layers and is thereby processed. Then the processed signals and the calculated results are issued via the output neurons, which then deliver the final result. The parameters of the individual neurons are calculated during the training of the network using the training data. The greater the number of neurons and layers you have, the more complex the problems that you can deal with.

In principle, a greater amount of data also leads to more robust models (as long as the data is not unbalanced). If there is not enough data and the selected network architecture is too complex, there is a risk of overfitting. This means that the model parameters are optimized too much during the training to the given data and that the model is not sufficiently generalized anymore, meaning that it does not work well anymore with independent test data. Possible tasks can be overcome by applying three learning methods: (1) supervised learning, (2) semi-supervised learning, and (3) unsupervised learning.

In supervised learning, a model is trained that can approximate one or more target variables from a set of annotated data. If the target variable is continuous, we speak of regression, in the case of discrete target values of classification. In classification problems, neural networks (with more than two classes) normally use as many neurons in the output layer as there are classes. The neuron that displays the highest activation for given input values is then the class that considers the network most probable.

Semi-supervisedlearning is a variant of supervised learning that uses both annotated and unannotated training data. The combination of this data can greatly improve the learning accuracy when the learning process is monitored by an expert. This learning method is also referred to as cooperative learning because of the artificial neural network and human work together. If the neural network cannot classify specific data with high confidence, it needs the help of an expert for annotation.

Unlike the other two learning methods,unsupervised learningonly has input data and no associated output variables. Since there are no right or wrong answers and no one supervises the behaviour of the system, the algorithms rely on themselves to discover and present relevant structures in the data. The most commonly used unsupervised learning method is clustering. The goal of clustering algorithms is to find patterns or groupings in the dataset. The data within one grouping then has a higher degree of similarity than data in other clusters.

Deep Learning with Java

Deep learning approaches are considered state of the art in various areas of machine learning, such as audio processing (speech or emotion recognition), image processing (object classification or facial recognition) and text processing (sentiment analysis or natural language processing). To simulate the neural networks, program libraries are often used for machine learning. Most robust libraries, such as TensorFlow, Caffe, or Theano, were written in the Python and C ++ programming languages. With Deeplearning4j [1], however, there is also a Java-based deep learning platform that can bridge the gap between the aforementioned Python-based program libraries and Java.

Deeplearning4j is mostly implemented in C and C ++ and uses CUDA to offload the calculations to a compatible NVIDIA graphics processor. The programmer has various architectures available, including CNNs, RNNs and auto-encoders. Likewise, models that have been created with the mentioned tools can be imported.

Essentially, this article addresses the use of deep learning for pattern recognition, such as in computer perception, using the example of learning audio feature representations using Convolutional Neural Networks (CNNs) and Long Short-Term Memory Recurrent Neural Networks (LSTM RNNs).

Audio plots (spectrograms) are generated from the audio signals. They are then used as input to the pre-trained CNN, and the activations of the last fully connected layer with 4096 neurons are extracted as deep spectrum features. This leads to a large feature vector that is eventually used for classification (Figure 2).

Fig. 2: Deep learning system for classifying audio signals using a CNN pre-trained on a million images

Convolutional Neural Networks

The presentation of the data that is put into the neural network is crucial. Signals, including two-dimensional signals, such as image data, can be fed directly into the neural network. In the case of image data, this means the colour values of the individual pixels. However, the processing of the data is not shift-invariant, as shifting an object in an image by the width of pixel results in the image information going through a completely different path in the neural network. Some degree of shift invariance can be achieved by CNNs.

CNNs perform a convolution operation, weighing the vicinity of a signal with a convolution kernel, and adding the products together. The weights of the convolution kernel are trained with CNNs and are constant over all areas of an image. For each pixel, multiple convolution operations are normally performed, creating so-called feature maps. Each feature map contains information about specific edge types or shapes in the input image, so each convolution kernel specializes in a specific local image pattern. In order to improve the shift invariance and to compress the image information that is initially blown up by a CNN layer, the described layers are normally used in combination with a subsequent maximum pooling layer. This layer selects, from a (mostly) 2 x 2 vicinity, only the largest activation respectively and propagates it to the subsequent network layer.

CNNs typically consist of a series of several convolutional and maximum pooling layers and are completed by one or more fully networked layers. Although CNNs are also applied to one-dimensional signals, they are most commonly found in the classification of images and have greatly improved the state of the art in this area. A standard problem is the detection of handwritten digits, for which the error rate on the test data of the MNIST standard data set was successfully reduced to below 0.3 percent [2].

Since very large amounts of data are required for the training of complex neural networks and long computation time is associated with it, pre-trained networks have been enjoying great popularity in recent years. An example of such a network is AlexNet, which was trained on the ImageNet image database, and consists of more than one million images in a thousand categories. The network has eight layers, of which the first five layers are those of a CNN. Such a neural network can be used not only for the classification of a thousand pre-trained categories but also for the classification of further objects or image classes by re-training the last layer (or last layers) with image examples from the desired categories while leaving the weights in the previous layers constant. The advantage here is that robust classifiers can be generated even with a much smaller number of training data. Such a procedure, in which we make use of models from another domain or problem definition, is referred to as transfer learning. At the Interspeech Conference 2017, a prestigious international conference, we presented a CNN pre-trained for image recognition for audio classification [3]. Figure 2 shows the structure of the presented approach.

Recurrent Neural Networks

Stay tuned!
Learn more about ML Conference:

Recurrent Neural Networks (RNNs) are suitable for modeling sequential data. These are the series of data points, mostly over time, such as audio and video signals, but also physiological measurements (such as electrocardiograms) or stock prices. An RNN, in contrast to feedforward networks such as CNNs, also has feedback in itself or to other neurons. Each passing of the activation into another neuron is understood as a time step so that an RNN can implicitly store data over any given period of time.

Since during training of RNNs, i.e. when optimizing the weights of the neurons, the error gradients have to be propagated back through different layers as well as over a large number of time steps and these are multiplied in each step, they gradually vanish (Vanishing Gradient Problem), and the respective weights are not sufficiently optimized. LSTMs solve this problem by introducing so-called LSTM cells. They were presented at the Technical University of Munich in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [4]. LSTMs are able to store activations over a longer number of time steps. This is achieved through a component of multiplicative gates: Input Gate, Output Gate and Forget Gate, which in turn consist of neurons whose weights are trained. The gates determine which activation is transmitted to the cell at what times (input gate), when and to what extent they are output (output gate), and when and if the activation is cleared (forget gate). Gated Recurrent Units (GRUs) are an advancement of the LSTMs. They do without an output gate and are thus faster to train, yet still offer similar accuracy.

LSTMs and GRUs can work with different input data. Classically, in the case of audio signals, time-dependent acoustic feature vectors have been extracted. Typical features include short-term energy in certain spectral bands, or in particular for speech signals, so-called Mel-frequency cepstral coefficients, which represent information from linguistic unities in a compressed manner. In addition, the fundamental frequency of the voice or rhythmic features may be relevant for certain issues. Alternatively, however, so-called end-to-end (E2E) learning has been increasingly in use recently. The step of feature extraction is replaced by several convolutional layers of a CNN. The convolution kernels are one-dimensional in audio signals and may also be considered as bandpass filters. The CNN layers are then preferably followed by LSTM or GRU layers to account for the temporal nature of the signal. E2E learning has been used successfully in the linguistic field for emotion recognition in human speech and is currently the most important subject of research in automatic speech recognition (speech-to-text).

Fig. 3: A high-level overview of auDeep

The emotion recognition from voice recordings mentioned above using the example of the robot is a complex example. First of all, it has to be considered that the robot has to work in a wide variety of acoustic environments – environments that it does not know yet from the training data. As shown in Figure 3, audio features vary greatly depending on the acoustic recording conditions and the respective speakers, possibly more than in comparison to different emotions. First, Mel-spectrograms are extracted from the audio files (a). Subsequently, a recurrent sequence-to-sequence autoencoder is trained on these spectrograms, which are considered to be time-dependent sequences of frequency vectors (b). After the autoencoder training, the learned representations are generated by the Mel-spectrograms for use as feature vectors for the corresponding instances (c). Finally, a classifier (d) is trained on the feature vectors. Therefore, as a first step, normally either the audio features or the audio signal itself are improved, therefore freed from any interference. Again, artificial neural networks, mostly RNNs, can be used for this purpose. Furthermore, ambient noise detection is often necessary, that is to say, a determination of the acoustic environment so that the system can select a model which is optimal for the respective situation, or it can adapt the model parameters accordingly. Finally, actual emotion recognition is performed from the preprocessed language.

One of the latest RNN-based developments for unsupervised learning is auDeep [5], [6]. The system is a sequence-to-sequence autoencoder that learns audio representations in an unsupervised method from extracted Mel-spectrograms. Figure 2 shows an illustration of the structure of auDeep. Mel-spectrograms are considered as a time-dependent sequence of frequency vectors in the interval[-1;1]^{{N_mel }}, each of which describes the amplitudes of the N_mel
Mel-frequency bands within an audio portion. This sequence is applied to a multilayer RNN encoder which updates its hidden state in each time step based on the input frequency vector. Therefore, the last hidden state of the RNN encoder contains information about the entire input sequence. This last hidden state is transformed using a fully connected layer, and another multilayer RNN decoder is used to reconstruct the original input sequence from the transformed representation.

The encoder RNN consists of N_layers, each contains N_unit GRUs. The hidden states of the encoder GRUs are initialized to zero for each input sequence and their last hidden states in each layer are concatenated into a one-dimensional vector. This vector can be viewed as a fixed length representation of a variable length input sequence – with dimensionality N _layer⋅ N _unit when the encoder RNN is unidirectional, and the dimensionality 2 ⋅N_layer⋅N_unit⋅ if it is bidirectional.

The representation vector is then passed through a fully connected layer with hyperbolic tangent activation. The output dimension of this layer is chosen so that the hidden states of the RNN decoder can be initialized.

The RNN decoder contains the same number of layers and units as the RNN encoder. Their task is the partial reconstruction of the input Mel-spectrogram based on the representation with which the hidden states of the RNN decoder were initialized. At the first time step, a zero input is fed to the RNN decoder. During the subsequent time steps t, the expected decoder output at time t-1 is passed as an input to the RNN decoder. Greater representations could possibly be obtained by using the actual decoder output rather than the expected output, as this reduces the amount of information available to the decoder.

The outputs of the decoder RNNs are passed through a single linear projection layer having hyperbolic tangent activation at each time step to assign the dimensionality of the decoder output of the target dimensionalityN _mel . The weights of this output projection are distributed over several time steps. To introduce larger short-term dependencies between the encoder and the decoder, the RNN decoder reconstructs the reverse input sequence.

Autoencoder training is performed using the root mean square error (RMSE) between the decoder output and the target sequence as the target function. A so-called dropout is applied to the inputs and outputs of the recurrent layers, but not to those of the hidden states. Dropout corresponds to the random elimination of neurons during the learning iterations to enforce a so-called regularization, which should allow individual neurons to learn more independently from their vicinity. Once training is completed, the fully connected layer activations are extracted as the learned spectrogram representations and forwarded for a decision, such as classification. Figure 4 illustrates how the autoencoder has learned new representations in an unsupervised manner from the mixed spectrograms. Finally, the learned audio representations can be classified by means of an RNN. This is illustrated below by Deeplearning4j. Deeplearning4j offers numerous libraries for the modelling of diverse neural networks.

Visualization of time-shared stochastic vicinity embedding of the spectrograms (a) and learned representations using a recurrent sequence-to-sequence autoencoder (b)

Finally, in Listing 1 we show the implementation of an RNN with Graves LSTM cells for the classification of feature vectors, which we extracted using the unsupervised method (Figure 2). To train the LSTM network, a number of hyperparameters must be adjusted. These include, for example, the learning rate of the network, the number of input and output neurons corresponding to the number of extracted features and the classes, and a majority of other parameters.


import org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator;
import org.deeplearning4j.eval.Evaluation;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.*;
import org.deeplearning4j.nn.conf.layers.GravesLSTM;
import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.nn.weights.WeightInit;
import org.datavec.api.records.reader.RecordReader;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.api.split.FileSplit;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
import org.nd4j.linalg.lossfunctions.LossFunctions.LossFunction;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.DataSet;
import java.io.File;

public class Main {
  public static void main(String[] args) throws Exception {
    int batchSize = 128; // set batchsize
     // Load training data
      RecordReader csvTrain = new CSVRecordReader(1, ",");
      csvTrain.initialize(new FileSplit(new 
File("src/main/resources/train.csv")));
      DataSetIterator iteratorTrain = new RecordReaderDataSetIterator(csvTrain, 
batchSize, 4096, 2);
       // Load evaluation data
      RecordReader csvTest = new CSVRecordReader(1, ",");
      csvTest.initialize(new FileSplit(new 
File("src/main/resources/eval.csv")));
      DataSetIterator iteratorTest = new RecordReaderDataSetIterator(csvTest, 
batchSize, 4096, 2);
   //****LSTM hyperparameters****
      int anzInputs = 4096;  // number of extracted features
      int anzOutputs = 2;  // number of classes
      int anzHiddenUnits = 200; // number of hidden units in each LSTM layer
      int backPropLaenge = 128; // Length for truncated back propagation over time
      int anzEpochen = 32; // number of training epochs
      double lrDecayRate = 10;  // Decline of the learning rate
  //****Network configuration****
      MultiLayerConfiguration netzKonfig = new NeuralNetConfiguration.Builder() 

        .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).iterations(1)
        .learningRate(0.001)
        .seed(234)
        .l1(0.01) // least absolute deviations (LAD)
        .l2(0.01) // least squares error (LSE)
        .regularization(true)
        .dropOut(0.1)
        .weightInit(WeightInit.RELU)
        .updater(Updater.ADAM)

  .learningRateDecayPolicy(LearningRatePolicy.Exponential).lrPolicyDecayRate(lrDecayRate)

        .list()
        .layer(0, new GravesLSTM.Builder().nIn(anzInputs ).nOut(anzHiddenUnits)
        .activation(Activation.TANH).build())
        .layer(1, new 
GravesLSTM.Builder().nIn(anzHiddenUnits).nOut(anzHiddenUnits)
        .activation(Activation.TANH).build())
        .layer(2, new 
GravesLSTM.Builder().nIn(anzHiddenUnits).nOut(anzHiddenUnits)
        .activation(Activation.TANH).build())
        .layer(3, new 
RnnOutputLayer.Builder(LossFunction.MEAN_ABSOLUTE_ERROR).activation(Activation.RELU)
        .nIn(anzHiddenUnits).nOut(anzOutputs).build()) 

        .backpropType(BackpropType.TruncatedBPTT).tBPTTForwardLength(backPropLaenge).tBPTTBackwardLength(backPropLaenge)
        .pretrain(true).backprop(true)
        .build();
      MultiLayerNetwork modell = new MultiLayerNetwork(netzKonfig);
      modell.init(); // initialization of the model 
      // Training after each epoch
      for (int n = 0; n < anzEpochen; n++) {
        System.out.println("Epoch number: " + (n + 1));
        modell.fit(iteratorTrain);
      }
    // Evaluation of the model
      System.out.println("Evaluation of the trained model ...");
      Evaluation Eval = new Evaluation(anzOutputs);
      while (iteratorTest.hasNext()) {
        DataSet data = iteratorTest.next();
        INDArray features = data.getFeatureMatrix();
        INDArray lables = data.getLabels();
        INDArray predicted = modell.output(features, false);
        Eval.eval(lables, predicted);
      }
      //****Show evaluation results****
      System.out.println("Accuracy:" + Eval.accuracy());
      System.out.println(Eval.confusionToString()); //Confusion Matrix
  }
}

Exploiting Deep Learning – Conclusion

Because of their special capabilities, deep learning methods will continue to increasingly dominate machine learning research and practice in the years to come. In recent years, a large number of companies have been founded which specialize in deep learning, while large IT companies such as Google and Apple are hiring experienced experts to a large extent. In the field of research, deep learning has in the meantime displaced a large part of classical signal processing and now dominates the field of data analysis.

Developers will increasingly have to deal with the integration of deep learning models. In the area of Java development, the toolkit Deeplearning4j presented earlier is a promising framework. In one example, the application of deep learning for audio analysis was shown here. Following the principle and code, a multitude of related issues can be solved elegantly and efficiently. Artificial intelligence has again become the focus of general interest thanks to deep learning. It remains to be seen what kind of new solutions and applications we will experience in the near future.

Links & literature

[1] Deeplearning4j: https://deeplearning4j.org/
[2] http://yann.lecun.com/exdb/mnist/
[3] Amiriparian, Shahin; Gerczuk, Maurice; Ottl, Sandra; Cummins, Nicholas; Freitag, Michael; Pugachevskiy, Sergey; Baird, Alice; Schuller, Björn: „Snore Sound Classification Using Image-based Deep Spectrum Features“, in: Proceedings INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, S. 3512–3516, ISCA, ISCA, August 2017
[4] Hochreiter, Sepp; Schmidhuber, Jürgen: „Long Short-Term Memory“, Neural Computation, 9 (8), S. 1735-1780, 1997
[5] Amiriparian, Shahin; Freitag, Michael; Cummins, Nicholas; Schuller, Björn: „Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio“, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), München, S. 17–21, IEEE, November 2017
[6] auDeep: https://github.com/auDeep/auDeep/

Get to know the ML Conference better?

Here are more sessions

→ debugging and visualizing tensorflow programs with images

→ Some Things I wish I had known about scaling Machine Learning Solutions

→ Deep Learning mit Small Data

→ Evolution 3.0: Solve your everyday Problems with genetic Algorithms

Check out the ML Conference program

The post Deep Learning with Java: Introduction to Deeplearning4j appeared first on ML Conference.

Find the outlier: Detecting sales fraud with machine learning

Julia Martin — Wed, 06 Jun 2018 14:18:21 +0000

JAXenter: Hello Canburak! Your session at the Machine Learning Conference is titled Anomaly detection in sales point transactions. What does this mean? Do you have an example?

Canburak Tümer: Let me first define what I mean with sales point. Sales points are the locations where Turkcell Superonline gathers new subscribers. They can be a shop belonging to Turkcell, they can be a franchise, or sometimes they can be a booth in an event. Anomaly in sales usually occurs in the numbers of new subscriptions; if a shop usually sells x subscriptions in a day and suddenly in a new day it sells twice as many units, there is an anomaly and it may lead to a fraud. We report this anomaly to revenue assurance teams to investigate.

The other type of anomaly is between different shops. We are expecting to have similar numbers between the same type of shop in same towns, but there can be outliers. These outliers should be investigated for any potential fraud. So an anomaly in the sales may mean a fraudulent action.

JAXenter: What parameters do you look for when looking for an anomaly?

Canburak Tümer: Our main parameter is the number of new subscriptions in different intervals (daily, weekly, monthly, 6 months), supported by the town and sales point type information. But in further research we will add, we will also look at cancellation numbers of these new subscriptions, complaint numbers, and average churn tenure.

JAXenter: How can outlying sales points be identified?

Canburak Tümer: For detecting the outlying shop in a town, we are now using the interquartile range method. This is a basic and trusted method to detect outliers in a set of records. Also, we are evaluating the hierarchical clustering method by choosing a good cut off point. Hierarchical clustering can help us to detect non-normal point in the data.

JAXenter: Why is it more complex to find outlier sales points? What is necessary for this?

Canburak Tümer: For a single sales point, it is easier to detect the trend, then predict the sales for the next time interval and check if the data belongs in the predicted value. But when it comes to comparing different sales points, new features come into the stage. First of all, location and population of the location affect the sales.

Then, the type of the sales point: an online sale or telesales point cannot be compared to a local shop. As the number of features increases, model complexity increases along with it. In order to keep things simple, we group the sales point by the location and type then use the simple methods to detect outliers.

JAXenter: Thank you!

Carina Schipper has been an editor at Java Magazine, Business Technology and JAXenter since 2017. She studied German and European Ethnology at the Julius-Maximilians-University Würzburg.

The post Find the outlier: Detecting sales fraud with machine learning appeared first on ML Conference.