Deep Learning Archives - ML Conference

Explainability – a promising next step in scientific machine learning

kuebra — Tue, 05 May 2020 09:07:11 +0000

Machine learning has become an integral part of our daily life – whether it be an essential component of all social media services or a simple helper for personal optimization. For some time now, also most areas of science have been influenced by machine learning in one way or another, as it opens up possibilities to derive findings and discoveries primarily from data.

Probably the most common objective has always been the prediction accuracy of the models. However, with the emergence of complex models such as deep neural networks, another aspired goal for scientific applications has emerged: explainability. This means that machine learning models should be designed in such a way that they not only provide accurate estimates, but also allow an understanding of why specific decisions are made and why the model operates in a certain way. With these demands to move away from non-transparent black-box models, new fields of research have emerged such as explainable artificial intelligence (XAI) or theory-guided/informed machine learning.

From transparency to explainability

Explainability is not a discrete state that either exists or does not exist, but rather a property that promotes that results become more trustworthy, models can be improved in a more targeted way, or scientific insights can be gained that did not exist before. Key elements of explainability are transparency and interpretability. Transparency is comparatively easy to achieve by describing and motivating the machine learning process by the creator. Even deep neural networks, often referred to as complete black boxes, are at least transparent in the sense that the relation between input and output can be written down in mathematical terms. The fact that the model would not be accessible is therefore usually not the problem, but rather that the models are often too complex to fully understand how they work and how decisions are made. That is exactly where interpretability comes into play. Interpretability is achieved by transferring abstract and complex processes into a domain that can be understood by a human.

A visualization tool often used in the sciences are heatmaps. They highlight parts of the input data which are salient, important, or sensitive to occlusions, depending on which method is used. Heatmaps are displayed in the same space as the input, so when analyzing images heatmaps are created that are images of the same size. They can also be applied to other data as long as they are in human-understandable domain. One of the most prominent methods is layer-wise relevance propagation for neural networks, which is applied after the model was learned. It uses the learned weights and activations that result when applied to a given input and propagates the output back into the input space. Another principle is pursued by model-agnostic approaches such as LIME (local interpretable model-agnostic explanations), which can be used with all kinds of methods – even non-transparent ones. The idea behind these approaches is to change the inputs and analyze how the output changes in response.

Nevertheless, domain knowledge is essential to achieve explainability for an intended application. Despite that processes in a model could be explained from a purely mathematical point of view, an additional integration of knowledge from the respective application is indispensable, not least to assess the meaningfulness of the results.

Explainable machine learning in the natural sciences

The possibilities this opens up in the natural sciences are wide-ranging. In the biosciences, as an example, the identification of whales from photographs plays an important role to analyze their migration over time and space. Identification by an expert is accurate and based on specific features such as scars and shape. Machine learning methods can automate this process and are therefore in great demand, which is why this task is approached as a Kaggle Challenge. Before such a tool is actually used in practice, the quality of such a model can be assessed by analyzing the derived heatmaps (Fig. 1).

This way it can be checked whether the model also looks at relevant features in the image rather than insignificant ones like water. In this way, the so called clever-Hans-effect can be excluded, which is defined by making right decisions for wrong reasons. This could occur, for example, if by chance a whale was always photographed with a mountain in the background and the identification algorithm falsely assumed this to be a feature of the whale. Therefore, human-understandable interpretations and their explanation by an expert are essential for scientific applications, as they allow us to draw conclusions about whether the models operate as expected.

Figure 1: Heatmaps derived by Grad-Cam, an interpretation tool utilizing the gradients in a neural network. All four original images show the same whale at different points in time.

Much more far-reaching, however, is the application of explainable machine learning when we provoke that the methods do not deliver what we expect but rather gives us new scientific insights. A prominent approach is presented, for example, by Iten et al., in which physical principles are to be derived automatically from observational data without any prior knowledge. The idea behind this is that the learned representation of the neural network is much simpler than the input data, and explanatory factors of the system such as physical parameters are captured in a few interpretable elements such as neurons.

In combination with expert knowledge, techniques such as neural networks can thus recognize patterns that help us to encounter things that were previously unknown to us.

The post Explainability – a promising next step in scientific machine learning appeared first on ML Conference.

Deep Learning: Not only in Python

ML editorial team — Tue, 28 Jan 2020 13:17:04 +0000

AI and deep learning are certainly hot topics at the moment and despite some initial setbacks, e.g. in the field of self-driving cars, the potential of deep learning is far from exhausted. But there are still many areas of IT in which the topic is only just gaining momentum. It is therefore particularly important to investigate how deep learning systems can be implemented on the JVM, as Java (both the language and the platform) is still the dominant technology in the enterprise sector.

TensorFlow is one of the most important frameworks in the field of deep learning. Despite the increasing popularity of Keras, it is still inconceivable to do without it, especially as the AI top dog Google continues to drive its development forward. This article shows how TensorFlow can be used on the JVM to train and infer TensorFlow models.

What is the combination of TensorFlow and JVM suitable for?

DL4J is the only professional deep learning framework that is really at home on the JVM, so if you would like to use deep learning on the JVM, DL4J is usually the best choice. TensorFlow – like many machine learning frameworks – is mainly used with Python. However, there are reasons to use TensorFlow within a JVM context:

You want to use a process which has an implementation in TensorFlow, but not in DL4J, and the porting effort is too high.
You are working with a data science team that is used to work with TensorFlow and Python, but the target infrastructure runs on the JVM.
The data which is necessary for the training lies within a Java infrastructure (databases, custom data formats, APIs) and in order to get to the data, existing interface code must be ported from Java to Python.

The JVM TensorFlow combination is therefore always useful if an existing Java environment is available and, for personnel or project related reasons, TensorFlow has to be used for deep learning (see the box: “TensorFlow and JVM – always a good idea?”).

TensorFlow and JVM – always a good idea?

Although there may be good reasons for this combination, it is also important to mention what may speak against it. Especially the choice of TensorFlow should be well considered:

TensorFlow is not a suitable framework for deep learning or machine learning beginners.
TensorFlow is not user-friendly: The API changes quickly and in the mass of instructions it is often not clear which path is the best.
TensorFlow isn’t better just because it is made by Google: Deep learning is math, and math is the same for everyone. TensorFlow does not create “smarter” AIs than other frameworks. It is also not faster than the alternatives (but also not “dumber” or slower).

If you want to get into deep learning and stay on the JVM, the use of DL4J is absolutely recommended. Especially for professional enterprise projects, DL4J is a good choice. But also, if you want to look over the fence and try out a bit of Python, it is worth trying out the TensorFlow alternatives. Here, you are currently better off with Keras, thanks to a much more convenient API.

How does TensorFlow work?

Before you start to use a new framework, it is important to take a look at what happens under the hood (see the box: “TensorFlow cheat sheet”). When thinking of TensorFlow, the first things that come to mind are AI and neural networks. But from a technical point of view, TensorFlow is mainly a framework that can execute complex, iterative, parallel calculations on tensors – and that, if possible, GPU-accelerated. Although deep learning is the main field of application for TensorFlow, it can also be used for any other calculation.

A TensorFlow program – or better: the configuration of a calculation – is always structured like a graph in TensorFlow. The nodes of the graph represent operations, such as adding or multiplying, but also loading and saving. Everything that TensorFlow does takes place in the nodes of a previously defined calculation graph. The nodes (operations) of the graph are connected by edges through which the data flows in the form of tensors. Hence the name TensorFlow.

All calculations in TensorFlow take place in a so-called session. In the session, either a finished graph is loaded, or a new graph is created piece by piece by API calls. Special nodes in the graph can contain variables. In order for the graph to work, these must be initialized. Once this has happened and a session with a finished, initialized graph exists, TensorFlow interacts only by calling operations in the graph. What is calculated depends on which output nodes of the graph are queried. Thus, not the entire graph is executed, but only the operations that provide input for the queried node, then their input nodes, etc., back to the input operations, which must be filled with the necessary input tensors. The important thing with TensorFlow is that all operations are automatically differentiated for the user – this is needed for the training of neural networks. However, the user can safely blend it out since it happens automatically.

Usually, the graph is defined by a Python API. It can be represented graphically with auxiliary programs (Fig. 1). But such representations only serve for debugging, as they are not graphically programmed like in a visual programming language such as LabView.

_{Fig. 1: A (small) section of a TensorFlow graph: The numbers on the edges indicate the size of the tensor flowing through them, the arrows indicate the direction.}

Although, in most examples, Python is used to interact with TensorFlow, the actual engine is written in C/C++. Therefore, you can use TensorFlow with any language that can call C functions. Thus, you can also perform calculations in TensorFlow from the JVM.

TensorFlow cheat sheet

Tensor: The basis for calculations in TensorFlow. A tensor is actually an object from linear algebra, but for our purposes it is completely sufficient to consider a tensor as a multidimensional array (mostly from float or double values, sometimes also char or boolean). TensorFlow uses tensors for everything. All data that TensorFlow consumes, produces, and uses internally is packaged in tensors – hence the name.
Graph: The definition of TensorFlow calculation procedures is usually stored in a file called graph.pb in a ProtoBuf binary format, similar to a Java .class file.
Training: When training a machine learning method, data and expected results are presented to the algorithm over and over again, whereupon the algorithm adjusts the internal parameters of the model to improve the result. Sometimes this is called “learning”, although it has little to do with human learning.
Inference: Depending on the application, you may want to use a machine learning process to classify, predict, translate, create content, and much more. All these applications are summarized under the term inference. Inference therefore means as much as “using a procedure to obtain a result”. This is what we want to do most of the time in live use after training. During inference, a procedure does not learn.
Model: the learned parameters of a machine learning procedure, for example, a neural net. This is the result of the learning process and necessary to obtain results (the variable state of the graph, so to speak). It is distributed over several files and stored in one *.index and several *.data files, for example, *.data-0000-of-0001. The first number indicates the consecutive number of the file, the second the total number.
Session: the context in which TensorFlow is executed, such as a running JVM instance. In order to use TensorFlow, we need to create a session in which a graph is loaded that is initialized with a model. In Java, a JVM instance must be started in which classes are loaded that are instantiated with constructor parameters.

TensorFlow training and inference with Python

The training of a TensorFlow model with Python (box: “tf.data or feeding?”) can be separated into the following steps:

Create the graph, either via several API calls that compose the graph or through loading a *.pb file that contains the graph
Create a session for the graph
Initialize the graph variable either by calling a special operation in the graph which fills the variables with default values or through loading a pre-trained model

After these three steps, we have an executable TensorFlow session with a functioning model. If we want to (further) train it, the following three steps are always executed in a loop until the model has learned enough – either by defining a fixed number of training steps beforehand or by waiting until the training error drops below a certain level:

Package input data in arrays and assign input sensors
Select output node and pack into a list
Execute the session: a special command causes the session to perform the necessary operations to generate the selected output

But where does the training take place? It is done by executing the correct output nodes. There is no difference between training and inference for TensorFlow, mathematical operations are simply performed in the calculation graph. We speak of training if these lead to a neural network that learns a better weighting to solve a problem. However, the API calls for training and any other type of usage are the same.

Our input consists of the data that is to be learned (for example, an image as a two-dimensional tensor and the label “dog” or “cat” in the form of an integer ID in a zero-dimensional tensor) during training. By running the correct nodes, TensorFlow updates some variables in the graph to improve the prediction. The main difference between training and inference is that we periodically save the current state of the graph variables – which are constantly changing – while this is useless for the inference because they remain constant.

tf.data or feeding?

There are two possibilities to load training data into the graph when you train a TensorFlow model in Python:
the tf.data API or so-called “feeding”, i.e. the transfer of individual data for each calculation step. The tf.data API is implemented internally in C, integrated directly into the graph, and therefore very fast – but also complicated to use and very difficult to debug. The feeding method is easy to use and understand, but you need Python code at runtime. Therefore, Python usually slows down the more expensive graphics card, and valuable GPU capacity is not used. But which approach do we take in Java? Fortunately, Java is orders of magnitude faster than Python, so here we get the best of both worlds: We can use the easy-to-understand feeding method and still get full performance. That is why we leave the tf.data API out of this article, we just don’t need it.

The TensorFlow Java API

Now we can call all operations that are necessary in Python for training or inference via JNI, since TensorFlow is implemented internally in C/C++. Fortunately, we no longer have to bother wrapping the low-level C API with JNI, as Google has already done this for us. The necessary libraries are, as usual, available on Maven Central. There are four different artifacts, all in the group org.tensorflow:

tensorflow: A metapackage with dependencies on libtensorflow and libtensorflow_jni; in order to avoid confusion, it should not be used.
libtensorflow: The API against which you program in Java; this is the compile and runtime dependency and the central entry point.
libtensorflow_jni: Contains the native CPU dependencies for libtensorflow; this artifact is needed at runtime when using a machine without GPU; it contains native code for Windows, Linux and Mac; TensorFlow is completely included, you don’t have to install Python or TensorFlow on the running system.
libtensorflow_jni_gpu: The GPU equivalent to libtensorflow_jni; you should use this dependency if you use a computer with NVIDIA GPU and Cuda and CuDNN are installed correctly; it only works under Windows and Linux, there is no GPU support for TensorFlow under macOS.

The version numbers of the Java wrappers correspond to the version number of the included TensorFlow version. Here we should always use the newest stable release. We only have to pay attention if the code is supposed to be executed on a computer with GPU (box: “Selecting the GPU to be used”). Not every TensorFlow version supports every CUDA and CuDNN version (CUDA is a special NVIDIA driver to use graphics cards for parallel calculations, CuDNN is a CUDA based library for neural networks). We must ensure that the CUDA and TensorFlow versions are matching. Currently, all TensorFlow versions from 1.13 on support the same CUDA version: 10.0. With a Java-based solution, we already have a great advantage over Python software when installing the finished software. Thanks to Maven, our resulting artifact already includes all dependencies. Neither Python nor TensorFlow nor any Python libraries have to be pre-installed or the installations managed with a tool like Anaconda.

You should not use the top-level dependency tensorflow, it is better to directly use libtensorflow and one of the *_jni implementations. The reason for this is that the tensorflow artifact has a dependency on libtensorflow_jni (the CPU variant). If we now add libtensorflow_jni_gpu, the CPU-native code is still used and one wonders why everything runs so slowly despite the GPU. The Gradle dependencies for the TensorFlow training on the GPU look like this:

compile "org.tensorflow:libtensorflow:1.14.0"
runtimeOnly "org.tensorflow:libtensorflow_jni_gpu:1.14.0"

The required Java API for training and inference is simple and manageable. Only four classes are important: Graph, Session, Tensor and Tensors. We can now see how to use them correctly by rebuilding the Python-typical training steps in Java.

TensorFlow training in Java

The first step in training is to define the graph. Unfortunately, we have to make the first but only compromise right at the beginning. A graph can also be built step by step using the Java API, but for many node types, the Python API automatically generates many necessary helper nodes that are required for the frictionless use of the graph. In order to build this in Java, we would need a very detailed knowledge of the Python API internals. This step must therefore be done once in advance in Python. We then store the resulting graph file as a Java resource in order to then load it back into the JVM. Saving the current graph in Python is very easy:

with open(filename, 'wb') as f:
  f.write(tf.get_default_graph().as_graph_def().SerializeToString())

Important: Even if the method used here is called SerializeToString(), the result is still a binary file. For our convenience, we should also save the initialized variables here. Although initializing the variables in the graph from the JVM would be easy, if we always choose the here shown procedure, it makes it easier to do transfer training with complex models afterwards. Hereby, an already existing state of a model is further trained and adapted (Listing 1).

Listing 1

# This Python command creates a node for initialization
init_op = tf.global_variables_initializer()
# The saver is an auxiliary class that stores a model in Python.
saver = tf.train.Saver()
# Save is a graph operation
# and can only be executed in one session
with tf.Session() as sess:
  # Initializing Variables
  sess.run(init_op)
  # Save state
  save_path = saver.save(sess, filename)

Now we have saved the graph and the model and can train it in Java and execute the graph. For the sake of brevity, the following examples are in Kotlin but can be transferred to any JVM language:

//create empty graph
val graph = Graph()
//*. load pb file - either from a file or from resources
val graphDefBytes = javaClass.getResource(resourceName).readBytes()
//reconstruct graph from file
graph.importGraphDef(graphDefBytes)

Now we have loaded the TensorFlow graph into the JVM. In order to do something with it, we need a session:

val session = Session(graph)

We only have to load the latest version of the variable before we can really get started. This can be either the file initially saved in Python or the last state of a previous training, for example, in order to continue a training. The loading of variables is only an operation in the TensorFlow graph, and a string packed into a tensor is needed for this operation. The string contains the name of the *.index file without the suffix, so foo instead of foo.index.

Here, we need the Tensors class for the first time. This class contains help functions to package Java data types into Tensor objects. Hereby, it is automatically taken into consideration that the Tensor has the correct form. Important for every Tensor object: It contains memory that has been allocated outside the JVM. Therefore, it must be closed manually, for which it implements the Closable Interface. In Java, an own try{…} finally { tensor.close(); } block must be created for each tensor. Fortunately, this is much easier in Kotlin with use:

Tensors.create(path).use { pathTensor ->
  session.runner().feed("save/Const", pathTensor)
                  .addTarget("save/restore_all")
                  .run()
}

Here we can see all necessary parts of a TensorFlow action on the JVM:

A runner is created for the session; this class has a builder API that defines what is supposed to be executed.
The input node for the loading and saving (“save/Const”) is filled with the tensor which contains the file name.
The target node is defined as the target for loading.
The action is executed.

The trick for all operations is to know their names. But since we build the graph ourselves beforehand and can define the name of a node at creation, we can choose them for ourselves. Exceptions are the nodes for loading and saving, which always have the here stated names.

Selecting the GPU to be used

Sometimes we don’t want to block all GPUs on systems with multiple GPUs, for example, to run multiple trainings in parallel. For this, we can configure the TensorFlow graph, which normally automatically allocates the GPU or GPUs, so that only one GPU is used. This has the big disadvantage, though, that the graph is then “hard wired” to a certain GPU and can be used only on this GPU. It is much more convenient to show or hide the GPUs by environment variable before starting the JVM. This can easily be done with the environment variable CUDA_VISIBLE_DEVICES. Here, we can specify a comma-separated list of CUDA devices that should be visible in the current shell. Caution: The numbering starts at 1, not at 0. The following console command, for example, activates only the second graphics card for TensorFlow (or other deep learning frameworks):

export CUDA_VISIBLE_DEVICES=2

Now we have already seen all the operations needed to interact with TensorFlow from the JVM. Carrying out a training step is now very easy. Let’s assume that our input is an array of loaded images. The black and white values of the pixels are converted to float values in the range 0-1. Each image belongs to a class defined by an int value, for example, 0 = dog, 1 = cat. Then the input for a batch (multiple images are always trained at once) is a float[][] array, which contains the images, and an int[] array, which contains the classes to learn. A training step can now be executed as follows (Listing 2).

Listing 2

fun train(inputs: Array<FloatArray>, labels: IntArray) {
  withResources {
    val results: List<Tensor<*>> = session.runner()
      .feed("inputs", Tensors.create(inputs).use())
      .feed("labels", Tensors.create(labels).use())
      .fetch("total_loss:0")
      .fetch(“accuracy:0")
      .fetch("prediction")
      .addTarget("optimize").run().useAll()
    val trainingError = results[0].floatValue()
    val accuracy      = results[1].floatValue()
    val prediction    = results[2].intValue()
  }
}

We see the same pattern again: A runner is created, the inputs are packaged into tensors, the target is selected (“optimize“) and the action is executed. But now we have an innovation: We get values back. The names of the nodes that are to be returned are defined with fetch. The names contain a suffix: “:0”. This means that they are nodes with multiple outputs, the :0 suffix means that the output with index 0 of the node should be returned.

The output is a list of Tensor objects. These can be converted into various primitive types and arrays to make the result available. Important: The Tensor objects created by the API also have to be closed. Normally, the entries in the list would have to be iterated and closed in a finally block. However, this is very inconvenient and difficult to read. Therefore, it is useful to define an extended use API in Kotlin, with which several objects within a block are marked with use or useAll (for lists of Closables), which are then closed safely (Listing 3).

Listing 3

class Resources : AutoCloseable {
  private val resources = mutableListOf<AutoCloseable>()

  fun <T: AutoCloseable> T.use(): T {
    resources += this
    return this
  }

  fun <T: Collection<AutoCloseable>> T.useAll(): T {
    resources.addAll(this)
    return this
  }

  override fun close() {
    var exception: Exception? = null
    for (resource in resources.reversed()) {
      try {
        resource.close()
      } catch (closeException: Exception) {
        if (exception == null) {
          exception = closeException
        } else {
          exception.addSuppressed(closeException)
        }
      }
    }
    if (exception != null) throw exception
  }
}

inline fun <T> withResources(block: Resources.() -> T): T = 
  Resources().use(block)

This useful trick allows you to close all tensors within a TensorFlow call conveniently and safely. With the inference under Java, it becomes really easy. We remember: Every action on the TensorFlow graph is performed by filling input nodes with input tensors and querying the correct output nodes. This means the following for our example above: The code remains the same, only we don’t set the inputs for the correct solution (labels). This makes sense because we don’t know them yet. In the output, we do not call the nodes for the error calculation and the update of the neural net (total_loss:0, accuracy:0, optimize), so we do not learn. Instead, we only query the result (prediction). Since the input of the solutions is not necessary for the calculation of the result, everything works just like before: There is no error because the part of the graph that trains the neural net remains inactive.

Practical experiences

The method presented here is not only an interesting experiment, but the author has already used it successfully in several commercial projects. Thereby, several advantages have emerged in practical use:

- The Java API is fast and efficient: There is no performance loss compared to the pure Python application. On the contrary: Since Java is much faster than Python for tasks like data import and pre-processing, it is even easier to implement a high-performance training process.
- The training runs absolutely stable over several days, Google’s Java implementation has proven to be very reliable.
- The deployment of the finished product is much easier than that of Python-based products, since only a Java runtime environment and the correct CUDA drivers need to be present – all dependencies are part of the Java TensorFlow library.
- TensorFlow’s low-level persistence API (as presented here) is easier to use than many of the “official” methods, such as estimators.
The only real drawback is that part of the project is still Python-based – the definition of the graph. So you need a team that is at least partly at home in the Python world.

The post Deep Learning: Not only in Python appeared first on ML Conference.

Deep Learning with Java: Introduction to Deeplearning4j

ahenseleit — Mon, 06 May 2019 10:28:26 +0000

Exploiting Deep Learning: the most important bits and pieces

Machine learning in general refers to data-based methods of artificial intelligence. A computer learns a model based on sample data. Artificial intelligence plays a significant role in human-machine interaction. An example of this is the Zeno robot shown in Figure 1. It is a therapy tool for autistic children to help them express and understand their emotions better. Zeno recognizes the emotion of its counterpart based on language and facial expression and reacts accordingly. For this purpose, the recorded sensor data must be analyzed in real time through the machine learning process.

Fig. 1: The robot Zeno is used for therapy with autistic children

Deep learning is based on networks of artificial neurons that have input and output neurons as well as multiple layers of intermediate neurons (hidden layers). Each neuron processes an input vector based on a method similar to that of human nerve cells: A weighted sum of all input values is calculated and the result is transformed with a non-linear function, the so-called activation function. The input neurons record the data, such as unprocessed audio signals, and feed it into the neural network. The audio data passes through the intermediate neurons of all the hidden layers and is thereby processed. Then the processed signals and the calculated results are issued via the output neurons, which then deliver the final result. The parameters of the individual neurons are calculated during the training of the network using the training data. The greater the number of neurons and layers you have, the more complex the problems that you can deal with.

In principle, a greater amount of data also leads to more robust models (as long as the data is not unbalanced). If there is not enough data and the selected network architecture is too complex, there is a risk of overfitting. This means that the model parameters are optimized too much during the training to the given data and that the model is not sufficiently generalized anymore, meaning that it does not work well anymore with independent test data. Possible tasks can be overcome by applying three learning methods: (1) supervised learning, (2) semi-supervised learning, and (3) unsupervised learning.

In supervised learning, a model is trained that can approximate one or more target variables from a set of annotated data. If the target variable is continuous, we speak of regression, in the case of discrete target values of classification. In classification problems, neural networks (with more than two classes) normally use as many neurons in the output layer as there are classes. The neuron that displays the highest activation for given input values is then the class that considers the network most probable.

Semi-supervisedlearning is a variant of supervised learning that uses both annotated and unannotated training data. The combination of this data can greatly improve the learning accuracy when the learning process is monitored by an expert. This learning method is also referred to as cooperative learning because of the artificial neural network and human work together. If the neural network cannot classify specific data with high confidence, it needs the help of an expert for annotation.

Unlike the other two learning methods,unsupervised learningonly has input data and no associated output variables. Since there are no right or wrong answers and no one supervises the behaviour of the system, the algorithms rely on themselves to discover and present relevant structures in the data. The most commonly used unsupervised learning method is clustering. The goal of clustering algorithms is to find patterns or groupings in the dataset. The data within one grouping then has a higher degree of similarity than data in other clusters.

Deep Learning with Java

Deep learning approaches are considered state of the art in various areas of machine learning, such as audio processing (speech or emotion recognition), image processing (object classification or facial recognition) and text processing (sentiment analysis or natural language processing). To simulate the neural networks, program libraries are often used for machine learning. Most robust libraries, such as TensorFlow, Caffe, or Theano, were written in the Python and C ++ programming languages. With Deeplearning4j [1], however, there is also a Java-based deep learning platform that can bridge the gap between the aforementioned Python-based program libraries and Java.

Deeplearning4j is mostly implemented in C and C ++ and uses CUDA to offload the calculations to a compatible NVIDIA graphics processor. The programmer has various architectures available, including CNNs, RNNs and auto-encoders. Likewise, models that have been created with the mentioned tools can be imported.

Essentially, this article addresses the use of deep learning for pattern recognition, such as in computer perception, using the example of learning audio feature representations using Convolutional Neural Networks (CNNs) and Long Short-Term Memory Recurrent Neural Networks (LSTM RNNs).

Audio plots (spectrograms) are generated from the audio signals. They are then used as input to the pre-trained CNN, and the activations of the last fully connected layer with 4096 neurons are extracted as deep spectrum features. This leads to a large feature vector that is eventually used for classification (Figure 2).

Fig. 2: Deep learning system for classifying audio signals using a CNN pre-trained on a million images

Convolutional Neural Networks

The presentation of the data that is put into the neural network is crucial. Signals, including two-dimensional signals, such as image data, can be fed directly into the neural network. In the case of image data, this means the colour values of the individual pixels. However, the processing of the data is not shift-invariant, as shifting an object in an image by the width of pixel results in the image information going through a completely different path in the neural network. Some degree of shift invariance can be achieved by CNNs.

CNNs perform a convolution operation, weighing the vicinity of a signal with a convolution kernel, and adding the products together. The weights of the convolution kernel are trained with CNNs and are constant over all areas of an image. For each pixel, multiple convolution operations are normally performed, creating so-called feature maps. Each feature map contains information about specific edge types or shapes in the input image, so each convolution kernel specializes in a specific local image pattern. In order to improve the shift invariance and to compress the image information that is initially blown up by a CNN layer, the described layers are normally used in combination with a subsequent maximum pooling layer. This layer selects, from a (mostly) 2 x 2 vicinity, only the largest activation respectively and propagates it to the subsequent network layer.

CNNs typically consist of a series of several convolutional and maximum pooling layers and are completed by one or more fully networked layers. Although CNNs are also applied to one-dimensional signals, they are most commonly found in the classification of images and have greatly improved the state of the art in this area. A standard problem is the detection of handwritten digits, for which the error rate on the test data of the MNIST standard data set was successfully reduced to below 0.3 percent [2].

Since very large amounts of data are required for the training of complex neural networks and long computation time is associated with it, pre-trained networks have been enjoying great popularity in recent years. An example of such a network is AlexNet, which was trained on the ImageNet image database, and consists of more than one million images in a thousand categories. The network has eight layers, of which the first five layers are those of a CNN. Such a neural network can be used not only for the classification of a thousand pre-trained categories but also for the classification of further objects or image classes by re-training the last layer (or last layers) with image examples from the desired categories while leaving the weights in the previous layers constant. The advantage here is that robust classifiers can be generated even with a much smaller number of training data. Such a procedure, in which we make use of models from another domain or problem definition, is referred to as transfer learning. At the Interspeech Conference 2017, a prestigious international conference, we presented a CNN pre-trained for image recognition for audio classification [3]. Figure 2 shows the structure of the presented approach.

Recurrent Neural Networks

Stay tuned!
Learn more about ML Conference:

Recurrent Neural Networks (RNNs) are suitable for modeling sequential data. These are the series of data points, mostly over time, such as audio and video signals, but also physiological measurements (such as electrocardiograms) or stock prices. An RNN, in contrast to feedforward networks such as CNNs, also has feedback in itself or to other neurons. Each passing of the activation into another neuron is understood as a time step so that an RNN can implicitly store data over any given period of time.

Since during training of RNNs, i.e. when optimizing the weights of the neurons, the error gradients have to be propagated back through different layers as well as over a large number of time steps and these are multiplied in each step, they gradually vanish (Vanishing Gradient Problem), and the respective weights are not sufficiently optimized. LSTMs solve this problem by introducing so-called LSTM cells. They were presented at the Technical University of Munich in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [4]. LSTMs are able to store activations over a longer number of time steps. This is achieved through a component of multiplicative gates: Input Gate, Output Gate and Forget Gate, which in turn consist of neurons whose weights are trained. The gates determine which activation is transmitted to the cell at what times (input gate), when and to what extent they are output (output gate), and when and if the activation is cleared (forget gate). Gated Recurrent Units (GRUs) are an advancement of the LSTMs. They do without an output gate and are thus faster to train, yet still offer similar accuracy.

LSTMs and GRUs can work with different input data. Classically, in the case of audio signals, time-dependent acoustic feature vectors have been extracted. Typical features include short-term energy in certain spectral bands, or in particular for speech signals, so-called Mel-frequency cepstral coefficients, which represent information from linguistic unities in a compressed manner. In addition, the fundamental frequency of the voice or rhythmic features may be relevant for certain issues. Alternatively, however, so-called end-to-end (E2E) learning has been increasingly in use recently. The step of feature extraction is replaced by several convolutional layers of a CNN. The convolution kernels are one-dimensional in audio signals and may also be considered as bandpass filters. The CNN layers are then preferably followed by LSTM or GRU layers to account for the temporal nature of the signal. E2E learning has been used successfully in the linguistic field for emotion recognition in human speech and is currently the most important subject of research in automatic speech recognition (speech-to-text).

Fig. 3: A high-level overview of auDeep

The emotion recognition from voice recordings mentioned above using the example of the robot is a complex example. First of all, it has to be considered that the robot has to work in a wide variety of acoustic environments – environments that it does not know yet from the training data. As shown in Figure 3, audio features vary greatly depending on the acoustic recording conditions and the respective speakers, possibly more than in comparison to different emotions. First, Mel-spectrograms are extracted from the audio files (a). Subsequently, a recurrent sequence-to-sequence autoencoder is trained on these spectrograms, which are considered to be time-dependent sequences of frequency vectors (b). After the autoencoder training, the learned representations are generated by the Mel-spectrograms for use as feature vectors for the corresponding instances (c). Finally, a classifier (d) is trained on the feature vectors. Therefore, as a first step, normally either the audio features or the audio signal itself are improved, therefore freed from any interference. Again, artificial neural networks, mostly RNNs, can be used for this purpose. Furthermore, ambient noise detection is often necessary, that is to say, a determination of the acoustic environment so that the system can select a model which is optimal for the respective situation, or it can adapt the model parameters accordingly. Finally, actual emotion recognition is performed from the preprocessed language.

One of the latest RNN-based developments for unsupervised learning is auDeep [5], [6]. The system is a sequence-to-sequence autoencoder that learns audio representations in an unsupervised method from extracted Mel-spectrograms. Figure 2 shows an illustration of the structure of auDeep. Mel-spectrograms are considered as a time-dependent sequence of frequency vectors in the interval[-1;1]^{{N_mel }}, each of which describes the amplitudes of the N_mel
Mel-frequency bands within an audio portion. This sequence is applied to a multilayer RNN encoder which updates its hidden state in each time step based on the input frequency vector. Therefore, the last hidden state of the RNN encoder contains information about the entire input sequence. This last hidden state is transformed using a fully connected layer, and another multilayer RNN decoder is used to reconstruct the original input sequence from the transformed representation.

The encoder RNN consists of N_layers, each contains N_unit GRUs. The hidden states of the encoder GRUs are initialized to zero for each input sequence and their last hidden states in each layer are concatenated into a one-dimensional vector. This vector can be viewed as a fixed length representation of a variable length input sequence – with dimensionality N _layer⋅ N _unit when the encoder RNN is unidirectional, and the dimensionality 2 ⋅N_layer⋅N_unit⋅ if it is bidirectional.

The representation vector is then passed through a fully connected layer with hyperbolic tangent activation. The output dimension of this layer is chosen so that the hidden states of the RNN decoder can be initialized.

The RNN decoder contains the same number of layers and units as the RNN encoder. Their task is the partial reconstruction of the input Mel-spectrogram based on the representation with which the hidden states of the RNN decoder were initialized. At the first time step, a zero input is fed to the RNN decoder. During the subsequent time steps t, the expected decoder output at time t-1 is passed as an input to the RNN decoder. Greater representations could possibly be obtained by using the actual decoder output rather than the expected output, as this reduces the amount of information available to the decoder.

The outputs of the decoder RNNs are passed through a single linear projection layer having hyperbolic tangent activation at each time step to assign the dimensionality of the decoder output of the target dimensionalityN _mel . The weights of this output projection are distributed over several time steps. To introduce larger short-term dependencies between the encoder and the decoder, the RNN decoder reconstructs the reverse input sequence.

Autoencoder training is performed using the root mean square error (RMSE) between the decoder output and the target sequence as the target function. A so-called dropout is applied to the inputs and outputs of the recurrent layers, but not to those of the hidden states. Dropout corresponds to the random elimination of neurons during the learning iterations to enforce a so-called regularization, which should allow individual neurons to learn more independently from their vicinity. Once training is completed, the fully connected layer activations are extracted as the learned spectrogram representations and forwarded for a decision, such as classification. Figure 4 illustrates how the autoencoder has learned new representations in an unsupervised manner from the mixed spectrograms. Finally, the learned audio representations can be classified by means of an RNN. This is illustrated below by Deeplearning4j. Deeplearning4j offers numerous libraries for the modelling of diverse neural networks.

Visualization of time-shared stochastic vicinity embedding of the spectrograms (a) and learned representations using a recurrent sequence-to-sequence autoencoder (b)

Finally, in Listing 1 we show the implementation of an RNN with Graves LSTM cells for the classification of feature vectors, which we extracted using the unsupervised method (Figure 2). To train the LSTM network, a number of hyperparameters must be adjusted. These include, for example, the learning rate of the network, the number of input and output neurons corresponding to the number of extracted features and the classes, and a majority of other parameters.


import org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator;
import org.deeplearning4j.eval.Evaluation;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.*;
import org.deeplearning4j.nn.conf.layers.GravesLSTM;
import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.nn.weights.WeightInit;
import org.datavec.api.records.reader.RecordReader;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.api.split.FileSplit;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
import org.nd4j.linalg.lossfunctions.LossFunctions.LossFunction;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.DataSet;
import java.io.File;

public class Main {
  public static void main(String[] args) throws Exception {
    int batchSize = 128; // set batchsize
     // Load training data
      RecordReader csvTrain = new CSVRecordReader(1, ",");
      csvTrain.initialize(new FileSplit(new 
File("src/main/resources/train.csv")));
      DataSetIterator iteratorTrain = new RecordReaderDataSetIterator(csvTrain, 
batchSize, 4096, 2);
       // Load evaluation data
      RecordReader csvTest = new CSVRecordReader(1, ",");
      csvTest.initialize(new FileSplit(new 
File("src/main/resources/eval.csv")));
      DataSetIterator iteratorTest = new RecordReaderDataSetIterator(csvTest, 
batchSize, 4096, 2);
   //****LSTM hyperparameters****
      int anzInputs = 4096;  // number of extracted features
      int anzOutputs = 2;  // number of classes
      int anzHiddenUnits = 200; // number of hidden units in each LSTM layer
      int backPropLaenge = 128; // Length for truncated back propagation over time
      int anzEpochen = 32; // number of training epochs
      double lrDecayRate = 10;  // Decline of the learning rate
  //****Network configuration****
      MultiLayerConfiguration netzKonfig = new NeuralNetConfiguration.Builder() 

        .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).iterations(1)
        .learningRate(0.001)
        .seed(234)
        .l1(0.01) // least absolute deviations (LAD)
        .l2(0.01) // least squares error (LSE)
        .regularization(true)
        .dropOut(0.1)
        .weightInit(WeightInit.RELU)
        .updater(Updater.ADAM)

  .learningRateDecayPolicy(LearningRatePolicy.Exponential).lrPolicyDecayRate(lrDecayRate)

        .list()
        .layer(0, new GravesLSTM.Builder().nIn(anzInputs ).nOut(anzHiddenUnits)
        .activation(Activation.TANH).build())
        .layer(1, new 
GravesLSTM.Builder().nIn(anzHiddenUnits).nOut(anzHiddenUnits)
        .activation(Activation.TANH).build())
        .layer(2, new 
GravesLSTM.Builder().nIn(anzHiddenUnits).nOut(anzHiddenUnits)
        .activation(Activation.TANH).build())
        .layer(3, new 
RnnOutputLayer.Builder(LossFunction.MEAN_ABSOLUTE_ERROR).activation(Activation.RELU)
        .nIn(anzHiddenUnits).nOut(anzOutputs).build()) 

        .backpropType(BackpropType.TruncatedBPTT).tBPTTForwardLength(backPropLaenge).tBPTTBackwardLength(backPropLaenge)
        .pretrain(true).backprop(true)
        .build();
      MultiLayerNetwork modell = new MultiLayerNetwork(netzKonfig);
      modell.init(); // initialization of the model 
      // Training after each epoch
      for (int n = 0; n < anzEpochen; n++) {
        System.out.println("Epoch number: " + (n + 1));
        modell.fit(iteratorTrain);
      }
    // Evaluation of the model
      System.out.println("Evaluation of the trained model ...");
      Evaluation Eval = new Evaluation(anzOutputs);
      while (iteratorTest.hasNext()) {
        DataSet data = iteratorTest.next();
        INDArray features = data.getFeatureMatrix();
        INDArray lables = data.getLabels();
        INDArray predicted = modell.output(features, false);
        Eval.eval(lables, predicted);
      }
      //****Show evaluation results****
      System.out.println("Accuracy:" + Eval.accuracy());
      System.out.println(Eval.confusionToString()); //Confusion Matrix
  }
}

Exploiting Deep Learning – Conclusion

Because of their special capabilities, deep learning methods will continue to increasingly dominate machine learning research and practice in the years to come. In recent years, a large number of companies have been founded which specialize in deep learning, while large IT companies such as Google and Apple are hiring experienced experts to a large extent. In the field of research, deep learning has in the meantime displaced a large part of classical signal processing and now dominates the field of data analysis.

Developers will increasingly have to deal with the integration of deep learning models. In the area of Java development, the toolkit Deeplearning4j presented earlier is a promising framework. In one example, the application of deep learning for audio analysis was shown here. Following the principle and code, a multitude of related issues can be solved elegantly and efficiently. Artificial intelligence has again become the focus of general interest thanks to deep learning. It remains to be seen what kind of new solutions and applications we will experience in the near future.

Links & literature

[1] Deeplearning4j: https://deeplearning4j.org/
[2] http://yann.lecun.com/exdb/mnist/
[3] Amiriparian, Shahin; Gerczuk, Maurice; Ottl, Sandra; Cummins, Nicholas; Freitag, Michael; Pugachevskiy, Sergey; Baird, Alice; Schuller, Björn: „Snore Sound Classification Using Image-based Deep Spectrum Features“, in: Proceedings INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, S. 3512–3516, ISCA, ISCA, August 2017
[4] Hochreiter, Sepp; Schmidhuber, Jürgen: „Long Short-Term Memory“, Neural Computation, 9 (8), S. 1735-1780, 1997
[5] Amiriparian, Shahin; Freitag, Michael; Cummins, Nicholas; Schuller, Björn: „Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio“, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), München, S. 17–21, IEEE, November 2017
[6] auDeep: https://github.com/auDeep/auDeep/

Get to know the ML Conference better?

Here are more sessions

→ debugging and visualizing tensorflow programs with images

→ Some Things I wish I had known about scaling Machine Learning Solutions

→ Deep Learning mit Small Data

→ Evolution 3.0: Solve your everyday Problems with genetic Algorithms

Check out the ML Conference program

The post Deep Learning with Java: Introduction to Deeplearning4j appeared first on ML Conference.