ML Archives - ML Conference https://mlconference.ai/tag/ml/ The Conference for Machine Learning Innovation Thu, 18 Apr 2024 08:47:27 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.2 Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems https://mlconference.ai/blog/building-ethical-ai-a-guide-for-developers-on-avoiding-bias-and-designing-responsible-systems/ Wed, 17 Apr 2024 06:19:44 +0000 https://mlconference.ai/?p=87456 The intersection of philosophy and artificial intelligence may seem obvious, but there are many different levels to be considered. We talked to Katleen Gabriels, Assistant Professor in Ethics and Philosophy of Technology and author of the 2020 book “Conscientious AI: Machines Learning Morals”. We asked her about the intersection of philosophy and AI, about the ethics of ChatGPT, AGI and the singularity.

The post Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems appeared first on ML Conference.

]]>
devmio: Thank you for taking the time for the interview. Can you please introduce yourself to our readers?

Katleen Gabriels: My name is Katleen Gabriels, I am an Assistant Professor in Ethics and Philosophy of Technology at the University of Maastricht in the Netherlands, but I was born and raised in Belgium. I studied linguistics, literature, and philosophy. My research career started as an avatar in Second Life and the social virtual world. Back then I was a master student in moral philosophy and I was really intrigued by this social virtual world that promised that you could be whoever you want to be. 

That became the research of my master thesis and evolved into a PhD project which was on the ontological and moral status of virtual worlds. Since then, all my research revolves around the relation between morality and new technologies. In my current research, I look at the mutual shaping of morality and technology. 

Some years ago, I held a chair at the Faculty of Engineering Sciences in Brussels and I gave lectures to engineering and mathematics students and I’ve also worked at the Technical University of Eindhoven.

Stay up to date

Learn more about MLCON

 

devmio: Where exactly do philosophy and AI overlap?

Katleen Gabriels: That’s a very good but also very broad question. What is really important is that an engineer does not just make functional decisions, but also decisions that have a moral impact. Whenever you talk to engineers, they very often want to make the world a better place through their technology. The idea that things can be designed for the better already has moral implications.

Way too often, people believe in the stereotype that technology is neutral. We have many examples around us today, and I think machine learning is a very good one, that a technology’s impact is highly dependent on design choices. For example, the data set and the quality of the data: If you train your algorithms with just even numbers, it will not know what an uneven number is. But there are older examples that have nothing to do with AI or computer technology. For instance, a revolving door does not include people who need a walking cane or a wheelchair.

In my talks, I always share a video of an automatic soap dispenser that does not recognize black people’s hands to show why it is so important to take into consideration a broad variety of end users. Morality and technology are not separate domains. Each technological object is human-made and humans are moral beings and therefore make moral decisions. 

Also, the philosophy of the mind is very much dealing with questions concerning intelligence, but with breakthroughs in generative AI like DALL-E, also, with what is creativity. Another important question that we’re constantly debating with new evolutions in technology is where the boundary between humans and machines is. Can we be replaced by a machine and to what extent?

Explore Generative AI Innovation

Generative AI and LLMs

devmio: In your book “Conscientious AI: Machines Learning Morals”, you write a lot about design as a moral choice. How can engineers or developers make good moral choices in their design?

Katleen Gabriels: It’s not only about moral choices, but also about making choices that have ethical impact. My most practical hands-on answer would be that education for future engineers and developers should focus much more on these conceptual and philosophical aspects. Very often, engineers or developers are indeed thinking about values, but it’s difficult to operationalize them, especially in a business context where it’s often about “act now, apologize later”. Today we see a lot of attempts of collaboration between philosophers and developers, but that is very often just a theoretical idea.

First and foremost, it’s about awareness that design choices are not just neutral choices that developers make. We have had many designers with regrets about their designs years later. Chris Wetherell is a nice example: He designed the retweet button and initially thought that the effects of it would only be positive because it can increase how much the voices of minorities are heard. And that’s true in a way, but of course, it has also contributed to fake news to polarization.

Often, people tend to underestimate how complex ethics is. I exaggerate a little bit, but very often when teaching engineers, they have a very binary approach to things. There are always some students who want to make a decision tree out of ethical decisions. But often values clash with each other, so you need to find a trade-off. You need to incorporate the messiness of stakeholders’ voices, you need time for reflection, debate, and good arguments. That complexity of ethics cannot be transferred into a decision tree. 

If we really want to think about better and more ethical technology, we have to reserve a lot of time for these discussions. I know that when working for a highly commercial company, there is not a lot of time reserved for this.

devmio: What is your take on biases in training data? Is it something that we can get rid of? Can we know all possible biases?

Katleen Gabriels: We should be aware of the dynamics of society, our norms, and our values. They’re not static. Ideas and opinions, for example, about in vitro fertilization have changed tremendously over time, as well as our relation with animal rights, women’s rights, awareness for minorities, sustainability, and so on. It’s really important to realize that whatever machine you’re training, you must always keep it updated with how society evolves, within certain limits, of course. 

With biases, it’s important to be aware of your own blind spots and biases. That’s a very tricky one. ChatGPT, for example, is still being designed by white men and this also affects some of the design decisions. OpenAI has often been criticized for being naive and overly idealistic, which might be because the designers do not usually have to deal with the kind of problems they may produce. They do not have to deal with hate speech online because they have a very high societal status, a good job, a good degree, and so on.

devmio: In the case of ChatGPT, training the model is also problematic. In what way?

Katleen Gabriels: There’s a lot of issues with ChatGPT. Not just with the technology itself, but things revolving around it. You might already have read that a lot of the labeling and filtering of the data has been outsourced, for instance, to clickworkers in Africa. This is highly problematic. Sustainability is also a big issue because of the enormous amounts of power that the servers and GPUs require. 

Another issue with ChatGPT has to do with copyright. There have already been very good articles about the arrogance of Big Tech because their technology is very much based on the creative works of other people. We should not just be very critical about the interaction with ChatGPT, but also about the broader context of how these models have been trained, who the company and the people behind it are, what their arguments and values are, and so on. This also makes the ethical analysis much more complex.

The paradox is that on the Internet, with all our interactions, we become very transparent for Big Tech companies, but they in turn remain very opaque about their decisions. I’ve also been amazed and but annoyed about how a lot of people dealt with the open letter demanding a six-month ban on AI development. People didn’t look critically at people like Elon Musk signing it and then announcing the start of a new AI company to compete with OpenAI.

This letter focuses on existential threats and yet completely ignores the political and economic situation of Big Tech today. 

 

devmio: In your book, you wrote that language still represents an enormous challenge for AI. The book was published in 2020 – before ChatGPT’s advent. Do you still hold that belief today?

Katleen Gabriels: That is one of the parts that I will revise and extend significantly in the new edition. Even though the results are amazing in terms of language and spelling, ChatGPT still is not magic. One of the challenges of language is that it’s context specific and that’s still a problem for algorithms, which has not been solved with ChatGPT. It’s still a calculation, a prediction.

The breakthrough in NLP and LLMs indeed came sooner than I would have expected, but some of the major challenges are not being solved. 

devmio: Language plays a big role in how we think and how we argue and reason. How far do you think we are from artificial general intelligence? In your book, you wrote that it might be entirely possible, that consciousness might be an emergent property of our physiology and therefore not achievable outside of the human body. Is AGI even achievable?

Katleen Gabriels: Consciousness is a very tricky one. For AGI, first of all, from a semantic point of view, we need to know what intelligence is. That in itself is a very philosophical and multidimensional question because intelligence is not just about being good in mathematics. The term is very broad. There is also emotional and different kinds of intelligence, for instance. 

We could take a look at the term superintelligence, as the Swedish philosopher Nick Bostrom defines it: Superintelligence means that a computer is much better than a human being and each facet of intelligence, including emotional intelligence. We’re very far away from that. It also has to do with bodily intelligence. It’s one thing to make a good calculation, but it’s another thing to teach a robot to become a good waiter and balance glasses filled with champagne through a crowd. 

AGI or strong AI means a form of consciousness or self-consciousness and includes the very difficult concept of free will and being accountable for your actions. I don’t see this happening. 

The concept of AGI is often coupled with the fear of the singularity, which is basically a threshold: The final thing we as humans do, is develop a very smart computer and then we are done for as we cannot compete with these computers. Ray Kurzweil predicted that this is going to happen in 2045. But depending on the definition of superintelligence and the definition of singularity, I don’t believe that 2045 will be the time when this happens. Very few people actually believe that.

devmio: We regularly talk to our expert Christoph Henkelmann. He raised an interesting point about AGI. If we are able to build a self-conscious AI, we have a responsibility to that being and cannot just treat it as a simple machine.

Katleen Gabriels: I’m not the only person who made the joke, but maybe the true Turing Test is that if a machine gains self-consciousness and commits suicide, maybe that is a sign of true intelligence. If you look at the history of science fiction, people have been really intrigued by all these questions and in a way, it very much fits the quote that “to philosophize is to learn how to die.”

I can relate that quote to this, especially the singularity is all about overcoming death and becoming immortal. In a way, we could make sense of our lives if we create something that outlives us, maybe even permanently. It might make our lives worth living. 

At the academic conferences that I attend, the consensus seems to be that the singularity is bullshit, the existential threat is not that big of a deal. There are big problems and very real threats in the future regarding AI, such as drones and warfare. But a number of impactful people only tell us about those existential threats. 

devmio: We recently talked to Matthias Uhl who worked on a study about ChatGPT as a moral advisor. His study concluded that people do take moral advice from a LLM, even though it cannot give it. Is that something you are concerned with?

Katleen Gabriels: I am familiar with the study and if I remember correctly, they required a five minute attention span of their participants. So in a way, they have a big data set but very little has been studied. If you want to ask the question of to what extent would you accept moral advice from a machine, then you really need a much more in-depth inquiry. 

In a way, this is also not new. The study even echoes some of the same ideas from the 1970s with ELIZA. ELIZA was something like an early chatbot and its founder, Joseph Weizenbaum, was shocked when he found out that people anthropomorphized it. He knew what it was capable of and in his book “Computer Power and Human Reason: From Judgment to Calculation” he recalls anecdotes where his secretary asked him to leave the room so she could interact with ELIZA in private. People were also contemplating to what extent ELIZA could replace human therapists. In a way, this says more about human stupidity than about artificial intelligence. 

In order to have a much better understanding of how people would take or not take moral advice from a chatbot, you need a very intense study and not a very short questionnaire.

devmio:  It also shows that people long for answers, right? That we want clear and concise answers to complex questions.

Katleen Gabriels: Of course, people long for a manual. If we were given a manual by birth, people would use it. It’s also about moral disengagement, it’s about delegating or distributing responsibility. But you don’t need this study to conclude that.

It’s not directly related, but it’s also a common problem on dating apps. People are being tricked into talking to chatbots. Usually, the longer you talk to a chatbot, the more obvious it might become, so there might be a lot of projection and wishful thinking. See also the media equation study. We simply tend to treat technology as human beings.

Stay up to date

Learn more about MLCON

 

devmio: We use technology to get closer to ourselves, to get a better understanding of ourselves. Would you agree?

Katleen Gabriels: I teach a course about AI and there’s always students saying, “This is not a course about AI, this is a course about us!” because it’s so much about what intelligence is, where the boundary between humans and machines is, and so on. 

This would also be an interesting study for the future of people who believe in a fatal singularity in the future. What does it say about them and what they think of us humans?

devmio: Thank you for your answers!

The post Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems appeared first on ML Conference.

]]>
What is Data Annotation and how is it used in Machine Learning? https://mlconference.ai/blog/data-annotation-ml/ Tue, 12 Oct 2021 12:24:15 +0000 https://mlconference.ai/?p=82363 What is data annotation? And how is data annotation applied in ML? In this article, we are delving deep to answer these key questions. Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

]]>
Modern businesses are operating in highly competitive markets, and finding new business opportunities is even harder. Customer experiences are constantly changing, finding the right talent to work on common business goals is also an enormous challenge, yet businesses want to perform the way the market demands. So what are these companies doing to create a sustainable competitive advantage? This is where Artificial Intelligence (AI) solutions come in and are prioritized. With AI, it is easier to automate business processes and smoothen decision-making. But, what exactly defines a successful Machine Learning (ML) project? The answer is simple, the quality of training datasets that work with your ML algorithms.

Having that in mind, what amounts to a high-quality training dataset? Data annotation. What is data annotation? And how is data annotation applied in ML?

In this article, we are delving deep to answer these key questions, and is particularly helpful if:

  • You are seeking to understand what data annotation is in ML and why it is so important.
  • You are a data scientist curious to know the various data annotation types out there and their unique applications.
  • You want to produce high-quality datasets for your ML model’s top performance, and have no idea where to find professional data annotation services.
  • You have huge chunks of unlabeled data, have no time to gather, organize, and label them, and in dire need of a data labeler to do the job for you, ultimately meet your training and deploying goals for your models.

What is Data Annotation?

In ML, data annotation refers to the process of labeling data in a manner that machines can recognize either through computer vision or natural language processing (NLP). In other words, data labeling teaches the ML model to interpret its environment, make decisions and take action in the process.

Data scientists use massive amounts of datasets when building an ML model, carefully customizing them according to the model training needs. Thus, machines are able to recognize data annotated in different, understandable formats such as images, texts, and videos.

This explains why AI and ML companies are after such annotated data to feed into their ML algorithm, training them to learn and recognize recurring patterns, eventually using the same to make precise estimations and predictions.

The data annotation types

Data annotation comes in different types, each serving different and unique use cases. Although data annotation is broad and wide, there are common annotation types in popular machine learning projects which we are looking at in this section to give you the gist in this field:

Semantic Annotation

Semantic annotation entails annotation of different concepts within text, such as names, objects, or people. Data annotators use semantic annotation in their ML projects to train chatbots and improve search relevance.

Image and Video Annotation

Let’s say this, image annotation enables machines to interpret content in pictures. Data experts use various forms of image annotation, including bounding boxes displayed on images, to pixels assigned a meaning individually, a process called semantic segmentation. This type of annotation is commonly used in image recognition models for various tasks like facial recognition and recognizing and blocking sensitive content.

Video annotation, on the other hand, uses bounding boxes, or polygons on video content. The process is simple, developers use video annotation tools to place these bounding boxes, or stick together video frames to track the movement of annotated objects. Either way deemed fit by the developer, this type of data becomes handy when developing computer vision models for localization of object tracking tasks.

Text categorization

Text categorization, also called text classification or text tagging is where a set of predefined categories are assigned to documents. A document can contain tagged paragraphs or sentences by topic using this type of annotation, thus making it easier for users to search for information within a document, an application, or a website.

Why is Data Annotation so Important in ML

Whether you think of search engines’ ability to improve on the quality of results, developing facial recognition software, or how self-driving automobiles are created, all these are made real through data annotation. Living examples include how Google manages to give results based on the user’s geographical location or sex, how Samsung and Apple have improved the security of their smartphones using facial unlocking software, how Tesla brought into the market semi-autonomous self-driving cars, and so on.

Annotated data is valuable in ML in giving accurate predictions and estimations in our living environments. As aforesaid, machines are able to recognize recurring patterns, make decisions, and take action as a result. In other words, machines are shown understandable patterns and told what to look for – in image, video, text, or audio. There is no limit to what similar patterns a trained ML algorithm cannot find in any new datasets fed into it.

Data Labeling in ML

In ML, a data label, also called a tag, is an element that identifies raw data (images, videos, or text), and adds one or more informative labels to put into context what an ML model can learn from. For example, a tag can indicate what words were said in an audio file, or what objects are contained in a photo.

Data labeling helps ML models learn from numerous examples given. For example, the model will spot a bird or a person easily in an image without labels if it has seen adequate examples of images with a car, bird, or a person in them.

Conclusion

Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before. The growth of the AI and ML industry as a whole depends solely on the continued creation of nuanced datasets needed to create some of ML’s complex problems.

There is no better “fuel” for training ML algorithms than annotated data in images, videos, or texts – and that is when we arrive at some of the autonomous ML models we can possibly and proudly have.

Now you understand why data annotation is essential in ML, its various and common types, and where to find data annotators to do the job for you. You are in a position to make informed choices for your enterprise and level up your operations.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

]]>
Neuroph and DL4J https://mlconference.ai/blog/neuroph-and-dl4j/ Tue, 14 Sep 2021 11:31:34 +0000 https://mlconference.ai/?p=82227 In this article, we would like to show how neural networks, specifically the multilayer perceptron of two Java frameworks, can be used to detect blood cells in images.

The post Neuroph and DL4J appeared first on ML Conference.

]]>
Microscopic blood counts include an analysis of the six types of white blood cells. These include: Neutrophils, Eosinophils, Basophilic Granulocytes, Monocytes, and Lymphocytes. Based on the number, maturity, and distribution of these white blood cells, you can obtain valuable information about possible diseases. However, here we will not focus on the handling of the blood smears, but on the recognition of the blood cells.

For the tests described, the Bresser Trino microscope with a MikrOkular was used and connected to a computer (HP Z600). The program presented in this article was used for image analysis. The software is based on neural networks using the Java frameworks Neuroph and Deep Learning for Java (DL4J). Staining of smears for the microscope were made with Löffler solution.

 

Training data

For neural network training, the images of the blood cells were centered, converted to grayscale format, and normalized. After preparation, the images looked as shown in Figure 1.

Fig. 1: The JPG images have a size of 100 x 100 pixels and show (from left to right) lymphocyte (ly), basophil (bg), eosinophil (eog), monocyte (mo), rod-nucleated (young) neutrophil (sng), segment-nucleated (mature) neutrophil (seg); the cell types were used for neural network training.

 

A dataset of 663 images with 6 labels – ly, bg, eog, mo, sng, seg – was compiled for training. For Neuroph, the imageLabels shown in Listing 1 were set.

List<String> imageLabels = new ArrayList();
  imageLabels.add("ly");
  imageLabels.add("bg");
  imageLabels.add("eog");
  imageLabels.add("mo");
  imageLabels.add("sng");
  imageLabels.add("seg");

After that, the directory for the input data looks like Figure 2.

Fig. 2: The directory for the input data

 

For DL4J the directory for the input data (your data location) is composed differently (Fig. 3).

Fig. 3: Directory for the input data for DL4J.

 

Most of the images in the dataset came from our own photographs. However, there were also images from open and free internet sources. In addition, the dataset contained the images multiple times as they were also rotated 90, 180, and 270 degrees respectively and stored.

 

Stay up to date

Learn more about MLCON

 

Neuroph MLP Network

The main dependencies for the Neuroph project in pom.xml are shown in Listing 2.

<dependency>
  <groupId>org.neuroph</groupId>
  <artifactId>neuroph-core</artifactId>
  <version>2.96</version>
</dependency>   
<dependency>
  <groupId>org.neuroph</groupId>
  <artifactId>neuroph-imgrec</artifactId>
  <version>2.96</version>
</dependency>
<dependency>
  <groupId>log4j</groupId>
  <artifactId>log4j</artifactId>
  <version>1.2.17</version>
</dependency>

A multilayer perceptron was set with the parameters shown in Listing 3.

private static final double LEARNINGRATE = 0.05;
private static final double MAXERROR = 0.05;
private static final int HIDDENLAYERS = 13;
 
//Open Network
Map<String, FractionRgbData> map;
  try {
    map = ImageRecognitionHelper.getFractionRgbDataForDirectory(new File(imageDir), new Dimension(10, 10));
    dataSet = ImageRecognitionHelper.createRGBTrainingSet(image-Labes, map);
    // create neural network
    List<Integer> hiddenLayers = new ArrayList<>();
    hiddenLayers.add(HIDDENLAYERS);/
    NeuralNetwork nnet = ImageRecognitionHelper.createNewNeuralNet-work("leukos", new Dimension(10, 10), ColorMode.COLOR_RGB, imageLabels, hiddenLayers, TransferFunctionType.SIGMOID);
    // set learning rule parameters
    BackPropagation mb = (BackPropagation) nnet.getLearningRule();
    mb.setLearningRate(LEARNINGRATE);
    mb.setMaxError(MAXERROR);
    nnet.save("leukos.net");
  } catch (IOException ex) {
    Logger.getLogger(Neuroph.class.getName()).log(Level.SEVERE, null, ex);
  }

 

Example

The implementation of a test can look like the one shown in Listing 4.

HashMap<String, Double> output;
String fileName = "leukos112.seg";
NeuralNetwork nnetTest = NeuralNetwork.createFromFile("leukos.net");
// get the image recognition plugin from neural network
ImageRecognitionPlugin imageRecognition = (ImageRecognitionPlugin) nnetTest.getPlugin(ImageRecognitionPlugin.class);
output = imageRecognition.recognizeImage(new File(fileName);

 

Client

A simple SWING interface was developed for graphical cell recognition. An example of the recognition of a lymphocyte is shown in Figure 4.

Fig. 4: The program recognizes a lymphocyte and highlights it

 

DL4J MLP network

The main dependencies for the DL4J project in pom.xml are shown in Listing 5.

<dependency>
  <groupId>org.deeplearning4j</groupId>
  <artifactId>deeplearning4j-core</artifactId>
  <version>1.0.0-beta4</version>
</dependency>
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native-platform</artifactId>
  <version>1.0.0-beta4</version>
</dependency>

Again, a multilayer perceptron was used with the parameters shown in Listing 6.

protected static int height = 100;
protected static int width = 100;
protected static int channels = 1;
protected static int batchSize = 20;
 
protected static long seed = 42;
protected static Random rng = new Random(seed);
protected static int epochs = 100;
protected static boolean save = true;
//DataSet
  String dataLocalPath = "your data location";
  ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator();
  File mainPath = new File(dataLocalPath);
  FileSplit fileSplit = new FileSplit(mainPath, NativeImageLoader.ALLOWED_FORMATS, rng);
  int numExamples = toIntExact(fileSplit.length());
  numLabels = Objects.requireNonNull(fileSplit.getRootDir().listFiles(File::isDirectory)).length;
  int maxPathsPerLabel = 18;
  BalancedPathFilter pathFilter = new BalancedPathFilter(rng, labelMaker, numExamples, numLabels, maxPathsPerLabel);
  //training – Share test
  double splitTrainTest = 0.8;
  InputSplit[] inputSplit = fileSplit.sample(pathFilter, splitTrainTest, 1 - splitTrainTest);
  InputSplit trainData = inputSplit[0];
  InputSplit testData = inputSplit[1];
 
//Open Network
MultiLayerNetwork network = lenetModel();
network.init();
ImageRecordReader trainRR = new ImageRecordReader(height, width, channels, labelMaker);
trainRR.initialize(trainData, null);
DataSetIterator trainIter = new RecordReaderDataSetIterator(trainRR, batchSize, 1, numLabels);
  scaler.fit(trainIter);
  trainIter.setPreProcessor(scaler);
  network.fit(trainIter, epochs);

 

LeNet Model

This model is a kind of forward neural network for image processing (Listing 8).

private MultiLayerNetwork lenetModel() {
  /*
    * Revisde Lenet Model approach developed by ramgo2 achieves slightly above random
    * Reference: https://gist.github.com/ramgo2/833f12e92359a2da9e5c2fb6333351c5
  */
  MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .l2(0.005)
    .activation(Activation.RELU)
    .weightInit(WeightInit.XAVIER)
    .updater(new AdaDelta())
    .list()
    .layer(0, convInit("cnn1", channels, 50, new int[]{5, 5}, new int[]{1, 1}, new int[]{0, 0}, 0))
    .layer(1, maxPool("maxpool1", new int[]{2, 2}))
    .layer(2, conv5x5("cnn2", 100, new int[]{5, 5}, new int[]{1, 1}, 0))
    .layer(3, maxPool("maxool2", new int[]{2, 2}))
    .layer(4, new DenseLayer.Builder().nOut(500).build())
    .layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
      .nOut(numLabels)
      .activation(Activation.SOFTMAX)
      .build())
    .setInputType(InputType.convolutional(height, width, channels))
    .build();
 
  return new MultiLayerNetwork(conf);
 
}

 

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Example

A test of the Lenet model might look like the one shown in Listing 9.

trainIter.reset();
DataSet testDataSet = trainIter.next();
List<String> allClassLabels = trainRR.getLabels();
int labelIndex;
int[] predictedClasses;
String expectedResult;
String modelPrediction;
int n = allClassLabels.size();
System.out.println("n = " + n);
for (int i = 0; i < n; i = i + 1) {
  labelIndex = testDataSet.getLabels().argMax(1).getInt(i);
  System.out.println("labelIndex=" + labelIndex);
  INDArray ia = testDataSet.getFeatures();
  predictedClasses = network.predict(ia);
  expectedResult = allClassLabels.get(labelIndex);
  modelPrediction = allClassLabels.get(predictedClasses[i]);
  System.out.println("For a single example that is labeled " + expectedResult + " the model predicted " + modelPrediction);
}

 

Results

After a few test runs, the results shown in Table 1 are obtained.

LeukosNeurophDL4J

Lymphocytes (ly) 87 85
Basophils (bg) 96 63
Eosinophils (eog) 93 54
Monocytes (mo) 86 60
Rod nuclear neutrophils (sng) 71 46
Segment nucleated neutrophils (seg) 92 81
Table 1: Results of leukocyte counting (N-success/N-samples in %).

 

As can be seen, the results using Neuroph are slightly better than those using DL4J, but it is important to note that the results are dependent on the quality of the input data and the network’s topology. We plan to investigate this issue further in the near future.

However, with these results, we have already been able to show that image recognition can be used for medical purposes with not one, but two sound and potentially complementary Java frameworks.

 

Acknowledgments

At this point, we would like to thank Mr. A. Klinger (Management Devoteam GmbH Germany) and Ms. M. Steinhauer (Bioinformatician) for their support.

The post Neuroph and DL4J appeared first on ML Conference.

]]>
Top 5 reasons to attend ML Conference https://mlconference.ai/blog/top-5-reasons-to-attend-ml-conference/ Tue, 20 Jul 2021 11:33:51 +0000 https://mlconference.ai/?p=82083 So you’ve decided to attend ML Conference but you don’t know how to break it to your boss that it is a win-win situation? Don’t worry, we’ve got you covered. Follow 4 simple steps and use these 5 arguments to show why your organization needs to invest in ML Conference!

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

]]>
1. Let your boss know why you want to go to ML Conference

Tell him the there are over 25 expert speakers and industry experts addressing actual trends and best practices.

Tell your boss to take a look at the conference tracks to have a better idea of what this conference is all about.

 

2. Tell him what’s in it for him

You have the chance to gain key knowledge and skills for this new era of Machine Learning. Turn your ideas into best practices during the workshops and meet people who can help you with that. You’ll learn what it means to build up a ML-first mindset with numerous real-world examples and you can put them into practice in your company. At ML Conference you will develop a deep understanding of your data, as well as of the latest tools and technologies.

 

3. Show him that you’ve done your homework: Book your ticket now and save money.

If you book your ticket now, your boss will save money on the early bird ticket. Plus, you will have an additional 10% discount for a group of 3 people or more.

 

4. Assure your boss that you will network with top industry experts

In addition to the valuable knowledge you will get from top-notch industry experts, you’ll also have the chance to connect and network with the people who are at the top of their career. ML Conference offers an expo reception and a networking event.

 

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

]]>
Anomaly Detection as a Service with Metrics Advisor https://mlconference.ai/blog/anomaly-detection-as-a-service-with-metrics-advisor/ Wed, 09 Jun 2021 07:39:21 +0000 https://mlconference.ai/?p=81814 We humans are usually good at spotting anomalies: often a quick glance at monitoring charts is enough to spot (or, in the best case, predict) a performance problem. A curve rises unnaturally fast, a value falls below a desired minimum or there are fluctuations that cannot be explained rationally. Some of this would be technically detectable by a simple automated if, but it's more fun with Azure Cognitive Services' new Metrics Advisor.

The post Anomaly Detection as a Service with Metrics Advisor appeared first on ML Conference.

]]>
Are you developing an application that stores time-based data? Orders, ratings, comments, appointments, time bookings, repairs or customer contacts? Do you have detailed log files about the number and duration of visits? Hand on heart: How quickly would you notice if your systems (or your users) were behaving differently than you thought? Maybe one of your clients is trying to flood the software with way too much data, or a product in your webshop is “going through the roof”? Maybe there are performance issues in certain browsers or unnatural CPU spikes that deserve a closer look? Metrics Advisor from Azure Cognitive Services provides an AI-powered service that monitors your data and alerts you when anomalies are suspected.

Stay up to date

Learn more about MLCON

 

What is normal?

The big challenge here is defining what constitutes an anomaly in the first place. Imagine a whole shelf full of developer magazines, with only one sports magazine among them. You could rightly say that the sports magazine is an anomaly. But perhaps, by chance, all the magazines are in A4 format, and only two are in A5 – another anomaly. Thus, for automated anomaly detection, it is important to learn from experience and understand which anomalies are actually relevant – and where there is a false alarm that should be avoided in the future.

In the case of time-based data – which is what Metrics Advisor is all about – there are several approaches to anomaly detection. The simplest way is to define hard limits: Anything below or above a certain threshold is considered an anomaly. This doesn’t require machine learning or artificial intelligence; the rules are quickly implemented and clearly understood. For monitoring data, that might be enough: If 70 percent of the storage space is occupied, you want to react. But often the (data) world does not run in rigid paths, sometimes the relative change is more decisive than the actual value: if there has been a significant increase or decrease of more than 10 percent within the last three hours, an anomaly should be detected. An example could be taken from finance. If your private account balance changes from €20,000 to €30,000, this is probably to be considered an anomaly. If a company account changes from €200,000 to €210,000, this is not worth mentioning. As you can tell from this example, the classification of what constitutes an anomaly may change over time. When founding a startup, €100,000 is a lot of money; when founding a large corporation, it is a marginal note. But what if your data is subject to seasonal fluctuations, or individual days such as weekends or holidays behave significantly differently? Here, too, the classification is not so trivial. Is a wave of influenza in the winter months to be expected and only an anomaly in the summer, or should every increase in infection numbers be flagged? As you can see, the question of anomaly detection is to some extent a very subjective one, regardless of the tooling – and not all decisions can be taken over by the technology. Machine learning, however, can help learn from historical data and distinguish normal fluctuations from anomalies.

RETHINK YOUR APPROACHES

Business & Strategy

Metrics Advisor

Metrics Advisor is a new service in the Azure Cognitive Services lineup and is only available in a preview version for now. Internally, another service is used, namely the Anomaly Detector (also part of the Cognitive Services). Metrics Advisor complements this with numerous API methods, a web frontend for managing and viewing data feeds, root-cause analysis and alert configurations. To experiment, you need an Azure subscription. There you can create a Metrics Advisor resource; it’s completely free to use in the preview phase.

The example I would like to use to demonstrate the basic procedure uses data from Google Trends [1]. I have evaluated and downloaded weekly Google Trends scores for two search terms (“vaccine” and “influenza”) for the last five years for four countries (USA, China, Germany, Austria) and would like to try to identify any anomalies in this data. The entire administration of the Metrics Advisor can be done via the provided REST API [2], a faster way to get started is via the provided web frontend, the so-called workspace [3].

Data Feeds

To start with, we create a new data feed that provides the basic data for the analysis. Various Azure services and databases are available out of the box as data sources: Application Insights, Blob Storage, Cosmos DB, Data Lake, Table Storage, SQL Database, Elastic Search, Mongo DB, PostgreSQL – and a few more. In our example, I loaded the Google Trends data into a SQL Server database. In addition to the primary key, the table has four other columns: the date, the country, and the scores for vaccine and influenza. In Metrics Advisor, an SQL statement must now be specified (in addition to the connection string) to query all values for a given date. This is because the service will continue to periodically visit our database to retrieve and analyze the new data. The frequency at which this update should happen is set via the granularity: The data can be analyzed annually, monthly, weekly, daily, hourly or in even shorter periods (the smallest unit is 300 seconds). Depending on the selected granularity, Microsoft also recommends how much historical data to provide. If we choose a 5-minute interval, then data from the last four days will suffice. In our case, a weekly analysis, four years is recommended. After clicking on Verify and get schema, the SQL statement is issued and the structure of our data source is determined. We see the columns shown in Figure 1 and need to assign meaning: Which column contains the timestamp? Which columns should be analyzed as metrics – and where are additional facts (dimensions) that could be possible causes of anomalies?

Fig. 1: Configuration of the data feed

Now, before the data is actually imported, there is one more thing to consider: The roll up settings. For a later root cause analysis, it is necessary to build a multidimensional cube that calculates aggregated values per dimension (i.e. in our case per week also an aggregation over all countries). This way, in case of anomalies, it can later be investigated which dimensions or characteristics seem to be causal for the change in value. If the aggregations are not already in our data source, Metrics Advisor can be asked to calculate them. The only decision we have to make here is the type of aggregation (sum, average, min, max, count). Here our example lags a bit: we select average, but the value of the USA thus flows in with the same factor as the value of the small Austria. As you can see: often one fails because of the data quality or has to be careful that statements are not based on wrong calculations.


Finally, we start the import, which can take several hours depending on the amount of data. The status of the import can also be tracked in the workspace, and individual time periods can be reloaded at any time.

Analysis and fine tuning

Once the data import is complete, we can take a first look at our results. The main goal of Metrics Advisor is to analyze and detect new anomalies – that is, to investigate whether the most recent data point is an anomaly or not. Nevertheless, historical data is also taken into consideration. Depending on the granularity, the service looks several hours to years into the past and tries to flag anomalies there as well. In our case (five years of data, weekly aggregation), the so-called Smart Detection provides results for the past six months and marks individual points in time as an anomaly (Fig. 2).

Fig. 2: Data visualization including anomalies

Now it is time to take a look at the suggestions: Are the identified anomalies actually relevant? Is the detection too sensitive or too tolerant? There are some ways to improve the detection rate. Let’s recall the beginning of this article: The big challenge is to define what constitutes an anomaly in the first place.

You will probably notice a prominently placed slider in the workspace very quickly. With it we can control the sensitivity. The higher the value, the smaller the area containing normal points. We also get these limits visualized within the charts as a light blue area. Sometimes it is useful not to warn at the first occurrence of an anomaly, but only when several anomalies have been detected over a period of time. We can configure Metrics Advisor to look at a certain number of points retrospectively and not consider an anomaly until a certain percentage of those points have been detected as an anomaly. For example, a brief performance problem should be tolerated, but if 70 percent of the readings have been detected as an anomaly in the last 15 minutes, it should be considered a problem overall.


According to the use case, it may make sense to supplement Smart Detection with manual rules. A Hard Threshold can be used to define a lower or upper limit as well as a range of values that should be considered an anomaly. The Change Threshold offers the mentioned possibility to evaluate a percentage change to one or more predecessor points as an anomaly. By the way the different rules are linked (or/and) we influence the detection: for example, an anomaly should only be detected as such if Smart Detection strikes and the value is above 30. We can compose and name several of these configurations. In addition, it is possible to store special rules for individual dimensions.

Depending on our settings, more or less anomalies are detected in the data, and the Metrics Advisor then tries to convert them into so-called incidents. An incident can consist of a single anomaly, but is often made up of related anomalies and thus entire time periods are listed under a common incident. Tools are available in the Incident Hub for closer examination: We can filter the found incidents (by time, criticality, and dimension), start an initial automatic root cause analysis (see “Root cause” in Fig. 3), and drill down through multiple dimensions to gather insights.

Fig. 3: Incident analysis

Feedback

Perhaps the greatest benefit of using artificial intelligence for anomaly detection is the ability to learn through feedback. Even if sensitivity and thresholds have been set well, sometimes the service will get it wrong. However, it is for these data points that feedback can then be provided via the API or portal: Where was an anomaly incorrectly detected? Where was an anomaly missed? The service accepts this feedback and tries to assign similar cases more correctly in the future. The service also tries to recognize time periods – and can be proven wrong if we mark a time range and report it as a period.

For predictable anomalies that have temporal reasons (holidays, weekends, cyclically recurring events), there are separate options for configuration. These should therefore not be reported subsequently as feedback, but stored as so-called preset events.

Alerts

We should now be at a point where data is imported regularly and anomaly detection hopefully works reliably. However, the best detection is of no use if we learn about it too late. Therefore, Alert Configurations should be set up to actively notify about anomalies. Currently, there are three channels to choose from: via email, as a WebHook, or as a ticket in Azure DevOps. The WebHook variant in particular offers exciting possibilities for integration: we can display the detected anomalies in our own application or trigger a workflow using Azure Logic Apps. Perhaps we simply restart the affected web app as a first automated action.

Snooze settings also seem handy; an alert can automatically ensure that no more alerts are sent for a configurable period of time afterwards. This avoids waking up in the morning with 500 emails in your inbox, all with the same content.

Summary

Metrics Advisor provides an exciting and easy entry into the world of anomaly detection of time-based data. Long-time data scientists may prefer other ways and means (and may be interested in the paper at [4]), but for application developers who want to start their first experiments with matching data, this service is a potent gateway drug. The preview status is currently still evident mainly in the web portal and in the lack of documentation quality of the REST API; however, good conceptual documentation is already available.

Have fun trying it out and experimenting with your own data sources.

Links & Literature

[1] https://trends.google.com

[2] https://docs.microsoft.com/en-us/azure/cognitive-services/metrics-advisor/

[3] https://metricsadvisor.azurewebsites.net

[4] https://arxiv.org/abs/1906.03821

 

The post Anomaly Detection as a Service with Metrics Advisor appeared first on ML Conference.

]]>
Tools & Processes for MLOps https://mlconference.ai/blog/tools-and-processes-for-mlops/ Wed, 26 May 2021 10:47:53 +0000 https://mlconference.ai/?p=81706 Training a machine learning model is getting easier. But building and training the model is also the easy part. The real challenge is getting a machine learning system into production and running it reliably. In the field of software development, we have gained a significant insight in this regard: DevOps is no longer just nice to have, but absolutely necessary. So why not use DevOps tools and processes for machine learning projects as well?

The post Tools & Processes for MLOps appeared first on ML Conference.

]]>
When we want to use our familiar tools and workflows from software development for data science and machine learning projects, we quickly run into problems. Data science and machine learning model building follow a different process than the classic software development process, which is fairly linear.

When I create a branch in software development, I have a clear goal in mind of what the outcome of that branch will be: I want to fix a bug, develop a user story, or revise a component. I start working on this defined task. Then, once I upload my code to the version control system, automated tests run – and one or more team members perform a code review. Then I usually do another round to incorporate the review comments. When all issues are fixed, my branch is integrated into the main branch and the CI/CD pipeline starts running; a normal development process. In summary, the majority of the branches I create are eventually integrated in and deployed to a production environment.

In the area of machine learning and data science, things are different. Instead of a linear and almost “mechanical” development process, the process here is very much driven by experiments. Experiments can fail; that is the nature of an experiment. I also often start an experiment precisely with the goal of disproving a thesis. Now, any training of a machine learning model is an experiment and an attempt to achieve certain results with a specific model and algorithm configuration and data set. If we imagine that for a better overview we manage each of these experiments in a separate branch, we will get very many branches very quickly. Since the majority of my experiments will not produce the desired result, I will discard many branches. Only a few of my experiments will ever make it into a production environment. But still, I want to have an overview of what experiments I have already done and what the results were so that I can reproduce and reuse them in the future.

But that’s not the only difference between traditional software development and machine learning model development. Another difference is behavior over time.

ML models deteriorate over time

Classic software works just as well after a month as it did on day one. Of course, there may be changes in memory and computational capacity requirements, and of course bugs will occur, but the basic behavioral characteristics of the production software do not change. With machine learning models, it’s different. For these, the quality decreases over time. A model that operates in a production environment and is not re-trained will degrade over time and never achieve as good a predictive accuracy as it did on day one.

Concept drift is to blame [1]. The world outside our machine learning system changes and so does the data that our model receives as input values. Different types of concept drift occur: data can change gradually, for example, when a sensor becomes less accurate over a long period of time due to wear and tear and shows an ever-increasing deviation from the actual measured value. Cyclical events such as seasons or holidays can also have an effect if we want to predict sales figures with our model.

But concept drift can also occur very abruptly: If global air traffic is brought to a standstill by COVID-19, then our carefully trained model for predicting daily passenger traffic will deliver poor results. Or if the sales department launches an Instagram promotion without notice that leads to a doubling of buyers of our vitamin supplement, that’s a good result, but not something our model is good at predicting.

There are two ways to counteract this deterioration in prediction quality: either we enable our model to actively retrain itself in the production environment, or we have to update our model frequently. Or better yet, update as often as we can somehow. We may also have made a necessary adjustment to an algorithm or introduced a new model that needs to be rolled out as quickly as possible.

So in our machine learning workflow, our goal is not just to deliver models to the user. Instead, our goal must be to build infrastructure that quickly informs our team when a model is providing incorrect predictions and enables the team to lift a new, better model into production environments as quickly as possible.

MLOps as DevOps for Machine Learning

We have seen that data science and machine learning model building require a different process than traditional, “linear” software development. It is also necessary that we achieve a high iteration speed in the development of machine learning models, in order to counteract concept drift. For this reason, it is necessary that we create a machine learning workflow and a machine learning platform to help us with these two requirements. This is a set of tools and processes that are to our machine learning workflow what DevOps is to software development: A process that enables rapid but controlled iteration in development supported by continuous integration, continuous delivery, and continuous deployment. This allows us to quickly and continuously bring high-quality machine learning systems into production, monitor their performance, and respond to changes. We call this process MLOps [2] or CD4ML (Continuous Delivery for Machine Learning) [3].

MLOps also provides us with other benefits: Through reproducible pipelines and versioned data, we create consistency and repeatability in the training process as well as in production environments. These are necessary prerequisites to implement business-critical ML use cases and to establish trust in the new technology among all stakeholders.

In the enterprise environment, we have a whole set of requirements that need to be implemented and adhered to in addition to the actual use case. There are privacy, data security, reproducibility, explainability, non-discrimination, and various compliance policies that may differ from company to company. If we leave these additional challenges for each team member to solve individually, we will create redundant, inconsistent and simply unnecessary processes. A unified machine learning workflow can provide a structure that addresses all of these issues, making each team member’s job easier.

Due to the experimental and iterative nature of machine learning, each step in the process that can be automated has a significant positive impact on the overall run time of the process from data to productive model. A machine learning platform allows data scientists and software engineers to focus on the critical aspects of the workflow and delegate the routine tasks to the automated workflows. But what sub-steps and tools can a platform for MLOps be built from?

Components of an MLOps pipeline

An MLOps workflow can be roughly divided into three areas:

  1. Data pipeline and feature management
  2. Experiment management and model development
  3. Deployment and monitoring

In the following, I describe the individual areas and present a selection of tools that are suitable for implementing the workflow. Of course, this selection is not conclusive or even representative, since the entire landscape is in a rapid development process, so that only individual snapshots are always possible.

Fig. 1: MLOps workflow

Data pipeline and feature management

As hackneyed as slogans like “data is the new oil” may seem, they have a kernel of truth: The first step in any machine learning and data science workflow is to collect and prepare data.

Centralized access to raw data

Companies with modern data warehouses or data lakes have a distinct advantage when developing machine learning products. Without a centralized point to collect and store raw data, finding appropriate data sources and ensuring access to that data is one of the most difficult steps in the lifecycle of a machine learning project in larger organizations.

Centralized access can be implemented here in the form of a Hadoop-based platform. However, for smaller data volumes, a relational database such as Postgres [4] or MySQL [5], or a document database based on an EL stack [6] is also perfectly adequate. The major cloud providers also provide their own products for centralized raw data management: Amazon Redshift [7], Google BigQuery [8] or Microsoft Azure Cosmos DB [9].

In any case, it is necessary that we first archive a canonical form of our original data before applying any transformation to it. This gives us an unmodified dataset of original data that we can use as a starting point for processing.

Even at this point in the workflow, it is important to rely on good documentation and to document the sources of the data, its meaning, and where it is stored. Even though this step seems simple, it is still of utmost importance. Invalid data, the wrong naming of a column of data, or a misconfigured scraping job can lead to a lot of frustration and wasted time.

Data Transformation

Rarely will we train our machine learning model directly on raw data. Instead, we generate features from the raw data. In the context of machine learning, a feature is one or more processed data attributes that an algorithm uses to make predictions. This could be a temperature value, for example, but in the case of deep learning applications also highly abstract features in images. To extract features from raw data, we will apply various transformations. We will typically define these transformations in code, although some ETL tools also allow us to define them using a graphical interface. Our transformations will either be run as batch jobs on larger sets of data, or we will define them as long-running streaming applications that continuously transform data.

We also need to split our dataset into training and testing datasets. To train a machine learning model, we need a set of training data. To test how our model performs with previously unknown data, we need another structurally identical set of test data. Therefore, we split our original transformed data set into two data sets. These must not overlap, meaning that the same data point does not occur twice in them. A common split here is to use 70 percent of the dataset for training and 30 percent for testing the model.

The exact split of the data sets depends on the context. For time-series data, sequential slices from the series should be chosen, while for image processing, random images from the data set should be chosen since they have no sequential relation to each other.

For non-sequential data, the individual data points can also be placed in a (pseudo-)random order. We also want to perform this process in a reproducible and automated manner rather than manually. A pleasantly usable tool for management and coordination here is Apache Airflow [10]. Here, according to the “pipeline as code” principle, one can define various pipelines in the form of a data flow graph, connect a wide variety of systems, and thus perform the desired transformations.

Feature repositories

Many machine learning models and systems within a company use the same or at least similar features. Now, once we have extracted a feature from our raw data, there is a high probability that this feature can be useful for other applications as well. Therefore, it can be useful not to have to implement feature extraction again for each application. For this, we can store known features in a feature store. This can be done either in a dedicated component (such as Feast) [11], or in well-documented database tables populated by appropriate transformations. These transformations can be mapped automatically using Apache Airflow.

Data versioning

In addition to code versioning, data versioning is useful in a machine learning context. This allows us to increase the reproducibility of our experiments and to validate our models and their predictions by retracing the exact state of a training dataset that was used at a given time. Tools such as DVC [12] or Pachyderm [13] can be used for this purpose.

Experiment management and model development

In order to deploy an optimal model into production, we need to create a process that enables the development of that optimal model. To do this, we need to capture and visualize information that enables the decision of what the optimal model is, since in most cases this decision is made by a human and not automated.

Since the data science process is very experiment-driven, multiple experiments are run in parallel, often by different people at the same time. And most will not be deployed in a production environment. The experimental approach in this phase of research is very different from the “traditional” software development process, as we can expect that the code for these experiments will be discarded in the majority of cases, and only some experiments will reach a production status.

Experiment management and visualizations

Running hundreds or even thousands of iterations on the way to an optimally trained ML model is not uncommon. In the process, quite a few parameters used to define each experiment and the results of that experiment are accumulated. Often, this metadata is stored in Excel spreadsheets or, in the worst case, in the heads of team members. However, to establish optimal reproducibility, avoid time-consuming multiple experiments, and enable optimal collaboration, this data should be captured automatically. Possible tools here are MLflow tracking [14] or Sacred [15]. To visualize the output metrics, either classical dashboards like Grafana [16] or specialized tools like TensorBoard [17] can be used. TensorBoard can also be used for this purpose independently of its use with TensorFlow. For example, PyTorch provides a compatible logging library [18]. However, there is still much room for optimization and experimentation here. For example, the combination of other tools from the DevOps environment such as Jenkins [19] and Terraform [20] would also be conceivable.

Version control for models

In addition to the results of our experiments, the trained models themselves can also be captured and versioned. This allows us to more easily roll back to a previous model in a production environment. Models can be versioned in several ways: In the simplest variant, we export our trained model in serialized form, for example as a Python .pkl file. We then record this in a suitable version control system (Git, DVC), depending on its size.

Another option is to provide a central model registry. For example, the MLflow Model Registry [21] or the model registry of one of the major cloud providers can be used here. Also, the model can be packaged in a Docker container and managed in a private Docker Registry [22].

Infrastructure for distributed training

Smaller ML models can usually still be trained on one’s own laptop with reasonable effort. However, as soon as the models become larger and more complex, this is no longer possible and a central on-premise or cloud server becomes necessary for training. For automated training in such an environment, it makes sense to build a model training pipeline.

This is executed with training data at specific times or on demand. The pipeline receives the configuration data that defines a training cycle. These data are for example model type, hyperparameters and used features. The pipeline can obtain the training data set automatically from the feature store and distribute it to all model variants to be trained in parallel. Once training is complete, the model files, original configuration, learned parameters, and metadata and timings are captured in the experiment and model tracking tools. One possible tool for building one is Kubeflow [23]. Kubeflow provides a number of useful features for automated training and for (cost-)efficient resource management based on Kubernetes.

Deployment and monitoring

Unless our machine learning project is purely a proof of concept or an academic project, we will eventually need to lift our model into a production environment. And that’s not all: once it gets there, we’ll need to monitor it and deploy a new version as needed. Also, in a large part of the cases, we will have not just one, but rather a whole set of models in our production environments.

Deploy models

On a technical level, any model training pipeline must produce an artifact that can be deployed into a production environment. The prediction results may be bad, but the model itself must be in a packaged state that allows it to be deployed directly into a production environment. This is a familiar idea from software development: continuous delivery. This packaging can be done in two different ways.

Either our model is deployed as a separate service and accessed by the rest of our systems via an interface. Here, the deployment can be done, for example, with TensorFlow Serving [24] or in a Docker container with a matching (Python) web server.

An alternative way of deployment is to embed it in our existing system. Here, the model files are loaded and input and output data are routed within the existing system. The problem here is that the model must be in a compatible format or a conversion must be performed before deployment. If such a conversion is performed, automated tests are essential. Here it must be ensured that both the original and the converted model deliver identical prediction results.

Monitoring

A data science project does not end with the deployment of a model into production. Even a production model has to face many challenges. The value distribution of my input values may be different in the real world than the one mapped in the training data. Also, value distributions can change slowly over time or due to singular events. This then requires retraining with the changed data.

Also, despite intensive testing, errors may have crept in during the previous steps. For this reason, infrastructure should be provided to continuously collect data on model performance. The input values and the resulting predictions of the model should also be recorded, as far as this is compatible with the applicable data protection regulations. On the other hand, if privacy considerations are only introduced at this point, one has to ask how a sufficient amount of training data could be collected without questionable privacy practices.

Here’s a sampling of basic information we should be collecting about our machine learning system in production:

  • How many times did the model make a prediction?
  • How long does it take the model to perform a prediction?
  • What is the distribution of the input data?
  • What features were used to make the prediction?
  • What results were predicted and what real results were observed later in the system?

Tools such as Logstash [25] or Prometheus [26] can be used to collect our monitoring data. To get a quick overview of the performance of the model, it is recommended to set up a dashboard that visualizes the most important metrics and to set up an automatic notification system that alerts the team in case of strong deviations so that appropriate countermeasures can be taken if needed.

Challenges on the road to MLOps

Companies face numerous challenges in staffing and challenges within their teams on the road to a successful machine learning strategy. There is also the financial challenge of attracting experienced software engineers and data scientists. But even if we manage to assemble a good team, we need to enable them to work together in the best possible way to bring out the strengths of each team member. Generally speaking, data scientists feel very comfortable using various statistical tools, machine learning algorithms, and Jupyter notebooks. However, they are often less familiar with version control software and testing tools that are widely used in software engineering. While software engineers are familiar with these tools, they often lack the expertise to choose the algorithm for a problem or to extract the last five percent of predictive accuracy from a model through skillful optimizations. Our workflows and processes must be designed to support both groups as best as possible and enable smooth collaboration.

In terms of technological challenges, we face a broad and dynamic technology landscape that is constantly evolving. In light of this confusing situation, we are often faced with the question of how to get started with new machine learning initiatives.

How do I get started with MLOps?

Building MLOps workflows must always be an evolutionary process and cannot be done in a one-time “big bang” approach. Each company has its own unique set of challenges and needs when developing its machine learning strategies. In addition, there are different users of machine learning systems, different organizational structures, and existing application landscapes. One possible approach here may be to create an initial prototype without ML components. Then, one should start building the infrastructure and simplest models.

From this starting point, the infrastructure created can then be used to move forward in small and incremental steps to more complicated models and lift them into production environments until the desired level of predictive accuracy and performance has been achieved. Short development cycles for machine learning models, in the range of days rather than weeks or even months, enable faster response to changing circumstances and data. However, such short iteration cycles can only be achieved with a high degree of automation.

Developing, putting into production, and keeping machine learning models productive is itself a complex and iterative process with many inherent challenges. Even on a small or experimental scale, many companies find it difficult to implement these processes cleanly and without failures. The data science and machine learning development process is particularly challenging in that it requires careful balancing of the iterative, exploratory components and the more linear engineering components.

The tools and processes for MLOps presented in this article are an attempt to provide structure to these development processes.

Although there are no proven and standardized processes for MLOps, we can take many lessons learned from “traditional” software engineering. In many teams there is a division between data scientists and software engineers. Here we have the pragmatic approach: Data scientists develop a model in a Jupyter notebook and throw this notebook over the fence to software engineering, which then follows the model into production with a DevOps approach.

If we think back a few years, “throwing it over the fence” is exactly the problem that gave rise to DevOps. This is exactly how Werner Vogels (CTO at Amazon) described the separation between development and operations in his famous interview in 2006 [27]: “The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon.” Then came the phrase that looks good on DevOps conference T-shirts, coffee mugs and motivational posters: “You build it, you run it.” As naturally as development and operations belong together today, we must also make the collaboration between data science and DevOps a matter of course.

Links & Literature

[1] Tsymbal, Alexey: “The problem of concept drift. Definitions and related work. Technical Report”: https://www.scss.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf

[2] https:// http://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

[3] https://martinfowler.com/articles/cd4ml.html

[4] https://www.postgresql.org

[5] https://www.mysql.com

[6] https://www.elastic.co/de/elasticsearch/

[7] https://aws.amazon.com/redshift/

[8] https://cloud.google.com/bigquery

[9] https://docs.microsoft.com/de-de/azure/cosmos-db/

[10] https://airflow.apache.org

[11] https://feast.dev

[12] https://dvc.org

[13] https://www.pachyderm.com

[14] https://www.mlflow.org/docs/latest/tracking.html

[15] https://github.com/IDSIA/sacred

[16] https://grafana.com

[17] https://www.tensorflow.org/tensorboard

[18] https://pytorch.org/docs/stable/tensorboard.html

[19] https://www.jenkins.io

[20] https://www.terraform.io

[21] https://www.mlflow.org/docs/latest/model-registry.html

[22] https://docs.docker.com/registry/deploying/

[23] https://www.kubeflow.org

[24] https://www.tensorflow.org/tfx/guide/serving

[25] https://www.elastic.co/de/logstash

[26] https://prometheus.io

[27] https://queue.acm.org/detail.cfm?id=1142065

 

The post Tools & Processes for MLOps appeared first on ML Conference.

]]>
On pythonic tracks https://mlconference.ai/blog/on-pythonic-tracks/ Mon, 15 Mar 2021 13:00:26 +0000 https://mlconference.ai/?p=81377 Python has established itself as a quasi-standard in the field of machine learning over the last few years, in part due to the broad availability of libraries. It is logical that Oracle did not really like to watch this trend — after all, Java has to be widely used if it wants to earn serious money with its product. Some time ago, Oracle placed its own library Tribuo under an open source license.

The post On pythonic tracks appeared first on ML Conference.

]]>
In principle, Tribuo is an ML system intended to help close the feature gap between Python and Java in the field of artificial intelligence, at least to a certain extent.

According to the announcement, the product, which is licensed under the (very liberal) Apache license, can look back on a history of several years of use within Oracle. This can be seen in the fact that the library offers very extensive functions — in addition to the generation of “own” models, there are also interfaces for various other libraries, including TensorFlow.

The author and editors are aware that machine learning cannot be included in an article like this. However, just because of the importance of this idea, we want to show you a little bit about how you can play around with Tribuo.

Stay up to date

Learn more about MLCON

 

Modular library structure

Machine learning applications are not usually among the tasks you run on resource-constrained systems — IoT edge systems that run ML payloads usually have a hardware accelerator, such as the SiPeed MAiX. Nevertheless, Oracle’s Tribuo library is offered in a modularized way, so in theory, developers can only include those parts of the project in their Solutions that they really need. An overview of which functions are provided in the individual packages can be found under https://tribuo.org/learn/4.0/docs/packageoverview.html.

In this introductory article, we do not want to delve further into modularization, but instead, program a little on the fly. A Tribuo-based artificial intelligence application generally has the structure shown in Figure 1.

Fig. 1: Tribuo strives to be a full-stack artificial intelligence solution (Image source: Oracle).


The figure informs us that Tribuo seeks to process information from A to Z itself. On the far left is a DataSource object that collects the information to be processed by artificial intelligence and converts it into Tribuo’s own storage format called Example. These Example objects are then held in the form of a Dataset instance, which — as usual — moves towards a model that delivers predictions. A class called Evaluator can then make concrete decisions based on this information, which is usually quite general or probability-based.

An interesting aspect of the framework in this context is that many Tribuo classes come with a more or less generic system for configuration settings. In principle, an annotation is placed in front of the attributes, whose behavior can be adjusted:

public class LinearSGDTrainer implements Trainer<Label>, WeightedExamples {
@Config(description="The classification objective function to use.")
private LabelObjective objective = new LogMulticlass();

The Oracle Labs Configuration and Utilities Toolkit (OLCUT) described in detail in https://github.com/oracle/OLCUT can read this information from XML files — in the case of the property we just created, the parameterization could be done according to the following scheme:

<config>
<component name="logistic"
type="org.tribuo.classification.sgd.linear.LinearSGDTrainer">
<property name="objective" value="log"/>

The point of this approach, which sounds academic at first glance, is that the behavior of ML systems is strongly dependent on the parameters contained in the various elements. By implementing OLCUT, the developer gives the user the possibility to dump these settings, or return a system to a defined state with little effort.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Tribuo integration

After these introductory considerations, it is time to conduct your first experiments with Tribuo. Even though the library source code is available on GitHub for self-compilation, it is recommended to use a ready-made package for first attempts.

Oracle supports both Maven and Gradle and even offers (partial) support for Kotlin in the Gradle special case. However, we want to work with classic tools in the following steps, which is why we grab an Eclipse instance and animate it to create a new Maven-based project skeleton by clicking New | New Maven project.

In the first step of the generator, the IDE asks if you want to load templates called archetype. Please select the Create a simple project (skip archetype selection) checkbox to animate the IDE to create a primitive project skeleton. In the next step, we open the file pom.xml to command an insertion of the library according to the scheme in Listing 1.

 <name>tribuotest1</name>
  <dependencies>
    <dependency>
      <groupId>org.tribuo</groupId>
      <artifactId>tribuo-all</artifactId>
      <version>4.0.1</version>
      <type>pom</type>
    </dependency>
  </dependencies>
</project>

To verify successful integration, we then command a recompilation of the solution — if you are connected to the Internet, the IDE will inform you that the required components are moving from the Internet to your workstation.

Given the proximity to the classic ML ecosystem, it should not surprise anyone that the Tribuo examples are by and large in the form of Jupiter Notebooks — a widespread form of presentation, especially in the research field, that is not suitable for serial use in production, at least in the author’s opinion.

Therefore, we want to rely on classical Java programs in the following steps. The first problem we want to address is classification. In the world of ML, this means that the information provided is classified into a group of categories or histogram bins called a class. In the field of machine learning, a group of examples called data sets (sample data sets) has become established. These are prebuilt databases that are constant and can be used to evaluate different model hierarchies and training levels. In the following steps, we want to rely on the Iris data set provided in https://archive.ics.uci.edu/ml/datasets/iris. Funnily enough, the term iris does not refer to a part of the eye, but to a plant species.

Fortunately, the data set is available in a form that can be used directly by Tribuo. For this reason, the author working under Linux opens a terminal window in the first step, creates a new working directory, and downloads the information from the server via wget:

t@T18:~$ mkdir tribuospace
t@T18:~$ cd tribuospace/
t@T18:~/tribuospace$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data

Next, we add a class to our Eclipse project skeleton from the first step, which takes a Main method. Place in it the code from Listing 2.

import org.tribuo.classification.Label;
import java.nio.file.Paths;
 
public class ClassifyWorker {
  public static void main(String[] args) {
    var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
    DataSource<Label> irisData =
      new CSVLoader<>(new LabelFactory()).loadDataSource(Paths.get("bezdekIris.data"),
        irisHeaders[4],
        irisHeaders);

This routine seems simple at first glance, but it’s a bit tricky in several respects. First, we are dealing with a class called Label — depending on the configuration of your Eclipse working environment, the IDE may even offer dozens of Label candidate classes. It’s important to make sure you choose the org.tribuo.classification.Label import shown here — a label is a cataloging category in Tribuo.

The syntax starting with var then enables the IDE to reach a current Java version — if you want to use Tribuo (effectively), you have to use at least JDK 8, or even better, JDK 10. After all, the var syntax introduced in this version can be found in just about every code sample.

It follows from the logic that — depending on the system configuration — you may have to make extensive adjustments at this point. For example, the author working under Ubuntu 18.04 first had to provide a compatible JDK:

tamhan@TAMHAN18:~/tribuospace$ sudo apt-get install openjdk-11-jdk

Note that Eclipse sometimes cannot reach the new installation path by itself — in the package used by the author, the correct directory was /usr/lib/jvm/java-11-openjdk-amd64/bin/java.

Anyway, after successfully adjusting the Java execution configuration, we are able to compile our application — you may still need to add the NIO package to the Maven configuration because the Tribuo library relies on this novel EA library across the field for better performance.

Now that we have our program skeleton up and running, let’s look at it in turn to learn more about the inner workings of Tribuo applications. The first thing we have to deal with — think of Figure 1 on the far left — is data loading (Listing 3).

public static void main(String[] args) {
try {
  var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
  DataSource<Label> irisData;
  
  irisData = new CSVLoader<>(new LabelFactory()).loadDataSource(    Paths.get("/home/tamhan/tribuospace/bezdekIris.data"),
    irisHeaders[4],
    irisHeaders);

The provided dataset contains comparatively little information — headers and co. are looked for in vain. For this reason, we have to pass an array to the CSVLoader class that informs you about the column names of the data set. The assignment of irisHeaders ensures that the target variable of the model receives separate announcements.

In the next step, we have to take care of splitting our dataset into a training group and a test group. Splitting the information into two groups is quite a common procedure in the field of machine learning. The test data is used to “verify” the trained model, while the actual training data is used to adjust and improve the parameters. In the case of our program, we want to make a 70-30 split between training and other data, which leads to the following code:

var splitIrisData = new TrainTestSplitter<>(irisData, 0.7, 1L);
var trainData = new MutableDataset<>(splitIrisData.getTrain());
var testData = new MutableDataset<>(splitIrisData.getTest());

Attentive readers wonder at this point why the additional parameter 1L is passed. Tribuo works internally with a random generator. As with all or at least most pseudo-random generators, it can be animated to behave more or less deterministically if you set the seed value to a constant. The constructor of the class TrainTestSplitter exposes this seed field — we pass the constant value one here to achieve a reproducible behavior of the class.

At this point, we are ready for our first training run. In the area of training machine learning-based models, a group of procedures emerged that are generally grouped together by developers. The fastest way to a runnable training system is to use the LogisticRegressionTrainer class, which inherently loads a set of predefined settings:

var linearTrainer = new LogisticRegressionTrainer();
Model<Label> linear = linearTrainer.train(trainData);

The call to the Train method then takes care of the framework triggering the training process and getting ready to issue predictions. It follows from this logic that our next task is to request such a prediction, which is forwarded towards an evaluator in the next step. Last but not least, we need to output its results towards the command line:

Prediction<Label> prediction = linear.predict(testData.getExample(0));
LabelEvaluation evaluation = new LabelEvaluator().evaluate(linear,testData);
double acc = evaluation.accuracy();
System.out.println(evaluation.toString());

At this point, our program is ready for a first small test run — Figure 2 shows how the results of the Iris data set present themselves on the author’s workstation.

Fig. 2: It works: Machine learning without Python!


Learning more about elements

Having enabled our little Tribuo classifier to perform classification against the Iris data set in the first step, let’s look in detail at some advanced features and functions of the classes used.

The first Interesting feature is to check the work of TrainTestSplitter class. For this purpose it is enough to place the following code in a conveniently accessible place in the Main method:

System.out.println(String.format("data size = %d, num features = %d, num classes = %d",trainingDataset.size(),trainingDataset.getFeatureMap().size(),trainingDataset.getOutputInfo().size()));

The data set exposes a set of member functions that output additional information about the tuples they contain. For example, executing the code here would inform how many data sets, how many features, and how many classes to subdivide.

The LogisticRegressionTrainer class is also only one of several processes you can use to train the model. If we want to rely on the CART process instead, we can adapt it according to the following scheme:

var cartTrainer = new CARTClassificationTrainer();
Model<Label> tree = cartTrainer.train(trainData);
Prediction<Label> prediction = tree.predict(testData.getExample(0));
LabelEvaluation evaluation = new LabelEvaluator().evaluate(tree,testData);

If you run the program again with the modified classification, a new console window opens with additional information — it shows that Tribuo is also suitable for exploring machine learning processes. Incidentally, given the comparatively simple and unambiguous structure of the data set, the same result is obtained with both classifiers.

Number Recognition

Another dataset widely used in the ML field is the MNIST Handwriting Sample. This is a survey conducted by MNIST about the behavior of people writing numbers. Logically, such a model can then be used to recognize postal codes — a process that helps save valuable man-hours and money when sorting letters.

Since we have just dealt with basic classification using a more or less synthetic data set, we want to work with this more realistic information in the next step. In the first step, this information must also be transferred to the workstation, which is done under Linux by calling wget:

tamhan@TAMHAN18:~/tribuospace$ wget
http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
tamhan@TAMHAN18:~/tribuospace$ wget
http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

The most interesting thing about the structure given here is that each wget command downloads two files — both the training and the actual working files are in separate archives each with label and image information.

The MNIST data set is available in a data format known as IDX, which in many cases is compressed via GZIP and easier to handle. If you study the Tribuo library documentation available at https://tribuo.org/learn/4.0/javadoc/org/tribuo/datasource/IDXDataSource.html, you will find that with the class IDXDataSource, a loading tool intended for this purpose is available. It even decompresses the data on the fly if required.

It follows that our next task is to integrate the IDXDataSource class into our program workflow (Listing 4).

public static void main(String[] args) {
  try {
    LabelFactory myLF = new LabelFactory();
    DataSource<Label> ids1;
    ids1 = new
IDXDataSource<Label>(Paths.get("/home/tamhan/tribuospace/t10k-images-idx3-ubyte.gz"),
Paths.get("/home/tamhan/tribuospace/t10k-labels-idx1-ubyte.gz"), myLF);

From a technical point of view, the IDXDataSource does not differ significantly from the CSVDataSource used above. Interestingly, we pass two file paths here, because the MNIST data set is available in the form of a separate training and working data set.

By the way, developers who grew up with other IO libraries should pay attention to a little beginner’s trap with the constructor: The classes contained in Tribuo do not expect a string, but a finished Path instance. However, this can be generated with the unbureaucratic call of Paths.get().

Be that as it may, the next task of our program is the creation of a second DataSource, which takes care of the training data. Its constructor differs — logically — only in the file name for the source information (Listing 5).

DataSource<Label> ids2;
ids2 = new
IDXDataSource<Label>(Paths.get("/home/tamhan/tribuospace/train-images-idx3-ubyte.gz"),
Paths.get("/home/tamhan/tribuospace/train-labels-idx1-ubyte.gz"), myLF);
var trainData = new MutableDataset<>(ids1);
var testData = new MutableDataset<>(ids2);
var evaluator = new LabelEvaluator();

Most data classes included in Tribuo are generic with respect to output format, at least to some degree. This affects us in that we must permanently pass a LabelFactory to inform the constructor of the desired output format and the Factory class to be used to create it

The next task we take care of is to convert the information into Data Sets and launch an emulator and a CARTClassificationTrainer training class:

var cartTrainer = new CARTClassificationTrainer();
Model<Label> tree = cartTrainer.train(trainData);

The rest of the code — not printed here for space reasons — is then a one-to-one modification of the model processing used above. If you run the program on the workstation, you will see the result shown in Figure 3. By the way, don’t be surprised if Tribuo takes some time here — the MNIST dataset is somewhat larger than the one in the table used above.

Fig. 3: Not bad for a first attempt: the results of our CART classifier.


Since our achieved accuracy is not so great after all, we want to use another training algorithm here. The structure of Tribuo shown in Figure 1, based on standard interfaces, helps us make such a change with comparatively little code.

This time we want to use the LinearSGDTrainer class as a target, which we integrate into the program by the command import org.tribuo.classification.sgd.linear.LinearSGDTrainer;. By the way, the explicit mention of it is not a pedantry of the publisher — besides the Classification class used here, there is also a LinearSGDTrainer, which is intended for regression tasks and cannot help us at this point, however.

Since the LinearSGDTrainer class does not come with a convenience constructor that initializes the trainer with a set of (more or less well-working) default values, we have to make some changes:

var cartTrainer = new LinearSGDTrainer(new LogMulticlass(), SGD.getLinearDecaySGD(0.2),10,200,1,1L);
Model<Label> tree = cartTrainer.train(trainData);

At this point, this program version is also ready to use — since LinearSGD requires more computing power, it will take a little more time to process.

Excursus: Automatic configuration

LinearSGD may be a powerful machine learning algorithm — but the constructor is long and difficult to understand because of the number of parameters, especially without IntelliSense support.

Since with ML systems you generally work with a Data Scientist with limited Java skills, it would be nice to be able to provide the model information as a JSON or XML file.

This is where the OLCUT configuration system mentioned above comes in. If we include it in our solution, we can parameterize the class in JSON according to the scheme shown in Listing 6.

"name" : "cart",
"type" : "org.tribuo.classification.dtree.CARTClassificationTrainer",
"export" : "false",
"import" : "false",
"properties" : {
  "maxDepth" : "6",
  "impurity" : "gini",
  "seed" : "12345",
  "fractionFeaturesInSplit" : "0.5"
}

Even at first glance, it is noticeable that parameters like the seed to be used for the random generator are directly addressable — the line “seed” : “12345” should be found even by someone who is challenged by Java.

By the way, OLCUT is not limited to the linear creation of object instances — the system is also able to create nested class structures. In the markup we just printed, the attribute “impurity” : “gini” is an excellent example of this — it can be expressed according to the following scheme to generate an instance of the GiniIndex class:

"name" : "gini",
"type" : "org.tribuo.classification.dtree.impurity.GiniIndex",

Once we have such a configuration file, we can invoke an instance of the ConfigurationManager class as follows:

ConfigurationManager.addFileFormatFactory(new JsonConfigFactory())
String configFile = "example-config.json";
String.join("\n",Files.readAllLines(Paths.get(configFile)))

OLCUT is agnostic in the area of file formats used — the actual logic for reading the data present in the file system or configuration file can be written in by an adapter class. In the case of our present code we use the JSON file format, so we register an instance of JsonConfigFactory via the addFileFormatFactory method.

Next, we can also start parameterizing the elements of our machine learning application using ConfigurationManager:

var cm = new ConfigurationManager(configFile);
DataSource<Label> mnistTrain = (DataSource<Label>) cm.lookup("mnist-train");

The lookup method takes a string, which is compared against the Name attributes in the knowledge base. Provided that the configuration system finds a match at this point, it automatically invokes the class structure described in the file.

ConfigurationManager is extremely powerful in this respect. Oracle offers a comprehensive example at https://tribuo.org/learn/4.0/tutorials/configuration-tribuo-v4.html that harnesses the configuration system using Jupiter Notebooks to generate a complex machine learning toolchain.

This measure, which seems pointless at first glance, is actually quite sensible. Machine learning systems live and die with the quality of the training data as well as with the parameterization — if these parameters can be conveniently addressed externally, it is easier to adapt the system to the present needs.

Grouping with synthetic data sets

In the field of machine learning, there is also a group of universal procedures that can be applied equally to various detailed problems. Having dealt with classification so far, we now turn to clustering. As shown in Figure 4, the focus here is on the idea of inscribing groups into a wild horde of data in order to be able to recognize trends or associations more easily.

Fig. 4: The inscription of the dividing lines divides the dataset (Image source: Wikimedia Commons/ChiRe).


It follows from logic that we also need a database for regression experiments. Tribuo helps us at this point with the ClusteringDataGenerator class, which harnesses the Gaussian distribution (from mathematics and probability theory) to generate test data sets. For now, we want to populate two test data fields according to the following scheme:

public static void main(String[] args) {
try {

var data = ClusteringDataGenerator.gaussianClusters(500, 1L);
var test = ClusteringDataGenerator.gaussianClusters(500, 2L);

The numbers passed as the second parameter determine which number will be used as the initial value for the random generator. Since we pass two different values here, the PRNG will generate two sequences of numbers that differ from the sequence here, but which always follow the rules of the Gaussian normal distribution.

In the field of clustering methods, the K-Means process has become well-established. For this reason alone, we want to use it again in our Tribuo example. If you look carefully at the structure of the code used, you will notice that the standardized structure is also apparent:

var trainer = new KMeansTrainer(5,10,Distance.EUCLIDEAN,1,1);
var model = trainer.train(data);
var centroids = model.getCentroidVectors();
for (var centroid : centroids) {
System.out.println(centroid);
}

Of particular interest is actually the passed Distance value, which determines which method should be used to calculate or weight the two-dimensional distances between elements. It is critical in that the Tribuo library comes with several Distance Enums — the one we need is in the namespace org.tribuo.clustering.kmeans.KMeansTrainer.Distance.

Now the program can be executed again — Figure 5 shows the generated center point matrices.

Fig. 5: Tribuo informs us of the whereabouts of clusters.


The red messages are progress reports: most of the classes included in Tribuo contain logic to inform the user about the progress of long-running processes in the console. However, since these operations also consume processing power, there is a way to influence the frequency of operation of the output logic in many constructors.

Be that as it may: To evaluate the results, it is helpful to know the Gaussian parameters used by the ClusteringDataGenerator function. Oracle, funnily enough, only reveals them in the Tribuo tutorial example; in any case, the values for the three parameter variables are as follows:

N([ 0.0,0.0], [[1.0,0.0],[0.0,1.0]])
N([ 5.0,5.0], [[1.0,0.0],[0.0,1.0]])
N([ 2.5,2.5], [[1.0,0.5],[0.5,1.0]])
N([10.0,0.0], [[0.1,0.0],[0.0,0.1]])
N([-1.0,0.0], [[1.0,0.0],[0.0,0.1]])

Since a detailed discussion of the mathematical processes taking place in the background would go beyond the scope of this article, we would like to cede the evaluation of the results to Tribuo instead. The tool of choice is, once again, an element based on the evaluator principle. However, since we are dealing with clustering this time, the necessary code looks like this:

ClusteringEvaluator eval = new ClusteringEvaluator();
var mtTestEvaluation = eval.evaluate(model,test);
System.out.println(mtTestEvaluation.toString());

When you run the present program, you get back a human-readable result — the Tribuo library takes care of preparing the results contained in ClusteringEvaluator for convenient output on the command line or in the terminal:

Clustering Evaluation
Normalized MI = 0.8154291916732408
Adjusted MI = 0.8139169342020222

Excursus: Faster when parallelized

Artificial intelligence tasks tend to consume immense amounts of computing power — if you don’t parallelize them, you lose out.

Parts of the Tribuo library are provided by Oracle out of the box with the necessary tooling that automatically distributes the tasks to be done across multiple cores of a workstation.

The trainer presented here is an excellent example of this. As a first task, we want to use the following scheme to ensure that both the training and the user data are much more extensive:

var data = ClusteringDataGenerator.gaussianClusters(50000, 1L);
var test = ClusteringDataGenerator.gaussianClusters(50000, 2L);

In the next step, it is sufficient to change a value in the constructor of the KMeansTrainer class according to the following scheme — if you pass eight here, you instruct the engine to take eight processor cores under fire at the same time:

var trainer = new KMeansTrainer(5,10,Distance.EUCLIDEAN,8,1);

At this point, you can release the program for testing again. When monitoring the overall computing power consumption with a tool like Top, you should see a brief flash of the utilization of all cores.

Regression with Tribuo

By now, it should be obvious that one of the most important USPs of the Tribuo library is its ability to represent more or less different artificial intelligence procedures via a common scheme of thought, process, and implementation shown in Figure 1. As a final task in this section, let’s turn to regression — which, in the world of artificial intelligence, is the analysis of a data set to determine the relationship between variables in an input set and an output set. This is for early neural networks in the field of AI for games — and the area that one expects most from AI as a non-initiated developer.

For this task we want to use a wine quality data set: The data sample available in detail at https://archive.ics.uci.edu/ml/datasets/Wine+Quality correlates wine ratings to various chemical analysis values. So the purpose of the resulting system is to give an accurate judgment about the quality or ratings of wine after feeding it with this chemical information.

As always, the first task is to provide the test data, which we (of course) do again via wget:

tamhan@TAMHAN18:~/tribuospace$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Our first order of business is to load the sample information we just provided — a task that is done via a CSVLoader class:

try {
var regressionFactory = new RegressionFactory();
var csvLoader = new CSVLoader<>(';',regressionFactory);
var wineSource = csvLoader.loadDataSource(Paths.get("/home/tamhan/tribuospace/winequality-red.csv"),"quality");

Two things are new compared to the previous code: First, the CSVLoader is now additionally passed a semicolon as a parameter. This string informs it that the submitted sample file has a “special” data format in which the individual pieces of information are not separated by commas. The second special feature is that we now use an instance of RegressionFactory as factory instance — the previously used LabelFactory is not really suitable for regression analyses.

The wine data set is not divided into training and working data. For this reason, the TrainTestSplitter class from before comes to new honors; we assume 30 percent training and 70 percent useful information:

var splitter = new TrainTestSplitter<>(wineSource, 0.7f, 0L);
Dataset<Regressor> trainData = new MutableDataset<>(splitter.getTrain());
Dataset<Regressor> evalData = new MutableDataset<>(splitter.getTest());

In the next step we need a trainer and an evaluator (Listing 7).

var trainer = new CARTRegressionTrainer(6);
Model<Regressor> model = trainer.train(trainData);
RegressionEvaluator eval = new RegressionEvaluator();
var evaluation = eval.evaluate(model,trainData);
var dimension = new Regressor("DIM-0",Double.NaN);
System.out.printf("Evaluation (train):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
  evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));

Two things are new about this code: First, we now have to create a regressor that marks one of the values contained in the data set as relevant. Second, we perform an evaluation against the data set used for training as part of the model preparation. While this approach can lead to overtraining, it does provide some interesting parameters if used intelligently.

If you run our program as-is, you will see the following output on the command line:

Evaluation (train):
RMSE 0.545205
MAE 0.406670
R^2 0.544085

RMSE and MAE are both parameters that describe the quality of the model. For both, a lower value indicates a more accurate model.

As a final task, we have to take care of performing another evaluation, but it no longer gets the training data as a comparison. To do this, we simply need to adjust the values passed to the Evaluate method:

evaluation = eval.evaluate(model,evalData);
dimension = new Regressor("DIM-0",Double.NaN);
System.out.printf("Evaluation (test):%n RMSE %f%n MAE %f%n R^2 %f%n",
evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));

The reward for our effort is the screen image shown in Figure 6 — since we are no longer evaluating the original training data, the accuracy of the resulting system has deteriorated.

Fig. 6: Against real data the test delivers less accurate results.


This program behavior makes sense: if you train a model against new data, you will naturally get worse data than if you short-circuit it with an already existing source of information. Note at this point, however, that overtraining is a classic antipattern, especially in financial economics, and has cost more than one algorithmic trader a lot of money.

This brings us to the end of our journey through the world of artificial intelligence with Tribuo. The library also supports anomaly detection as a fourth design pattern, which we will not discuss here. At https://tribuo.org/learn/4.0/tutorials/anomaly-tribuo-v4.html you can find a small tutorial — the difference between the three methods presented so far and the new method, however, is that the new method works with other classes.

Conclusion

Hand on heart: If you know Java well, you can learn Python — while swearing, but without any problems: So it is more a question of not wanting to than a problem of not being able to.

On the other hand, there is no question that Java payloads are much easier to integrate into various enterprise processes and enterprise toolchains than their Python-based counterparts. The use of the Tribuo library already provides a remedy since you do not have to bother with the manual brokering of values between the Python and the Java parts of the application.

If Oracle would improve the documentation a bit and offer even more examples, there would be absolutely nothing against the system. Thus, it is true that the hurdle to switching is somewhat higher: Those who have never worked with ML before will learn the basics faster with Python, because they will find more turnkey examples. On the other hand, it is also true that working with Tribuo is faster in many cases due to the various comfort functions. All in all, Oracle succeeds with Tribuo in a big way, which should give Java a good chance in the field of machine learning — at least from the point of view of an ecosystem.

The post On pythonic tracks appeared first on ML Conference.

]]>
Understanding Language is a Frontier for AI https://mlconference.ai/blog/understanding-language-is-the-new-frontier-for-ai/ Tue, 16 Feb 2021 08:51:18 +0000 https://mlconference.ai/?p=76183 In recent years we have seen a lot of breakthroughs in AI. We now have deep learning algorithms beating the best of the best in games like chess and go. In computer vision these algorithms now recognise faces with the same accuracy as humans. Except they don’t, they can do it for millions of faces while humans struggle to recognize more than a few hundred people.

The post Understanding Language is a Frontier for AI appeared first on ML Conference.

]]>
Board games are artificial tasks with only a few types of board pieces and rules that are simple enough to teach them to young children. It is no surprise that researchers chose this as a starting ground in the early days of AI research.

In 1955 Arthur Samuel was able to build the first ‘reinforcement’ algorithm that could teach itself to play checkers. Simply by playing against itself it was able to go from the level of a 3-year old, moving pieces randomly on a board, to the point where even Samuel himself could no longer win.

The most recent breakthrough in this domain happened in 2016 when Google’s AlphaGo was able to beat go world champion Lee Sedol. If Google had taken the same approach for go as IBM did in the 90s for chess, it would have had to wait another 10 years before this was possible. IBM’s DeepBlue ran on a supercomputer that was able to compute every possible move and counter move on the 8 by 8 chess board. With Moore’s law in mind, it would have taken until at least 2025 before we would have a supercomputer that could do the same for the 19 by 19 grid on the Go board.

What Google did instead was go back to reinforcement algorithms that could learn by playing against different versions of itself. So even though AlphaGo could not compute every possible move on the board, it was able to see patterns that Lee Seedol couldn’t.

Because these board games have such simple rules, it was really easy for Google to set up a simulation environment where these algorithms could play against slightly different versions of themselves. In less than an hour they could gather more data to learn from than any professional player was able acquire in a lifetime. A few weeks of self-play delivered AlphaGo more data to learn from than all go players in history combined. It is this high volume of high-quality data that enabled AlphaGo to beat the world champion.

The reason we do not see Google’s algorithms controlling factories is that they require insane amounts of data to learn from. We cannot really afford to have our machines try out many different combinations of parameters to see which configurations give us the highest yield and revenue.

So, if we want AI to be better than us, we need at least vast amounts of high-quality labelled data. But there is more to it than that if we go to more complex tasks.

It took until 2015 until we were able to build an algorithm that could recognize faces with an accuracy that is comparable to humans. Facebook’s DeepFace is 97,4% accurate, just shy of the 97.5% human performance. For reference, the FBI algorithm used in Hollywood only reaches 85% accuracy, which means it is still wrong in more than 1 out of every 7 cases.

The FBI algorithm was handcrafted by a team of engineers. For each feature, like the size of a nose and the relative placement of your eyes was manually programmed. The Facebook algorithm works with learned features instead. They used a special deep learning architecture called Convolutional Neural Networks that mimics the different layers in our visual cortex. Because we don’t know exactly how we see, the connections between these layers are learned by the algorithm.

The breakthrough Facebook pulled off here is actually twofold. It was not only able to learn good features, but they also gathered high quality labelled data from the millions of users who were kind enough to tag their friends in the photos they shared.

Vision is a problem that evolution has solved in millions of different species ranging from the smallest insects to the weirdest sea creatures. If you look at it that way, language seems to be much more complex. As far as we know, we are currently the only species that communicates with a complex language.

So, you need large quantities of high-quality labelled data and a smart architecture for AI to learn human level tasks.

Less than a decade ago, AI algorithms only counted how often certain words occurred in a body of text to understand what it was about. But this approach clearly ignores the fact that words have synonyms and only mean anything within context.

It took until 2013 for Tomas Mikolov and his team at Google to discover how to create an architecture that can learn the meaning of words. Their word2vec algorithm mapped synonyms on top of each other, it was able to model meaning like size, gender, speed, … and even learn functional relations like countries and their capitals.

The missing piece however was context. It was 2018 before we saw a real breakthrough in this field when Google introduced the BERT model. Jacob Devlin and team were able to recycle an architecture typically used for machine translation so it can learn the meaning of words in relation to their context in a sentence.

By teaching the model to fill out missing words they were able to embed language structure in the BERT model. With only limited data they could finetune BERT for a multitude of tasks ranging from finding the right answer to a question to really understanding what a sentence is about.

In 2019 researchers at Facebook were able to take this even further. They trained a BERT-like model on more than 100 languages simultaneously. It was able to learn tasks in 1 language, for example English, and use it for the same task in any of the other languages like Arabic, Chinese, Hindi and so on. This language agnostic model has the same performance as BERT on the language it is trained on and there is only a limited impact going from 1 language to another.

All these techniques are really impressive in their own right, but it took until early 2020 for researchers at Google to beat human performance on select academic tasks. Google pushed the BERT architecture to its limits by training a much larger network on even more data. This so-called T5 model now performs better than humans in labelling sentences and finding the right answers to a question. The language agnostic mT5 model released in October is almost as good as bilingual humans at switching from 1 language to another, but on 100+ languages at once.

Now what is the catch here? Why aren’t we seeing these algorithms everywhere? Training the T5 algorithm costs around $1.3 million in cloud compute. And although Google was so kind as to share these models, they are so big that they take ages to fine tune for your problem.

Since we are still more or less on track with Moore’s law, compute power will continue to double every 2 years. We will see human level language understanding in real applications and in more than 100 different languages soon. But don’t expect to see these models in low latency tasks any time soon.

If you want to try out the current state-of-the-art of language agnostic chatbots, why don’t you give Chatlayer.ai by Sinch a try. On this bot platform you can build bots in your native tongue and they will understand any language you (or Google Translate) can.

The post Understanding Language is a Frontier for AI appeared first on ML Conference.

]]>
Real-time anomaly detection with Kafka and Isolation Forests https://mlconference.ai/blog/real-time-anomaly-detection-with-kafka-and-isolation-forests/ Thu, 07 Jan 2021 08:43:20 +0000 https://mlconference.ai/?p=81108 Anomalies - or outliers - are ubiquitous in data. Be it due to measurement errors of sensors, unexpected events in the environment or faulty behaviour of a machine. In many cases, it makes sense to detect such anomalies in real time in order to be able to react immediately. The data streaming platform Apache Kafka and the Python library scikit-learn provide us with the necessary tools for this.

The post Real-time anomaly detection with Kafka and Isolation Forests appeared first on ML Conference.

]]>
Detecting anomalies in data series can be valuable in many contexts: from anticipatory waiting to monitoring resource consumption and IT security. The time factor also plays a role in detection: the earlier the anomaly is detected, the better. Ideally, anomalies should be detected in real time and immediately after they occur. This can be achieved with just a few tools. In the following, we take a closer look at Kafka, Docker and Python with scikit-learn. The complete implementation of the listed code fragments can be found under github.com.

Stay up to date

Learn more about MLCON

 

Apache Kafka

The core of Apache Kafka is a simple idea: a log to which data can be attached at will. This continuous log is then called a stream. On the one hand, so-called producers can write data into the stream, on the other hand, consumers can read this data. This simplicity makes Kafka very powerful and usable for many purposes. For example, sensor data can be sent into the stream directly after the measurement. On the other hand, an anomaly detection software can read and process the data. But why should there even be a stream between the data source (sensors) and the anomaly detection instead of sending the data directly to the anomaly detection?

There are at least three good reasons for this. First, Kafka can run distributed in a cluster and offer high reliability (provided that it is actually running in a cluster of servers and that another server can take over in case of failure). If the anomaly detection fails, the stream can be processed after a restart and continue where it left off last. So no data will be lost.

Secondly, Kafka offers advantages when a single container for real-time anomaly detection is not working fast enough. In this case, load balancing would be needed to distribute the data. Kafka solves this problem with partitions. Each stream can be divided into several partitions (which are potentially available on several servers). A partition is then assigned to a consumer. This allows multiple consumers (also called consumer groups) to dock to the same stream and allows the load to be distributed between several consumers.

Thirdly, it should be mentioned that Kafka enables the functionalities to be decoupled. If, for example, the sensor data are additionally stored in a database, this can be realized by an additional consumer group that docks onto the stream and stores the data in the database. This is why companies like LinkedIn, originally developed Kafka, have declared Kafka to be the central nervous system through which all data is passed.

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

A Kafka cluster can be easily started using Docker. A Docker-Compose-File is suitable for this. It allows the user to start several Docker containers at the same time. A Docker network is also created (here called kafka_cluster_default), which all containers join automatically and which enables their communication. The following commands start the cluster from the command line (the commands can also be found in the README.md in the GitHub repository for copying):

git clone https://github.com/NKDataConv/anomalie-erkennung.git
cd anomaly detection/
docker-compose -f ./kafka_cluster/docker-compose.yml up -d –build

Here the Kafka Cluster consists of three components:

  1. The broker is the core component and contains and manages the stream.
  2. ZooKeeper is an independent tool and is used at Kafka to coordinate several brokers. Even if only one broker is started, ZooKeeper is still needed.
  3. The schema registry is an optional component, but has established itself as the best practice in connection with Kafka.

The Schema Registry solves the following problem: In practice, the responsibilities for the producer (i.e. origin of the data) are often detached from the responsibilities for the consumer (i.e. data processing). In order to make this cooperation smooth, the Schema Registry establishes a kind of contract between the two parties. This contract contains a schema for the data. If the data does not match the schema, the Schema Registry does not allow the data to be added to the corresponding stream. The Schema Registry therefore checks that the schema is adhered to each time the data is sent. This schema is also used to serialize the data. This is the avro format. Serialization with avro allows you to further develop the schema, for example, to add additional data fields. The downward compatibility of the schema is always ensured. A further advantage of the Schema Registry in cooperation with Avro is that the data in the Kafka stream ends up without the schema, because it is stored in the Schema Registry. This means that only the data itself is in the stream. In contrast, when sending data in JSON format, for example, the names of all fields would have to be sent with each message. This makes the Avro format very efficient for Kafka.

To test the Kafka cluster, we can send data from the command line into the stream and read it out again with a command line consumer. To do this we first create a stream. In Kafka, a single stream is also called a topic. The command for this consists of two parts. First, we have to take a detour via the Docker-Compose-File to address Kafka. The command important for Kafka starts with kafka-topics. In the following the topic test-stream is created:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-topics –create –bootstrap-server localhost:9092 –topic test-stream

Now we can send messages to the topic. To do this, messages can be sent with ENTER after the following command has been issued:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-console-producer –broker-list localhost:9092 –topic test-stream

At the same time, a consumer can be opened in a second terminal, which outputs the messages on the command line:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-console-consumer –bootstrap-server localhost:9092 –topic test-stream –from-beginning

Both processes can be aborted with CTRL + C. The Schema Registry has not yet been used here. It will only be used in the next step.

Kafka Producer with Python

There are several ways to implement a producer for Kafka. Kafka Connect allows you to connect many external tools such as databases or the file system and send their data to the stream. But there are also libraries for various programming languages. For Python, for example, there are at least three working, actively maintained libraries. To simulate data in real time, we use a time series that measures the resource usage of a server. This time series includes CPU usage and the number of bytes of network and disk read operations. This time series was recorded with Amazon CloudWatch and is available on Kaggle. An overview of the data is shown in Figure 1, with a few obvious outliers for each time series.



Fig. 1: Overview of Amazon CloudWatch data

In the time series, the values are recorded at 5-minute intervals. To save time, they are sent in the stream every 1 second. In the following, an image for the Docker Container is created. The code for the producer can be viewed and adjusted in the subfolder producer. Then the container is started as part of the existing network kafka_cluster_default.

docker build ./producer -t kafka-producer
docker run –network=”kafka_cluster_default” -it kafka-producer

Now we want to turn to the other side of the stream.

Kafka Consumer

For an anomaly detection it is interesting to implement two consumers. One consumer for real-time detection, the second for training the model. This consumer can be stopped after the training and restarted for a night training. This is useful if the data has changed. Then the algorithm is trained again and a new model is saved. However, it is also conceivable to trigger this container at regular intervals using a cron job. In the productive case, however, this should not be given a completely free hand. An intermediate step with a check of the model makes sense at this point – more about this later.

The trained model is then stored in the Docker volume. The volume is mounted in the subfolder data. The first consumer has access to this Docker Volume. This consumer can load the trained model and evaluate the data for anomalies.

Technologically, libraries also exist for the consumer in many languages. Since Python is often the language of choice for data scientists, it is the obvious choice. It is interesting to note that the schema no longer needs to be specified for implementation on the consumer side. It is automatically obtained via the schema registry, and the message is deserialized. An overview of the containers described so far is shown in Figure 2. Before we also run the consumer, let’s take a closer look at what is algorithmically done in the containers.



Fig. 2: Overview of docker containers for anomaly detection

Isolation Forests

For the anomaly detection a colourful bouquet of algorithms exists. Furthermore, the so-called no-free-lunch theorem makes it difficult for us to choose. The main point of this theorem is not that we should take money with us into lunch break, but that there is nothing free when choosing the algorithm. In concrete terms, this means that we don’t know whether a particular algorithm works best for our data unless we test all algorithms (with all possible settings) – an impossible task. Nevertheless, we can sort the algorithms a bit.

Basically, in anomaly detection one can first distinguish between supervised and unsupervised learning. If the data indicates when an anomaly has occurred, this is referred to as supervised learning. Classification algorithms are suitable for this problem. Here, the classes “anomaly” and “no anomaly” are used. If, on the other hand, the data set does not indicate when an anomaly has occurred, one speaks of unsupervised learning. Since this is usually the case in practice, we will concentrate here on the approaches for unsupervised learning. Any prediction algorithm can be used for this problem. The next data point arriving is predicted. When the corresponding data point arrives, it is compared with the forecast. Normally the forecast should be close to the data point. However, if the forecast is far off, the data point is classified as an anomaly.

What “far off” exactly means can be determined with self-selected threshold values or with statistical tests. The Seasonal Hybrid ESD algorithm developed by Twitter, for example, does this. A similar approach is pursued with autoencoders. These are neural networks that reduce the dimensions of the data in the first step. In the second step, the original data is to be restored from the dimension reduction. Normally, this works well. If, on the other hand, a data point does not succeed, it is classified as an anomaly. Another common approach is the use of One-Class Support Vector Machines. These draw as narrow a boundary as possible around the training data. A few data points are the exception and are outside the limits. The number of exceptions must be given to the algorithm as a percentage and determines how high the percentage of detected anomalies will be. There must therefore be some premonition about the number of anomalies. Each new data point is then checked to see whether it lies within or outside the limits. Outside corresponds to an anomaly.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Another approach that works well is Isolation Forest. The idea behind this algorithm is that anomalies can be distinguished from the rest of the data with the help of a few dividing lines (Figure 3). Data points that can be isolated with the fewest separators are therefore the anomalies (box: “Isolation Forests”).

Isolation Forests

Isolation Forests are based on several decision trees. A single decision tree in turn consists of nodes and leaves. The decisions are made in the nodes. To do this, a random feature (that is, a feature of the data, in our example, CPU, network, or memory resource) and a random split value is selected for the node. If a data point in the respective feature has a value that is less than the split value, it is placed in the left branch, otherwise in the right branch. Each branch is followed by the next node, and the whole thing happens again: random feature and random split value. This continues until each data point ends up in a separate sheet. So this isolates each data point. The procedure is repeated with the same data for several trees. These trees form the forest and give the algorithm its name.

So at this point we have isolated all training data in each tree into leaves. Now we have to decide which leaves or data points are an anomaly. Here is what we can determine: Anomalies are easier to isolate. This is also shown in Figure 3, which shows two-dimensional data for illustration. Points that are slightly outside the data point can be isolated with a single split (shown as a black line on the left). This corresponds to a split in the decision tree. In contrast, data points that are close together on the screen can only be isolated with multiple splits (shown as multiple separators on the right).

This means that on average, anomalies require fewer splits to isolate. And this is exactly how anomalies are defined in Isolation Forests: Anomalies are the data points that are isolated with the fewest splits. Now all that remains to be done is to decide which part of the data should be classified as an anomaly. This is a parameter that the Isolation Forests need for training. Therefore, one needs a premonition of how many anomalies are found in the data set. Alternatively, an iterative approach can be used. Different values can be tried out and the result checked. Another parameter is the number of decision trees for the forest, since the number of splits is calculated using the average of several decision trees. The more decision trees, the more accurate the result. One hundred trees is a generous value here, which is sufficient for a good result. This may sound like a lot at first, but the calculations are actually within seconds. This efficiency is also an advantage of Isolation Forests.



Fig. 3: Demonstration of data splitting with Isolation Forests: left with as few splits possible, right only with several splits

Isolation Forests are easy to implement with Python and the library scikit-learn. The parameter n_estimators corresponds to the number of decision trees and contamination to the expected proportion of anomalies. The fit method is used to trigger the training process on the data. The model is then stored in the Docker Volume.

from sklearn.ensemble import IsolationForest
iso_forest=IsolationForest(n_estimators=100, contamination=float(.02))
iso_forest.fit(x_train)
joblib.dump(iso_forest, ‘/data/iso_forest.joblib’)

From the Docker Volume, the Anomaly Detection Consumer can then call the model and evaluate streaming data using the predict method:

iso_forest = joblib.load(‘/data/iso_forest.joblib’)

anomaly = iso_forest.predict(df_predict)

A nice feature of scikit-learn is the simplicity and consistency of the API. Isolation Forests can be exchanged for the above mentioned One-Class Support Vector Machines. Besides the import, only line 2 IsolationForest has to be replaced by OneClassSVM. The data preparation and methods remain the same. Let us now turn to data preparation.

Feature Engineering with time series

In Machine Learning, each column of the data set (CPU, network and disk) is also called a feature. Feature engineering is the preparation of the features for use. There are a few things to consider. Often the features are scaled to get all features in a certain value range. For example, MinMax scaling transforms all values into the range 0 to 1. The maximum of the original values then corresponds to 1, the minimum to 0, and all other values are scaled in the range 0 to 1. Scaling is, for example, a prerequisite for training neural networks. For Isolation Forests especially, scaling is not necessary.

Another important step is often the conversion of categorical features that are represented by a string. Let’s assume we have an additional column for the processor with the strings Processor_1 and Processor_2, which indicate which processor the measured value refers to. These strings can be converted to zeros and ones using one-hot encoding, where zero represents the category Processor_1 and one represents Processor_2. If there are more than two categories, it makes sense to create a separate feature (or column) for each category. This will result in a Processor_1 column, a Processor_2 column, and so on, with each column consisting of zeros and ones for the corresponding category.

Feature engineering also means to prepare the existing information in a useful way and to provide additional information. Especially for time series it is often important to extract time information from the time stamp. For example, many time series fluctuate during the course of a day. Then it might be important to prepare the hour as a feature. If the fluctuations come with the seasons, a useful feature is the month. Or you can differentiate between working days and vacation days. All this information can easily be extracted from the timestamp and often provides a significant added value to the algorithm. A useful feature in our case could be working hours. This feature is zero during the time 8-18 and otherwise one. With the Python library pandas, such features can be easily created:

df[‘hour’] = df.timestamp.dt.hour
df[‘business_hour’] = ((df.hour < 8) | (df.hour>18)).astype(“int”)

It is always important that sufficient data is available. It is therefore useless to introduce the feature month if there is no complete year of training data available. Then the first appearance of the month of May would be detected as an anomaly in live operation, if it was not already present in the training data.

At the same time one should note that the following does not apply: A lot helps out a lot. On the contrary! If a feature is created that has nothing to do with the anomalies, the algorithm will still use the feature for detection and find any irregularities. In this case, thinking about the data and the problem is essential.

Trade-off in anomaly detection

Training an algorithm to correctly detect all anomalies is difficult or even impossible. You should therefore expect the algorithm to fail. But these errors can be controlled to a certain degree. The so-called first type of error occurs when an anomaly is not recognized as such by the algorithm. So, an anomaly is missed.

The second type of error is the reverse: An anomaly is detected, but in reality it is not an anomaly – i.e. a false alarm. Both errors are of course bad.

However, depending on the scenario, one of the two errors may be more decisive. If each detected anomaly results in expensive machine maintenance, we should try to have as few false alarms as possible. On the other hand, anomaly detection could monitor important values in the health care sector. In doing so, we would rather accept a few false alarms and look after the patient more often than miss an emergency. These two errors can be weighed against each other in most anomaly detection algorithms. This is done by setting a threshold value or, in the case of isolation forests, by setting the contamination parameter. The higher this value is set, the more anomalies are expected by the algorithm and the more are detected, of course. This reduces the probability that an anomaly will be missed.

The argumentation works the other way round as well: If the parameter is set low, the few detected anomalies are most likely the ones that are actually those. But because of this, some anomalies can be missed. So here again it is important to think the problem through and find meaningful parameters by iterative trial and error.

Real-time anomaly detection

Now we have waited long enough and the producer should have written some data into the Kafka stream by now. Time to start training for anomaly detection. This happens in a second terminal with Docker using the following commands:

docker build ./Consumer_ML_Training -t kafka-consumer-training
docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:rw -it kafka consumer training

The training occurs on 60 percent of the data. The remaining 40 percent is used for testing the algorithm. For time series, it is important that the training data lies ahead of the test data in terms of time. Otherwise it would be an unauthorized look into a crystal ball.

An evaluation on the test data is stored in Docker Volume in the form of a graph together with the trained model. An evaluation is visible in Figure 4. The vertical red lines correspond to the detected anomalies. For this evaluation we waited until all data were available in the stream.



Fig. 4: Detected anomalies in the data set

If we are satisfied with the result, we can now use the following commands to start the anomaly detection:

docker build ./Consumer_Prediction -t kafka-consumer-prediction
docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:ro -it kafka-consumer-prediction

As described at the beginning, we can also use several Docker containers for evaluation to distribute the load. For this, the Kafka Topic load must be distributed over two partitions. So let’s increase the number of partitions to two:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-topics –alter –zookeeper zookeeper:2181 –topic anomaly_tutorial –partitions 2

Now we simply start a second consumer for anomaly detection with the known command:

docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:ro -it kafka-consumer-prediction

Stay up to date

Learn more about MLCON

 

Kafka now automatically takes care of the allocation of partitions to Consumers. The conversion can take a few seconds. An anomaly is displayed in the Docker logs with “No anomaly” or “Anomaly at time *”. Here it remains open to what extent the anomaly is processed further in a meaningful way. It would be particularly elegant to write the anomaly into a new topic in Kafka. An ELK stack could dock to this with Kafka Connect and visualize the anomalies.

The post Real-time anomaly detection with Kafka and Isolation Forests appeared first on ML Conference.

]]>
Let’s visualize the coronavirus pandemic https://mlconference.ai/blog/lets-visualize-the-coronavirus-pandemic/ Wed, 16 Dec 2020 14:16:20 +0000 https://mlconference.ai/?p=81043 Since February, we have been inundated in the media with diagrams and graphics on the spread of the coronavirus. The data comes from freely accessible sources and can be used by everyone. But how do you turn the source data into a data set that can be used to create something visual like a dashboard? With Python and modules like pandas, this is no magic trick.

The post Let’s visualize the coronavirus pandemic appeared first on ML Conference.

]]>
These are crazy times we have been living in since the beginning of 2020; a pandemic has turned public life upside down. News sites offer live tickers with the latest news on infections, recoveries and death rates, and there is no medium that does not use a chart for visualization. Institutes like the Robert Koch Institute (RKI) or the Johns Hopkins University provide dashboards. We live in a world dominated by data, even during a pandemic.

The good thing is that most data on the pandemic is publicly available. Johns Hopkins University, for example, makes its data available in an open GitHub repository. So what could be more obvious than to create your own dashboard with this freely accessible data? This article uses coronavirus data to illustrate how to get from data cleansing to enriching data from other sources to creating a dashboard using Plotly’s dash. First of all, an important note: The data is not interpreted or analyzed in any way. This must be left to experts such as virologists, otherwise false conclusions may be drawn. Even if data is available for almost all countries, it is not necessarily comparable; each country uses different methods for testing infections. Some countries even have too few tests, so that no uniform picture can be drawn. The data set serves only as an example.

 

First, the work

In order to be able to use the data, we need to get it in a standardized form for our purposes. The data of Johns Hopkins University is stored in a GitHub repository. Basically, it is divided into two categories: first, continuously as time-series data and second, as a daily report in a separate CSV file. For the dashboard we need both sources. With the time-series data, it is easy to create line charts and plot gradients, curves, etc. From this we later generate the temporal course of the case numbers as line charts. Furthermore, we can calculate and display the growth rates from the data.

 

Set up environment

To prepare the data we use Python and the library pandas. pandas is the Swiss army knife for Python in terms of data analysis and cleanup. Since we need to install some modules for Python, I recommend creating a multi-environment setup for Python. Personally I use Anaconda, but there are also alternatives like Virtualenv. Using isolated environments does not change anything in the system-wide installation of Python, so I strongly recommend it. Furthermore you can work with different Python versions and export the dependencies for deployment more easily. Regardless of the system used, the first step is to activate the environment. With Anaconda and the command line tool conda this works as follows:

$ conda create -n corona-dashboard python=3.8.2

The environment called corona-dashboard is created, and Python 3.8.2 and the necessary modules are installed. To use the environment, we activate it with

$ conda activate corona-dashboard

We do not want to go deeper into Anaconda here, but for more information you can refer to the official documentation.

Once we have activated our environment, we install the necessary modules. In the first step these are pandas and Jupyter notebook. A Jupyter notebook is, simply put, a digital notebook. The notebooks contain markdown and code snippets that can be executed immediately. They are perfectly suited for iterative data cleansing and for developing the necessary steps. The notebooks can also be used to develop diagrams before they are transferred into the final scripts.

$ conda install pandas
$ conda install -c conda-forge notebook

In the following we will perform all steps in Jupyter notebooks. The notebooks can be found in the repository for this article on GitHub. To use them, the server must be started:

$ jupyter notebook

After it’s up and running, the browser will display the overview in the current directory. Click on new and choose your desired environment to open a new Jupyter notebook. Now we can start on cleaning up the data.

 

Cleaning time-based data

In the first step, we import pandas, which provides methods for reading and manipulating data. Here we need a parser for CSV files, for which the method read_csv is provided. At least one path to a file or a buffer is expected as a parameter. If you specify a URL to a CSV as a parameter, pandas will read and process it without any problems. To ensure consistent, traceable data, we access a downloaded file that is available in the repository.

df = pd.read_csv("time_series_covid19_confirmed_global.csv")

To check this, we use the instruction df.head() to output the first five lines of the DataFrame (fig. 1).

Fig. 1: The unadjusted DataFrame

 

To become aware of the structure, we can have the column names printed. This is done with the statement df.columns. You can see in figure 2 that there is one column for each day in the table. Furthermore there are geo coordinates for the respective countries. The countries are partly divided into provinces and federal states. For the time-based data, we do not need geo-coordinates, and we can also remove the column with the states. We achieve this in pandas with the following methods on the data frame:

df.drop(columns=['Lat', 'Long', 'Province/State'], inplace = True)

The method drop expects as parameter the data we want to get rid of. In this case, there are three columns: Lat, Long and Province/State. It is important that the names with upper/lower case and any spaces are specified exactly. The second parameter inplace is used to apply the operation directly to our DataFrame. Without this parameter, pandas returns the modified DataFrame to us without changing the original. If you look at the frame with df.head(), you will see that the desired columns have been discarded.

The division of some countries into provinces or states results in multiple entries for some. An example is China. Therefore, it makes sense to group the data by country. For this, pandas provides a powerful grouping function.

df_grouped = df.groupby(['Country/Region'], as_index=False).sum()

Using the groupby function and parameters, which say in which column the rows should be grouped, the rows are combined. The concatenated .sum() sums the values of the respective combined groups. The return is a new DataFrame with the grouped and summed data, so that we can access the time-related data for all countries. So we swap rows and columns to get one entry for each country (columns) for each day (row).

df_grouped.reset_index(level=0, inplace=True)
df_grouped.rename(columns={'index': 'Date'}, inplace=True)

Before transposing, we set the index to Country/Region to get a clean frame. The disadvantage of this is that the new index is then called Country/Region.

The next adjustment is to set the date in a separate column. To correct this, we reset the index again. This will turn our old index (Country/Region) into a column named Index. This column contains the date specifications and must be renamed and set to the correct data type. The cleanup is complete (Fig. 3).

df_grouped.reset_index(level=0, inplace=True)
df_grouped.rename(columns={'index': 'Date'}, inplace=True)
df_grouped['Date'] = pd.to_datetime(df_grouped['Date'])

Fig. 2: Overview of the columns in the DataFrame

 

Fig. 3: The cleaned DataFrame

 

The fact that the index continues to be called Country/Region won’t bother us from now on because the final CSV file is saved without any index.

df_grouped.to_csv('../data/worldwide_timeseries.csv', index=False)

This means that the time-series data are adjusted and can be used for each country. If, for example, we only want to use the data for Germany, a new DataFrame can be created as a copy by selecting the desired columns.

df_germany = df_grouped[['Date', 'Germany']].copy()

Packed into a function we get the source code from Listing 1 for the cleanup of the time-series data.

def clean_and_save_timeseries(df):
  drop_columns = ['Lat', 
                          'Long', 
                          'Province/State']
 
  df.drop(columns=drop_columns, inplace = True)
  
  df_grouped = df.groupby(['Country/Region'], as_index=False).sum()
  df_grouped = df_grouped.set_index('Country/Region').transpose()
  df_grouped.reset_index(level=0, inplace=True)
  df_grouped.rename(columns={'index': 'Date'}, inplace=True)
  df_grouped['Date'] = pd.to_datetime(df_grouped['Date'])
 
  df_grouped.to_csv('../data/worldwide_timeseries.csv',index=False)

 

The function expects to receive the DataFrame to be cleaned as a parameter. Furthermore, all steps described above are applied and the CSV file is saved accordingly.

 

Clean up case numbers per country

In the repository of Johns Hopkins University you can find more CSVs with case numbers per country, divided into provinces/states. Additionally, the administrations for North America and the geographic coordinates are listed. With this data, we can generate an overview on a world map. The adjustment is less complex. As with the cleanup of the temporal data, we read the CSV into a pandas DataFrame and look at the first lines (Fig. 4).

import pandas as pd

df = pd.read_csv('04-13-2020.csv')
df.head()

Fig. 4: The original DataFrame

 

In addition to the actual case numbers, information on provinces/states, geo-coordinates and other metadata are included that we do not need in the dashboard. Therefore we remove the columns from our DataFrame.

df.drop(columns=['FIPS','Lat', 'Long_', 'Combined_Key', 'Admin2', 'Province_State'], inplace=True)

To get the summed up figures for the respective countries, we group the data by Country_Region, assign it to a new DataFrame and sum it up.

df_cases = df.groupby(['Country_Region'], as_index=False).sum()

We have thus completed the clean-up operation. We will save the cleaned CSV for later use. Packaged into a function, the steps look like those shown in Listing 2.

def clean_and_save_worldwide(df):
  drop_columns = ['FIPS',
                          'Lat', 
                          'Long_', 
                          'Combined_Key', 
                          'Admin2', 
                          'Province_State']
 
  df.drop(columns=drop_columns, inplace=True)
 
  df_cases = df.groupby(['Country_Region'], as_index=False).sum()
  df_cases.to_csv('Total_cases_wordlwide.csv')

The function receives the DataFrame and applies the steps described above. At the end a cleaned CSV file is saved.

 

Clean up case numbers for Germany

In addition to the data from Johns Hopkins University, we want to use the data from the RKI with case numbers from Germany, broken down by federal state. This will allow us to create a detailed view of the German case numbers. The RKI does not make this data available as CSV in a repository; instead, the figures are displayed in an HTML table on a separate web page and updated daily. pandas provides the function read_html() for such cases. If you pass a URL to a web page, it is loaded and parsed. All tables found are returned as DataFrames in a list. I do not recommend accessing the web page directly and reading the tables. On the one hand, there are pages (as well as those of the RKI) which prevent this, on the other hand, the requests should be kept low, especially during development. For our purposes we therefore store the website locally with a wget https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html. To ensure that the examples work consistently, this page is part of the article’s repository. pandas doesn’t care whether we read the page remotely or pass it as a file path.

import pandas as pd
df = pd.read_html('../Fallzahlen.html', decimal=',', thousands='.')

Since the numbers on the web page are formatted, we pass some information about them to read_html. We inform you that the decimal separation is done with commas and the separation of the thousands with a dot. Pandas thus interprets the correct data types when reading the data. To see how many tables were found, we check the length of the list with a simple len(df). In this case this returns a 1, which means that there was exactly one interpretable table. We save this DataFrame into a new variable for further cleanup:

df_de = df[0]

Fig. 5: Case numbers from Germany in the unadjusted dataframe

 

Since the table (Fig. 5) has column headings that are difficult to process, we rename them. The number of columns is known, so we do this step pragmatically:

df_de.columns = ['Bundesland', 'Anzahl', 'diff', 'Pro_Tsd', 'Gestorben']

Note: The number of columns must match exactly. Should the format of the table change, this must be adjusted. This makes the DataFrame look much more usable. We don’t need the column of the previous day’s increments in the dashboard, so it will be removed.

Furthermore, the last line of the table contains the grand total. It is also not necessary and will be removed. We directly access the index (16) of the line and remove it. This gives us our final dataframe with the numbers for the individual states of Germany.

df_de.drop(columns=['diff'], index=[16], inplace=True)

We will now save this data in a new CSV for further use.

df_de.to_csv('cases_germany_states.csv')

A resulting function looks as follows:

def clean_and_save_german_states(df):
df.columns = ['Bundesland', 'Anzahl', 'diff', 'Pro_Tsd', 'Gestorben']
df.drop(columns=['diff'], index=[16], inplace=True)
df.to_csv('cases_germany_states.csv')

As before, the function expects the dataframe as a transfer value. So we have cleaned up all existing data and saved it in new CSV files. In the next step, we will start developing the diagrams with Plotly.

 

Creating the visualizations

Since we will later use Dash from Plotly to develop the dashboard, we will create the desired visualizations in advance in a Jupyter notebook. Plotly is a library for creating interactive diagrams and maps for Python, R and JavaScript. The output of the diagrams is done automatically with Plotly.js. This gives all diagrams additional functions (zoom, save as PNG, etc.) without creating them yourself. Before that we have to install some more required modules.

# Anaconda Benutzer
conda install -c plotly plotly

# Pip
pip install plotly

In order to create diagrams with Plotly as quickly and easily as possible, we will use Plotly Express, which is now part of Plotly. Plotly Express is the easy-to-use high-level interface to Plotly, which works with “tidy” data and considerably simplifies the creation of diagrams. Since we are working with pandas DataFrames again, we import pandas and Plotly Express at the beginning.

import pandas as pd
import plotly.express as px

Starting with the presentation of the development of infections over time in a line chart, we will import the previously cleaned and saved dataframe.

df_ts = pd.read_csv(‘../data/worldwide_timeseries.csv’)

Thanks to the data cleanup, creating a line chart with Plotly is very easy:

fig_ts = px.line(df_ts,
x="Date",
y="Germany")

The parameters are mostly self-explanatory. The first parameter tells you which DataFrame is used. With x='Date' and y='Germany' we determine the columns to be used from the DataFrame. On the horizontal the date and vertically the country. To make the diagram understandable, we set more parameters for the title and label of the axes. For the y-axis we define a linear division. If we want a logarithmic representation, we can set ‘log‘ instead of ‘linear‘.

fig_ts.update_layout(xaxis_type='date',
xaxis={
'title': 'Datum'
},
yaxis={
'title': 'Infektionen',
'type': 'linear',
},
title_text='Infektionen in Deutschland')

To display diagrams in Jupyter notebooks, we need to tell the show() method that we are working in a notebook (Fig. 6).

fig_ts.show('notebook')

Fig. 6: Progression of infections over time

 

That’s all there is to it. The visualizations can be configured in many ways. For this I refer to Plotly’s extensive documentation.

Let us continue with the creation of further visualizations. For the dashboard, cases from Germany are to be presented. For this purpose, the likewise cleaned DataFrame is read in and then sorted in ascending order.

df_fs = pd.read_csv('../data/cases_germany_states.csv')
df_fs.sort_values(by=['Anzahl'], ascending=True, inplace=True)

The first representation is a simple horizontal bar chart (Fig. 7). The code can be seen in Listing 3.

fig_fs = px.bar(df_fs, x='Anzahl', 
                     y='Bundesland',
                     hover_data=['Gestorben'],
                     height=600, 
                     orientation='h',
                     labels={'Gestorben':'Bereits verstorben'},
                     template='ggplot2')
 
fig_fs.update_layout(xaxis={
                               'title': 'Anzahl der Infektionen'
                             },
                             yaxis={
                               'title': '',
                             },
                             title_text='Infektionen in Deutschland')

The DataFrame is expected as the first parameter. Then we configure which columns represent the x and y axis. With hover_data we can determine which data is additionally displayed by a bar when hovering. To ensure a comprehensible description of the parameter, labels can be used to determine what the label for data in the frontend should be. Since “Died” sounds a bit strange, we set it to “Already deceased”. The parameter orientation specifies whether we want to create a vertical (v) or horizontal (h) bar chart. With height we set the height to 600 pixels. Finally, we must update the layout parameters, as we did for the line chart.

Fig. 7: Number of cases in Germany as a bar chart

 

To make the distribution easier to see, we create a pie chart.

fig_fs_pie = px.pie(df_fs,
values='Anzahl',
names='Bundesland',
title='Verteilung auf Bundesländer',
template='ggplot2')

The parameters are mostly self-explanatory. values defines which column of the DataFrame is used, and names, which column contains the labels. In our case these are the names of the federal states (Fig. 8).

Fig. 8: Case numbers Germany as a pie chart

 

Finally, we generate a world map with the distribution of cases per country. We import the adjusted data of the worldwide infections.

df_ww_cases = pd.read_csv('../data/Total_cases_wordlwide.csv')

Then we create a scatter-geo-plot. Simply put, this will draw a bubble for each country, the size of which corresponds to the number of cases. The code can be seen in Listing 4.

fig_geo_ww = px.scatter_geo(df_ww_cases, 
             locations="Country_Region",
             hover_name="Country_Region",
             hover_data=['Confirmed', 'Recovered', 'Deaths'],
             size="Confirmed",
             locationmode='country names',
             text='Country_Region',
             scope='world',
             labels={
               'Country_Region':'Land',
               'Confirmed':'Bestätigte Fälle',
               'Recovered':'Wieder geheilt',
               'Deaths':'Verstorbene',
             },
             projection="equirectangular",
             size_max=35,
             template='ggplot2',)

The parameters are somewhat more extensive, but no less easy to understand. In principle, it is an assignment of the fields from the DataFrame. The parameters labels and hover_data have the same functions as before. With locations we tell which column of our DataFrame contains the locations/countries. So that Plotly Express knows how to assign them on a world map, we set locationmode to country names. Plotly can thus carry out the assignment for this dataframe at country level without having to specify exact geo-coordinates. text determines which heading the mouseovers have. The size of the bubbles is calculated from the confirmed cases in the DataFrame. We pass this to size. We can define the maximum size of the bubbles with size_max (in this case 35 pixels). With scope we control the focus on the world map. Possibilities are ‘usa‘, ‘europe‘, ‘asia‘, ‘africa‘, ‘north america‘, and ‘south america‘. This means that the map is not only focused on a certain region, but also limited to it. The appropriate labels and other metaparameters for the display are applied when the layout is updated:

fig_geo_ww.update_layout(
title_text = 'Bestätigte Infektionen weltweit',
title_x = 0.5,
geo = dict(
showframe = False,
showcoastlines = True,
projection_type = 'equirectangular'
)
)

geo defines the representations of the world map. The parameter projection_type is worth mentioning, since it can be used to control the visualization of the world map. For the dashboard we use equirectangular, better known as isosceles. The finished map is shown in figure 9.

Fig. 9: Distribution of cases worldwide

 

With this we have created all the necessary images to be used in our dashboard. In the next step we will come to the dashboard itself. Since Dash is used by Plotly, we can reuse, configure and interactively design our diagrams relatively easily.

 

Creating the dashboard with Dash from Plotly

Dash is a Python framework for building web applications. It is based on Flask, Plotly.js and React.js. Dash is very well suited for creating data visualization applications with highly customized user interfaces in Python. It is especially interesting for anyone working with data in Python. Since we have already created all diagrams with Plotly in advance, the next step to a web-based dashboard is straightforward.

We create an interactive dashboard from the data in our cleaned up CSV files – without a single line of HTML and JavaScript. We want to limit ourselves to basic functions and not go deeper into special features, as this would go beyond the scope of this article. For this I refer to the excellent documentation of Dash and the tutorial contained therein. Dash creates the website from a declarative description of the Python structure. To work with Dash, we need the Python module dash. It is installed with conda or pip:

$ conda install -c conda-forge dash
# or
$ pip install dash

For the dashboard we create a Python script called app.py. The first step is to import the required modules. Since we are importing the data with pandas in addition to Dash, we need to import the following packages:

import pandas as pd
import dash
import dash_core_components as dcc
import dash_html_components as html
# Module with help functions
import dashboard_figures as mf

Besides the actual module for Dash, we need the core components as well as the HTML components. The former contain all important components we need for the diagrams. HTML components contain the HTML parts. Once the modules are installed, the dashboard can be developed (Figure 10).

Let’s start with the global data. The breakdown is relatively simple: a heading, with two rows below it, each with two columns. We place the world map with facts as text to the right. Then we want a combo box to select one or more countries for display in the line chart below. We also have the choice of whether we prefer a linear or logarithmic representation. The data on infections in Germany are a bit extensive and include the bar chart and the distribution as a pie chart.

In Dash the layout is described in code. Since we have already generated all diagrams with Plotly Express, we outsource them to an external module with helper functions. I won’t go into this code in detail, because it contains the diagrams we have already created. The code is stored in the GitHub repository. Before the diagrams can be displayed, we need to import the CSV files and have the diagrams created (Listing 5).

df_ww = pd.read_csv('../data/worldwide_timeseries.csv')
df_total = pd.read_csv('../data/Total_cases_wordlwide.csv')
df_germany = pd.read_csv('../data/cases_germany_states.csv')
 
fig_geo_ww = mf.get_wordlwide_cases(df_total)
fig_germany_bar = mf.get_german_barchart(df_germany)
fig_germany_pie = mf.get_german_piechart(df_germany)
 
fig_line = mf.get_lineplot('Germany', df_ww)
 
ww_facts = mf.get_worldwide_facts(df_total)
 
fact_string = ‚'''Weltweit gibt es insgesamt {} Infektionen. 
  Davon sind bereits {} Menschen vertorben und {} gelten als geheilt. 
  Somit sind offiziell {} Menschen aktiv an einer Coronainfektion erkrankt.'''
 
fact_string = fact_string.format(ww_facts['total'], 
                                             ww_facts['deaths'], 
                                             ww_facts['recovered'], 
                                             ww_facts['active'])
 
countries = mf.get_country_names(df_ww)

Further help functions return a list of all countries and aggregated data. For the combo box, a list of options must be created that contains all countries.

dd_options = []
for key in countries:
dd_options.append({
'label': key,
'value': key
})

That was all the preparation that was necessary. The layout of the web application consists of the components provided by Dash. The complete description of the layout can be seen in Listing 6.

app.layout = html.Div(children=[
  html.H1(children='COVID-19: Dashboard', style={'textAlign': 'center'}),
  html.H2(children='Weltweite Verteilung', style={'textAlign': 'center'}),
  # World and Facts
  html.Div(children=[
 
    html.Div(children=[
 
        dcc.Graph(figure=fig_geo_ww),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '66%'}),
 
    html.Div(children=[
 
      html.H3(children='Fakten'),
      html.P(children=fact_string)
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '33%'})
 
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
 
  # Combobox and Checkbox
  html.Div(children=[
    html.Div(children=[
      # combobox
      dcc.Dropdown(
        id='country-dropdown',
        options=dd_options,
        value=['Germany'],
        multi=True
      ),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '66%'}),
 
    html.Div(children=[
      # Radio-Buttons
      dcc.RadioItems(
        id='yaxis-type',
        options=[{'label': i, 'value': i} for i in ['Linear', 'Log']],
        value='Linear',
        labelStyle={'display':'inline-block'}
      ),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '33%'})
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
  
  # Lineplot and Facts
  html.Div(children=[
    html.Div(children=[
 
      #Line plot: Infections
      dcc.Graph(figure=fig_line, id='infections_line'),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '100%'}),
 
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
 
  # Germany
  html.H2(children=‘Zahlen aus Deutschland‘, style={‚textAlign‘: ‚center‘}),
  html.Div(children=[
    html.Div(children=[
 
      # Barchart Germany
      dcc.Graph(figure=fig_germany_bar),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '50%'}),
 
    html.Div(children=[
 
      # Pie Chart Germany
      dcc.Graph(figure=fig_germany_pie),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '50%'})
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'})
])

The layout is described by Divs, or the respective wrappers for Divs, as code. There is a wrapper function for each HTML element. These functions can be nested however you like to create the desired layout. Instead of using inline styles, you can also work with your own classes and with a stylesheet. For our purposes, however, the inline styles and the external stylesheet read in at the beginning are sufficient. The approach of a declarative description for layouts has the advantage that we do not have to leave our code and do not have to be an expert in JavaScript or HTML. The focus is on dashboard development. If you look closely, you will find core components for the visualizations in addition to HTML components.

...
# Bar chart Germany
dcc.Graph(figure=fig_germany_bar),
....

The respective diagrams are inserted at these positions. For the line chart, we assign an ID so that we can access the chart later.

dcc.Graph(figure=fig_line, id='infections_line'),

Using the selected values from the combo box and radio buttons, we can adjust the display of the lines. In the combo box we want to provide a selection of countries. Multiple selection is also possible if you want to be able to see several countries in one diagram (see section # combo box in Listing 6). Furthermore, the radio buttons control the division of the y-axis according to linear or logarithmic display (see section # Radio-Buttons in Listing 6). In order to apply the selected options to the chart, we need to create an update function. We annotate this function with an @app.callback decorator (Listing 7).

# Interactive Line Chart
@app.callback(
  Output('infections_line', 'figure'),
  [Input('country-dropdown', 'value'), Input('yaxis-type', 'value')])
def update_graph(countries, axis_type):
  countries = countries if len(countries) > 0 else ['Germany']
  data_value = []
  for country in countries:
    data_value.append(dict(
      x= df_ww['Date'], 
      y= df_ww[country], 
      type= 'lines', 
      name= str(country)
    ))
 
  title = ', '.join(countries)
  title = 'Infektionen: ' + title
  return {
    'data': data_value,
    'layout': dict( 
      yaxis={
        'type': 'linear' if axis_type == 'Linear' else 'log'
      },
      hovermode='closest',
      title = title
    )
  }

The decorator has inputs and outputs. Output defines to which element the return is applied and what kind of element it is. In our case we want to access the line chart with the ID infections_line, which is of the type figure. Input describes which input values are needed in the function. In our case, these are the values from the combo box and the selection via the radio buttons. The corresponding function gets these values and we can work with them.

Since the countries are returned as a list, the line to be drawn must be configured in each case. In our example this is implemented with a simple for-in loop. For each country it is determined which columns from the dataframe are necessary. Finally, we return the new configuration of the diagram. When we return, we still define the division of the y-axis, depending on the selection, whether linear or logarithmic. Finally, we start a debug server:

if __name__ == '__main__':
app.run_server(debug=True)

If we start the Python script, a server is started and the application can be called in the browser. If one or more countries are selected, they are displayed as a line chart. The development server supports hot reloading: If we change something in the code and save it, the page is automatically reloaded. With this we have created our own coronavirus/Covid-19 dashboard with interactive diagrams. All without having to write HTML, JavaScript and CSS. Congratulations!

Fig. 10: The finished dashboard

 

 

Summary

Together we have worked through a simple data project. From cleaning up data and the creation of visualizations to the provision of a dashboard. Even with small projects, you can see that the cleanup and visualization part takes up most of the time. If the data is poor, the end product is guaranteed not to be better. In the end, the dashboard is created quickly.

Charts can be created very quickly with Plotly and are interactive out of the box, so they are not just static images. The ability to create dashboards quickly is especially helpful in the prototypical development of larger data visualization projects. There is no need for a large team of data scientists and developers, ideas can be tried out quickly and discarded if necessary. Plotly and Dash are more than sufficient for many purposes. If the data is cleaned up right from the start, the next steps are much easier. If you are dealing with data and its visualization, you should take a look at Plotly and not ignore Dash.

One final remark: In this example I have not used predictions and predictive algorithms and models. On the one hand, the amount of data is too small. On the other hand, a prediction with exponential data is always difficult and should be treated with great care, otherwise wrong conclusions will be drawn.

The post Let’s visualize the coronavirus pandemic appeared first on ML Conference.

]]>