Top Tools, APIs, and Frameworks for Efficient ML Development https://mlconference.ai/blog/tools-apis-frameworks/ The Conference for Machine Learning Innovation Mon, 16 Dec 2024 12:00:22 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows https://mlconference.ai/blog/data-lakehouse-databricks-ml-performance/ Mon, 18 Mar 2024 10:31:51 +0000 https://mlconference.ai/?p=87350 In today’s rapidly evolving data landscape, leveraging a Data Lakehouse architecture is becoming a key strategy for enhancing machine learning workflows. Databricks, a leader in unified data analytics, provides a robust platform that integrates seamlessly with the data lakehouse model to enable data engineers, data scientists, and Machine learning (ml) developers to collaborate more effectively. In this article, we explore how Databricks empowers organizations to streamline data processing, accelerate model development, and unlock the full potential of artificial intelligence (AI) by providing a centralized data repository. This solution not only improves scalability and efficiency but also facilitates end-to-end machine learning pipelines from data ingestion to model deployment.

The post Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows appeared first on ML Conference.

]]>
Demystify the power of DataBricks Lakehouse! This comprehensive guide dives into setting up, running, and optimizing machine learning experiments on this industry-leading platform. Whether you’re a seasoned data scientist or just getting started, this hands-on approach will equip you with the skills to unlock the full potential of DataBricks.

DataBricks is known as the Data Lake House. This is a combination of a data warehouse and data lake. This article will take a closer look at what this means in practice and how you can start your first experiments with DataBricks.{.preface}

You should know that the DataBricks platform is a spin-off of the Apache Spark project. As with many open source projects, the idea behind it was to combine open source technology with quality of life improvements.

DataBricks in particular obviously focuses on ease of use and a flat learning curve. Developers should resist the temptation to use an inexpensive, turnkey product instead of a  technically innovative system, especially for projects with a short lifespan.

Stay up to date

Learn more about MLCON

 

Commissioning DataBricks

DataBricks is currently used exclusively in resources or implementations from cloud providers. At the time of writing this, the company at least supports the “Big Three”. Interestingly, in the [FAQ] seen in **Figure 1**, they explicitly admit that they don’t currently provide the option of locally hosting the DataBricks system.

Fig. 1: If you want to host DataBricks locally, you’re out of luck.{.caption}

Interestingly, DataBricks has a close relationship with all three cloud providers. In many cases, you don’t have to pay separate AWS or cloud costs when purchasing a commercial DataBricks product. Instead, payment is made directly to DataBricks and the provider settles the costs.

For newcomers, there is the DataBricks Community Edition, a light version provided in collaboration with Amazon AWS. It’s completely free to use, but only allows 15 GB of data volume and is limited in terms of some convenience functions, scheduling (and the REST API). But this function should be enough for our first attempts.

So let’s call up the [DataBricks Community Edition log-in page] in the browser of our choice. After clicking on the sign-up link, DataBricks takes you to the fully-fledged log-in portal, where you can register for a free 14-day trial of the platform’s full version. In order to use the Community Edition, you must first fully complete the registration process.

In the second step, be sure not to choose a cloud provider in the window shown in **Figure 2**. Instead, click the Get started with Community Edition link at the bottom to continue the registration process for the Community Edition.

Databricks cloud provider selection screen with options for AWS, Microsoft Azure, and Google Cloud Platform, along with a button to continue and a link to Community Edition.

Fig. 2: Care is needed when activating the Community Edition.{.caption}

In the next step, you need to solve a captcha to identify yourself as a human user. The confirmation message seen in **Figure 3** is divided between the commercial and community edition. Don’t get anxious about the reference to the free trial phase.

Databricks email verification screen prompting users to check their email to start their trial, with links to an administration guide and a quickstart guide for deploying the first workspace.

Fig. 3: Community Edition users also see this message.{.caption}

Entering a valid e-mail address is especially important. DataBricks will send a confirmation email. Clicking the link in the email lets you set a password. Then you’ll find yourself in the product’s start interface, [which can be activated later here](https://community.cloud.databricks.com/).

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Working through the Quickstart notebook

In many respects, commercial companies are interested in flattening the learning curve for potential customers. This can be seen in DataBrick’s guide. The Quickstart tutorial section is prominently placed on the homepage, offering the Start Tutorial link.

Click it to command the web interface to change mode. Your efforts will be rewarded with a user interface similar to several Python notebook systems.

The visual similarities are no coincidence. DataBricks relies on the IPython engine in the background and is more or less compatible with standalone product versions.

Creating the cluster is especially important here. Let me explain. The developer creates the intelligence needed to complete the machine learning task in the notebooks.

But the actual execution of this intelligence requires computing power that normally far exceeds the available computing resources behind Schlomo Normaldevveloper’s browser window. Interestingly, DataBricks’ clusters are available in two versions. The all-purpose class is a classic cloud VM that (manually started and/or scheduled) is also available to a user rotation for collaboratively finishing battle tasks.

System number two is the job cluster. This is a dedicated cluster created for a batch task. It is automatically terminated after a successful or failed job processing. It’s important to note that the administrator isn’t able to keep a job cluster alive after the batch process finishes.

Be that as it may, in the next step, we place our mouse pointer on the far left to expand the menu. DataBricks offers two different operating modes by default.

We want to choose Data Science and Engineering. In the next step, open the Compute menu. Here, we can manage the computing power sources in our account.

Activate the All-Purpose-Compute tab and click the Create Compute option to make a new cluster element. You can freely choose a name. I opted for SUSTest1.

It’s important that several Runtime versions are available. In the following, we opt for the 7.3 LTS option (Scala 2.12, Spark 3.0.1).

As free Community Edition users, we don’t have the option of choosing different cluster hardware sizes. Our system only ever has 15 GB of memory and deactivates after two hours of inactivity.

So, all you need to do to start the configuration process is click the Create Cluster button. Then, click the compute element again to switch to the overview table. This lists all of your account’s compute resources side-by-side.

Generating the compute resources will take some time. To the far left of the table, as seen in **Figure 4**, there is a rotating circle symbol to show that our cluster is in progress.

Databricks compute configuration screen showing options for all-purpose compute and job compute, with a button to create new compute resources and a list of existing resources labeled 'SUSTest1'.

Fig. 4: If the circle is rotating, the cluster isn’t ready for combat yet.{.caption}

The start process can take up to five minutes. Once the work is done, a green tick symbol will appear, as seen in **Figure 5**. As a free version user, you cannot assume that your cluster is running ad perpetuum. If you notice strange behavior in the DataBricks, it makes sense to check the cluster status.

Screenshot of Databricks' 'Compute' tab showcasing an active all-purpose compute resource named 'SUSTest1'. This compute resource is used in a scalable machine learning (ML) pipeline within a data lakehouse architecture. The platform streamlines data processing and analytics workflows, supporting collaboration and efficient compute management.

Fig. 5: The green tick mean it’s ready for action.{.caption}

Once our work is done, we can return to the notebook. The Connect option is available in the top right-hand corner. Click it and select the cluster to establish a connection. Then click 

the Run All icon next to it to instruct all commands in the notebook to execute. In the following, the system will execute commands in individual cells in real-time, as seen in **Figure 5**. Be sure to scroll down and view the results.

Screenshot showing a Databricks notebook executing PySpark commands for a machine learning (ML) workflow within a data lakehouse architecture. The code reads a CSV file, saves it using Delta format, creates a Delta table, and runs a SQL query on the 'diamonds' dataset. This demonstrates scalable data processing and streamlined pipelines for analytics and collaboration.

Fig. 6: The environment provides real-time information about operations performed

Focus on the cell.{.caption}

Due to the architectural decision to build DataBricks as a whole on IPython notebooks, we  must deliver the commands to be executed in the form of notebooks. Interestingly, the notebook as a whole can be kept in one programming language, while individual command cells can offer other languages. A foreign-language command element is created by clicking the respective language bubble, as shown in **Figure 7**.

Screenshot of a Databricks notebook displaying a PySpark command to read a CSV file from a dataset, process it with Delta format, and overwrite it into a Delta table. A dropdown menu shows options to change the notebook cell language, including Markdown, Python, SQL, Scala, and R. This is part of a machine learning (ML) workflow in a scalable data lakehouse architecture.

Fig. 7: DataBricks allows the use of insular languages.{.caption}

Using the menu option File | Export | HTML, the DataBricks notebook can also be exported as an HTML file after its commands are successfully processed. The majority of the mark-up is lost, but the resulting file presents the results in a way that’s easier for management to understand and digest.

Alternatively, you can click the blue Publish button to generate a globally valid link that lets any user view the fully-fledged notebook. By default, these links stay valid for six months. Please note that publishing a new version invalidates all existing links.

Commercial version owners can also run their notebooks regularly like a cron job with the scheduling option. The user interface in **Figure 8** is used for this. Other job scheduling system users will feel right at home. However, be aware that this function requires a job cluster, which isn’t included and cannot be created in the free Community Edition at the time of writing this.

DataBricks in scheduling mode

Fig. 8: DataBricks in scheduling mode.{.caption}

 

Last but not least, you can also stop the cluster using the menu at the top right. This is only a courtesy to the company for the Community Edition. But, it’s highly recommended for commercial use since it reduces overall costs.

Different data tables for optimizing performance

One of NoSQL databases’ basic characteristics is that in many cases, they soften the ACID criteria. The lower consistency quality is usually offset by a greatly reduced database administration effort. Sometimes, this results in impressive performance increases compared to a classic relational database. When working with DataBricks, we deal with a group of different table types that differ in terms of performance and data storage type.

The most important difference concerns external tables and managed tables. A managed table lives entirely in the DataBricks cluster. The development team understands this to mean that the database server handles management of the actual information and the provision of metadata and access features.

There’s also the unmanaged or external table. This table represents a kind of “wrapper” around an external data source. Using this design pattern is recommended if you frequently use sample databases or information already available elsewhere in the system in an accessible form.

Since our sample from DataBricks is based on a diamond information set, using external tables is recommended. Redundant duplication of resources will only waste memory space in our cluster, without bringing any significant benefits here.

However, a careful look at the instructions created in the example notebook shows two different procedures. The first table is created with the following snippet:

 

```

DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
```

Besides the call to DROP TABLE, which is always needed to initialize the cluster, creating the new table uses standard SQL commands, more or less.We use _Using csv_ to tell the Runtime we want to use the CSV engine.

If you scroll further down in the example, you’ll see that the table is created again, but in a two-stage process. In the first step, there’s now a Python island in the notebook that interacts with the diamond sample information in the URL /databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv according to the following:

```
%python
diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
diamonds.write.format("delta").mode("overwrite").save("/delta/diamonds")
```

The DataBricks development team provides aspiring data science experimenters with a dozen or so widely used sample datasets. These can be accessed directly from the DataBricks runtime using friendly URLs. Additional information about available data sources [can be found here](https://docs.databricks.com/dbfs/databricks-datasets.html).

In the second step, there’s a snippet of SQL code that delivers Using Delta instead of the previously used Using CSV. This instructs the DataBricks backend to animate the existing element with the Delta database engine.

```
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING DELTA LOCATION '/delta/diamonds/'
```

Delta is an open source database engine based on Apache Parquet. Its . Normally, it’s always preferable to use the Apache Spark table because it delivers better results in terms of both ACID criteria and performance, especially when large amounts of data need to be processed.

DataBricks is more – Focus on machine learning

Until now, we operated the DataBricks runtime in engineering mode. It’s optimized for the needs of ordinary data scientists who want to perform various types of analyses. But the user interface has a special mode specifically for machine learning (**Fig. 9** shows the mode switcher) that focuses on relevant functions.

This option lets you change the personality of the DataBricks interface.

Fig. 9: This option lets you change the personality of the DataBricks interface.{.caption}

In principle, the workflow in **Figure 10** is always used. Anyone implementing this workflow in an in-house application will always work with the Auto-ML working environment sooner or later. In theory, this is available from version 9.1 at the end of Runtime, but it’s only really feature-complete when at least version 10.4 LTS ML is available on the cluster. But since this is one of the USPs of the DataBricks platform, we can assume that the product is under constant further development.

It’s advised that you check if the cluster in question is running the product’s latest version. For data engineering, DataBricks also offers a dedicated tutorial in the Guide: Training section from the home screen. This makes it easier to get started. Click the Start guide option again to load the notebook for this tutorial as “to be edited”.

ML functions in DataBricks workflow.

Fig. 10: If you want to use the ML functions in DataBricks, you should familiarize yourself with this workflow.{.caption}

Due to higher demands on the aforementioned required Data Bricks Runtime, you should switch to the Compute section and delete the previously created cluster. Then, click the Create Compute option again and delete the previously created cluster. Click the Create Compute option again and make sure to click the ML heading in the DataBricks Runtime Version field (see **Fig. 11**) in the first step.

ML-capable variants of the DataBricks runtime appear in a separate section in the backend.

Fig. 11: ML-capable variants of the DataBricks runtime appear in a separate section in the backend.{.caption}

Just for fun, we’ll use the latest version 12.0 ML and name the cluster “SUSTestML”. It takes some time after clicking the Create Cluster button, since the cloud resources aren’t immediately provided.

During cluster generation, we can return to the notebook to get an overview of the elements. In the first step, we see the inclusion of the following libraries, abbreviated here. They are familiar to every Python developer:

```

import mlflow
import numpy as np
import pandas as pd
import sklearn.datasets
. . .
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
. . .
```

In many respects, DataBricks is based on what ML developers are familiar with from working with standard Python scripts. Some libraries naturally have optimizations to make them run more efficiently on the DataBricks hardware. In general, however, a locally functioning Python script will continue to work without any problems after being moved to the DataBricks cluster. For the actual monitoring of the learning process, Data Bricks relies on MLFlow, which is available here [6].

For this reason, the rest of the notebook is standard ML code, although it’s elegantly integrated into the user interface. For example, there is a flyout in which the application provides information about various parameters that were created during the parameterization of the model:

```
with mlflow.start_run(run_name='gradient_boost') as run:
  model = sklearn.ensemble.GradientBoostingClassifier(random_state=0)
  model.fit(X_train, y_train)
  . . .
```

It’s also interesting to note that the results of the individual optimization runs are not only displayed in the user interface. The Python code that lives in the notebook can also access them programmatically. In this way, it can perform a kind of reflection to find the most suitable parameters and/or model architectures.

In the case of the example notebook provided by DataBricks, this is illustrated in the following snippet, which applies an SQL query to the results available in the mlflow.search_runs field:

“`

best_run = mlflow.search_runs(
  order_by=['metrics.test_auc DESC', 'start_time DESC'],
  max_results=10,
).iloc[0]
print('Best Run')
print('AUC: {}'.format(best_run["metrics.test_auc"]))
print('Num Estimators: {}'.format(best_run["params.n_estimators"]))
```

 AutoML, for the second time

The duality of control via the user interface and programmatic control also continues in the case of the AutoML library mentioned above. The user interface shown in Figure 12, which allows graphical parameterization of ML runs, is probably the most common marketing argument.

AutoML allows the graphical configuration of modeling

Fig. 12: AutoML allows the graphical configuration of modeling{.caption}

On the other hand, there is also a programmatic API that illustrates DataBricks in the form of a group of notebooks. Here we want to use the example notebook provided here [7], which we load into a browser window in the first step. Then click on the Import Notebook button at the top right and copy the URL to the clipboard.

Next, open the menu of your DataBricks instance and select the Workspace File Users option. Next to your email address, there is a downward pointing arrow, which allows you to open a context menu. Select the import option there and then enter the URL to load the sample notebook into your DataBricks instance.

The actual body of the model couldn’t be any easier. In the first step, we mainly load test data, but we also create a schema element that informs the engine about the type or data type of the model information to be processed:

```

from pyspark.sql.types import DoubleType, StringType, StructType, StructField

schema = StructType([
  StructField("age", DoubleType(), False),
  . . .
  StructField("income", StringType(), False)
])
input_df = 
spark.read.format("csv").schema(schema).load("/databricks-datasets/adult/adult.data")
```
The actual classification run then also takes place with a single line:
```

from databricks import automl
summary = automl.classify(train_df, target_col="income", timeout_minutes=30)
```

 

If you want to carry out interference later, you can do this with both Pandas and Spark.

Stay up to date

Learn more about MLCON

 

The multitool for ML professionals

Although there are hundreds of pages yet to be written about DataBricks, we’ll end our experiments with this brief overview. DataBricks is a tool that is completely focused on data scientists and machine learning experts and is not really suitable for beginners due to the very steep learning curve. Much like the infamous Squirrel Busters, DataBricks is a product that will find you when you need it.

The post Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows appeared first on ML Conference.

]]>
OpenAI Embeddings https://mlconference.ai/blog/openai-embeddings-technology-2024/ Mon, 19 Feb 2024 13:18:46 +0000 https://mlconference.ai/?p=87274 Embedding vectors (or embeddings) play a central role in the challenges of processing and interpretation of unstructured data such as text, images, or audio files. Embeddings take unstructured data and convert it to structured, no matter how complex, so they can be easily processed by software. OpenAI offers such embeddings, and this article will go over how they work and how they can be used.

The post OpenAI Embeddings appeared first on ML Conference.

]]>
Data has always played a central role in the development of software solutions. One of the biggest challenges in this area is the processing and interpretation of unstructured data such as text, images, or audio files. This is where embedding vectors (called embeddings for short) come into play – a technology that is becoming increasingly important in the development of software solutions with the integration of AI functions.

Stay up to date

Learn more about MLCON

 

 

Embeddings are essentially a technique for converting unstructured data into a structure that can be easily processed by software. They are used to transform complex data such as words, sentences, or even entire documents into a vector space, with similar elements close to each other. These vector representations allow machines to recognize and exploit nuances and relationships in the data. Which is essential for a variety of applications such as natural language processing (NLP), image recognition, and recommendation systems.

OpenAI, the company behind ChatGPT, offers models for creating embeddings for texts, among other things. At the end of January 2024, OpenAI presented new versions of these embeddings models, which are more powerful and cost-effective than their predecessors. In this article, after a brief introduction to embeddings, we’ll take a closer look at the OpenAI embeddings and the recently introduced innovations, discuss how they work, and examine how they can be used in various software development projects.

Embeddings briefly explained

Imagine you’re in a room full of people and your task is to group these people based on their personality. To do this, you could start asking questions about different personality traits. For example, you could ask how open someone is to new experiences and rate the answer on a scale from 0 to 1. Each person is then assigned a number that represents their openness.

Next, you could ask about another personality trait, such as the level of sense of duty, and again give a score between 0 and 1. Now each person has two numbers that together form a vector in a two-dimensional space. By asking more questions about different personality traits and rating them in a similar way, you can create a multidimensional vector for each person. In this vector space, people who have similar vectors can then be considered similar in terms of their personality.

In the world of artificial intelligence, we use embeddings to transform unstructured data into an n-dimensional vector space. Similarly how a person’s personality traits are represented in the vector space, each point in this vector space represents an element of the original data (such as a word or phrase) in a way that is understandable and processable by computers.

OpenAI Embeddings

OpenAI embeddings extend this basic concept. Instead of using simple features like personality traits, OpenAI models use advanced algorithms and big data to achieve a much deeper and more nuanced representation of the data. The model not only analyzes individual words, but also looks at the context in which those words are used, resulting in more accurate and meaningful vector representations.

Another important difference is that OpenAI embeddings are based on sophisticated machine learning models that can learn from a huge amount of data. This means that they can recognize subtle patterns and relationships in the data that go far beyond what could be achieved by simple scaling and dimensioning, as in the initial analogy. This leads to a significantly improved ability to recognize and exploit similarities and differences in the data.

 

Explore Generative AI Innovation

Generative AI and LLMs

Individual values are not meaningful

While in the personality trait analogy, each individual value of a vector can be directly related to a specific characteristic – for example openness to new experiences or a sense of duty – this direct relationship no longer exists with OpenAI embeddings. In these embeddings, you cannot simply look at a single value of the vector in isolation and draw conclusions about specific properties of the input data. For example, a specific value in the embedding vector of a sentence cannot be used to directly deduce how friendly or not this sentence is.

The reason for this lies in the way machine learning models, especially those used to create embeddings, encode information. These models work with complex, multi-dimensional representations where the meaning of a single element (such as a word in a sentence) is determined by the interaction of many dimensions in vector space. Each aspect of the original data – be it the tone of a text, the mood of an image, or the intent behind a spoken utterance – is captured by the entire spectrum of the vector rather than by individual values within that vector.

Therefore, when working with OpenAI embeddings, it’s important to understand that the interpretation of these vectors is not intuitive or direct. You need algorithms and analysis to draw meaningful conclusions from these high-dimensional and densely coded vectors.

Comparison of vectors with cosine similarity

A central element in dealing with embeddings is measuring the similarity between different vectors. One of the most common methods for this is cosine similarity. This measure is used to determine how similar two vectors are and therefore the data they represent.

To illustrate the concept, let’s start with a simple example in two dimensions. Imagine two vectors in a plane, each represented by a point in the coordinate system. The cosine similarity between these two vectors is determined by the cosine of the angle between them. If the vectors point in the same direction, the angle between them is 0 degrees and the cosine of this angle is 1, indicating maximum similarity. If the vectors are orthogonal (i.e. the angle is 90 degrees), the cosine is 0, indicating no similarity. If they are opposite (180 degrees), the cosine is -1, indicating maximum dissimilarity.

Figure 1 -Cosine similarity

Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

 

A Python Notebook to try out
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

In practice, especially when working with embeddings, we are dealing with n-dimensional vectors. The calculation of the cosine similarity remains conceptually the same, even if the calculation is more complex in higher dimensions. Formally, the cosine similarity of two vectors A and B in an n-dimensional space is calculated by the scalar product (dot product) of these vectors divided by the product of their lengths:

Figure 2 – Calculation of cosine similarity

The normalization of vectors plays an important role in the calculation of cosine similarity. If a vector is normalized, this means that its length (norm) is set to 1. For normalized vectors, the scalar product of two vectors is directly equal to the cosine similarity since the denominators in the formula from Figure 2 are both 1. OpenAI embeddings are normalized, which means that to calculate the similarity between two embeddings, only their scalar product needs to be calculated. This not only simplifies the calculation, but also increases efficiency when processing large quantities of embeddings.

OpenAI Embeddings API

OpenAI offers a web API for creating embeddings. The exact structure of this API, including code examples for curl, Python and Node.js, can be found in the OpenAI reference documentation.

OpenAI does not use the LLM from ChatGPT to create embeddings, but rather specialized models. They were developed specifically for the creation of embeddings and are optimized for this task. Their development was geared towards generating high-dimensional vectors that represent the input data as well as possible. In contrast, ChatGPT is primarily optimized for generating and processing text in a conversational form. The embedding models are also more efficient in terms of memory and computing requirements than more extensive language models such as ChatGPT. As a result, they are not only faster but much more cost-effective.

New embedding models from OpenAI

Until recently, OpenAI recommended the use of the text-embedding-ada-002 model for creating embeddings. This model converts text into a sequence of floating point numbers (vectors) that represent the concepts within the content. The ada v2 model generated embeddings with a size of 1536 dimensions and delivered solid performance in benchmarks such as MIRACL and MTEB, which are used to evaluate model performance in different languages and tasks.

At the end of January 2024, OpenAI presented new, improved models for embeddings:

text-embedding-3-small: A smaller, more efficient model with improved performance compared to its predecessor. It performs better in benchmarks and is significantly cheaper.
text-embedding-3-large: A larger model that is more powerful and creates embeddings with up to 3072 dimensions. It shows the best performance in the benchmarks but is slightly more expensive than ada v2.

A new function of the two new models allows developers to adjust the size of the embeddings when generating them without significantly losing their concept-representing properties. This enables flexible adaptation, especially for applications that are limited in terms of available memory and computing power.

Readers who are interested in the details of the new models can find them in the announcement on the OpenAI blog. The exact costs of the various embedding models can be found here.

New embeddings models
At the end of January 2024, OpenAI introduced new models for creating embeddings. All code examples and result values contained in this article already refer to the new text-embedding-3-large model.

Create embeddings with Python

In the following section, the use of embeddings is demonstrated using a few code examples with Python. The code examples are designed so that they can be tried out in Python Notebooks. They are also available in a similar form in the previously mentioned accompanying Google Colab notebook mentioned above.
Listing 1 shows how to create embeddings with the Python SDK from OpenAI. In addition, numpy is used to show that the embeddings generated by OpenAI are normalized.

Listing 1

from openai import OpenAI
from google.colab import userdata
import numpy as np

# Create OpenAI client
client = OpenAI(
    api_key=userdata.get('openaiKey'),
)

# Define a helper function to calculate embeddings
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(
        input=input,
        model="text-embedding-3-large", # We use the new embeddings model here (announced end of Jan 2024)
        # dimensions=... # You could limit the number of output dimensions with the new embeddings models
    ).data[0].embedding

# Calculate the embedding vector for a sample sentence
vec = get_embedding_vec("King")
print(vec[:10])

# Calculate the magnitude of the vector. I should be 1 as
# embedding vectors from OpenAI are always normalized.
magnitude = np.linalg.norm(vec)
magnitude

Similarity analysis with embeddings

In practice, OpenAI embeddings are often used for similarity analysis of texts (e.g. searching for duplicates, finding relevant text sections in relation to a customer query, and grouping text). Embeddings are very well suited for this, as they work in a fundamentally different way to comparison methods based on characters, such as Levenshtein distance. While it measures the similarity between texts by counting the minimum number of single-character operations (insert, delete, replace) required to transform one text into another, embeddings capture the meaning and context of words or sentences. They consider the semantic and contextual relationships between words, going far beyond a simple character-based level of comparison.

As a first example, let’s look at the following three sentences (the following examples are in English, but embeddings work analogously for other languages and cross-language comparisons are also possible without any problems):

I enjoy playing soccer on weekends.
Football is my favorite sport. Playing it on weekends with friends helps me to relax.
In Austria, people often watch soccer on TV on weekends.

In the first and second sentence, two different words are used for the same topic: Soccer and football. The third sentence contains the original soccer, but it has a fundamentally different meaning from the first two sentences. If you calculate the similarity of sentence 1 to 2, you get 0.75. The similarity of sentence 1 to 3 is only 0.51. The embeddings have therefore reflected the meaning of the sentence and not the choice of words.

Here is another example that requires an understanding of the context in which words are used:
He is interested in Java programming.
He visited Java last summer.
He recently started learning Python programming.

In sentence 2, Java refers to a place, while sentences 1 and 3 have something to do with software development. The similarity of sentence 1 to 2 is 0.536, but that of 1 to 3 is 0.587. As expected, the different meaning of the word Java has an effect on the similarity.

The next example deals with the treatment of negations:
I like going to the gym.
I don’t like going to the gym.
I don’t dislike going to the gym.

Sentences 1 and 2 say the opposite, while sentence 3 expresses something similar to sentence 1. This content is reflected in the similarities of the embeddings. Sentence 1 to sentence 2 yields a cosine similarity of 0.714 while sentence 1 compared to sentence 3 yields 0.773. It is perhaps surprising that there is no major difference between the embeddings. However, it’s important to remember that all three sets are about the same topic: The question of whether you like going to the gym to work out.

The last example shows that the OpenAI embeddings models, just like ChatGPT, have built in a certain “knowledge” of concepts and contexts through training with texts about the real world.

I need to get better slicing skills to make the most of my Voron.
3D printing is a worthwhile hobby.
Can I have a slice of bread?

In order to compare these sentences in a meaningful way, it’s important to know that Voron is the name of a well-known open-source project in the field of 3D printing. It’s also important to note that slicing is a term that plays an important role in 3D printing. The third sentence also mentions slicing, but in a completely different context to sentence 1. Sentence 2 mentions neither slicing nor Voron. However, the trained knowledge enables the OpenAI Embeddings model to recognize that sentences 1 and 2 have a thematic connection, but sentence 3 means something completely different. The similarity of sentence 1 and 2 is 0.333 while the comparison of sentence 1 and 3 is only 0.263.

Similarity values are not percentages

The similarity values from the comparisons shown above are the cosine similarity of the respective embeddings. Although the cosine similarity values range from -1 to 1, with 1 being the maximum similarity and -1 the maximum dissimilarity, they are not to be interpreted directly as percentages of agreement. Instead, these values should be considered in the context of their relative comparisons. In applications such as searching text sections in a knowledge base, the cosine similarity values are used to sort the text sections in terms of their similarity to a given query. It is important to see the values in relation to each other. A higher value indicates a greater similarity, but the exact meaning of the value can only be determined by comparing it with other similarity values. This relative approach makes it possible to effectively identify and prioritize the most relevant and similar text sections.

Embeddings and RAG solutions

Embeddings play a crucial role in Retrieval Augmented Generation (RAG) solutions, an approach in artificial intelligence that combines the capabilities of information retrieval and text generation. Embeddings are used in RAG systems to retrieve relevant information from large data sets or knowledge databases. It is not necessary for these databases to have been included in the original training of the embedding models. They can be internal databases that are not available on the public Internet.
With RAG solutions, queries or input texts are converted into embeddings. The cosine similarity to the existing document embeddings in the database is then calculated to identify the most relevant text sections from the database. This retrieved information is then used by a text generation model such as ChatGPT to generate contextually relevant responses or content.

Vector databases play a central role in the functioning of RAG systems. They are designed to efficiently store, index and query high-dimensional vectors. In the context of RAG solutions and similar systems, vector databases serve as storage for the embeddings of documents or pieces of data that originate from a large amount of information. When a user makes a request, this request is first transformed into an embedding vector. The vector database is then used to quickly find the vectors that correspond most closely to this query vector – i.e. those documents or pieces of information that have the highest similarity. This process of quickly finding similar vectors in large data sets is known as Nearest Neighbor Search.

Challenge: Splitting documents

A detailed explanation of how RAG solutions work is beyond the scope of this article. However, the explanations regarding embeddings are hopefully helpful for getting started with further research on the topic of RAGs.

However, one specific point should be pointed out at the end of this article: A particular and often underestimated challenge in the development of RAG systems that go beyond Hello World prototypes is the splitting of longer texts. Splitting is necessary because the OpenAI embeddings models are limited to just over 8,000 tokens. One token corresponds to approximately 4 characters in the English language (see also).

It’s not easy finding a good strategy for splitting documents. Naive approaches such as splitting after a certain number of characters can lead to the context of text sections being lost or distorted. Anaphoric links are a typical example of this. The following two sentences are an example:

VX-2000 requires regular lubrication to maintain its smooth operation.
The machine requires the DX97 oil, as specified in the maintenance section of this manual.

The machine in the second sentence is an anaphoric link to the first sentence. If the text were to be split up after the first sentence, the essential context would be lost, namely that the DX97 oil is necessary for the VX-2000 machine.

There are various approaches to solving this problem, which will not be discussed here to keep this article concise. However, it is essential for developers of such software systems to be aware of the problem and understand how splitting large texts affects embeddings.

Stay up to date

Learn more about MLCON

 

 

Summary

Embeddings play a fundamental role in the modern AI landscape, especially in the field of natural language processing. By transforming complex, unstructured data into high-dimensional vector spaces, embeddings enable in-depth understanding and efficient processing of information. They form the basis for advanced technologies such as RAG systems and facilitate tasks such as information retrieval, context analysis, and data-driven decision-making.

OpenAI’s latest innovations in the field of embeddings, introduced at the end of January 2024, mark a significant advance in this technology. With the introduction of the new text-embedding-3-small and text-embedding-3-large models, OpenAI now offers more powerful and cost-efficient options for developers. These models not only show improved performance in standardized benchmarks, but also offer the ability to find the right balance between performance and memory requirements on a project-specific basis through customizable embedding sizes.

Embeddings are a key component in the development of intelligent systems that aim to achieve useful processing of speech information.

Links and Literature:

  1. https://colab.research.google.com/gist/rstropek/f3d4521ed9831ae5305a10df84a42ecc/embeddings.ipynb
  2. https://platform.openai.com/docs/api-reference/embeddings/create
  3. https://openai.com/blog/new-embedding-models-and-api-updates
  4. https://openai.com/pricing
  5. https://platform.openai.com/tokenizer

The post OpenAI Embeddings appeared first on ML Conference.

]]>
Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone https://mlconference.ai/blog/building-chatbot-openai-api-php-pinecone/ Thu, 04 Jan 2024 08:50:31 +0000 https://mlconference.ai/?p=87014 We leveraged OpenAI's API and PHP to develop a proof-of-concept chatbot that seamlessly integrates with Pinecone, a vector database, to enhance our homepage's search functionality and empower our customers to find answers more effectively. In this article, we’ll explain our steps so far to accomplish this.

The post Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone appeared first on ML Conference.

]]>
[lwptoc]

The team at Three.ie, recognized that customers were having difficulty finding answers to basic questions on our website. To improve the user experience, we decided to utilize AI to create a more efficient and user-friendly experience with a chatbot. Building the chatbot posed several challenges, such as effectively managing the expanding context of each chat session and maintaining high-quality data. This article details our journey from concept to implementation and how we overcome these challenges. Anyone interested in AI, data management, and customer experience improvements should find valuable insights in this article. 

While the chatbot project is still in progress, this article outlines the steps taken and key takeaways from the journey thus far. Stay tuned for subsequent installments and the project’s resolution.

Stay up to date

Learn more about MLCON

 

Identifying the Problem

Hi there, I’m a Senior PHP Developer at Three.ie, a company in the Telecom Industry. Today, I’d like to address the problem of our customers’ challenge with locating answers to basic questions on our website. Information like understanding bill details, how to top up, and more relevant information is available but isn’t easy to find, because it’s tucked away within our forums.

![community-page.png](community-page.png) {.caption}

Community Page {.caption}

The AI Solution

The rise of AI chatbots and the impressive capabilities of GPT-3 presented us with an opportunity to tackle this issue head-on. The idea was simple, why not leverage AI to create a more user-friendly way for customers to find the information they need? Our tool of choice for this task was OpenAI’s API, which we planned to integrate into a chat interface.

To make this chatbot truly useful, it needed access to the right data and that’s where Pinecone came in. Using this vector database, we were able to generate embeddings from the OpenAI API, creating an efficient search system for our chatbot.

This laid groundwork for our proof of concept: a simple yet effective solution for a problem faced by many businesses. Let’s dive deeper into how we brought this concept to life.

![chat-poc.png](chat-poc.png) {.figure}

First POC {.caption}

Challenges and AI’s Role

With our proof of concept in place, the next step was to ensure the chatbot was interacting with the right data and providing the most accurate search results possible. While Pinecone served as an excellent solution for storing data and enabling efficient search during the early stages. In the long term, we realized it might not be the most cost-effective choice for a full-fledged product. 

While Pinecone is an excellent solution easy to integrate and straightforward to use. The free tier only allows you to have a single pod with a single project. We would need to create small indexes but separated into multiple products. The  starting plan costs around $70/month/pod. Aiming to keep the project within budget was a priority, and we knew that continuing with Pinecone would soon become difficult, since we wanted to split our data.

The initial data used in the chatbot was extracted directly from our website and stored in separate files. This setup allowed us to create embeddings and feed them to our chatbot. To streamline this process, we developed a ‘data import’ script. The script works by taking a file, adding it to the database, creating an embedding using the content, and finally it stores the embedding in Pinecone, using the database ID as a reference.

Unfortunately, we faced a hurdle with the structure and quality of our data. Some of the extracted data was not well-structured, which led to issues with the chatbot’s responses. To address this challenge, we once again turned to AI, this time to enhance our data quality. Employing the GPT-3.5 model, we optimized the content of each file before generating the vector. By doing so, we were able to harness the power of AI not only for answering customer queries but also for improving the quality of our data.

As the process grew more complex, the need for more efficient automation became evident. To reduce the time taken by the data import script, we incorporated queues and utilized parallel processing. This allowed us to manage the increasingly complex data import process more effectively and keep the system efficient.

![data-ingress-flow.png](data-ingress-flow.png) {.figure}

Data Ingress Flow {.caption}

Data Integration

With our data stored and the API ready to handle chats, the next step was to bring everything together. The initial plan was to use Pinecone to retrieve the top three results matching the customer’s query. For instance, if a user inquired, “How can I top up by text message?”, we would generate an embedding for this question and then use Pinecone to fetch the three most relevant records. These matches were determined based on cosine similarity, ensuring the retrieved information was highly pertinent to the user’s query.

Cosine similarity is a key part of our search algorithm. Think of it like this: imagine each question and answer is a point in space. Cosine similarity measures how close these points are to each other. For example, if a user asks, “How do I top up my account?”, and we have a database entry that says, “Top up your account by going to Settings”, these two are closely related and would have a high cosine similarity score, close to 1. On the other hand, if the database entry says something about “changing profile picture”, the score would be low, closer to 0, indicating they’re not related.

This way, we can quickly find the best matches to a customer’s query, making the chatbot’s answers more relevant and useful.

For those who understand a bit of math, this is how cosine similarity works. You represent each sentence as a vector in multi-dimensional space. The cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes. Mathematically, it looks like this:

![cosine-formula.png](cosine-formula.png) {.figure}

Cosine Similarity  {.caption}

This formula gives us a value between -1 and 1. A value close to 1 means the sentences are very similar, and a value close to -1 means they are dissimilar. Zero means they are not related.

![simplified-workflow.png](simplified-workflow.png) {.figure}

Simplified Workflow {.caption}

Next, we used these top three records as a context in the OpenAI chat API. We merged everything together: the chat history, Three’s base prompt instructions, the current question, and the top three contexts.

![vector-comparison-logic.png](vector-comparison-logic.png) {.figure}

Vector Comparison Logic {.caption}

Initially, this approach was fantastic and provided accurate and informative answers.  However, there was a looming issue, as we were using OpenAI’s first 4k model, and the entire context was sent for every request. Furthermore, the context was treated as “history” for the following message, meaning that each new message added the boilerplate text plus three more contexts. As you can imagine, this led to rapid growth of the context.

To manage this complexity, we decided to keep track of the context. We started storing each message from the user (along with the chatbot’s responses) and the selected contexts. As a result, each chat session now had two separate artifacts: messages and contexts. This ensured that if a user’s next message related to the same context, it wouldn’t be duplicated and we could keep track of what had been used before.

Progress so Far

To put it simply, our system starts with a manual input of questions and answers (Q&A)  which is then enhanced by our AI.  To ensure efficient data handling we use queues to store data quickly. In the chat, when a user asks a question, we add a “context group” that includes all the data we got from Pinecone. To maintain system organization and efficiency, older messages are removed from longer chats.

 

 

 

![chat-workflow.png](chat-workflow.png) {.figure}

 

Chat Workflow {.caption}

![chat-workflow.png](chat-workflow.png) {.figure}

Chat Workflow {.caption}

Automating Data Collection

Acknowledging the manual input as a bottleneck, we set out to streamline the process through automation. I started by trying out scrappers using different languages like PHP and Python. However, to be honest, none of them were good enough and we faced issues with both speed and accuracy. While this component of the system is still in its formative stages, we’re committed to overcoming this challenge. We are currently evaluating the possibility of utilizing an external service to manage this task, aiming to streamline and simplify the overall process.

While working towards data automation, I dedicated my efforts to improving our existing system. I developed a backend admin page, replacing the manual data input process with a streamlined interface. This admin panel provides additional control over the chatbot, enabling adjustments to parameters like the ‘temperature’ setting and initial prompt, further optimizing the customer experience.  So, although we have challenges ahead, we’re making improvements every step of the way.

 

#

RETHINK YOUR APPROACHES

Business & Strategy

A Week of Intense Progress

The week was a whirlwind of AI-fueled excitement, and we eagerly jumped in. After sending an email to my department, the feedback came flooding in. Our team was truly collaborative: a skilled designer supplied Figma templates and a copywriter crafted the app’s text. We even had volunteers who stress-tested our tool with unconventional prompts. It felt like everything was coming together quickly.

However, this initial enthusiasm came to a screeching halt due to security concerns becoming the new focus. A recent data breach at OpenAI, unrelated to our project, shifted our priorities. Though frustrating, it necessitated a comprehensive security check of all projects, causing a temporary halt to our progress.

The breach occurred during a specific nine-hour window on March 20, between 1 a.m. and 10 a.m. Pacific Time. OpenAI confirmed that around 1.2% of active ChatGPT Plus subscribers had their data compromised during this period. They were using the Redis client library (redis-py), which allowed them to maintain a pool of connections between their Python server and Redis. This meant they didn’t need to query the main database for every request, but it became a point of vulnerability.

In the end, it’s good to put security at the forefront and not treat it as an afterthought, especially in the wake of a data breach. While the delay is frustrating, we all agree that making sure our project is secure is worth the wait. Now, our primary focus is to meet all security guidelines before progressing further.

The Move to Microsoft Azure

In just one week, the board made a big decision to move from OpenAI and Pinecone to Microsoft’s Azure.  At first glance, it looks like a smart choice as Azure is known for solid security but the plug-and-play aspect can be difficult.

What stood out in Azure was having our own dedicated GPT-3.5 Turbo model. Unlike OpenAI, where the general GPT-3.5 model is shared, Azure gives you a model exclusive to your company. You can train it, fine-tune it, all in a very secure environment, a big plus for us.

The hard part? Setting up the data storage was not an easy feat. Everything in Azure is different from what we were used to. So, we are now investing time to understand these new services, a learning curve we’re currently climbing.

Azure Cognitive Search

In our move to Microsoft Azure, security was a key focus. We looked into using Azure Cognitive Search for our data management. Azure offers advanced security features like end-to-end encryption and multi-factor authentication. This aligns well with our company’s heightened focus on safeguarding customer data.

The idea was simple: you upload your data into Azure, create an index, and then you can search it just like a database. You define what’s called “fields” for indexing and then Azure Cognitive Search organizes it for quick searching. But the truth is, setting it up wasn’t easy because creating the indexes was more complex than we thought. So, we didn’t end up using it in our project. It’s a powerful tool, but difficult to implement. This was the idea:

![azure-structure.png](azure-structure.png) {.figure}

Azure Structure {.caption}

The Long Road of Discovery

So, what did we really learn from this whole experience? First, improving the customer journey isn’t a walk in the park; it’s a full-on challenge. AI brings a lot of potential to the table, but it’s not a magic fix. We’re still deep in the process of getting this application ready for the public, and it’s still a work in progress.

One of the most crucial points for me has been the importance of clear objectives. Knowing exactly what you aim to achieve can steer the project in the right direction from the start. Don’t wait around — get a proof of concept (POC) out as fast as you can. Test the raw idea before diving into complexities.

Also, don’t try to solve issues that haven’t cropped up yet, this is something we learned the hard way. Transitioning to Azure seemed like a move towards a more robust infrastructure. But it ended up complicating things and setting us back significantly. The added layers of complexity postponed our timeline for future releases. Sometimes, ‘better’ solutions can end up being obstacles if they divert you from your main goal.

 

Stay up to date

Learn more about MLCON

 

In summary, this project has been a rollercoaster of both challenges and valuable lessons learned. We’re optimistic about the future, but caution has become our new mantra. We’ve come to understand that a straightforward approach is often the most effective, and introducing unnecessary complexities can lead to unforeseen problems. With these lessons in hand, we are in the process of recalibrating our strategies and setting our sights on the next development phase.

Although we have encountered setbacks, particularly in the area of security, these experiences have better equipped us for the journey ahead. The ultimate goal remains unchanged: to provide an exceptional experience for our customers. We are fully committed to achieving this goal, one carefully considered step at a time.

Stay tuned for further updates as we continue to make progress. This project is far from complete, and we are excited to share the next chapters of our story with you.

The post Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone appeared first on ML Conference.

]]>
Take Control of ML Projects https://mlconference.ai/blog/take-control-of-ml-projects/ Mon, 11 Jul 2022 10:33:17 +0000 https://mlconference.ai/?p=84602 The decision to move Elasticsearch to proprietary licensing awakened a sleeping giant. The open source community rapidly flexed its muscle to ensure a true open source option for fast and scalable search and analytics—which many users depend on for ML projects—would continue to be available. The result is OpenSearch, a community-driven hard fork of Elasticsearch 7.10.2, built with Apache Lucene and available under the fully open source Apache 2.0 license.

The post Take Control of ML Projects appeared first on ML Conference.

]]>
OpenSearch brings users the same enterprise-grade core features and advanced add-ons as its predecessor. Key benefits include a horizontally-scalable distributed architecture ready to handle thousands of nodes and petabytes of data, high availability, extremely fast and powerful text search, and analytics with faceting and aggregations. OpenSearch also features a rich ecosystem with language specific clients such as Python, Node, Java and more. OpenSearch also supports data shippers such as Logstash, beats, and Fluentd.

Migrating from Elasticsearch to OpenSearch enables you to continue to utilize the same powerful capabilities that your organization is already accustomed to, while safeguarding your technology against the potential for future proprietary lock-in and limitations that come with a solution that is under a proprietary license that is no longer truly open source. At the same time, making the leap to OpenSearch ensures that organizations are positioned to take advantage of all new features introduced by the open source community as the technology evolves moving forward.

To successfully migrate from Elasticsearch to OpenSearch, on cloud or on-prem systems or through a managed platform, follow these eight steps:

Make sure you’re running Elasticsearch 7.10.2, and upgrade if necessary

Enterprises should be running Elasticsearch 7.10.2 for maximum compatibility before migrating to OpenSearch. Upgrade to client libraries compatible with Elasticsearch 7.10.2, and be sure to use OpenSearch versions of libraries when available (all of which also work with Elasticsearch clusters). If the existing cluster is on a newer version than 7.12.0 then downgrade to 7.10.2 via reindex. Also be on alert for potential breaking changes or the need to re-index (between v5.6 and v6.8) that can occur when upgrading between Elasticsearch versions.

Stay up to date

Learn more about MLCON

 

Build a migration testing environment

Create an Elasticsearch test cluster that emulates your production environment as closely as possible. Run the same Elasticsearch version client libraries, and any other data shippers such as Logstash or Fluentd. Benchmark the test environment’s search and indexing performance based on realistic data. Next, create a test OpenSearch cluster with an equivalent number and types of nodes, for a fair and simple comparison.

Check your tool and client libraries for OpenSearch compatibility

It’s crucial to verify the interoperability of all tools and libraries prior to upgrading. For example, recent builds of tools like Logstash and others include version checks that make them incompatible with OpenSearch. While the community rapidly develops open source versions of popular tools and clients for use with OpenSearch – and many are already available and production-ready – it still pays to implement a deliberate compatibility strategy. 

The tenets of such a strategy: first, use clients and tools provided by OpenSearch whenever possible. Where OpenSearch-specific options aren’t available, use tool or client versions compatible with Elasticsearch OSS 7.10.2. As a last alternative, use the OpenSearch compatibility setting to override version issues, using either the opensearch.yaml or cluster-wide settings like this:

In the opensearch.yaml (Restarting the OpenSearch cluster is necessary to change the opensearch.yaml):

<style="text-align: left;">compatibility.override_main_response_version: true

In the cluster settings:

PUT _cluster/settings
{
  "persistent": {
    "compatibility": {
      "override_main_response_version": true
    }
  }
}

Compatibility verification example: Filebeat

Here we’ll check that the Filebeat module we have running on an Apache HTTP server is compatible with our OpenSearch cluster. First, we’ll point the Filebeat configuration to the OpenSearch endpoint:

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:

  # Array of hosts to connect to.

  hosts: ["search.cxxxxxxxxxxx.cnodes.io:9200"]

  # Protocol - either `http` (default) or `https`.

  protocol: "https"

  # Authentication credentials - either API key or username/password.

  #api_key: "id:api_key"

  username: "icopensearch"

  password: "***************************"

 

Make sure that the OpenSearch cluster can receive logs. Unfortunately (in this example), the non-OSS version of Filebeat cannot connect to OpenSearch, but does sends logs to the cluster thanks to the compatibility version check:

2021-11-30T23:28:32.514Z        ERROR  
[publisher_pipeline_output]     pipeline/output.go:154  Failed to 
connect to 
backoff(elasticsearch(https://search.xxxxxxxxxxxxxxxxxxxxxx.cnodes
.io:9200)): Connection marked as failed because the onConnect
callback failed: could not connect to a compatible version of
Elasticsearch: 400 Bad Request:
{"error":{"root_cause":[{"type":"invalid_index_name_exception","re
ason":"Invalid index name [_license], must not start with
'_'.","index":"_license","index_uuid":"_na_"}],"type":"invalid_ind
ex_name_exception","reason":"Invalid index name [_license], must
not start with
 '_'.","index":"_license","index_uuid":"_na_"},"status":400}


Replacing non-OSS Filebeat with the open source version 7.10.2 solves these compatibility issues. You can verify that you are receiving the filebeat logs by checking for agent.type: filebeat

In addition to ensuring that all tools and clients function with OpenSearch, monitor tools and clients to see that performance in the OpenSearch cluster is similar to Elasticsearch.

Back up your data

Before going ahead with the bulk of the migration, be sure to back up all important data. While the migration to OpenSearch shouldn’t cause data loss, it never hurts to play it safe. Backups are especially crucial when performing rolling, restart, or other in-place upgrades. With Elasticsearch snapshots, you can backup your data to a filesystem repository or to cloud repositories such as S3, GCS, Microsoft Azure. 

Migrate data

Migrating from Elasticsearch to OpenSearch can be done in a few different ways which vary in ease of migration, required downtime, level of compatibility, etc.

Migrating with reindex provides the highest level of compatibility and we will be focusing on reindex migration for the rest of the document.

Migrate data via reindex

To begin, identify all indices you’ll migrate to OpenSearch (don’t migrate system indices). Then copy all needed index mappings, settings, and templates, and apply them to your OpenSearch cluster. Make as few changes as possible in the interest of a seamless migration.

For example, the following code takes the index sample_http_responses and copies and applies settings and mappings to OpenSearch:

PUT sample_http_responses

{

  "mappings": {

    "properties": {

      "@timestamp": {"type": "date"},

      "http_1xx": {"type": "long"},

      "http_2xx": {"type": "long"},

      "http_3xx": {"type": "long"},

      "http_4xx": {"type": "long"},

      "http_5xx": {"type": "long"},

      "status_code": {"type": "long"}

    }

  },

  "settings": {

    "index": {

      "number_of_shards": "3",

      "number_of_replicas": "1"

    }

  }

}

 

Prior to reindexing, you’ll ideally want to stop any new indexing to the source index. Not possible for your use case? Then perform incremental reindexing to handle newer documents, if you have a timestamp or incremental id available to facilitate that strategy.

You’ll also need to whitelist your remote cluster endpoint in OpenSearch’s settings before beginning a reindex. Edit the opensearch.yaml file, and add the whitelist config for the remote IP. It’s also possible to list <ip-addr>:<port> configurations you’d like to white list. 

reindex.remote.whitelist: "xxx.xxx.xxx.xxx:9200"

Next, submit the reindex request, specifying remote endpoint details such as ssl parameters and remote credentials. The following submits the reindexing operation as an async request, a useful technique since moving a lot of data can lengthen completion time. 

To avoid overloading the remote cluster, it’s also possible to throttle the number of requests per second.

 

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

 

POST _reindex?wait_for_completion=false&requests_per_second=-1
{

  "source": {

span style="font-weight: 400;">    "remote": {

      "host": "http://xxx.xxx.xxx.xxx:9200",

      "socket_timeout": "2m",

      "connect_timeout": "60s"

    },

    "index": "sample_http_responses"

  },

  "dest": {

    "index": "sample_http_responses"

  }

}

 

The _task/<task-id> end point lets you check the reindex operation’s progress.

 

{
  "completed" : true,

  "task" : {

    "node" : "o-qCCzE-RZOK1_nDS3ItmA",

    "id" : 1858,

    "type" : "transport",

    "action" : "indices:data/write/reindex",

    "status" : {

      "total" : 1000,

      "updated" : 0,

      "created" : 1000,

      "deleted" : 0,

      "batches" : 1,

      "version_conflicts" : 0,

      "noops" : 0,

      "retries" : {

        "bulk" : 0,

        "search" : 0

      },

      "throttled_millis" : 0,

      "requests_per_second" : -1.0,

      "throttled_until_millis" : 0

    },

    "description" : """reindex from [host=xxx.xxx.xxx.xxx port=9200 query={

  "match_all" : {

    "boost" : 1.0
 }

Migrate dashboards and visualizations

Exporting dashboards and visualizations from Kibana as saved objects, and reimporting them to the OpenSearch dashboard offers the simplest approach for migrating these items. Maintaining the same index names across reindexed data will allow dashboards to work seamlessly post-migration.

First, migrate index patterns by selecting them under Kibana > Stack Management > Saved Objects, and exporting them as ndjson objects.

 

Next, import the index patterns from the OpenSearch Dashboard under Stack Management > Saved Objects.

 

Now migrate dashboards and visualizations using the same process, and check that they function correctly.

 

Validate the functionality and performance of your new OpenSearch cluster

With the migration to OpenSearch complete, check that everything is working as it should. Verify the functionality of search queries and aggregations your applications depend upon, and check performance versus the previous Elasticsearch cluster. 

 

Do it all again in production

With your migration strategy now successfully proven out in test environments, it’s time to repeat the steps above, and to migrate your production Elasticsearch environment to realize the benefits of fully open source OpenSearch. 

The post Take Control of ML Projects appeared first on ML Conference.

]]>
Python Developers live in Visual Studio Code https://mlconference.ai/blog/python-developers-live-in-visual-studio-code/ Thu, 10 Feb 2022 14:47:32 +0000 https://mlconference.ai/?p=83282 With over 18 million monthly users, VS Code has become one of the most popular and fastest-growing text editors in the world. To learn more about why over 3.7 million of them find VS Code to be the perfect habitat for Python development and data science work, keep on reading!.

The post Python Developers live in Visual Studio Code appeared first on ML Conference.

]]>
The Python Programming language was created during the late 1980’s by Guido Van Rossum. By 2003, it consistently ranked among the most popular programming languages in the world. According to the PYPL (PopularitY of Programming Language index) [1], which is generated by analyzing the frequency of coding tutorial searches on Google, Python is now the most popular language in the world. This is no surprise after having grown 15.4% in the last 5 years. [Fig. 1]

Fig. 1: Python popularity growth

With hundreds of packages including Pandas, NumPy, Matplotlib and Scikit Learn that provide functionality for tabular data manipulation, numerical computing, data visualizations and machine learning algorithms for predictive data analysis respectively, the Python language has become the go-to for data science work. Powerful frameworks for building apps such as Flask and Django that are lightning-fast, scalable, and flexible make it one of the most compelling options for web development. Python’s growth and coverage in multiple coding use-cases has continued to skyrocket and has no indications of slowing down any time soon.

VS Code – The Perfect Habitat for Python Developers

At Microsoft’s Developer Division, our mission is to enable every developer to achieve more. This year, to continue supporting the quickly growing Python community, we increased our sponsorship of the Python Software Foundation [2] to the top new visionary level. Goals of the PSF include providing grants and resources for further development and adoption of Python as well as expanding Python outreach by funding the Python Ambassador Program.

On top of supporting the Python community at large, we aim to support Python users right here at home in VS Code! With over 18 million monthly users, VS Code has become one of the most popular and fastest-growing text editors in the world. To learn more about why over 3.7 million of them find VS Code to be the perfect habitat for Python development and data science work, keep on reading!

Stay up to date

Learn more about MLCON

 

How to Follow Along

First things first, you will need to install VS Code. Once you have VS Code installed, you can search for and install extensions through the VS Code Extensions Marketplace. The family of extensions you’ll need for the ultimate Python coding experience include the Python, Pylance, Jupyter, and Azure Machine Learning extensions. [Fig. 2]

Fig. 2: VS Code Extension Marketplace

Python Extension

The Python extension builds on top of VS Code’s already powerful code editor. By providing additional support for environment handling, debugging, testing, linting and formatting, the Python extension capabilities are here to supercharge your Python development work. Our recent extension startup changes have also made great strides in performance improvements so you can get coding sooner.

RETHINK YOUR APPROACHES

Business & Strategy

 

Environment Handling

Get started easily with any of your favorite environments such as pyenv, pipenv, Conda, and Poetry. The extension will automatically detect Python interpreters that are installed in standard locations and the environment you choose will power the IntelliSense, auto-completions, linting, formatting, and any other language-related feature other than debugging.

Debugging

Print statements to check states of variables are a thing of the past. Easily debug different types of Python applications (e.g., multi-threaded, web, and remote applications) by setting breakpoints, inspecting data, and utilizing the debug console as you step through your code. On top of that, there is no starting and stopping with the Python debugger. If you make changes to your code after the debugger execution has already hit a breakpoint, you no longer need to restart the debugger as auto-reload exists for Python scripts, Django and Flask! [Fig. 3]

Fig. 3: Debug Python file in terminal

Linting

Linting analyzes your code for potential errors, highlighting areas where problems should be corrected so you don’t have to manually parse the code yourself. The currently supported linters include Pylint, pycodestyle, Flake8, mypy, pydocstyle, prospector, and pylama. Simply make sure the linter of your choice is installed in the active Python interpreter. [Fig. 4]

Fig. 4: Linter support

Pylance

Your default Python editing experience has been upgraded with the bunding of our Pylance extension, Visual Studio Code’s most robust and performant Python language server. Its rich editing features include completions, auto-imports, function signature help, docstrings, contextual highlighting, and more!

With auto-imports you can say goodbye to your workflow being interrupted to import necessary modules. As you are constructing your code, Pylance will provide smart import suggestions and insert them at the top of your file for you. The function signature help provide information on parameters as well as return types so that you no longer have to hunt down external documentation and leave the context of your code editor.

You can even refactor your code at the speed of light by tapping into Pylance’s extraction features. You can highlight lines of code and pick either “Extract Method” [Fig 5] or “Extract Variable” [Fig. 6] to have Pylance do the heavy lifting to turn them into new variables or functions.

Fig. 5: Pylance’s Extract Method

Fig. 6: Pylance’s Extract Variable

Don’t forget about contextual document highlighting! Double-clicking on variables will present other instances of the variable to you such that none can slip by you. [Fig. 7]

Fig. 7: Pylance’s Contextual Document Highlighting

Jupyter Notebooks

A Jupyter Notebook is an interactive programming and computational document that supports mixing executable code, equations, visualizations, and narrative text. Jupyter Notebooks [Fig. 8] can contain markdown and code cells, where code cells have two major components: input and output. You can write code in the input area of a cell, and after running the cell the result will show up in the output area just below. [Fig. 9]

Fig. 8: VS Code Notebook

Fig. 9: Histogram in VS Code Notebooks

Jupyter Notebooks have quickly become the de facto tool for data science. The ability to run chunks of code at a time and out of order makes them very exploratory in nature which is incredibly conducive to data exploration. The ability to see outputs and visualizations in a hassle-free manner paired with narrative text, makes Jupyter Notebooks the perfect location to tell a story with data. Outside of data science though, they are also a great tool for teaching or learning new languages, general code experimentation, and building quick prototypes.

This past year, our very own implementation of Jupyter Notebooks got a major overhaul by being fully integrated with Visual Studio Code. On top of a new modern design, you can now benefit from faster load times, innate source control and diffing capabilities, full notebook debugging, customizable theming, and more!

Variable Explorer and Data Viewer

Our additional features such as our Variable Explorer [Fig. 10] and Data Viewer will help you keep track of the state of your variables and take a deeper look at the tabular data you might be working with. To access the Variable Explorer, simply click on the Variables Icon in your notebook toolbar.[Fig. 11] To access the Data Viewer, click on the icon to the left of the tabular variable you would like to inspect.

Fig. 10: Variable Explorer in Notebook Toolbar

Fig. 11: Variable Explorer

The Data Viewer [Fig. 12] provides a spreadsheet-like view of your data and the filtering capabilities allow you to make quick checks on your data. It facilitates and speeds up identifying data quality issues and the next steps that must be taken in order to properly clean the data.

Fig. 12: Data Viewer

Debugging

VS Code allows you to debug your notebook in multiple ways. For a “debugging-lite” experience, you can opt for Run by Line. As you’ve likely already guessed based on the name, Run by Line [Fig. 13] allows you to run through your cell, one line at a time. When Run by Line is enabled, the Variable Explorer will open alongside with it so you can keep track of the state of your variables as you iterate and quickly resolve small code issues.

Fig. 13: Run by Line

With our most recent revamp, you can now take advantage of the same full debugging experience [Fig. 14] enabled by the Python extension in notebooks. It’s important to note that if you’d like to use these debugging features in VS Code today, you’ll need at least version 6.0 or higher of ipykernel [3] in the environment you select to run your notebooks.

Fig. 14: Debugging in Jupyter Notebooks

Custom Notebook Diffing

Under the hood, Jupyter Notebooks are JSON files. The segments in a JSON file are rendered as cells that are comprised of three components: input, output, and metadata. Comparing changes made in a notebook using lined-based diffing is difficult and hard to parse. The rich diffing editor for notebooks allows you to easily see changes for each component of a cell.

You can even customize what types of changes you want displayed within your diffing view. In the top right, select the overflow menu item in the toolbar to customize what cell components you want included, but don’t worry about input changes as those will always be shown. [Fig. 15]

Fig. 15: Customized Diffing View

Interactive Window

If you like the idea of notebooks but are used to working with scripts, we have the feature just for you! The Interactive Window is a hybrid between a notebook and a script. When working in a Python file, you can create cell-like code segments by using the ‘#%%’ delimiter. Running these faux cells in the Interactive Window [Fig.16] allows you to break down your longer Python script into smaller and more comprehensible chunks and see their results to the right as opposed to inline like a notebook would. You can also run code directly in the Interactive Window itself, that way you can use it as a scratch pad where you can try out slightly tweaked code before inserting it into your more finalized Python script.

Fig. 16: Interactive Window

TensorBoard and PyTorch Profiler

If you are using PyTorch or TensorFlow you can look forward to our TensorBoard dashboard integration helping you visualize datasets, train models, spot check model predictions, view model architecture, analyze model’s loss and accuracy over time, as well as profile your code to understand where it is slowest. In addition to TensorBoard integration, we’ve also embedded the PyTorch Profiler in VS Code such that you can monitor your PyTorch models all in one convenient location. In addition, VS Code is exclusively the only tool today that allows you to jump directly to your source code file from the PyTorch Profiler!

If working in notebooks, our Variable Explorer allows you to inspect PyTorch and TensorFlow data types and our Data Viewer allows you to slice data so you can get a robust understanding of any 3D or higher dimensional data. As a reminder, you can access the Data Viewer through the Variable Explorer or during a Python debugging session. When a debugging session has started, you can right click on the Tensor you would like to do a deeper dive on and select “View Value in Data Viewer”. [Fig. 17]

Fig. 17: Data viewer

Azure Machine Learning

While many data science and machine learning tasks can be completed successfully on your local machine, sometimes you just need a bit more power! If you’re interested in scaling your training and inferencing workloads, the Azure Machine Learning extension [4] has you covered.

You can search for the Azure Machine Learning extension in the Visual Studio Code marketplace, sign into your Azure Account, [5]  and create a machine learning workspace [6] to organize and manage your resources. Through the Azure Machine Learning extension, you can create a compute resource and seamlessly connect to it without requiring SSH or additional network configuration. When connected to the compute you can continue using your favorite VS Code capabilities (notebooks, debugger, terminal), import your local project, and scale your model training while leveraging GPU resources.

The Azure Machine Learning extension [Fig. 18] also provides enhanced language support (completions based on your Azure resources) and generated templates that you can use to author and check-in reproducible, shareable configuration files. Once created and deployed, the results of your work (e.g., creating an environment, training a model) can be viewed from directly within Visual Studio Code through the extension tree view; you no longer have to context-switch between the editor and the browser to manage your machine learning resources.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Fig. 18: Azure Machine Learning

Additional Notable Mentions

While the next two extensions are not exclusive to Python itself, they are incredibly powerful aids in your development experience.

Remote – SSH

The Remote – SSH extension lets you use any remote machine with a SSH server as your development environment. It effectively runs VS Code on the remote machine, so you have access to any extensions and files on that same remote machine. With this extension you can develop on the same operating system you deploy to or use larger, faster, or more specialized hardware than your local machine as well as switch between remote development environments without altering anything on your local machine.

Live Share

The Live Share extension allows you to collaboratively edit and debug with others in real time, regardless of what programming languages are being used. You can forget the archaic method of sending files back and forth between you and your coworkers. Live Share allows you to instantly (and securely) share your current project, and then as needed, share debugging sessions, terminal instances, localhost web apps, voice calls, and more! Guests invited to your sessions will have your editor context mirrored on their machine so you can start collaborating productively immediately without needing to clone any repos or install any SDKs.

As the guest joining a Live Share session however, you will still have all your personal editor preferences (e.g. keybindings, theme) honored and your own cursor so you can seamlessly jump into a session and work together or independently in the same file.

Welcome Home Pythonistas

Regardless of what you are trying to achieve with Python, VS Code is the place for you! We hope you come try out the Python experience in VS Code and that it becomes your new home for Python development and data science work! Let us know what you think!

For a much more detailed and in-depth walkthrough of the mentioned extensions and features, please visit our Visual Studio Code Documentation.

How to Contact Us

Follow our Twitter handles @code for any Visual Studio Code product updates and @pythonvscode for Python and Jupyter product announcements! For any feature suggestions or issues don’t hesitate to file issues on our VS Code, VS Code Python, VS Code Jupyter, or VS Code Pylance GitHub Repositories! As always, we encourage and welcome the community to participate and contribute to our open-source tools!

Links & Literature

[1] https://pypl.github.io/PYPL.html

[2] https://www.python.org/psf/ 

[3] https://pypi.org/project/ipykernel/ 

[4] https://docs.microsoft.com/en-us/Azure/machine-learning/how-to-setup-vs-code 

[5] https://docs.microsoft.com/en-us/dotnet/azure/create-azure-account

[6] https://docs.microsoft.com/en-us/Azure/machine-learning/how-to-setup-vs-code 

The post Python Developers live in Visual Studio Code appeared first on ML Conference.

]]>
What is Data Annotation and how is it used in Machine Learning? https://mlconference.ai/blog/data-annotation-ml/ Tue, 12 Oct 2021 12:24:15 +0000 https://mlconference.ai/?p=82363 What is data annotation? And how is data annotation applied in ML? In this article, we are delving deep to answer these key questions. Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

]]>
Modern businesses are operating in highly competitive markets, and finding new business opportunities is even harder. Customer experiences are constantly changing, finding the right talent to work on common business goals is also an enormous challenge, yet businesses want to perform the way the market demands. So what are these companies doing to create a sustainable competitive advantage? This is where Artificial Intelligence (AI) solutions come in and are prioritized. With AI, it is easier to automate business processes and smoothen decision-making. But, what exactly defines a successful Machine Learning (ML) project? The answer is simple, the quality of training datasets that work with your ML algorithms.

Having that in mind, what amounts to a high-quality training dataset? Data annotation. What is data annotation? And how is data annotation applied in ML?

In this article, we are delving deep to answer these key questions, and is particularly helpful if:

  • You are seeking to understand what data annotation is in ML and why it is so important.
  • You are a data scientist curious to know the various data annotation types out there and their unique applications.
  • You want to produce high-quality datasets for your ML model’s top performance, and have no idea where to find professional data annotation services.
  • You have huge chunks of unlabeled data, have no time to gather, organize, and label them, and in dire need of a data labeler to do the job for you, ultimately meet your training and deploying goals for your models.

What is Data Annotation?

In ML, data annotation refers to the process of labeling data in a manner that machines can recognize either through computer vision or natural language processing (NLP). In other words, data labeling teaches the ML model to interpret its environment, make decisions and take action in the process.

Data scientists use massive amounts of datasets when building an ML model, carefully customizing them according to the model training needs. Thus, machines are able to recognize data annotated in different, understandable formats such as images, texts, and videos.

This explains why AI and ML companies are after such annotated data to feed into their ML algorithm, training them to learn and recognize recurring patterns, eventually using the same to make precise estimations and predictions.

The data annotation types

Data annotation comes in different types, each serving different and unique use cases. Although data annotation is broad and wide, there are common annotation types in popular machine learning projects which we are looking at in this section to give you the gist in this field:

Semantic Annotation

Semantic annotation entails annotation of different concepts within text, such as names, objects, or people. Data annotators use semantic annotation in their ML projects to train chatbots and improve search relevance.

Image and Video Annotation

Let’s say this, image annotation enables machines to interpret content in pictures. Data experts use various forms of image annotation, including bounding boxes displayed on images, to pixels assigned a meaning individually, a process called semantic segmentation. This type of annotation is commonly used in image recognition models for various tasks like facial recognition and recognizing and blocking sensitive content.

Video annotation, on the other hand, uses bounding boxes, or polygons on video content. The process is simple, developers use video annotation tools to place these bounding boxes, or stick together video frames to track the movement of annotated objects. Either way deemed fit by the developer, this type of data becomes handy when developing computer vision models for localization of object tracking tasks.

Text categorization

Text categorization, also called text classification or text tagging is where a set of predefined categories are assigned to documents. A document can contain tagged paragraphs or sentences by topic using this type of annotation, thus making it easier for users to search for information within a document, an application, or a website.

Why is Data Annotation so Important in ML

Whether you think of search engines’ ability to improve on the quality of results, developing facial recognition software, or how self-driving automobiles are created, all these are made real through data annotation. Living examples include how Google manages to give results based on the user’s geographical location or sex, how Samsung and Apple have improved the security of their smartphones using facial unlocking software, how Tesla brought into the market semi-autonomous self-driving cars, and so on.

Annotated data is valuable in ML in giving accurate predictions and estimations in our living environments. As aforesaid, machines are able to recognize recurring patterns, make decisions, and take action as a result. In other words, machines are shown understandable patterns and told what to look for – in image, video, text, or audio. There is no limit to what similar patterns a trained ML algorithm cannot find in any new datasets fed into it.

Data Labeling in ML

In ML, a data label, also called a tag, is an element that identifies raw data (images, videos, or text), and adds one or more informative labels to put into context what an ML model can learn from. For example, a tag can indicate what words were said in an audio file, or what objects are contained in a photo.

Data labeling helps ML models learn from numerous examples given. For example, the model will spot a bird or a person easily in an image without labels if it has seen adequate examples of images with a car, bird, or a person in them.

Conclusion

Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before. The growth of the AI and ML industry as a whole depends solely on the continued creation of nuanced datasets needed to create some of ML’s complex problems.

There is no better “fuel” for training ML algorithms than annotated data in images, videos, or texts – and that is when we arrive at some of the autonomous ML models we can possibly and proudly have.

Now you understand why data annotation is essential in ML, its various and common types, and where to find data annotators to do the job for you. You are in a position to make informed choices for your enterprise and level up your operations.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

]]>
Neuroph and DL4J https://mlconference.ai/blog/neuroph-and-dl4j/ Tue, 14 Sep 2021 11:31:34 +0000 https://mlconference.ai/?p=82227 In this article, we would like to show how neural networks, specifically the multilayer perceptron of two Java frameworks, can be used to detect blood cells in images.

The post Neuroph and DL4J appeared first on ML Conference.

]]>
Microscopic blood counts include an analysis of the six types of white blood cells. These include: Neutrophils, Eosinophils, Basophilic Granulocytes, Monocytes, and Lymphocytes. Based on the number, maturity, and distribution of these white blood cells, you can obtain valuable information about possible diseases. However, here we will not focus on the handling of the blood smears, but on the recognition of the blood cells.

For the tests described, the Bresser Trino microscope with a MikrOkular was used and connected to a computer (HP Z600). The program presented in this article was used for image analysis. The software is based on neural networks using the Java frameworks Neuroph and Deep Learning for Java (DL4J). Staining of smears for the microscope were made with Löffler solution.

 

Training data

For neural network training, the images of the blood cells were centered, converted to grayscale format, and normalized. After preparation, the images looked as shown in Figure 1.

Fig. 1: The JPG images have a size of 100 x 100 pixels and show (from left to right) lymphocyte (ly), basophil (bg), eosinophil (eog), monocyte (mo), rod-nucleated (young) neutrophil (sng), segment-nucleated (mature) neutrophil (seg); the cell types were used for neural network training.

 

A dataset of 663 images with 6 labels – ly, bg, eog, mo, sng, seg – was compiled for training. For Neuroph, the imageLabels shown in Listing 1 were set.

List<String> imageLabels = new ArrayList();
  imageLabels.add("ly");
  imageLabels.add("bg");
  imageLabels.add("eog");
  imageLabels.add("mo");
  imageLabels.add("sng");
  imageLabels.add("seg");

After that, the directory for the input data looks like Figure 2.

Fig. 2: The directory for the input data

 

For DL4J the directory for the input data (your data location) is composed differently (Fig. 3).

Fig. 3: Directory for the input data for DL4J.

 

Most of the images in the dataset came from our own photographs. However, there were also images from open and free internet sources. In addition, the dataset contained the images multiple times as they were also rotated 90, 180, and 270 degrees respectively and stored.

 

Stay up to date

Learn more about MLCON

 

Neuroph MLP Network

The main dependencies for the Neuroph project in pom.xml are shown in Listing 2.

<dependency>
  <groupId>org.neuroph</groupId>
  <artifactId>neuroph-core</artifactId>
  <version>2.96</version>
</dependency>   
<dependency>
  <groupId>org.neuroph</groupId>
  <artifactId>neuroph-imgrec</artifactId>
  <version>2.96</version>
</dependency>
<dependency>
  <groupId>log4j</groupId>
  <artifactId>log4j</artifactId>
  <version>1.2.17</version>
</dependency>

A multilayer perceptron was set with the parameters shown in Listing 3.

private static final double LEARNINGRATE = 0.05;
private static final double MAXERROR = 0.05;
private static final int HIDDENLAYERS = 13;
 
//Open Network
Map<String, FractionRgbData> map;
  try {
    map = ImageRecognitionHelper.getFractionRgbDataForDirectory(new File(imageDir), new Dimension(10, 10));
    dataSet = ImageRecognitionHelper.createRGBTrainingSet(image-Labes, map);
    // create neural network
    List<Integer> hiddenLayers = new ArrayList<>();
    hiddenLayers.add(HIDDENLAYERS);/
    NeuralNetwork nnet = ImageRecognitionHelper.createNewNeuralNet-work("leukos", new Dimension(10, 10), ColorMode.COLOR_RGB, imageLabels, hiddenLayers, TransferFunctionType.SIGMOID);
    // set learning rule parameters
    BackPropagation mb = (BackPropagation) nnet.getLearningRule();
    mb.setLearningRate(LEARNINGRATE);
    mb.setMaxError(MAXERROR);
    nnet.save("leukos.net");
  } catch (IOException ex) {
    Logger.getLogger(Neuroph.class.getName()).log(Level.SEVERE, null, ex);
  }

 

Example

The implementation of a test can look like the one shown in Listing 4.

HashMap<String, Double> output;
String fileName = "leukos112.seg";
NeuralNetwork nnetTest = NeuralNetwork.createFromFile("leukos.net");
// get the image recognition plugin from neural network
ImageRecognitionPlugin imageRecognition = (ImageRecognitionPlugin) nnetTest.getPlugin(ImageRecognitionPlugin.class);
output = imageRecognition.recognizeImage(new File(fileName);

 

Client

A simple SWING interface was developed for graphical cell recognition. An example of the recognition of a lymphocyte is shown in Figure 4.

Fig. 4: The program recognizes a lymphocyte and highlights it

 

DL4J MLP network

The main dependencies for the DL4J project in pom.xml are shown in Listing 5.

<dependency>
  <groupId>org.deeplearning4j</groupId>
  <artifactId>deeplearning4j-core</artifactId>
  <version>1.0.0-beta4</version>
</dependency>
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native-platform</artifactId>
  <version>1.0.0-beta4</version>
</dependency>

Again, a multilayer perceptron was used with the parameters shown in Listing 6.

protected static int height = 100;
protected static int width = 100;
protected static int channels = 1;
protected static int batchSize = 20;
 
protected static long seed = 42;
protected static Random rng = new Random(seed);
protected static int epochs = 100;
protected static boolean save = true;
//DataSet
  String dataLocalPath = "your data location";
  ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator();
  File mainPath = new File(dataLocalPath);
  FileSplit fileSplit = new FileSplit(mainPath, NativeImageLoader.ALLOWED_FORMATS, rng);
  int numExamples = toIntExact(fileSplit.length());
  numLabels = Objects.requireNonNull(fileSplit.getRootDir().listFiles(File::isDirectory)).length;
  int maxPathsPerLabel = 18;
  BalancedPathFilter pathFilter = new BalancedPathFilter(rng, labelMaker, numExamples, numLabels, maxPathsPerLabel);
  //training – Share test
  double splitTrainTest = 0.8;
  InputSplit[] inputSplit = fileSplit.sample(pathFilter, splitTrainTest, 1 - splitTrainTest);
  InputSplit trainData = inputSplit[0];
  InputSplit testData = inputSplit[1];
 
//Open Network
MultiLayerNetwork network = lenetModel();
network.init();
ImageRecordReader trainRR = new ImageRecordReader(height, width, channels, labelMaker);
trainRR.initialize(trainData, null);
DataSetIterator trainIter = new RecordReaderDataSetIterator(trainRR, batchSize, 1, numLabels);
  scaler.fit(trainIter);
  trainIter.setPreProcessor(scaler);
  network.fit(trainIter, epochs);

 

LeNet Model

This model is a kind of forward neural network for image processing (Listing 8).

private MultiLayerNetwork lenetModel() {
  /*
    * Revisde Lenet Model approach developed by ramgo2 achieves slightly above random
    * Reference: https://gist.github.com/ramgo2/833f12e92359a2da9e5c2fb6333351c5
  */
  MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .l2(0.005)
    .activation(Activation.RELU)
    .weightInit(WeightInit.XAVIER)
    .updater(new AdaDelta())
    .list()
    .layer(0, convInit("cnn1", channels, 50, new int[]{5, 5}, new int[]{1, 1}, new int[]{0, 0}, 0))
    .layer(1, maxPool("maxpool1", new int[]{2, 2}))
    .layer(2, conv5x5("cnn2", 100, new int[]{5, 5}, new int[]{1, 1}, 0))
    .layer(3, maxPool("maxool2", new int[]{2, 2}))
    .layer(4, new DenseLayer.Builder().nOut(500).build())
    .layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
      .nOut(numLabels)
      .activation(Activation.SOFTMAX)
      .build())
    .setInputType(InputType.convolutional(height, width, channels))
    .build();
 
  return new MultiLayerNetwork(conf);
 
}

 

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Example

A test of the Lenet model might look like the one shown in Listing 9.

trainIter.reset();
DataSet testDataSet = trainIter.next();
List<String> allClassLabels = trainRR.getLabels();
int labelIndex;
int[] predictedClasses;
String expectedResult;
String modelPrediction;
int n = allClassLabels.size();
System.out.println("n = " + n);
for (int i = 0; i < n; i = i + 1) {
  labelIndex = testDataSet.getLabels().argMax(1).getInt(i);
  System.out.println("labelIndex=" + labelIndex);
  INDArray ia = testDataSet.getFeatures();
  predictedClasses = network.predict(ia);
  expectedResult = allClassLabels.get(labelIndex);
  modelPrediction = allClassLabels.get(predictedClasses[i]);
  System.out.println("For a single example that is labeled " + expectedResult + " the model predicted " + modelPrediction);
}

 

Results

After a few test runs, the results shown in Table 1 are obtained.

LeukosNeurophDL4J

Lymphocytes (ly) 87 85
Basophils (bg) 96 63
Eosinophils (eog) 93 54
Monocytes (mo) 86 60
Rod nuclear neutrophils (sng) 71 46
Segment nucleated neutrophils (seg) 92 81
Table 1: Results of leukocyte counting (N-success/N-samples in %).

 

As can be seen, the results using Neuroph are slightly better than those using DL4J, but it is important to note that the results are dependent on the quality of the input data and the network’s topology. We plan to investigate this issue further in the near future.

However, with these results, we have already been able to show that image recognition can be used for medical purposes with not one, but two sound and potentially complementary Java frameworks.

 

Acknowledgments

At this point, we would like to thank Mr. A. Klinger (Management Devoteam GmbH Germany) and Ms. M. Steinhauer (Bioinformatician) for their support.

The post Neuroph and DL4J appeared first on ML Conference.

]]>
Top 5 reasons to attend ML Conference https://mlconference.ai/blog/top-5-reasons-to-attend-ml-conference/ Tue, 20 Jul 2021 11:33:51 +0000 https://mlconference.ai/?p=82083 So you’ve decided to attend ML Conference but you don’t know how to break it to your boss that it is a win-win situation? Don’t worry, we’ve got you covered. Follow 4 simple steps and use these 5 arguments to show why your organization needs to invest in ML Conference!

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

]]>
1. Let your boss know why you want to go to ML Conference

Tell him the there are over 25 expert speakers and industry experts addressing actual trends and best practices.

Tell your boss to take a look at the conference tracks to have a better idea of what this conference is all about.

 

2. Tell him what’s in it for him

You have the chance to gain key knowledge and skills for this new era of Machine Learning. Turn your ideas into best practices during the workshops and meet people who can help you with that. You’ll learn what it means to build up a ML-first mindset with numerous real-world examples and you can put them into practice in your company. At ML Conference you will develop a deep understanding of your data, as well as of the latest tools and technologies.

 

3. Show him that you’ve done your homework: Book your ticket now and save money.

If you book your ticket now, your boss will save money on the early bird ticket. Plus, you will have an additional 10% discount for a group of 3 people or more.

 

4. Assure your boss that you will network with top industry experts

In addition to the valuable knowledge you will get from top-notch industry experts, you’ll also have the chance to connect and network with the people who are at the top of their career. ML Conference offers an expo reception and a networking event.

 

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

]]>
Anomaly Detection as a Service with Metrics Advisor https://mlconference.ai/blog/anomaly-detection-as-a-service-with-metrics-advisor/ Wed, 09 Jun 2021 07:39:21 +0000 https://mlconference.ai/?p=81814 We humans are usually good at spotting anomalies: often a quick glance at monitoring charts is enough to spot (or, in the best case, predict) a performance problem. A curve rises unnaturally fast, a value falls below a desired minimum or there are fluctuations that cannot be explained rationally. Some of this would be technically detectable by a simple automated if, but it's more fun with Azure Cognitive Services' new Metrics Advisor.

The post Anomaly Detection as a Service with Metrics Advisor appeared first on ML Conference.

]]>
Are you developing an application that stores time-based data? Orders, ratings, comments, appointments, time bookings, repairs or customer contacts? Do you have detailed log files about the number and duration of visits? Hand on heart: How quickly would you notice if your systems (or your users) were behaving differently than you thought? Maybe one of your clients is trying to flood the software with way too much data, or a product in your webshop is “going through the roof”? Maybe there are performance issues in certain browsers or unnatural CPU spikes that deserve a closer look? Metrics Advisor from Azure Cognitive Services provides an AI-powered service that monitors your data and alerts you when anomalies are suspected.

Stay up to date

Learn more about MLCON

 

What is normal?

The big challenge here is defining what constitutes an anomaly in the first place. Imagine a whole shelf full of developer magazines, with only one sports magazine among them. You could rightly say that the sports magazine is an anomaly. But perhaps, by chance, all the magazines are in A4 format, and only two are in A5 – another anomaly. Thus, for automated anomaly detection, it is important to learn from experience and understand which anomalies are actually relevant – and where there is a false alarm that should be avoided in the future.

In the case of time-based data – which is what Metrics Advisor is all about – there are several approaches to anomaly detection. The simplest way is to define hard limits: Anything below or above a certain threshold is considered an anomaly. This doesn’t require machine learning or artificial intelligence; the rules are quickly implemented and clearly understood. For monitoring data, that might be enough: If 70 percent of the storage space is occupied, you want to react. But often the (data) world does not run in rigid paths, sometimes the relative change is more decisive than the actual value: if there has been a significant increase or decrease of more than 10 percent within the last three hours, an anomaly should be detected. An example could be taken from finance. If your private account balance changes from €20,000 to €30,000, this is probably to be considered an anomaly. If a company account changes from €200,000 to €210,000, this is not worth mentioning. As you can tell from this example, the classification of what constitutes an anomaly may change over time. When founding a startup, €100,000 is a lot of money; when founding a large corporation, it is a marginal note. But what if your data is subject to seasonal fluctuations, or individual days such as weekends or holidays behave significantly differently? Here, too, the classification is not so trivial. Is a wave of influenza in the winter months to be expected and only an anomaly in the summer, or should every increase in infection numbers be flagged? As you can see, the question of anomaly detection is to some extent a very subjective one, regardless of the tooling – and not all decisions can be taken over by the technology. Machine learning, however, can help learn from historical data and distinguish normal fluctuations from anomalies.

RETHINK YOUR APPROACHES

Business & Strategy

Metrics Advisor

Metrics Advisor is a new service in the Azure Cognitive Services lineup and is only available in a preview version for now. Internally, another service is used, namely the Anomaly Detector (also part of the Cognitive Services). Metrics Advisor complements this with numerous API methods, a web frontend for managing and viewing data feeds, root-cause analysis and alert configurations. To experiment, you need an Azure subscription. There you can create a Metrics Advisor resource; it’s completely free to use in the preview phase.

The example I would like to use to demonstrate the basic procedure uses data from Google Trends [1]. I have evaluated and downloaded weekly Google Trends scores for two search terms (“vaccine” and “influenza”) for the last five years for four countries (USA, China, Germany, Austria) and would like to try to identify any anomalies in this data. The entire administration of the Metrics Advisor can be done via the provided REST API [2], a faster way to get started is via the provided web frontend, the so-called workspace [3].

Data Feeds

To start with, we create a new data feed that provides the basic data for the analysis. Various Azure services and databases are available out of the box as data sources: Application Insights, Blob Storage, Cosmos DB, Data Lake, Table Storage, SQL Database, Elastic Search, Mongo DB, PostgreSQL – and a few more. In our example, I loaded the Google Trends data into a SQL Server database. In addition to the primary key, the table has four other columns: the date, the country, and the scores for vaccine and influenza. In Metrics Advisor, an SQL statement must now be specified (in addition to the connection string) to query all values for a given date. This is because the service will continue to periodically visit our database to retrieve and analyze the new data. The frequency at which this update should happen is set via the granularity: The data can be analyzed annually, monthly, weekly, daily, hourly or in even shorter periods (the smallest unit is 300 seconds). Depending on the selected granularity, Microsoft also recommends how much historical data to provide. If we choose a 5-minute interval, then data from the last four days will suffice. In our case, a weekly analysis, four years is recommended. After clicking on Verify and get schema, the SQL statement is issued and the structure of our data source is determined. We see the columns shown in Figure 1 and need to assign meaning: Which column contains the timestamp? Which columns should be analyzed as metrics – and where are additional facts (dimensions) that could be possible causes of anomalies?

Fig. 1: Configuration of the data feed

Now, before the data is actually imported, there is one more thing to consider: The roll up settings. For a later root cause analysis, it is necessary to build a multidimensional cube that calculates aggregated values per dimension (i.e. in our case per week also an aggregation over all countries). This way, in case of anomalies, it can later be investigated which dimensions or characteristics seem to be causal for the change in value. If the aggregations are not already in our data source, Metrics Advisor can be asked to calculate them. The only decision we have to make here is the type of aggregation (sum, average, min, max, count). Here our example lags a bit: we select average, but the value of the USA thus flows in with the same factor as the value of the small Austria. As you can see: often one fails because of the data quality or has to be careful that statements are not based on wrong calculations.


Finally, we start the import, which can take several hours depending on the amount of data. The status of the import can also be tracked in the workspace, and individual time periods can be reloaded at any time.

Analysis and fine tuning

Once the data import is complete, we can take a first look at our results. The main goal of Metrics Advisor is to analyze and detect new anomalies – that is, to investigate whether the most recent data point is an anomaly or not. Nevertheless, historical data is also taken into consideration. Depending on the granularity, the service looks several hours to years into the past and tries to flag anomalies there as well. In our case (five years of data, weekly aggregation), the so-called Smart Detection provides results for the past six months and marks individual points in time as an anomaly (Fig. 2).

Fig. 2: Data visualization including anomalies

Now it is time to take a look at the suggestions: Are the identified anomalies actually relevant? Is the detection too sensitive or too tolerant? There are some ways to improve the detection rate. Let’s recall the beginning of this article: The big challenge is to define what constitutes an anomaly in the first place.

You will probably notice a prominently placed slider in the workspace very quickly. With it we can control the sensitivity. The higher the value, the smaller the area containing normal points. We also get these limits visualized within the charts as a light blue area. Sometimes it is useful not to warn at the first occurrence of an anomaly, but only when several anomalies have been detected over a period of time. We can configure Metrics Advisor to look at a certain number of points retrospectively and not consider an anomaly until a certain percentage of those points have been detected as an anomaly. For example, a brief performance problem should be tolerated, but if 70 percent of the readings have been detected as an anomaly in the last 15 minutes, it should be considered a problem overall.


According to the use case, it may make sense to supplement Smart Detection with manual rules. A Hard Threshold can be used to define a lower or upper limit as well as a range of values that should be considered an anomaly. The Change Threshold offers the mentioned possibility to evaluate a percentage change to one or more predecessor points as an anomaly. By the way the different rules are linked (or/and) we influence the detection: for example, an anomaly should only be detected as such if Smart Detection strikes and the value is above 30. We can compose and name several of these configurations. In addition, it is possible to store special rules for individual dimensions.

Depending on our settings, more or less anomalies are detected in the data, and the Metrics Advisor then tries to convert them into so-called incidents. An incident can consist of a single anomaly, but is often made up of related anomalies and thus entire time periods are listed under a common incident. Tools are available in the Incident Hub for closer examination: We can filter the found incidents (by time, criticality, and dimension), start an initial automatic root cause analysis (see “Root cause” in Fig. 3), and drill down through multiple dimensions to gather insights.

Fig. 3: Incident analysis

Feedback

Perhaps the greatest benefit of using artificial intelligence for anomaly detection is the ability to learn through feedback. Even if sensitivity and thresholds have been set well, sometimes the service will get it wrong. However, it is for these data points that feedback can then be provided via the API or portal: Where was an anomaly incorrectly detected? Where was an anomaly missed? The service accepts this feedback and tries to assign similar cases more correctly in the future. The service also tries to recognize time periods – and can be proven wrong if we mark a time range and report it as a period.

For predictable anomalies that have temporal reasons (holidays, weekends, cyclically recurring events), there are separate options for configuration. These should therefore not be reported subsequently as feedback, but stored as so-called preset events.

Alerts

We should now be at a point where data is imported regularly and anomaly detection hopefully works reliably. However, the best detection is of no use if we learn about it too late. Therefore, Alert Configurations should be set up to actively notify about anomalies. Currently, there are three channels to choose from: via email, as a WebHook, or as a ticket in Azure DevOps. The WebHook variant in particular offers exciting possibilities for integration: we can display the detected anomalies in our own application or trigger a workflow using Azure Logic Apps. Perhaps we simply restart the affected web app as a first automated action.

Snooze settings also seem handy; an alert can automatically ensure that no more alerts are sent for a configurable period of time afterwards. This avoids waking up in the morning with 500 emails in your inbox, all with the same content.

Summary

Metrics Advisor provides an exciting and easy entry into the world of anomaly detection of time-based data. Long-time data scientists may prefer other ways and means (and may be interested in the paper at [4]), but for application developers who want to start their first experiments with matching data, this service is a potent gateway drug. The preview status is currently still evident mainly in the web portal and in the lack of documentation quality of the REST API; however, good conceptual documentation is already available.

Have fun trying it out and experimenting with your own data sources.

Links & Literature

[1] https://trends.google.com

[2] https://docs.microsoft.com/en-us/azure/cognitive-services/metrics-advisor/

[3] https://metricsadvisor.azurewebsites.net

[4] https://arxiv.org/abs/1906.03821

 

The post Anomaly Detection as a Service with Metrics Advisor appeared first on ML Conference.

]]>
Let’s visualize the coronavirus pandemic https://mlconference.ai/blog/lets-visualize-the-coronavirus-pandemic/ Wed, 16 Dec 2020 14:16:20 +0000 https://mlconference.ai/?p=81043 Since February, we have been inundated in the media with diagrams and graphics on the spread of the coronavirus. The data comes from freely accessible sources and can be used by everyone. But how do you turn the source data into a data set that can be used to create something visual like a dashboard? With Python and modules like pandas, this is no magic trick.

The post Let’s visualize the coronavirus pandemic appeared first on ML Conference.

]]>
These are crazy times we have been living in since the beginning of 2020; a pandemic has turned public life upside down. News sites offer live tickers with the latest news on infections, recoveries and death rates, and there is no medium that does not use a chart for visualization. Institutes like the Robert Koch Institute (RKI) or the Johns Hopkins University provide dashboards. We live in a world dominated by data, even during a pandemic.

The good thing is that most data on the pandemic is publicly available. Johns Hopkins University, for example, makes its data available in an open GitHub repository. So what could be more obvious than to create your own dashboard with this freely accessible data? This article uses coronavirus data to illustrate how to get from data cleansing to enriching data from other sources to creating a dashboard using Plotly’s dash. First of all, an important note: The data is not interpreted or analyzed in any way. This must be left to experts such as virologists, otherwise false conclusions may be drawn. Even if data is available for almost all countries, it is not necessarily comparable; each country uses different methods for testing infections. Some countries even have too few tests, so that no uniform picture can be drawn. The data set serves only as an example.

 

First, the work

In order to be able to use the data, we need to get it in a standardized form for our purposes. The data of Johns Hopkins University is stored in a GitHub repository. Basically, it is divided into two categories: first, continuously as time-series data and second, as a daily report in a separate CSV file. For the dashboard we need both sources. With the time-series data, it is easy to create line charts and plot gradients, curves, etc. From this we later generate the temporal course of the case numbers as line charts. Furthermore, we can calculate and display the growth rates from the data.

 

Set up environment

To prepare the data we use Python and the library pandas. pandas is the Swiss army knife for Python in terms of data analysis and cleanup. Since we need to install some modules for Python, I recommend creating a multi-environment setup for Python. Personally I use Anaconda, but there are also alternatives like Virtualenv. Using isolated environments does not change anything in the system-wide installation of Python, so I strongly recommend it. Furthermore you can work with different Python versions and export the dependencies for deployment more easily. Regardless of the system used, the first step is to activate the environment. With Anaconda and the command line tool conda this works as follows:

$ conda create -n corona-dashboard python=3.8.2

The environment called corona-dashboard is created, and Python 3.8.2 and the necessary modules are installed. To use the environment, we activate it with

$ conda activate corona-dashboard

We do not want to go deeper into Anaconda here, but for more information you can refer to the official documentation.

Once we have activated our environment, we install the necessary modules. In the first step these are pandas and Jupyter notebook. A Jupyter notebook is, simply put, a digital notebook. The notebooks contain markdown and code snippets that can be executed immediately. They are perfectly suited for iterative data cleansing and for developing the necessary steps. The notebooks can also be used to develop diagrams before they are transferred into the final scripts.

$ conda install pandas
$ conda install -c conda-forge notebook

In the following we will perform all steps in Jupyter notebooks. The notebooks can be found in the repository for this article on GitHub. To use them, the server must be started:

$ jupyter notebook

After it’s up and running, the browser will display the overview in the current directory. Click on new and choose your desired environment to open a new Jupyter notebook. Now we can start on cleaning up the data.

 

Cleaning time-based data

In the first step, we import pandas, which provides methods for reading and manipulating data. Here we need a parser for CSV files, for which the method read_csv is provided. At least one path to a file or a buffer is expected as a parameter. If you specify a URL to a CSV as a parameter, pandas will read and process it without any problems. To ensure consistent, traceable data, we access a downloaded file that is available in the repository.

df = pd.read_csv("time_series_covid19_confirmed_global.csv")

To check this, we use the instruction df.head() to output the first five lines of the DataFrame (fig. 1).

Fig. 1: The unadjusted DataFrame

 

To become aware of the structure, we can have the column names printed. This is done with the statement df.columns. You can see in figure 2 that there is one column for each day in the table. Furthermore there are geo coordinates for the respective countries. The countries are partly divided into provinces and federal states. For the time-based data, we do not need geo-coordinates, and we can also remove the column with the states. We achieve this in pandas with the following methods on the data frame:

df.drop(columns=['Lat', 'Long', 'Province/State'], inplace = True)

The method drop expects as parameter the data we want to get rid of. In this case, there are three columns: Lat, Long and Province/State. It is important that the names with upper/lower case and any spaces are specified exactly. The second parameter inplace is used to apply the operation directly to our DataFrame. Without this parameter, pandas returns the modified DataFrame to us without changing the original. If you look at the frame with df.head(), you will see that the desired columns have been discarded.

The division of some countries into provinces or states results in multiple entries for some. An example is China. Therefore, it makes sense to group the data by country. For this, pandas provides a powerful grouping function.

df_grouped = df.groupby(['Country/Region'], as_index=False).sum()

Using the groupby function and parameters, which say in which column the rows should be grouped, the rows are combined. The concatenated .sum() sums the values of the respective combined groups. The return is a new DataFrame with the grouped and summed data, so that we can access the time-related data for all countries. So we swap rows and columns to get one entry for each country (columns) for each day (row).

df_grouped.reset_index(level=0, inplace=True)
df_grouped.rename(columns={'index': 'Date'}, inplace=True)

Before transposing, we set the index to Country/Region to get a clean frame. The disadvantage of this is that the new index is then called Country/Region.

The next adjustment is to set the date in a separate column. To correct this, we reset the index again. This will turn our old index (Country/Region) into a column named Index. This column contains the date specifications and must be renamed and set to the correct data type. The cleanup is complete (Fig. 3).

df_grouped.reset_index(level=0, inplace=True)
df_grouped.rename(columns={'index': 'Date'}, inplace=True)
df_grouped['Date'] = pd.to_datetime(df_grouped['Date'])

Fig. 2: Overview of the columns in the DataFrame

 

Fig. 3: The cleaned DataFrame

 

The fact that the index continues to be called Country/Region won’t bother us from now on because the final CSV file is saved without any index.

df_grouped.to_csv('../data/worldwide_timeseries.csv', index=False)

This means that the time-series data are adjusted and can be used for each country. If, for example, we only want to use the data for Germany, a new DataFrame can be created as a copy by selecting the desired columns.

df_germany = df_grouped[['Date', 'Germany']].copy()

Packed into a function we get the source code from Listing 1 for the cleanup of the time-series data.

def clean_and_save_timeseries(df):
  drop_columns = ['Lat', 
                          'Long', 
                          'Province/State']
 
  df.drop(columns=drop_columns, inplace = True)
  
  df_grouped = df.groupby(['Country/Region'], as_index=False).sum()
  df_grouped = df_grouped.set_index('Country/Region').transpose()
  df_grouped.reset_index(level=0, inplace=True)
  df_grouped.rename(columns={'index': 'Date'}, inplace=True)
  df_grouped['Date'] = pd.to_datetime(df_grouped['Date'])
 
  df_grouped.to_csv('../data/worldwide_timeseries.csv',index=False)

 

The function expects to receive the DataFrame to be cleaned as a parameter. Furthermore, all steps described above are applied and the CSV file is saved accordingly.

 

Clean up case numbers per country

In the repository of Johns Hopkins University you can find more CSVs with case numbers per country, divided into provinces/states. Additionally, the administrations for North America and the geographic coordinates are listed. With this data, we can generate an overview on a world map. The adjustment is less complex. As with the cleanup of the temporal data, we read the CSV into a pandas DataFrame and look at the first lines (Fig. 4).

import pandas as pd

df = pd.read_csv('04-13-2020.csv')
df.head()

Fig. 4: The original DataFrame

 

In addition to the actual case numbers, information on provinces/states, geo-coordinates and other metadata are included that we do not need in the dashboard. Therefore we remove the columns from our DataFrame.

df.drop(columns=['FIPS','Lat', 'Long_', 'Combined_Key', 'Admin2', 'Province_State'], inplace=True)

To get the summed up figures for the respective countries, we group the data by Country_Region, assign it to a new DataFrame and sum it up.

df_cases = df.groupby(['Country_Region'], as_index=False).sum()

We have thus completed the clean-up operation. We will save the cleaned CSV for later use. Packaged into a function, the steps look like those shown in Listing 2.

def clean_and_save_worldwide(df):
  drop_columns = ['FIPS',
                          'Lat', 
                          'Long_', 
                          'Combined_Key', 
                          'Admin2', 
                          'Province_State']
 
  df.drop(columns=drop_columns, inplace=True)
 
  df_cases = df.groupby(['Country_Region'], as_index=False).sum()
  df_cases.to_csv('Total_cases_wordlwide.csv')

The function receives the DataFrame and applies the steps described above. At the end a cleaned CSV file is saved.

 

Clean up case numbers for Germany

In addition to the data from Johns Hopkins University, we want to use the data from the RKI with case numbers from Germany, broken down by federal state. This will allow us to create a detailed view of the German case numbers. The RKI does not make this data available as CSV in a repository; instead, the figures are displayed in an HTML table on a separate web page and updated daily. pandas provides the function read_html() for such cases. If you pass a URL to a web page, it is loaded and parsed. All tables found are returned as DataFrames in a list. I do not recommend accessing the web page directly and reading the tables. On the one hand, there are pages (as well as those of the RKI) which prevent this, on the other hand, the requests should be kept low, especially during development. For our purposes we therefore store the website locally with a wget https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html. To ensure that the examples work consistently, this page is part of the article’s repository. pandas doesn’t care whether we read the page remotely or pass it as a file path.

import pandas as pd
df = pd.read_html('../Fallzahlen.html', decimal=',', thousands='.')

Since the numbers on the web page are formatted, we pass some information about them to read_html. We inform you that the decimal separation is done with commas and the separation of the thousands with a dot. Pandas thus interprets the correct data types when reading the data. To see how many tables were found, we check the length of the list with a simple len(df). In this case this returns a 1, which means that there was exactly one interpretable table. We save this DataFrame into a new variable for further cleanup:

df_de = df[0]

Fig. 5: Case numbers from Germany in the unadjusted dataframe

 

Since the table (Fig. 5) has column headings that are difficult to process, we rename them. The number of columns is known, so we do this step pragmatically:

df_de.columns = ['Bundesland', 'Anzahl', 'diff', 'Pro_Tsd', 'Gestorben']

Note: The number of columns must match exactly. Should the format of the table change, this must be adjusted. This makes the DataFrame look much more usable. We don’t need the column of the previous day’s increments in the dashboard, so it will be removed.

Furthermore, the last line of the table contains the grand total. It is also not necessary and will be removed. We directly access the index (16) of the line and remove it. This gives us our final dataframe with the numbers for the individual states of Germany.

df_de.drop(columns=['diff'], index=[16], inplace=True)

We will now save this data in a new CSV for further use.

df_de.to_csv('cases_germany_states.csv')

A resulting function looks as follows:

def clean_and_save_german_states(df):
df.columns = ['Bundesland', 'Anzahl', 'diff', 'Pro_Tsd', 'Gestorben']
df.drop(columns=['diff'], index=[16], inplace=True)
df.to_csv('cases_germany_states.csv')

As before, the function expects the dataframe as a transfer value. So we have cleaned up all existing data and saved it in new CSV files. In the next step, we will start developing the diagrams with Plotly.

 

Creating the visualizations

Since we will later use Dash from Plotly to develop the dashboard, we will create the desired visualizations in advance in a Jupyter notebook. Plotly is a library for creating interactive diagrams and maps for Python, R and JavaScript. The output of the diagrams is done automatically with Plotly.js. This gives all diagrams additional functions (zoom, save as PNG, etc.) without creating them yourself. Before that we have to install some more required modules.

# Anaconda Benutzer
conda install -c plotly plotly

# Pip
pip install plotly

In order to create diagrams with Plotly as quickly and easily as possible, we will use Plotly Express, which is now part of Plotly. Plotly Express is the easy-to-use high-level interface to Plotly, which works with “tidy” data and considerably simplifies the creation of diagrams. Since we are working with pandas DataFrames again, we import pandas and Plotly Express at the beginning.

import pandas as pd
import plotly.express as px

Starting with the presentation of the development of infections over time in a line chart, we will import the previously cleaned and saved dataframe.

df_ts = pd.read_csv(‘../data/worldwide_timeseries.csv’)

Thanks to the data cleanup, creating a line chart with Plotly is very easy:

fig_ts = px.line(df_ts,
x="Date",
y="Germany")

The parameters are mostly self-explanatory. The first parameter tells you which DataFrame is used. With x='Date' and y='Germany' we determine the columns to be used from the DataFrame. On the horizontal the date and vertically the country. To make the diagram understandable, we set more parameters for the title and label of the axes. For the y-axis we define a linear division. If we want a logarithmic representation, we can set ‘log‘ instead of ‘linear‘.

fig_ts.update_layout(xaxis_type='date',
xaxis={
'title': 'Datum'
},
yaxis={
'title': 'Infektionen',
'type': 'linear',
},
title_text='Infektionen in Deutschland')

To display diagrams in Jupyter notebooks, we need to tell the show() method that we are working in a notebook (Fig. 6).

fig_ts.show('notebook')

Fig. 6: Progression of infections over time

 

That’s all there is to it. The visualizations can be configured in many ways. For this I refer to Plotly’s extensive documentation.

Let us continue with the creation of further visualizations. For the dashboard, cases from Germany are to be presented. For this purpose, the likewise cleaned DataFrame is read in and then sorted in ascending order.

df_fs = pd.read_csv('../data/cases_germany_states.csv')
df_fs.sort_values(by=['Anzahl'], ascending=True, inplace=True)

The first representation is a simple horizontal bar chart (Fig. 7). The code can be seen in Listing 3.

fig_fs = px.bar(df_fs, x='Anzahl', 
                     y='Bundesland',
                     hover_data=['Gestorben'],
                     height=600, 
                     orientation='h',
                     labels={'Gestorben':'Bereits verstorben'},
                     template='ggplot2')
 
fig_fs.update_layout(xaxis={
                               'title': 'Anzahl der Infektionen'
                             },
                             yaxis={
                               'title': '',
                             },
                             title_text='Infektionen in Deutschland')

The DataFrame is expected as the first parameter. Then we configure which columns represent the x and y axis. With hover_data we can determine which data is additionally displayed by a bar when hovering. To ensure a comprehensible description of the parameter, labels can be used to determine what the label for data in the frontend should be. Since “Died” sounds a bit strange, we set it to “Already deceased”. The parameter orientation specifies whether we want to create a vertical (v) or horizontal (h) bar chart. With height we set the height to 600 pixels. Finally, we must update the layout parameters, as we did for the line chart.

Fig. 7: Number of cases in Germany as a bar chart

 

To make the distribution easier to see, we create a pie chart.

fig_fs_pie = px.pie(df_fs,
values='Anzahl',
names='Bundesland',
title='Verteilung auf Bundesländer',
template='ggplot2')

The parameters are mostly self-explanatory. values defines which column of the DataFrame is used, and names, which column contains the labels. In our case these are the names of the federal states (Fig. 8).

Fig. 8: Case numbers Germany as a pie chart

 

Finally, we generate a world map with the distribution of cases per country. We import the adjusted data of the worldwide infections.

df_ww_cases = pd.read_csv('../data/Total_cases_wordlwide.csv')

Then we create a scatter-geo-plot. Simply put, this will draw a bubble for each country, the size of which corresponds to the number of cases. The code can be seen in Listing 4.

fig_geo_ww = px.scatter_geo(df_ww_cases, 
             locations="Country_Region",
             hover_name="Country_Region",
             hover_data=['Confirmed', 'Recovered', 'Deaths'],
             size="Confirmed",
             locationmode='country names',
             text='Country_Region',
             scope='world',
             labels={
               'Country_Region':'Land',
               'Confirmed':'Bestätigte Fälle',
               'Recovered':'Wieder geheilt',
               'Deaths':'Verstorbene',
             },
             projection="equirectangular",
             size_max=35,
             template='ggplot2',)

The parameters are somewhat more extensive, but no less easy to understand. In principle, it is an assignment of the fields from the DataFrame. The parameters labels and hover_data have the same functions as before. With locations we tell which column of our DataFrame contains the locations/countries. So that Plotly Express knows how to assign them on a world map, we set locationmode to country names. Plotly can thus carry out the assignment for this dataframe at country level without having to specify exact geo-coordinates. text determines which heading the mouseovers have. The size of the bubbles is calculated from the confirmed cases in the DataFrame. We pass this to size. We can define the maximum size of the bubbles with size_max (in this case 35 pixels). With scope we control the focus on the world map. Possibilities are ‘usa‘, ‘europe‘, ‘asia‘, ‘africa‘, ‘north america‘, and ‘south america‘. This means that the map is not only focused on a certain region, but also limited to it. The appropriate labels and other metaparameters for the display are applied when the layout is updated:

fig_geo_ww.update_layout(
title_text = 'Bestätigte Infektionen weltweit',
title_x = 0.5,
geo = dict(
showframe = False,
showcoastlines = True,
projection_type = 'equirectangular'
)
)

geo defines the representations of the world map. The parameter projection_type is worth mentioning, since it can be used to control the visualization of the world map. For the dashboard we use equirectangular, better known as isosceles. The finished map is shown in figure 9.

Fig. 9: Distribution of cases worldwide

 

With this we have created all the necessary images to be used in our dashboard. In the next step we will come to the dashboard itself. Since Dash is used by Plotly, we can reuse, configure and interactively design our diagrams relatively easily.

 

Creating the dashboard with Dash from Plotly

Dash is a Python framework for building web applications. It is based on Flask, Plotly.js and React.js. Dash is very well suited for creating data visualization applications with highly customized user interfaces in Python. It is especially interesting for anyone working with data in Python. Since we have already created all diagrams with Plotly in advance, the next step to a web-based dashboard is straightforward.

We create an interactive dashboard from the data in our cleaned up CSV files – without a single line of HTML and JavaScript. We want to limit ourselves to basic functions and not go deeper into special features, as this would go beyond the scope of this article. For this I refer to the excellent documentation of Dash and the tutorial contained therein. Dash creates the website from a declarative description of the Python structure. To work with Dash, we need the Python module dash. It is installed with conda or pip:

$ conda install -c conda-forge dash
# or
$ pip install dash

For the dashboard we create a Python script called app.py. The first step is to import the required modules. Since we are importing the data with pandas in addition to Dash, we need to import the following packages:

import pandas as pd
import dash
import dash_core_components as dcc
import dash_html_components as html
# Module with help functions
import dashboard_figures as mf

Besides the actual module for Dash, we need the core components as well as the HTML components. The former contain all important components we need for the diagrams. HTML components contain the HTML parts. Once the modules are installed, the dashboard can be developed (Figure 10).

Let’s start with the global data. The breakdown is relatively simple: a heading, with two rows below it, each with two columns. We place the world map with facts as text to the right. Then we want a combo box to select one or more countries for display in the line chart below. We also have the choice of whether we prefer a linear or logarithmic representation. The data on infections in Germany are a bit extensive and include the bar chart and the distribution as a pie chart.

In Dash the layout is described in code. Since we have already generated all diagrams with Plotly Express, we outsource them to an external module with helper functions. I won’t go into this code in detail, because it contains the diagrams we have already created. The code is stored in the GitHub repository. Before the diagrams can be displayed, we need to import the CSV files and have the diagrams created (Listing 5).

df_ww = pd.read_csv('../data/worldwide_timeseries.csv')
df_total = pd.read_csv('../data/Total_cases_wordlwide.csv')
df_germany = pd.read_csv('../data/cases_germany_states.csv')
 
fig_geo_ww = mf.get_wordlwide_cases(df_total)
fig_germany_bar = mf.get_german_barchart(df_germany)
fig_germany_pie = mf.get_german_piechart(df_germany)
 
fig_line = mf.get_lineplot('Germany', df_ww)
 
ww_facts = mf.get_worldwide_facts(df_total)
 
fact_string = ‚'''Weltweit gibt es insgesamt {} Infektionen. 
  Davon sind bereits {} Menschen vertorben und {} gelten als geheilt. 
  Somit sind offiziell {} Menschen aktiv an einer Coronainfektion erkrankt.'''
 
fact_string = fact_string.format(ww_facts['total'], 
                                             ww_facts['deaths'], 
                                             ww_facts['recovered'], 
                                             ww_facts['active'])
 
countries = mf.get_country_names(df_ww)

Further help functions return a list of all countries and aggregated data. For the combo box, a list of options must be created that contains all countries.

dd_options = []
for key in countries:
dd_options.append({
'label': key,
'value': key
})

That was all the preparation that was necessary. The layout of the web application consists of the components provided by Dash. The complete description of the layout can be seen in Listing 6.

app.layout = html.Div(children=[
  html.H1(children='COVID-19: Dashboard', style={'textAlign': 'center'}),
  html.H2(children='Weltweite Verteilung', style={'textAlign': 'center'}),
  # World and Facts
  html.Div(children=[
 
    html.Div(children=[
 
        dcc.Graph(figure=fig_geo_ww),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '66%'}),
 
    html.Div(children=[
 
      html.H3(children='Fakten'),
      html.P(children=fact_string)
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '33%'})
 
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
 
  # Combobox and Checkbox
  html.Div(children=[
    html.Div(children=[
      # combobox
      dcc.Dropdown(
        id='country-dropdown',
        options=dd_options,
        value=['Germany'],
        multi=True
      ),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '66%'}),
 
    html.Div(children=[
      # Radio-Buttons
      dcc.RadioItems(
        id='yaxis-type',
        options=[{'label': i, 'value': i} for i in ['Linear', 'Log']],
        value='Linear',
        labelStyle={'display':'inline-block'}
      ),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '33%'})
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
  
  # Lineplot and Facts
  html.Div(children=[
    html.Div(children=[
 
      #Line plot: Infections
      dcc.Graph(figure=fig_line, id='infections_line'),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '100%'}),
 
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
 
  # Germany
  html.H2(children=‘Zahlen aus Deutschland‘, style={‚textAlign‘: ‚center‘}),
  html.Div(children=[
    html.Div(children=[
 
      # Barchart Germany
      dcc.Graph(figure=fig_germany_bar),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '50%'}),
 
    html.Div(children=[
 
      # Pie Chart Germany
      dcc.Graph(figure=fig_germany_pie),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '50%'})
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'})
])

The layout is described by Divs, or the respective wrappers for Divs, as code. There is a wrapper function for each HTML element. These functions can be nested however you like to create the desired layout. Instead of using inline styles, you can also work with your own classes and with a stylesheet. For our purposes, however, the inline styles and the external stylesheet read in at the beginning are sufficient. The approach of a declarative description for layouts has the advantage that we do not have to leave our code and do not have to be an expert in JavaScript or HTML. The focus is on dashboard development. If you look closely, you will find core components for the visualizations in addition to HTML components.

...
# Bar chart Germany
dcc.Graph(figure=fig_germany_bar),
....

The respective diagrams are inserted at these positions. For the line chart, we assign an ID so that we can access the chart later.

dcc.Graph(figure=fig_line, id='infections_line'),

Using the selected values from the combo box and radio buttons, we can adjust the display of the lines. In the combo box we want to provide a selection of countries. Multiple selection is also possible if you want to be able to see several countries in one diagram (see section # combo box in Listing 6). Furthermore, the radio buttons control the division of the y-axis according to linear or logarithmic display (see section # Radio-Buttons in Listing 6). In order to apply the selected options to the chart, we need to create an update function. We annotate this function with an @app.callback decorator (Listing 7).

# Interactive Line Chart
@app.callback(
  Output('infections_line', 'figure'),
  [Input('country-dropdown', 'value'), Input('yaxis-type', 'value')])
def update_graph(countries, axis_type):
  countries = countries if len(countries) > 0 else ['Germany']
  data_value = []
  for country in countries:
    data_value.append(dict(
      x= df_ww['Date'], 
      y= df_ww[country], 
      type= 'lines', 
      name= str(country)
    ))
 
  title = ', '.join(countries)
  title = 'Infektionen: ' + title
  return {
    'data': data_value,
    'layout': dict( 
      yaxis={
        'type': 'linear' if axis_type == 'Linear' else 'log'
      },
      hovermode='closest',
      title = title
    )
  }

The decorator has inputs and outputs. Output defines to which element the return is applied and what kind of element it is. In our case we want to access the line chart with the ID infections_line, which is of the type figure. Input describes which input values are needed in the function. In our case, these are the values from the combo box and the selection via the radio buttons. The corresponding function gets these values and we can work with them.

Since the countries are returned as a list, the line to be drawn must be configured in each case. In our example this is implemented with a simple for-in loop. For each country it is determined which columns from the dataframe are necessary. Finally, we return the new configuration of the diagram. When we return, we still define the division of the y-axis, depending on the selection, whether linear or logarithmic. Finally, we start a debug server:

if __name__ == '__main__':
app.run_server(debug=True)

If we start the Python script, a server is started and the application can be called in the browser. If one or more countries are selected, they are displayed as a line chart. The development server supports hot reloading: If we change something in the code and save it, the page is automatically reloaded. With this we have created our own coronavirus/Covid-19 dashboard with interactive diagrams. All without having to write HTML, JavaScript and CSS. Congratulations!

Fig. 10: The finished dashboard

 

 

Summary

Together we have worked through a simple data project. From cleaning up data and the creation of visualizations to the provision of a dashboard. Even with small projects, you can see that the cleanup and visualization part takes up most of the time. If the data is poor, the end product is guaranteed not to be better. In the end, the dashboard is created quickly.

Charts can be created very quickly with Plotly and are interactive out of the box, so they are not just static images. The ability to create dashboards quickly is especially helpful in the prototypical development of larger data visualization projects. There is no need for a large team of data scientists and developers, ideas can be tried out quickly and discarded if necessary. Plotly and Dash are more than sufficient for many purposes. If the data is cleaned up right from the start, the next steps are much easier. If you are dealing with data and its visualization, you should take a look at Plotly and not ignore Dash.

One final remark: In this example I have not used predictions and predictive algorithms and models. On the one hand, the amount of data is too small. On the other hand, a prediction with exponential data is always difficult and should be treated with great care, otherwise wrong conclusions will be drawn.

The post Let’s visualize the coronavirus pandemic appeared first on ML Conference.

]]>