Python Archives - ML Conference

OpenAI Embeddings

ML editorial team — Mon, 19 Feb 2024 13:18:46 +0000

Data has always played a central role in the development of software solutions. One of the biggest challenges in this area is the processing and interpretation of unstructured data such as text, images, or audio files. This is where embedding vectors (called embeddings for short) come into play – a technology that is becoming increasingly important in the development of software solutions with the integration of AI functions.

Stay up to date

Learn more about MLCON

Embeddings are essentially a technique for converting unstructured data into a structure that can be easily processed by software. They are used to transform complex data such as words, sentences, or even entire documents into a vector space, with similar elements close to each other. These vector representations allow machines to recognize and exploit nuances and relationships in the data. Which is essential for a variety of applications such as natural language processing (NLP), image recognition, and recommendation systems.

OpenAI, the company behind ChatGPT, offers models for creating embeddings for texts, among other things. At the end of January 2024, OpenAI presented new versions of these embeddings models, which are more powerful and cost-effective than their predecessors. In this article, after a brief introduction to embeddings, we’ll take a closer look at the OpenAI embeddings and the recently introduced innovations, discuss how they work, and examine how they can be used in various software development projects.

Embeddings briefly explained

Imagine you’re in a room full of people and your task is to group these people based on their personality. To do this, you could start asking questions about different personality traits. For example, you could ask how open someone is to new experiences and rate the answer on a scale from 0 to 1. Each person is then assigned a number that represents their openness.

Next, you could ask about another personality trait, such as the level of sense of duty, and again give a score between 0 and 1. Now each person has two numbers that together form a vector in a two-dimensional space. By asking more questions about different personality traits and rating them in a similar way, you can create a multidimensional vector for each person. In this vector space, people who have similar vectors can then be considered similar in terms of their personality.

In the world of artificial intelligence, we use embeddings to transform unstructured data into an n-dimensional vector space. Similarly how a person’s personality traits are represented in the vector space, each point in this vector space represents an element of the original data (such as a word or phrase) in a way that is understandable and processable by computers.

OpenAI Embeddings

OpenAI embeddings extend this basic concept. Instead of using simple features like personality traits, OpenAI models use advanced algorithms and big data to achieve a much deeper and more nuanced representation of the data. The model not only analyzes individual words, but also looks at the context in which those words are used, resulting in more accurate and meaningful vector representations.

Another important difference is that OpenAI embeddings are based on sophisticated machine learning models that can learn from a huge amount of data. This means that they can recognize subtle patterns and relationships in the data that go far beyond what could be achieved by simple scaling and dimensioning, as in the initial analogy. This leads to a significantly improved ability to recognize and exploit similarities and differences in the data.

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

Individual values are not meaningful

While in the personality trait analogy, each individual value of a vector can be directly related to a specific characteristic – for example openness to new experiences or a sense of duty – this direct relationship no longer exists with OpenAI embeddings. In these embeddings, you cannot simply look at a single value of the vector in isolation and draw conclusions about specific properties of the input data. For example, a specific value in the embedding vector of a sentence cannot be used to directly deduce how friendly or not this sentence is.

The reason for this lies in the way machine learning models, especially those used to create embeddings, encode information. These models work with complex, multi-dimensional representations where the meaning of a single element (such as a word in a sentence) is determined by the interaction of many dimensions in vector space. Each aspect of the original data – be it the tone of a text, the mood of an image, or the intent behind a spoken utterance – is captured by the entire spectrum of the vector rather than by individual values within that vector.

Therefore, when working with OpenAI embeddings, it’s important to understand that the interpretation of these vectors is not intuitive or direct. You need algorithms and analysis to draw meaningful conclusions from these high-dimensional and densely coded vectors.

Comparison of vectors with cosine similarity

A central element in dealing with embeddings is measuring the similarity between different vectors. One of the most common methods for this is cosine similarity. This measure is used to determine how similar two vectors are and therefore the data they represent.

To illustrate the concept, let’s start with a simple example in two dimensions. Imagine two vectors in a plane, each represented by a point in the coordinate system. The cosine similarity between these two vectors is determined by the cosine of the angle between them. If the vectors point in the same direction, the angle between them is 0 degrees and the cosine of this angle is 1, indicating maximum similarity. If the vectors are orthogonal (i.e. the angle is 90 degrees), the cosine is 0, indicating no similarity. If they are opposite (180 degrees), the cosine is -1, indicating maximum dissimilarity.

Figure 1 -Cosine similarity

[gdlr_box title=”A Python Notebook to try out” ]
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.[/gdlr_box]

A Python Notebook to try out
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

In practice, especially when working with embeddings, we are dealing with n-dimensional vectors. The calculation of the cosine similarity remains conceptually the same, even if the calculation is more complex in higher dimensions. Formally, the cosine similarity of two vectors A and B in an n-dimensional space is calculated by the scalar product (dot product) of these vectors divided by the product of their lengths:

Figure 2 – Calculation of cosine similarity

The normalization of vectors plays an important role in the calculation of cosine similarity. If a vector is normalized, this means that its length (norm) is set to 1. For normalized vectors, the scalar product of two vectors is directly equal to the cosine similarity since the denominators in the formula from Figure 2 are both 1. OpenAI embeddings are normalized, which means that to calculate the similarity between two embeddings, only their scalar product needs to be calculated. This not only simplifies the calculation, but also increases efficiency when processing large quantities of embeddings.

OpenAI Embeddings API

OpenAI offers a web API for creating embeddings. The exact structure of this API, including code examples for curl, Python and Node.js, can be found in the OpenAI reference documentation.

OpenAI does not use the LLM from ChatGPT to create embeddings, but rather specialized models. They were developed specifically for the creation of embeddings and are optimized for this task. Their development was geared towards generating high-dimensional vectors that represent the input data as well as possible. In contrast, ChatGPT is primarily optimized for generating and processing text in a conversational form. The embedding models are also more efficient in terms of memory and computing requirements than more extensive language models such as ChatGPT. As a result, they are not only faster but much more cost-effective.

New embedding models from OpenAI

Until recently, OpenAI recommended the use of the text-embedding-ada-002 model for creating embeddings. This model converts text into a sequence of floating point numbers (vectors) that represent the concepts within the content. The ada v2 model generated embeddings with a size of 1536 dimensions and delivered solid performance in benchmarks such as MIRACL and MTEB, which are used to evaluate model performance in different languages and tasks.

At the end of January 2024, OpenAI presented new, improved models for embeddings:

text-embedding-3-small: A smaller, more efficient model with improved performance compared to its predecessor. It performs better in benchmarks and is significantly cheaper.
text-embedding-3-large: A larger model that is more powerful and creates embeddings with up to 3072 dimensions. It shows the best performance in the benchmarks but is slightly more expensive than ada v2.

A new function of the two new models allows developers to adjust the size of the embeddings when generating them without significantly losing their concept-representing properties. This enables flexible adaptation, especially for applications that are limited in terms of available memory and computing power.

Readers who are interested in the details of the new models can find them in the announcement on the OpenAI blog. The exact costs of the various embedding models can be found here.

New embeddings models
At the end of January 2024, OpenAI introduced new models for creating embeddings. All code examples and result values contained in this article already refer to the new text-embedding-3-large model.

Create embeddings with Python

In the following section, the use of embeddings is demonstrated using a few code examples with Python. The code examples are designed so that they can be tried out in Python Notebooks. They are also available in a similar form in the previously mentioned accompanying Google Colab notebook mentioned above.
Listing 1 shows how to create embeddings with the Python SDK from OpenAI. In addition, numpy is used to show that the embeddings generated by OpenAI are normalized.

Listing 1

from openai import OpenAI
from google.colab import userdata
import numpy as np

# Create OpenAI client
client = OpenAI(
    api_key=userdata.get('openaiKey'),
)

# Define a helper function to calculate embeddings
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(
        input=input,
        model="text-embedding-3-large", # We use the new embeddings model here (announced end of Jan 2024)
        # dimensions=... # You could limit the number of output dimensions with the new embeddings models
    ).data[0].embedding

# Calculate the embedding vector for a sample sentence
vec = get_embedding_vec("King")
print(vec[:10])

# Calculate the magnitude of the vector. I should be 1 as
# embedding vectors from OpenAI are always normalized.
magnitude = np.linalg.norm(vec)
magnitude

Similarity analysis with embeddings

In practice, OpenAI embeddings are often used for similarity analysis of texts (e.g. searching for duplicates, finding relevant text sections in relation to a customer query, and grouping text). Embeddings are very well suited for this, as they work in a fundamentally different way to comparison methods based on characters, such as Levenshtein distance. While it measures the similarity between texts by counting the minimum number of single-character operations (insert, delete, replace) required to transform one text into another, embeddings capture the meaning and context of words or sentences. They consider the semantic and contextual relationships between words, going far beyond a simple character-based level of comparison.

As a first example, let’s look at the following three sentences (the following examples are in English, but embeddings work analogously for other languages and cross-language comparisons are also possible without any problems):

I enjoy playing soccer on weekends.
Football is my favorite sport. Playing it on weekends with friends helps me to relax.
In Austria, people often watch soccer on TV on weekends.

In the first and second sentence, two different words are used for the same topic: Soccer and football. The third sentence contains the original soccer, but it has a fundamentally different meaning from the first two sentences. If you calculate the similarity of sentence 1 to 2, you get 0.75. The similarity of sentence 1 to 3 is only 0.51. The embeddings have therefore reflected the meaning of the sentence and not the choice of words.

Here is another example that requires an understanding of the context in which words are used:
He is interested in Java programming.
He visited Java last summer.
He recently started learning Python programming.

In sentence 2, Java refers to a place, while sentences 1 and 3 have something to do with software development. The similarity of sentence 1 to 2 is 0.536, but that of 1 to 3 is 0.587. As expected, the different meaning of the word Java has an effect on the similarity.

The next example deals with the treatment of negations:
I like going to the gym.
I don’t like going to the gym.
I don’t dislike going to the gym.

Sentences 1 and 2 say the opposite, while sentence 3 expresses something similar to sentence 1. This content is reflected in the similarities of the embeddings. Sentence 1 to sentence 2 yields a cosine similarity of 0.714 while sentence 1 compared to sentence 3 yields 0.773. It is perhaps surprising that there is no major difference between the embeddings. However, it’s important to remember that all three sets are about the same topic: The question of whether you like going to the gym to work out.

The last example shows that the OpenAI embeddings models, just like ChatGPT, have built in a certain “knowledge” of concepts and contexts through training with texts about the real world.

I need to get better slicing skills to make the most of my Voron.
3D printing is a worthwhile hobby.
Can I have a slice of bread?

In order to compare these sentences in a meaningful way, it’s important to know that Voron is the name of a well-known open-source project in the field of 3D printing. It’s also important to note that slicing is a term that plays an important role in 3D printing. The third sentence also mentions slicing, but in a completely different context to sentence 1. Sentence 2 mentions neither slicing nor Voron. However, the trained knowledge enables the OpenAI Embeddings model to recognize that sentences 1 and 2 have a thematic connection, but sentence 3 means something completely different. The similarity of sentence 1 and 2 is 0.333 while the comparison of sentence 1 and 3 is only 0.263.

Similarity values are not percentages

The similarity values from the comparisons shown above are the cosine similarity of the respective embeddings. Although the cosine similarity values range from -1 to 1, with 1 being the maximum similarity and -1 the maximum dissimilarity, they are not to be interpreted directly as percentages of agreement. Instead, these values should be considered in the context of their relative comparisons. In applications such as searching text sections in a knowledge base, the cosine similarity values are used to sort the text sections in terms of their similarity to a given query. It is important to see the values in relation to each other. A higher value indicates a greater similarity, but the exact meaning of the value can only be determined by comparing it with other similarity values. This relative approach makes it possible to effectively identify and prioritize the most relevant and similar text sections.

Embeddings and RAG solutions

Embeddings play a crucial role in Retrieval Augmented Generation (RAG) solutions, an approach in artificial intelligence that combines the capabilities of information retrieval and text generation. Embeddings are used in RAG systems to retrieve relevant information from large data sets or knowledge databases. It is not necessary for these databases to have been included in the original training of the embedding models. They can be internal databases that are not available on the public Internet.
With RAG solutions, queries or input texts are converted into embeddings. The cosine similarity to the existing document embeddings in the database is then calculated to identify the most relevant text sections from the database. This retrieved information is then used by a text generation model such as ChatGPT to generate contextually relevant responses or content.

Vector databases play a central role in the functioning of RAG systems. They are designed to efficiently store, index and query high-dimensional vectors. In the context of RAG solutions and similar systems, vector databases serve as storage for the embeddings of documents or pieces of data that originate from a large amount of information. When a user makes a request, this request is first transformed into an embedding vector. The vector database is then used to quickly find the vectors that correspond most closely to this query vector – i.e. those documents or pieces of information that have the highest similarity. This process of quickly finding similar vectors in large data sets is known as Nearest Neighbor Search.

Challenge: Splitting documents

A detailed explanation of how RAG solutions work is beyond the scope of this article. However, the explanations regarding embeddings are hopefully helpful for getting started with further research on the topic of RAGs.

However, one specific point should be pointed out at the end of this article: A particular and often underestimated challenge in the development of RAG systems that go beyond Hello World prototypes is the splitting of longer texts. Splitting is necessary because the OpenAI embeddings models are limited to just over 8,000 tokens. One token corresponds to approximately 4 characters in the English language (see also).

It’s not easy finding a good strategy for splitting documents. Naive approaches such as splitting after a certain number of characters can lead to the context of text sections being lost or distorted. Anaphoric links are a typical example of this. The following two sentences are an example:

VX-2000 requires regular lubrication to maintain its smooth operation.
The machine requires the DX97 oil, as specified in the maintenance section of this manual.

The machine in the second sentence is an anaphoric link to the first sentence. If the text were to be split up after the first sentence, the essential context would be lost, namely that the DX97 oil is necessary for the VX-2000 machine.

There are various approaches to solving this problem, which will not be discussed here to keep this article concise. However, it is essential for developers of such software systems to be aware of the problem and understand how splitting large texts affects embeddings.

Stay up to date

Learn more about MLCON

Summary

Embeddings play a fundamental role in the modern AI landscape, especially in the field of natural language processing. By transforming complex, unstructured data into high-dimensional vector spaces, embeddings enable in-depth understanding and efficient processing of information. They form the basis for advanced technologies such as RAG systems and facilitate tasks such as information retrieval, context analysis, and data-driven decision-making.

OpenAI’s latest innovations in the field of embeddings, introduced at the end of January 2024, mark a significant advance in this technology. With the introduction of the new text-embedding-3-small and text-embedding-3-large models, OpenAI now offers more powerful and cost-efficient options for developers. These models not only show improved performance in standardized benchmarks, but also offer the ability to find the right balance between performance and memory requirements on a project-specific basis through customizable embedding sizes.

Embeddings are a key component in the development of intelligent systems that aim to achieve useful processing of speech information.

Links and Literature:

The post OpenAI Embeddings appeared first on ML Conference.

Address Matching with NLP in Python

ML editorial team — Fri, 02 Feb 2024 12:02:35 +0000

Address matching isn’t always simple in data; we often need to parse and standardize addresses into a consistent format first before we can use them as identifiers for matching. Address matching is an important step in the following use cases:

Geospatial Analysis: Accurate address matching forms the foundation of geospatial analysis, allowing organizations to make informed decisions about locations, market trends, and resource allocation across various industries like retail and media.
Real Estate Data Management: In the real estate industry, precise address matching facilitates property valuation, market analysis, and portfolio management.
Logistics and Navigation: Efficient routing and delivery depend on accurate address matching.
Compliance and Regulation: Many regulatory requirements mandate precise address data, such as tax reporting and census data collection.

Stay up to date

Learn more about MLCON

Cherre is the leading real estate data management company, we specialize in accurate address matching for the second use case. Whether you’re an asset manager, portfolio manager, or real estate investor, a building represents the atomic unit of all financial, legal, and operating information. However, real estate data lives in many silos, which makes having a unified view of properties difficult. Address matching is an important step in breaking down data silos in real estate. By joining disparate datasets on address, we can unlock many opportunities for further portfolio analysis.

Data Silos in Real Estate

Real estate data usually fall into the following categories: public, third party, and internal. Public data is collected by governmental agencies and made available publicly, such as land registers. The quality of public data is generally not spectacular and the data update frequency is usually delayed, but it provides the most comprehensive coverage geographically. Don’t be surprised if addresses from public data sources are misaligned and misspelled.

Third party data usually come from data vendors, whose business models focus on extracting information as datasets and monetizing those datasets. These datasets usually have good data quality and are much more timely, but limited in geographical coverage. Addresses from data vendors are usually fairly clean compared to public data, but may not be the same address designation across different vendors. For large commercial buildings with multiple entrances and addresses, this creates an additional layer of complexity.

Lastly, internal data is information that is collected by the information technology (I.T.) systems of property owners and asset managers. These can incorporate various functions, from leasing to financial reporting, and are often set up to represent the business organizational structures and functions. Depending on the governance standards, and data practices, the quality of these datasets can vary and data coverage only encompasses the properties in the owner’s portfolio. Addresses in these systems can vary widely, some systems are designated at the unit-level, while others designate the entire property. These systems also may not standardize addresses inherently, which makes it difficult to match property records across multiple systems.

With all these variations in data quality, coverage, and address formats, we can see the need for having standardized addresses to do basic property-level analysis.

[track_display_in_blog headline="NEW & PRACTICAL ENDEAVORS FOR ML" text="Machine Learning Principles" textcolor="white" backgroundimage="https://mlconference.ai/wp-content/uploads/2024/02/MLC_Global24_Website_Blog1.jpg" icon="https://mlconference.ai/wp-content/uploads/2019/10/MLC_Singapur20_Trackicons_MLPrinciples_250x250_54073_rot_v1.png" btnlink="machine-learning-principles" btntext="Learn more"]

Address Matching Using the Parse-Clean-Match Strategy

In order to match records across multiple datasets, the address parse-clean-match strategy works very well regardless of region. By breaking down addresses into their constituent pieces, we have many more options for associating properties with each other. Many of the approaches for this strategy use simple natural language processing (NLP) techniques.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Learn more

Address Parsing

Before we can associate addresses with each other, we must first parse the address. Address parsing is the process of breaking down each address string into its constituent components. Components in addresses will vary by country.

In the United States and Canada, addresses are generally formatted as the following:

{street_number} {street_name}

{city}, {state_or_province} {postal_code}

{country}

In the United Kingdom, addresses are formatted very similarly as in the U.S. and Canada, with an additional optional locality designation:

{building_number} {street_name}

{locality (optional)}

{city_or_town}

{postal_code}

{country}

French addresses vary slightly from U.K. addresses with the order of postal code and city:

{building_number} {street_name}

{postal_code} {city}

{country}

German addresses take the changes in French addresses and then swap the order of street name and building number:

{street_name} {building_number} {postal_code} {city} {country}

Despite the slight variations across countries’ address formats, addresses generally have the same components, which makes this an easily digestible NLP problem. We can break down the process into the following steps:

Tokenization: Split the address into its constituent words. This step segments the address into manageable units.
Named Entity Recognition (NER): Identify entities within the address, such as street numbers, street names, cities, postal codes, and countries. This involves training or using pre-trained NER models to label the relevant parts of the address.
Sequence Labeling: Use sequence labeling techniques to tag each token with its corresponding entity

Let’s demonstrate address parsing with a sample Python code snippet using the spaCy library. SpaCy is an open-source software library containing many neural network models for NLP functions. SpaCy supports models across 23 different languages and allows for data scientists to train custom models for their own datasets. We will demonstrate address parsing using one of SpaCy’s out-of-the-box models for the address of a historical landmark: David Bowie’s Berlin apartment.

import spacy

# Load the NER spaCy model
model = spacy.load("en_core_web_sm")

# Address to be parsed
address = "Hauptstraße 155, 10827 Berlin"

# Tokenize and run NER
doc = model(address)

# Extract address components
street_number = ""
street_name = ""
city = ""
state = ""
postal_code = ""

for token in doc:
    if token.ent_type_ == "GPE":  # Geopolitical Entity (City)
        city = token.text
    elif token.ent_type_ == "LOC":  # Location (State/Province)
        state = token.text
    elif token.ent_type_ == "DATE":  # Postal Code
        postal_code = token.text
    else:
        if token.is_digit:
            street_number = token.text
        else:
            street_name += token.text + " "

# Print the parsed address components
print("Street Number:", street_number)
print("Street Name:", street_name)
print("City:", city)
print("State:", state)
print("Postal Code:", postal_code)

Now that we have a parsed address, we can now clean each address component.

Address Cleaning

Address cleaning is the process of converting parsed address components into a consistent and uniform format. This is particularly important for any public data with misspelled, misformatted, or mistyped addresses. We want to have addresses follow a consistent structure and notation, which will make further data processing much easier.

To standardize addresses, we need to standardize each component, and how the components are joined. This usually entails a lot of string manipulation. There are many open source libraries (such as libpostal) and APIs that can automate this step, but we will demonstrate the basic premise using simple regular expressions in Python.

import pandas as pd
import re

# Sample dataset with tagged address components
data = {
    'Street Name': ['Hauptstraße', 'Schloß Nymphenburg', 'Mozartweg'],
    'Building Number': ['155', '1A', '78'],
    'Postal Code': ['10827', '80638', '54321'],
    'City': ['Berlin', ' München', 'Hamburg'],
}

df = pd.DataFrame(data)

# Functions with typical necessary steps for each address component

# We uppercase all text for easier matching in the next step

def standardize_street_name(street_name):

    # Remove special characters and abbreviations, uppercase names
    standardized_name = re.sub(r'[^\w\s]', '', street_name)
    return standardized_name.upper()

def standardize_building_number(building_number):
    # Remove any non-alphanumeric characters (although exceptions exist)
    standardized_number = re.sub(r'\W', '', building_number)
    return standardized_number

def standardize_postal_code(postal_code):

    # Make sure we have consistent formatting (i.e. leading zeros)
    return postal_code.zfill(5)

def standardize_city(city):

    # Upper case the city, normalize spacing between words
    return ' '.join(word.upper() for word in city.split())

# Apply standardization functions to our DataFrame
df['Street Name'] = df['Street Name'].apply(standardize_street_name)
df['Building Number'] = df['Building Number'].apply(standardize_building_number)
df['Postal Code'] = df['Postal Code'].apply(standardize_postal_code)
df['City'] = df['City'].apply(standardize_city)

# Finally create a standardized full address (without commas)

df[‘Full Address’] = df['Street Name'] + ' ' + df['Building Number'] + ' ' + df['Postal Code'] + ' ' + df['City']

Address Matching

Now that our addresses are standardized into a consistent format, we can finally match addresses from one dataset to address in another dataset. Address matching involves identifying and associating similar or identical addresses from different datasets. When two full addresses match exactly, we can easily associate the two together through a direct string match.

When addresses don’t match, we will need to apply fuzzy matching on each address component. Below is an example of how to do fuzzy matching on one of the standardized address components for street names. We can apply the same logic to city and state as well.

from fuzzywuzzy import fuzz

# Sample list of street names from another dataset
street_addresses = [
    "Hauptstraße",
    "Schlossallee",
    "Mozartweg",
    "Bergstraße",
    "Wilhelmstraße",
    "Goetheplatz",
]

# Target address component (we are using street name)
target_street_name = "Hauptstrasse " # Note the different spelling and space 

# Similarity threshold

# Increase this number if too many false positives

# Decrease this number if not enough matches

threshold = 80

# Perform fuzzy matching
matches = []

for address in street_addresses:
    similarity_score = fuzz.partial_ratio(address, target_street_name)
    if similarity_score >= threshold:
        matches.append((address, similarity_score))

matches.sort(key=lambda x: x[1], reverse=True)

# Display matched street name
print("Target Street Name:", target_street_name)
print("Matched Street Names:")
for match in matches:
    print(f"{match[0]} (Similarity: {match[1]}%)")

Up to here, we have solved the problem for properties with the same address identifiers. But what about the large commercial buildings with multiple addresses?

Other Geospatial Identifiers

Addresses are not the only geospatial identifiers in the world of real estate. An address typically refers to the location of a structure or property, often denoting a street name and house number. There are actually four other geographic identifiers in real estate:

A “lot” represents a portion of land designated for specific use or ownership.
A “parcel” extends this notion to a legally defined piece of land with boundaries, often associated with property ownership and taxation.
A “building” encompasses the physical structures erected on these parcels, ranging from residential homes to commercial complexes.
A “unit” is a sub-division within a building, typically used in multi-unit complexes or condominiums. These can be commercial complexes (like office buildings) or residential complexes (like apartments).

What this means is that we actually have multiple ways of identifying real estate objects, depending on the specific persona and use case. For example, leasing agents focus on the units within a building for tenants, while asset managers optimize for the financial performance of entire buildings. The nuances of these details are also codified in many real estate software systems (found in internal data), in the databases of governments (found in public data), and across databases of data vendors (found in third party data). In public data, we often encounter lots and parcels. In vendor data, we often find addresses (with or without units). In real estate enterprise resource planning systems, we often find buildings, addresses, units, and everything else in between.

In the case of large commercial properties with multiple addresses, we need to associate various addresses with each physical building. In this case, we can use geocoding and point-in-polygon searches.

Geocoding Addresses

Geocoding is the process of converting addresses into geographic coordinates. The most common form is latitude and longitude. European address geocoding requires a robust understanding of local address formats, postal codes, and administrative regions. Luckily, we have already standardized our addresses into an easily geocodable format.

Many commercial APIs exist for geocoding addresses in bulk, but we will demonstrate geocoding using a popular Python library, Geopy, to geocode addresses.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="my_geocoder")
location = geolocator.geocode("1 Canada Square, London")
print(location.latitude, location.longitude)

Now that we’ve converted our addresses into latitude and longitude, we can use point-in-polygon searches to associate addresses with buildings.

Point-in-Polygon Search

A point-in-polygon search is a technique to determine if a point is located within the boundaries of a given polygon.

The “point” in a point-in-polygon search refers to a specific geographical location defined by its latitude and longitude coordinates. We have already obtained our points by geocoding our addresses.

The “polygon” is a closed geometric shape with three or more sides, which is usually characterized by a set of vertices (points) connected by edges, forming a closed loop. Building polygons can be downloaded from open source sites like OpenStreetMap or from specific data vendors. The quality and detail of the OpenStreetMap building data may vary, and the accuracy of the point-in-polygon search depends on the precision of the building geometries.

While the concept seems complex, the code for creating this lookup is quite simple. We demonstrate a simplified example using our previous example of 1 Canada Square in London.

import json
from shapely.geometry import shape, Point

# Load the GeoJSON data
with open('building_data.geojson') as geojson_file:
    building_data = json.load(geojson_file)

# Latitude and Longitude of 1 Canada Square in Canary Wharf
lat, lon = 51.5049, 0.0195

# Create a Point geometry for 1 Canada Square
point_1_canada = Point(lon, lat)

# See if point is within any of the polygons
for feature in building_data['features']:
    building_geometry = shape(feature['geometry'])

    if point_1_canada.within(building_geometry):
        print(f"Point is within this building polygon: {feature}")
        break
else:
    print("Point is not within any building polygon in the dataset.")

Using this technique, we can properly identify all addresses associated with this property.

Stay up to date

Learn more about MLCON

Summary

Addresses in real life are confusing because they are the physical manifestation of many disparate decisions in city planning throughout the centuries-long life of a city. But using addresses to match across different datasets doesn’t have to be confusing.

Using some basic NLP and geocoding techniques, we can easily associate property-level records across various datasets from different systems. Only through breaking down data silos can we have more holistic views of property behaviors in real estate.

Author Biography

Alyce Ge is data scientist at Cherre, the industry-leading real estate data management and analytics platform. Prior to joining Cherre, Alyce held data science and analytics roles for a variety of technology companies focusing on real estate and business intelligence solutions. Alyce is a Google Cloud-certified machine learning engineer, Google Cloud-certified data engineer, and Triplebyte certified data scientist. She earned her Bachelor of Science in Applied Mathematics from Columbia University in New York.

The post Address Matching with NLP in Python appeared first on ML Conference.

On pythonic tracks

ML editorial team — Mon, 15 Mar 2021 13:00:26 +0000

In principle, Tribuo is an ML system intended to help close the feature gap between Python and Java in the field of artificial intelligence, at least to a certain extent.

According to the announcement, the product, which is licensed under the (very liberal) Apache license, can look back on a history of several years of use within Oracle. This can be seen in the fact that the library offers very extensive functions — in addition to the generation of “own” models, there are also interfaces for various other libraries, including TensorFlow.

The author and editors are aware that machine learning cannot be included in an article like this. However, just because of the importance of this idea, we want to show you a little bit about how you can play around with Tribuo.

Stay up to date

Learn more about MLCON

Modular library structure

Machine learning applications are not usually among the tasks you run on resource-constrained systems — IoT edge systems that run ML payloads usually have a hardware accelerator, such as the SiPeed MAiX. Nevertheless, Oracle’s Tribuo library is offered in a modularized way, so in theory, developers can only include those parts of the project in their Solutions that they really need. An overview of which functions are provided in the individual packages can be found under https://tribuo.org/learn/4.0/docs/packageoverview.html.

In this introductory article, we do not want to delve further into modularization, but instead, program a little on the fly. A Tribuo-based artificial intelligence application generally has the structure shown in Figure 1.

Fig. 1: Tribuo strives to be a full-stack artificial intelligence solution (Image source: Oracle).

The figure informs us that Tribuo seeks to process information from A to Z itself. On the far left is a DataSource object that collects the information to be processed by artificial intelligence and converts it into Tribuo’s own storage format called Example. These Example objects are then held in the form of a Dataset instance, which — as usual — moves towards a model that delivers predictions. A class called Evaluator can then make concrete decisions based on this information, which is usually quite general or probability-based.

An interesting aspect of the framework in this context is that many Tribuo classes come with a more or less generic system for configuration settings. In principle, an annotation is placed in front of the attributes, whose behavior can be adjusted:

public class LinearSGDTrainer implements Trainer, WeightedExamples { @Config(description="The classification objective function to use.") private LabelObjective objective = new LogMulticlass();

The Oracle Labs Configuration and Utilities Toolkit (OLCUT) described in detail in https://github.com/oracle/OLCUT can read this information from XML files — in the case of the property we just created, the parameterization could be done according to the following scheme:

type="org.tribuo.classification.sgd.linear.LinearSGDTrainer">

The point of this approach, which sounds academic at first glance, is that the behavior of ML systems is strongly dependent on the parameters contained in the various elements. By implementing OLCUT, the developer gives the user the possibility to dump these settings, or return a system to a defined state with little effort.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Learn more

Tribuo integration

After these introductory considerations, it is time to conduct your first experiments with Tribuo. Even though the library source code is available on GitHub for self-compilation, it is recommended to use a ready-made package for first attempts.

Oracle supports both Maven and Gradle and even offers (partial) support for Kotlin in the Gradle special case. However, we want to work with classic tools in the following steps, which is why we grab an Eclipse instance and animate it to create a new Maven-based project skeleton by clicking New | New Maven project.

In the first step of the generator, the IDE asks if you want to load templates called archetype. Please select the Create a simple project (skip archetype selection) checkbox to animate the IDE to create a primitive project skeleton. In the next step, we open the file pom.xml to command an insertion of the library according to the scheme in Listing 1.

 tribuotest1
  
    
      org.tribuo
      tribuo-all
      4.0.1
      pom

To verify successful integration, we then command a recompilation of the solution — if you are connected to the Internet, the IDE will inform you that the required components are moving from the Internet to your workstation.

Given the proximity to the classic ML ecosystem, it should not surprise anyone that the Tribuo examples are by and large in the form of Jupiter Notebooks — a widespread form of presentation, especially in the research field, that is not suitable for serial use in production, at least in the author’s opinion.

Therefore, we want to rely on classical Java programs in the following steps. The first problem we want to address is classification. In the world of ML, this means that the information provided is classified into a group of categories or histogram bins called a class. In the field of machine learning, a group of examples called data sets (sample data sets) has become established. These are prebuilt databases that are constant and can be used to evaluate different model hierarchies and training levels. In the following steps, we want to rely on the Iris data set provided in https://archive.ics.uci.edu/ml/datasets/iris. Funnily enough, the term iris does not refer to a part of the eye, but to a plant species.

Fortunately, the data set is available in a form that can be used directly by Tribuo. For this reason, the author working under Linux opens a terminal window in the first step, creates a new working directory, and downloads the information from the server via wget:

t@T18:~$ mkdir tribuospace t@T18:~$ cd tribuospace/ t@T18:~/tribuospace$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data

Next, we add a class to our Eclipse project skeleton from the first step, which takes a Main method. Place in it the code from Listing 2.

import org.tribuo.classification.Label;
import java.nio.file.Paths;
 
public class ClassifyWorker {
  public static void main(String[] args) {
    var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
    DataSource irisData =
      new CSVLoader<>(new LabelFactory()).loadDataSource(Paths.get("bezdekIris.data"),
        irisHeaders[4],
        irisHeaders);

This routine seems simple at first glance, but it’s a bit tricky in several respects. First, we are dealing with a class called Label — depending on the configuration of your Eclipse working environment, the IDE may even offer dozens of Label candidate classes. It’s important to make sure you choose the org.tribuo.classification.Label import shown here — a label is a cataloging category in Tribuo.

The syntax starting with var then enables the IDE to reach a current Java version — if you want to use Tribuo (effectively), you have to use at least JDK 8, or even better, JDK 10. After all, the var syntax introduced in this version can be found in just about every code sample.

It follows from the logic that — depending on the system configuration — you may have to make extensive adjustments at this point. For example, the author working under Ubuntu 18.04 first had to provide a compatible JDK:

tamhan@TAMHAN18:~/tribuospace$ sudo apt-get install openjdk-11-jdk

Note that Eclipse sometimes cannot reach the new installation path by itself — in the package used by the author, the correct directory was /usr/lib/jvm/java-11-openjdk-amd64/bin/java.

Anyway, after successfully adjusting the Java execution configuration, we are able to compile our application — you may still need to add the NIO package to the Maven configuration because the Tribuo library relies on this novel EA library across the field for better performance.

Now that we have our program skeleton up and running, let’s look at it in turn to learn more about the inner workings of Tribuo applications. The first thing we have to deal with — think of Figure 1 on the far left — is data loading (Listing 3).

public static void main(String[] args) {
try {
  var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
  DataSource irisData;
  
  irisData = new CSVLoader<>(new LabelFactory()).loadDataSource(    Paths.get("/home/tamhan/tribuospace/bezdekIris.data"),
    irisHeaders[4],
    irisHeaders);

The provided dataset contains comparatively little information — headers and co. are looked for in vain. For this reason, we have to pass an array to the CSVLoader class that informs you about the column names of the data set. The assignment of irisHeaders ensures that the target variable of the model receives separate announcements.

In the next step, we have to take care of splitting our dataset into a training group and a test group. Splitting the information into two groups is quite a common procedure in the field of machine learning. The test data is used to “verify” the trained model, while the actual training data is used to adjust and improve the parameters. In the case of our program, we want to make a 70-30 split between training and other data, which leads to the following code:

var splitIrisData = new TrainTestSplitter<>(irisData, 0.7, 1L); var trainData = new MutableDataset<>(splitIrisData.getTrain()); var testData = new MutableDataset<>(splitIrisData.getTest());

Attentive readers wonder at this point why the additional parameter 1L is passed. Tribuo works internally with a random generator. As with all or at least most pseudo-random generators, it can be animated to behave more or less deterministically if you set the seed value to a constant. The constructor of the class TrainTestSplitter exposes this seed field — we pass the constant value one here to achieve a reproducible behavior of the class.

At this point, we are ready for our first training run. In the area of training machine learning-based models, a group of procedures emerged that are generally grouped together by developers. The fastest way to a runnable training system is to use the LogisticRegressionTrainer class, which inherently loads a set of predefined settings:

var linearTrainer = new LogisticRegressionTrainer(); Model linear = linearTrainer.train(trainData);

The call to the Train method then takes care of the framework triggering the training process and getting ready to issue predictions. It follows from this logic that our next task is to request such a prediction, which is forwarded towards an evaluator in the next step. Last but not least, we need to output its results towards the command line:

Prediction prediction = linear.predict(testData.getExample(0)); LabelEvaluation evaluation = new LabelEvaluator().evaluate(linear,testData); double acc = evaluation.accuracy(); System.out.println(evaluation.toString());

At this point, our program is ready for a first small test run — Figure 2 shows how the results of the Iris data set present themselves on the author’s workstation.

Fig. 2: It works: Machine learning without Python!

Learning more about elements

Having enabled our little Tribuo classifier to perform classification against the Iris data set in the first step, let’s look in detail at some advanced features and functions of the classes used.

The first Interesting feature is to check the work of TrainTestSplitter class. For this purpose it is enough to place the following code in a conveniently accessible place in the Main method:

System.out.println(String.format("data size = %d, num features = %d, num classes = %d",trainingDataset.size(),trainingDataset.getFeatureMap().size(),trainingDataset.getOutputInfo().size()));

The data set exposes a set of member functions that output additional information about the tuples they contain. For example, executing the code here would inform how many data sets, how many features, and how many classes to subdivide.

The LogisticRegressionTrainer class is also only one of several processes you can use to train the model. If we want to rely on the CART process instead, we can adapt it according to the following scheme:

var cartTrainer = new CARTClassificationTrainer(); Model tree = cartTrainer.train(trainData); Prediction prediction = tree.predict(testData.getExample(0)); LabelEvaluation evaluation = new LabelEvaluator().evaluate(tree,testData);

If you run the program again with the modified classification, a new console window opens with additional information — it shows that Tribuo is also suitable for exploring machine learning processes. Incidentally, given the comparatively simple and unambiguous structure of the data set, the same result is obtained with both classifiers.

Number Recognition

Another dataset widely used in the ML field is the MNIST Handwriting Sample. This is a survey conducted by MNIST about the behavior of people writing numbers. Logically, such a model can then be used to recognize postal codes — a process that helps save valuable man-hours and money when sorting letters.

Since we have just dealt with basic classification using a more or less synthetic data set, we want to work with this more realistic information in the next step. In the first step, this information must also be transferred to the workstation, which is done under Linux by calling wget:

tamhan@TAMHAN18:~/tribuospace$ wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz tamhan@TAMHAN18:~/tribuospace$ wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

The most interesting thing about the structure given here is that each wget command downloads two files — both the training and the actual working files are in separate archives each with label and image information.

The MNIST data set is available in a data format known as IDX, which in many cases is compressed via GZIP and easier to handle. If you study the Tribuo library documentation available at https://tribuo.org/learn/4.0/javadoc/org/tribuo/datasource/IDXDataSource.html, you will find that with the class IDXDataSource, a loading tool intended for this purpose is available. It even decompresses the data on the fly if required.

It follows that our next task is to integrate the IDXDataSource class into our program workflow (Listing 4).

public static void main(String[] args) {
  try {
    LabelFactory myLF = new LabelFactory();
    DataSource ids1;
    ids1 = new
IDXDataSource(Paths.get("/home/tamhan/tribuospace/t10k-images-idx3-ubyte.gz"),
Paths.get("/home/tamhan/tribuospace/t10k-labels-idx1-ubyte.gz"), myLF);

From a technical point of view, the IDXDataSource does not differ significantly from the CSVDataSource used above. Interestingly, we pass two file paths here, because the MNIST data set is available in the form of a separate training and working data set.

By the way, developers who grew up with other IO libraries should pay attention to a little beginner’s trap with the constructor: The classes contained in Tribuo do not expect a string, but a finished Path instance. However, this can be generated with the unbureaucratic call of Paths.get().

Be that as it may, the next task of our program is the creation of a second DataSource, which takes care of the training data. Its constructor differs — logically — only in the file name for the source information (Listing 5).

DataSource ids2;
ids2 = new
IDXDataSource(Paths.get("/home/tamhan/tribuospace/train-images-idx3-ubyte.gz"),
Paths.get("/home/tamhan/tribuospace/train-labels-idx1-ubyte.gz"), myLF);
var trainData = new MutableDataset<>(ids1);
var testData = new MutableDataset<>(ids2);
var evaluator = new LabelEvaluator();

Most data classes included in Tribuo are generic with respect to output format, at least to some degree. This affects us in that we must permanently pass a LabelFactory to inform the constructor of the desired output format and the Factory class to be used to create it

The next task we take care of is to convert the information into Data Sets and launch an emulator and a CARTClassificationTrainer training class:

var cartTrainer = new CARTClassificationTrainer(); Model tree = cartTrainer.train(trainData);

The rest of the code — not printed here for space reasons — is then a one-to-one modification of the model processing used above. If you run the program on the workstation, you will see the result shown in Figure 3. By the way, don’t be surprised if Tribuo takes some time here — the MNIST dataset is somewhat larger than the one in the table used above.

Fig. 3: Not bad for a first attempt: the results of our CART classifier.

Since our achieved accuracy is not so great after all, we want to use another training algorithm here. The structure of Tribuo shown in Figure 1, based on standard interfaces, helps us make such a change with comparatively little code.

This time we want to use the LinearSGDTrainer class as a target, which we integrate into the program by the command import org.tribuo.classification.sgd.linear.LinearSGDTrainer;. By the way, the explicit mention of it is not a pedantry of the publisher — besides the Classification class used here, there is also a LinearSGDTrainer, which is intended for regression tasks and cannot help us at this point, however.

Since the LinearSGDTrainer class does not come with a convenience constructor that initializes the trainer with a set of (more or less well-working) default values, we have to make some changes:

var cartTrainer = new LinearSGDTrainer(new LogMulticlass(), SGD.getLinearDecaySGD(0.2),10,200,1,1L); Model tree = cartTrainer.train(trainData);

At this point, this program version is also ready to use — since LinearSGD requires more computing power, it will take a little more time to process.

Excursus: Automatic configuration

LinearSGD may be a powerful machine learning algorithm — but the constructor is long and difficult to understand because of the number of parameters, especially without IntelliSense support.

Since with ML systems you generally work with a Data Scientist with limited Java skills, it would be nice to be able to provide the model information as a JSON or XML file.

This is where the OLCUT configuration system mentioned above comes in. If we include it in our solution, we can parameterize the class in JSON according to the scheme shown in Listing 6.

"name" : "cart",
"type" : "org.tribuo.classification.dtree.CARTClassificationTrainer",
"export" : "false",
"import" : "false",
"properties" : {
  "maxDepth" : "6",
  "impurity" : "gini",
  "seed" : "12345",
  "fractionFeaturesInSplit" : "0.5"
}

Even at first glance, it is noticeable that parameters like the seed to be used for the random generator are directly addressable — the line “seed” : “12345” should be found even by someone who is challenged by Java.

By the way, OLCUT is not limited to the linear creation of object instances — the system is also able to create nested class structures. In the markup we just printed, the attribute “impurity” : “gini” is an excellent example of this — it can be expressed according to the following scheme to generate an instance of the GiniIndex class:

"name" : "gini", "type" : "org.tribuo.classification.dtree.impurity.GiniIndex",

Once we have such a configuration file, we can invoke an instance of the ConfigurationManager class as follows:

ConfigurationManager.addFileFormatFactory(new JsonConfigFactory()) String configFile = "example-config.json"; String.join("\n",Files.readAllLines(Paths.get(configFile)))

OLCUT is agnostic in the area of file formats used — the actual logic for reading the data present in the file system or configuration file can be written in by an adapter class. In the case of our present code we use the JSON file format, so we register an instance of JsonConfigFactory via the addFileFormatFactory method.

Next, we can also start parameterizing the elements of our machine learning application using ConfigurationManager:

var cm = new ConfigurationManager(configFile); DataSource mnistTrain = (DataSource) cm.lookup("mnist-train");

The lookup method takes a string, which is compared against the Name attributes in the knowledge base. Provided that the configuration system finds a match at this point, it automatically invokes the class structure described in the file.

ConfigurationManager is extremely powerful in this respect. Oracle offers a comprehensive example at https://tribuo.org/learn/4.0/tutorials/configuration-tribuo-v4.html that harnesses the configuration system using Jupiter Notebooks to generate a complex machine learning toolchain.

This measure, which seems pointless at first glance, is actually quite sensible. Machine learning systems live and die with the quality of the training data as well as with the parameterization — if these parameters can be conveniently addressed externally, it is easier to adapt the system to the present needs.

Grouping with synthetic data sets

In the field of machine learning, there is also a group of universal procedures that can be applied equally to various detailed problems. Having dealt with classification so far, we now turn to clustering. As shown in Figure 4, the focus here is on the idea of inscribing groups into a wild horde of data in order to be able to recognize trends or associations more easily.

Fig. 4: The inscription of the dividing lines divides the dataset (Image source: Wikimedia Commons/ChiRe).

It follows from logic that we also need a database for regression experiments. Tribuo helps us at this point with the ClusteringDataGenerator class, which harnesses the Gaussian distribution (from mathematics and probability theory) to generate test data sets. For now, we want to populate two test data fields according to the following scheme:

public static void main(String[] args) { try {

var data = ClusteringDataGenerator.gaussianClusters(500, 1L); var test = ClusteringDataGenerator.gaussianClusters(500, 2L);

The numbers passed as the second parameter determine which number will be used as the initial value for the random generator. Since we pass two different values here, the PRNG will generate two sequences of numbers that differ from the sequence here, but which always follow the rules of the Gaussian normal distribution.

In the field of clustering methods, the K-Means process has become well-established. For this reason alone, we want to use it again in our Tribuo example. If you look carefully at the structure of the code used, you will notice that the standardized structure is also apparent:

var trainer = new KMeansTrainer(5,10,Distance.EUCLIDEAN,1,1); var model = trainer.train(data); var centroids = model.getCentroidVectors(); for (var centroid : centroids) { System.out.println(centroid); }

Of particular interest is actually the passed Distance value, which determines which method should be used to calculate or weight the two-dimensional distances between elements. It is critical in that the Tribuo library comes with several Distance Enums — the one we need is in the namespace org.tribuo.clustering.kmeans.KMeansTrainer.Distance.

Now the program can be executed again — Figure 5 shows the generated center point matrices.

Fig. 5: Tribuo informs us of the whereabouts of clusters.

The red messages are progress reports: most of the classes included in Tribuo contain logic to inform the user about the progress of long-running processes in the console. However, since these operations also consume processing power, there is a way to influence the frequency of operation of the output logic in many constructors.

Be that as it may: To evaluate the results, it is helpful to know the Gaussian parameters used by the ClusteringDataGenerator function. Oracle, funnily enough, only reveals them in the Tribuo tutorial example; in any case, the values for the three parameter variables are as follows:

N([ 0.0,0.0], [[1.0,0.0],[0.0,1.0]]) N([ 5.0,5.0], [[1.0,0.0],[0.0,1.0]]) N([ 2.5,2.5], [[1.0,0.5],[0.5,1.0]]) N([10.0,0.0], [[0.1,0.0],[0.0,0.1]]) N([-1.0,0.0], [[1.0,0.0],[0.0,0.1]])

Since a detailed discussion of the mathematical processes taking place in the background would go beyond the scope of this article, we would like to cede the evaluation of the results to Tribuo instead. The tool of choice is, once again, an element based on the evaluator principle. However, since we are dealing with clustering this time, the necessary code looks like this:

ClusteringEvaluator eval = new ClusteringEvaluator(); var mtTestEvaluation = eval.evaluate(model,test); System.out.println(mtTestEvaluation.toString());

When you run the present program, you get back a human-readable result — the Tribuo library takes care of preparing the results contained in ClusteringEvaluator for convenient output on the command line or in the terminal:

Clustering Evaluation Normalized MI = 0.8154291916732408 Adjusted MI = 0.8139169342020222

Excursus: Faster when parallelized

Artificial intelligence tasks tend to consume immense amounts of computing power — if you don’t parallelize them, you lose out.

Parts of the Tribuo library are provided by Oracle out of the box with the necessary tooling that automatically distributes the tasks to be done across multiple cores of a workstation.

The trainer presented here is an excellent example of this. As a first task, we want to use the following scheme to ensure that both the training and the user data are much more extensive:

var data = ClusteringDataGenerator.gaussianClusters(50000, 1L); var test = ClusteringDataGenerator.gaussianClusters(50000, 2L);

In the next step, it is sufficient to change a value in the constructor of the KMeansTrainer class according to the following scheme — if you pass eight here, you instruct the engine to take eight processor cores under fire at the same time:

var trainer = new KMeansTrainer(5,10,Distance.EUCLIDEAN,8,1);

At this point, you can release the program for testing again. When monitoring the overall computing power consumption with a tool like Top, you should see a brief flash of the utilization of all cores.

Regression with Tribuo

By now, it should be obvious that one of the most important USPs of the Tribuo library is its ability to represent more or less different artificial intelligence procedures via a common scheme of thought, process, and implementation shown in Figure 1. As a final task in this section, let’s turn to regression — which, in the world of artificial intelligence, is the analysis of a data set to determine the relationship between variables in an input set and an output set. This is for early neural networks in the field of AI for games — and the area that one expects most from AI as a non-initiated developer.

For this task we want to use a wine quality data set: The data sample available in detail at https://archive.ics.uci.edu/ml/datasets/Wine+Quality correlates wine ratings to various chemical analysis values. So the purpose of the resulting system is to give an accurate judgment about the quality or ratings of wine after feeding it with this chemical information.

As always, the first task is to provide the test data, which we (of course) do again via wget:

tamhan@TAMHAN18:~/tribuospace$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Our first order of business is to load the sample information we just provided — a task that is done via a CSVLoader class:

try { var regressionFactory = new RegressionFactory(); var csvLoader = new CSVLoader<>(';',regressionFactory); var wineSource = csvLoader.loadDataSource(Paths.get("/home/tamhan/tribuospace/winequality-red.csv"),"quality");

Two things are new compared to the previous code: First, the CSVLoader is now additionally passed a semicolon as a parameter. This string informs it that the submitted sample file has a “special” data format in which the individual pieces of information are not separated by commas. The second special feature is that we now use an instance of RegressionFactory as factory instance — the previously used LabelFactory is not really suitable for regression analyses.

The wine data set is not divided into training and working data. For this reason, the TrainTestSplitter class from before comes to new honors; we assume 30 percent training and 70 percent useful information:

var splitter = new TrainTestSplitter<>(wineSource, 0.7f, 0L); Dataset trainData = new MutableDataset<>(splitter.getTrain()); Dataset evalData = new MutableDataset<>(splitter.getTest());

In the next step we need a trainer and an evaluator (Listing 7).

var trainer = new CARTRegressionTrainer(6);
Model model = trainer.train(trainData);
RegressionEvaluator eval = new RegressionEvaluator();
var evaluation = eval.evaluate(model,trainData);
var dimension = new Regressor("DIM-0",Double.NaN);
System.out.printf("Evaluation (train):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
  evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));

Two things are new about this code: First, we now have to create a regressor that marks one of the values contained in the data set as relevant. Second, we perform an evaluation against the data set used for training as part of the model preparation. While this approach can lead to overtraining, it does provide some interesting parameters if used intelligently.

If you run our program as-is, you will see the following output on the command line:

Evaluation (train): RMSE 0.545205 MAE 0.406670 R^2 0.544085

RMSE and MAE are both parameters that describe the quality of the model. For both, a lower value indicates a more accurate model.

As a final task, we have to take care of performing another evaluation, but it no longer gets the training data as a comparison. To do this, we simply need to adjust the values passed to the Evaluate method:

evaluation = eval.evaluate(model,evalData); dimension = new Regressor("DIM-0",Double.NaN); System.out.printf("Evaluation (test):%n RMSE %f%n MAE %f%n R^2 %f%n", evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));

The reward for our effort is the screen image shown in Figure 6 — since we are no longer evaluating the original training data, the accuracy of the resulting system has deteriorated.

Fig. 6: Against real data the test delivers less accurate results.

This program behavior makes sense: if you train a model against new data, you will naturally get worse data than if you short-circuit it with an already existing source of information. Note at this point, however, that overtraining is a classic antipattern, especially in financial economics, and has cost more than one algorithmic trader a lot of money.

This brings us to the end of our journey through the world of artificial intelligence with Tribuo. The library also supports anomaly detection as a fourth design pattern, which we will not discuss here. At https://tribuo.org/learn/4.0/tutorials/anomaly-tribuo-v4.html you can find a small tutorial — the difference between the three methods presented so far and the new method, however, is that the new method works with other classes.

Conclusion

Hand on heart: If you know Java well, you can learn Python — while swearing, but without any problems: So it is more a question of not wanting to than a problem of not being able to.

On the other hand, there is no question that Java payloads are much easier to integrate into various enterprise processes and enterprise toolchains than their Python-based counterparts. The use of the Tribuo library already provides a remedy since you do not have to bother with the manual brokering of values between the Python and the Java parts of the application.

If Oracle would improve the documentation a bit and offer even more examples, there would be absolutely nothing against the system. Thus, it is true that the hurdle to switching is somewhat higher: Those who have never worked with ML before will learn the basics faster with Python, because they will find more turnkey examples. On the other hand, it is also true that working with Tribuo is faster in many cases due to the various comfort functions. All in all, Oracle succeeds with Tribuo in a big way, which should give Java a good chance in the field of machine learning — at least from the point of view of an ecosystem.

The post On pythonic tracks appeared first on ML Conference.

Real-time anomaly detection with Kafka and Isolation Forests

btessari — Thu, 07 Jan 2021 08:43:20 +0000

Detecting anomalies in data series can be valuable in many contexts: from anticipatory waiting to monitoring resource consumption and IT security. The time factor also plays a role in detection: the earlier the anomaly is detected, the better. Ideally, anomalies should be detected in real time and immediately after they occur. This can be achieved with just a few tools. In the following, we take a closer look at Kafka, Docker and Python with scikit-learn. The complete implementation of the listed code fragments can be found under github.com.

Stay up to date

Learn more about MLCON

Apache Kafka

The core of Apache Kafka is a simple idea: a log to which data can be attached at will. This continuous log is then called a stream. On the one hand, so-called producers can write data into the stream, on the other hand, consumers can read this data. This simplicity makes Kafka very powerful and usable for many purposes. For example, sensor data can be sent into the stream directly after the measurement. On the other hand, an anomaly detection software can read and process the data. But why should there even be a stream between the data source (sensors) and the anomaly detection instead of sending the data directly to the anomaly detection?

There are at least three good reasons for this. First, Kafka can run distributed in a cluster and offer high reliability (provided that it is actually running in a cluster of servers and that another server can take over in case of failure). If the anomaly detection fails, the stream can be processed after a restart and continue where it left off last. So no data will be lost.

Secondly, Kafka offers advantages when a single container for real-time anomaly detection is not working fast enough. In this case, load balancing would be needed to distribute the data. Kafka solves this problem with partitions. Each stream can be divided into several partitions (which are potentially available on several servers). A partition is then assigned to a consumer. This allows multiple consumers (also called consumer groups) to dock to the same stream and allows the load to be distributed between several consumers.

Thirdly, it should be mentioned that Kafka enables the functionalities to be decoupled. If, for example, the sensor data are additionally stored in a database, this can be realized by an additional consumer group that docks onto the stream and stores the data in the database. This is why companies like LinkedIn, originally developed Kafka, have declared Kafka to be the central nervous system through which all data is passed.

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Learn more

A Kafka cluster can be easily started using Docker. A Docker-Compose-File is suitable for this. It allows the user to start several Docker containers at the same time. A Docker network is also created (here called kafka_cluster_default), which all containers join automatically and which enables their communication. The following commands start the cluster from the command line (the commands can also be found in the README.md in the GitHub repository for copying):

git clone https://github.com/NKDataConv/anomalie-erkennung.git
cd anomaly detection/
docker-compose -f ./kafka_cluster/docker-compose.yml up -d –build

Here the Kafka Cluster consists of three components:

The broker is the core component and contains and manages the stream.
ZooKeeper is an independent tool and is used at Kafka to coordinate several brokers. Even if only one broker is started, ZooKeeper is still needed.
The schema registry is an optional component, but has established itself as the best practice in connection with Kafka.

The Schema Registry solves the following problem: In practice, the responsibilities for the producer (i.e. origin of the data) are often detached from the responsibilities for the consumer (i.e. data processing). In order to make this cooperation smooth, the Schema Registry establishes a kind of contract between the two parties. This contract contains a schema for the data. If the data does not match the schema, the Schema Registry does not allow the data to be added to the corresponding stream. The Schema Registry therefore checks that the schema is adhered to each time the data is sent. This schema is also used to serialize the data. This is the avro format. Serialization with avro allows you to further develop the schema, for example, to add additional data fields. The downward compatibility of the schema is always ensured. A further advantage of the Schema Registry in cooperation with Avro is that the data in the Kafka stream ends up without the schema, because it is stored in the Schema Registry. This means that only the data itself is in the stream. In contrast, when sending data in JSON format, for example, the names of all fields would have to be sent with each message. This makes the Avro format very efficient for Kafka.

To test the Kafka cluster, we can send data from the command line into the stream and read it out again with a command line consumer. To do this we first create a stream. In Kafka, a single stream is also called a topic. The command for this consists of two parts. First, we have to take a detour via the Docker-Compose-File to address Kafka. The command important for Kafka starts with kafka-topics. In the following the topic test-stream is created:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-topics –create –bootstrap-server localhost:9092 –topic test-stream

Now we can send messages to the topic. To do this, messages can be sent with ENTER after the following command has been issued:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-console-producer –broker-list localhost:9092 –topic test-stream

At the same time, a consumer can be opened in a second terminal, which outputs the messages on the command line:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-console-consumer –bootstrap-server localhost:9092 –topic test-stream –from-beginning

Both processes can be aborted with CTRL + C. The Schema Registry has not yet been used here. It will only be used in the next step.

Kafka Producer with Python

There are several ways to implement a producer for Kafka. Kafka Connect allows you to connect many external tools such as databases or the file system and send their data to the stream. But there are also libraries for various programming languages. For Python, for example, there are at least three working, actively maintained libraries. To simulate data in real time, we use a time series that measures the resource usage of a server. This time series includes CPU usage and the number of bytes of network and disk read operations. This time series was recorded with Amazon CloudWatch and is available on Kaggle. An overview of the data is shown in Figure 1, with a few obvious outliers for each time series.

Fig. 1: Overview of Amazon CloudWatch data

In the time series, the values are recorded at 5-minute intervals. To save time, they are sent in the stream every 1 second. In the following, an image for the Docker Container is created. The code for the producer can be viewed and adjusted in the subfolder producer. Then the container is started as part of the existing network kafka_cluster_default.

docker build ./producer -t kafka-producer
docker run –network=”kafka_cluster_default” -it kafka-producer

Now we want to turn to the other side of the stream.

Kafka Consumer

For an anomaly detection it is interesting to implement two consumers. One consumer for real-time detection, the second for training the model. This consumer can be stopped after the training and restarted for a night training. This is useful if the data has changed. Then the algorithm is trained again and a new model is saved. However, it is also conceivable to trigger this container at regular intervals using a cron job. In the productive case, however, this should not be given a completely free hand. An intermediate step with a check of the model makes sense at this point – more about this later.

The trained model is then stored in the Docker volume. The volume is mounted in the subfolder data. The first consumer has access to this Docker Volume. This consumer can load the trained model and evaluate the data for anomalies.

Technologically, libraries also exist for the consumer in many languages. Since Python is often the language of choice for data scientists, it is the obvious choice. It is interesting to note that the schema no longer needs to be specified for implementation on the consumer side. It is automatically obtained via the schema registry, and the message is deserialized. An overview of the containers described so far is shown in Figure 2. Before we also run the consumer, let’s take a closer look at what is algorithmically done in the containers.

Fig. 2: Overview of docker containers for anomaly detection

Isolation Forests

For the anomaly detection a colourful bouquet of algorithms exists. Furthermore, the so-called no-free-lunch theorem makes it difficult for us to choose. The main point of this theorem is not that we should take money with us into lunch break, but that there is nothing free when choosing the algorithm. In concrete terms, this means that we don’t know whether a particular algorithm works best for our data unless we test all algorithms (with all possible settings) – an impossible task. Nevertheless, we can sort the algorithms a bit.

Basically, in anomaly detection one can first distinguish between supervised and unsupervised learning. If the data indicates when an anomaly has occurred, this is referred to as supervised learning. Classification algorithms are suitable for this problem. Here, the classes “anomaly” and “no anomaly” are used. If, on the other hand, the data set does not indicate when an anomaly has occurred, one speaks of unsupervised learning. Since this is usually the case in practice, we will concentrate here on the approaches for unsupervised learning. Any prediction algorithm can be used for this problem. The next data point arriving is predicted. When the corresponding data point arrives, it is compared with the forecast. Normally the forecast should be close to the data point. However, if the forecast is far off, the data point is classified as an anomaly.

What “far off” exactly means can be determined with self-selected threshold values or with statistical tests. The Seasonal Hybrid ESD algorithm developed by Twitter, for example, does this. A similar approach is pursued with autoencoders. These are neural networks that reduce the dimensions of the data in the first step. In the second step, the original data is to be restored from the dimension reduction. Normally, this works well. If, on the other hand, a data point does not succeed, it is classified as an anomaly. Another common approach is the use of One-Class Support Vector Machines. These draw as narrow a boundary as possible around the training data. A few data points are the exception and are outside the limits. The number of exceptions must be given to the algorithm as a percentage and determines how high the percentage of detected anomalies will be. There must therefore be some premonition about the number of anomalies. Each new data point is then checked to see whether it lies within or outside the limits. Outside corresponds to an anomaly.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Learn more

Another approach that works well is Isolation Forest. The idea behind this algorithm is that anomalies can be distinguished from the rest of the data with the help of a few dividing lines (Figure 3). Data points that can be isolated with the fewest separators are therefore the anomalies (box: “Isolation Forests”).

Isolation Forests

Isolation Forests are based on several decision trees. A single decision tree in turn consists of nodes and leaves. The decisions are made in the nodes. To do this, a random feature (that is, a feature of the data, in our example, CPU, network, or memory resource) and a random split value is selected for the node. If a data point in the respective feature has a value that is less than the split value, it is placed in the left branch, otherwise in the right branch. Each branch is followed by the next node, and the whole thing happens again: random feature and random split value. This continues until each data point ends up in a separate sheet. So this isolates each data point. The procedure is repeated with the same data for several trees. These trees form the forest and give the algorithm its name.

So at this point we have isolated all training data in each tree into leaves. Now we have to decide which leaves or data points are an anomaly. Here is what we can determine: Anomalies are easier to isolate. This is also shown in Figure 3, which shows two-dimensional data for illustration. Points that are slightly outside the data point can be isolated with a single split (shown as a black line on the left). This corresponds to a split in the decision tree. In contrast, data points that are close together on the screen can only be isolated with multiple splits (shown as multiple separators on the right).

This means that on average, anomalies require fewer splits to isolate. And this is exactly how anomalies are defined in Isolation Forests: Anomalies are the data points that are isolated with the fewest splits. Now all that remains to be done is to decide which part of the data should be classified as an anomaly. This is a parameter that the Isolation Forests need for training. Therefore, one needs a premonition of how many anomalies are found in the data set. Alternatively, an iterative approach can be used. Different values can be tried out and the result checked. Another parameter is the number of decision trees for the forest, since the number of splits is calculated using the average of several decision trees. The more decision trees, the more accurate the result. One hundred trees is a generous value here, which is sufficient for a good result. This may sound like a lot at first, but the calculations are actually within seconds. This efficiency is also an advantage of Isolation Forests.

Fig. 3: Demonstration of data splitting with Isolation Forests: left with as few splits possible, right only with several splits

Isolation Forests are easy to implement with Python and the library scikit-learn. The parameter n_estimators corresponds to the number of decision trees and contamination to the expected proportion of anomalies. The fit method is used to trigger the training process on the data. The model is then stored in the Docker Volume.

from sklearn.ensemble import IsolationForest
iso_forest=IsolationForest(n_estimators=100, contamination=float(.02))
iso_forest.fit(x_train)
joblib.dump(iso_forest, ‘/data/iso_forest.joblib’)

From the Docker Volume, the Anomaly Detection Consumer can then call the model and evaluate streaming data using the predict method:

iso_forest = joblib.load(‘/data/iso_forest.joblib’)

anomaly = iso_forest.predict(df_predict)

A nice feature of scikit-learn is the simplicity and consistency of the API. Isolation Forests can be exchanged for the above mentioned One-Class Support Vector Machines. Besides the import, only line 2 IsolationForest has to be replaced by OneClassSVM. The data preparation and methods remain the same. Let us now turn to data preparation.

Feature Engineering with time series

In Machine Learning, each column of the data set (CPU, network and disk) is also called a feature. Feature engineering is the preparation of the features for use. There are a few things to consider. Often the features are scaled to get all features in a certain value range. For example, MinMax scaling transforms all values into the range 0 to 1. The maximum of the original values then corresponds to 1, the minimum to 0, and all other values are scaled in the range 0 to 1. Scaling is, for example, a prerequisite for training neural networks. For Isolation Forests especially, scaling is not necessary.

Another important step is often the conversion of categorical features that are represented by a string. Let’s assume we have an additional column for the processor with the strings Processor_1 and Processor_2, which indicate which processor the measured value refers to. These strings can be converted to zeros and ones using one-hot encoding, where zero represents the category Processor_1 and one represents Processor_2. If there are more than two categories, it makes sense to create a separate feature (or column) for each category. This will result in a Processor_1 column, a Processor_2 column, and so on, with each column consisting of zeros and ones for the corresponding category.

Feature engineering also means to prepare the existing information in a useful way and to provide additional information. Especially for time series it is often important to extract time information from the time stamp. For example, many time series fluctuate during the course of a day. Then it might be important to prepare the hour as a feature. If the fluctuations come with the seasons, a useful feature is the month. Or you can differentiate between working days and vacation days. All this information can easily be extracted from the timestamp and often provides a significant added value to the algorithm. A useful feature in our case could be working hours. This feature is zero during the time 8-18 and otherwise one. With the Python library pandas, such features can be easily created:

df[‘hour’] = df.timestamp.dt.hour
df[‘business_hour’] = ((df.hour < 8) | (df.hour>18)).astype(“int”)

It is always important that sufficient data is available. It is therefore useless to introduce the feature month if there is no complete year of training data available. Then the first appearance of the month of May would be detected as an anomaly in live operation, if it was not already present in the training data.

At the same time one should note that the following does not apply: A lot helps out a lot. On the contrary! If a feature is created that has nothing to do with the anomalies, the algorithm will still use the feature for detection and find any irregularities. In this case, thinking about the data and the problem is essential.

Trade-off in anomaly detection

Training an algorithm to correctly detect all anomalies is difficult or even impossible. You should therefore expect the algorithm to fail. But these errors can be controlled to a certain degree. The so-called first type of error occurs when an anomaly is not recognized as such by the algorithm. So, an anomaly is missed.

The second type of error is the reverse: An anomaly is detected, but in reality it is not an anomaly – i.e. a false alarm. Both errors are of course bad.

However, depending on the scenario, one of the two errors may be more decisive. If each detected anomaly results in expensive machine maintenance, we should try to have as few false alarms as possible. On the other hand, anomaly detection could monitor important values in the health care sector. In doing so, we would rather accept a few false alarms and look after the patient more often than miss an emergency. These two errors can be weighed against each other in most anomaly detection algorithms. This is done by setting a threshold value or, in the case of isolation forests, by setting the contamination parameter. The higher this value is set, the more anomalies are expected by the algorithm and the more are detected, of course. This reduces the probability that an anomaly will be missed.

The argumentation works the other way round as well: If the parameter is set low, the few detected anomalies are most likely the ones that are actually those. But because of this, some anomalies can be missed. So here again it is important to think the problem through and find meaningful parameters by iterative trial and error.

Real-time anomaly detection

Now we have waited long enough and the producer should have written some data into the Kafka stream by now. Time to start training for anomaly detection. This happens in a second terminal with Docker using the following commands:

docker build ./Consumer_ML_Training -t kafka-consumer-training
docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:rw -it kafka consumer training

The training occurs on 60 percent of the data. The remaining 40 percent is used for testing the algorithm. For time series, it is important that the training data lies ahead of the test data in terms of time. Otherwise it would be an unauthorized look into a crystal ball.

An evaluation on the test data is stored in Docker Volume in the form of a graph together with the trained model. An evaluation is visible in Figure 4. The vertical red lines correspond to the detected anomalies. For this evaluation we waited until all data were available in the stream.

Fig. 4: Detected anomalies in the data set

If we are satisfied with the result, we can now use the following commands to start the anomaly detection:

docker build ./Consumer_Prediction -t kafka-consumer-prediction
docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:ro -it kafka-consumer-prediction

As described at the beginning, we can also use several Docker containers for evaluation to distribute the load. For this, the Kafka Topic load must be distributed over two partitions. So let’s increase the number of partitions to two:

docker-compose -f kafka_cluster/docker-compose.yml exec broker kafka-topics –alter –zookeeper zookeeper:2181 –topic anomaly_tutorial –partitions 2

Now we simply start a second consumer for anomaly detection with the known command:

docker run –network=”kafka_cluster_default” –volume $(pwd)/data:/data:ro -it kafka-consumer-prediction

Stay up to date

Learn more about MLCON

Kafka now automatically takes care of the allocation of partitions to Consumers. The conversion can take a few seconds. An anomaly is displayed in the Docker logs with “No anomaly” or “Anomaly at time *”. Here it remains open to what extent the anomaly is processed further in a meaningful way. It would be particularly elegant to write the anomaly into a new topic in Kafka. An ELK stack could dock to this with Kafka Connect and visualize the anomalies.

The post Real-time anomaly detection with Kafka and Isolation Forests appeared first on ML Conference.

Let’s visualize the coronavirus pandemic

ML editorial team — Wed, 16 Dec 2020 14:16:20 +0000

These are crazy times we have been living in since the beginning of 2020; a pandemic has turned public life upside down. News sites offer live tickers with the latest news on infections, recoveries and death rates, and there is no medium that does not use a chart for visualization. Institutes like the Robert Koch Institute (RKI) or the Johns Hopkins University provide dashboards. We live in a world dominated by data, even during a pandemic.

The good thing is that most data on the pandemic is publicly available. Johns Hopkins University, for example, makes its data available in an open GitHub repository. So what could be more obvious than to create your own dashboard with this freely accessible data? This article uses coronavirus data to illustrate how to get from data cleansing to enriching data from other sources to creating a dashboard using Plotly’s dash. First of all, an important note: The data is not interpreted or analyzed in any way. This must be left to experts such as virologists, otherwise false conclusions may be drawn. Even if data is available for almost all countries, it is not necessarily comparable; each country uses different methods for testing infections. Some countries even have too few tests, so that no uniform picture can be drawn. The data set serves only as an example.

First, the work

In order to be able to use the data, we need to get it in a standardized form for our purposes. The data of Johns Hopkins University is stored in a GitHub repository. Basically, it is divided into two categories: first, continuously as time-series data and second, as a daily report in a separate CSV file. For the dashboard we need both sources. With the time-series data, it is easy to create line charts and plot gradients, curves, etc. From this we later generate the temporal course of the case numbers as line charts. Furthermore, we can calculate and display the growth rates from the data.

Set up environment

To prepare the data we use Python and the library pandas. pandas is the Swiss army knife for Python in terms of data analysis and cleanup. Since we need to install some modules for Python, I recommend creating a multi-environment setup for Python. Personally I use Anaconda, but there are also alternatives like Virtualenv. Using isolated environments does not change anything in the system-wide installation of Python, so I strongly recommend it. Furthermore you can work with different Python versions and export the dependencies for deployment more easily. Regardless of the system used, the first step is to activate the environment. With Anaconda and the command line tool conda this works as follows:

$ conda create -n corona-dashboard python=3.8.2

The environment called corona-dashboard is created, and Python 3.8.2 and the necessary modules are installed. To use the environment, we activate it with

$ conda activate corona-dashboard

We do not want to go deeper into Anaconda here, but for more information you can refer to the official documentation.

Once we have activated our environment, we install the necessary modules. In the first step these are pandas and Jupyter notebook. A Jupyter notebook is, simply put, a digital notebook. The notebooks contain markdown and code snippets that can be executed immediately. They are perfectly suited for iterative data cleansing and for developing the necessary steps. The notebooks can also be used to develop diagrams before they are transferred into the final scripts.

$ conda install pandas $ conda install -c conda-forge notebook

In the following we will perform all steps in Jupyter notebooks. The notebooks can be found in the repository for this article on GitHub. To use them, the server must be started:

$ jupyter notebook

After it’s up and running, the browser will display the overview in the current directory. Click on new and choose your desired environment to open a new Jupyter notebook. Now we can start on cleaning up the data.

Cleaning time-based data

In the first step, we import pandas, which provides methods for reading and manipulating data. Here we need a parser for CSV files, for which the method read_csv is provided. At least one path to a file or a buffer is expected as a parameter. If you specify a URL to a CSV as a parameter, pandas will read and process it without any problems. To ensure consistent, traceable data, we access a downloaded file that is available in the repository.

df = pd.read_csv("time_series_covid19_confirmed_global.csv")

To check this, we use the instruction df.head() to output the first five lines of the DataFrame (fig. 1).

Fig. 1: The unadjusted DataFrame

To become aware of the structure, we can have the column names printed. This is done with the statement df.columns. You can see in figure 2 that there is one column for each day in the table. Furthermore there are geo coordinates for the respective countries. The countries are partly divided into provinces and federal states. For the time-based data, we do not need geo-coordinates, and we can also remove the column with the states. We achieve this in pandas with the following methods on the data frame:

df.drop(columns=['Lat', 'Long', 'Province/State'], inplace = True)

The method drop expects as parameter the data we want to get rid of. In this case, there are three columns: Lat, Long and Province/State. It is important that the names with upper/lower case and any spaces are specified exactly. The second parameter inplace is used to apply the operation directly to our DataFrame. Without this parameter, pandas returns the modified DataFrame to us without changing the original. If you look at the frame with df.head(), you will see that the desired columns have been discarded.

The division of some countries into provinces or states results in multiple entries for some. An example is China. Therefore, it makes sense to group the data by country. For this, pandas provides a powerful grouping function.

df_grouped = df.groupby(['Country/Region'], as_index=False).sum()

Using the groupby function and parameters, which say in which column the rows should be grouped, the rows are combined. The concatenated .sum() sums the values of the respective combined groups. The return is a new DataFrame with the grouped and summed data, so that we can access the time-related data for all countries. So we swap rows and columns to get one entry for each country (columns) for each day (row).

df_grouped.reset_index(level=0, inplace=True) df_grouped.rename(columns={'index': 'Date'}, inplace=True)

Before transposing, we set the index to Country/Region to get a clean frame. The disadvantage of this is that the new index is then called Country/Region.

The next adjustment is to set the date in a separate column. To correct this, we reset the index again. This will turn our old index (Country/Region) into a column named Index. This column contains the date specifications and must be renamed and set to the correct data type. The cleanup is complete (Fig. 3).

df_grouped.reset_index(level=0, inplace=True) df_grouped.rename(columns={'index': 'Date'}, inplace=True) df_grouped['Date'] = pd.to_datetime(df_grouped['Date'])

Fig. 2: Overview of the columns in the DataFrame

Fig. 3: The cleaned DataFrame

The fact that the index continues to be called Country/Region won’t bother us from now on because the final CSV file is saved without any index.

df_grouped.to_csv('../data/worldwide_timeseries.csv', index=False)

This means that the time-series data are adjusted and can be used for each country. If, for example, we only want to use the data for Germany, a new DataFrame can be created as a copy by selecting the desired columns.

df_germany = df_grouped[['Date', 'Germany']].copy()

Packed into a function we get the source code from Listing 1 for the cleanup of the time-series data.

def clean_and_save_timeseries(df):
  drop_columns = ['Lat', 
                          'Long', 
                          'Province/State']
 
  df.drop(columns=drop_columns, inplace = True)
  
  df_grouped = df.groupby(['Country/Region'], as_index=False).sum()
  df_grouped = df_grouped.set_index('Country/Region').transpose()
  df_grouped.reset_index(level=0, inplace=True)
  df_grouped.rename(columns={'index': 'Date'}, inplace=True)
  df_grouped['Date'] = pd.to_datetime(df_grouped['Date'])
 
  df_grouped.to_csv('../data/worldwide_timeseries.csv',index=False)

The function expects to receive the DataFrame to be cleaned as a parameter. Furthermore, all steps described above are applied and the CSV file is saved accordingly.

Clean up case numbers per country

In the repository of Johns Hopkins University you can find more CSVs with case numbers per country, divided into provinces/states. Additionally, the administrations for North America and the geographic coordinates are listed. With this data, we can generate an overview on a world map. The adjustment is less complex. As with the cleanup of the temporal data, we read the CSV into a pandas DataFrame and look at the first lines (Fig. 4).

import pandas as pd

df = pd.read_csv('04-13-2020.csv') df.head()

Fig. 4: The original DataFrame

In addition to the actual case numbers, information on provinces/states, geo-coordinates and other metadata are included that we do not need in the dashboard. Therefore we remove the columns from our DataFrame.

df.drop(columns=['FIPS','Lat', 'Long_', 'Combined_Key', 'Admin2', 'Province_State'], inplace=True)

To get the summed up figures for the respective countries, we group the data by Country_Region, assign it to a new DataFrame and sum it up.

df_cases = df.groupby(['Country_Region'], as_index=False).sum()

We have thus completed the clean-up operation. We will save the cleaned CSV for later use. Packaged into a function, the steps look like those shown in Listing 2.

def clean_and_save_worldwide(df):
  drop_columns = ['FIPS',
                          'Lat', 
                          'Long_', 
                          'Combined_Key', 
                          'Admin2', 
                          'Province_State']
 
  df.drop(columns=drop_columns, inplace=True)
 
  df_cases = df.groupby(['Country_Region'], as_index=False).sum()
  df_cases.to_csv('Total_cases_wordlwide.csv')

The function receives the DataFrame and applies the steps described above. At the end a cleaned CSV file is saved.

Clean up case numbers for Germany

In addition to the data from Johns Hopkins University, we want to use the data from the RKI with case numbers from Germany, broken down by federal state. This will allow us to create a detailed view of the German case numbers. The RKI does not make this data available as CSV in a repository; instead, the figures are displayed in an HTML table on a separate web page and updated daily. pandas provides the function read_html() for such cases. If you pass a URL to a web page, it is loaded and parsed. All tables found are returned as DataFrames in a list. I do not recommend accessing the web page directly and reading the tables. On the one hand, there are pages (as well as those of the RKI) which prevent this, on the other hand, the requests should be kept low, especially during development. For our purposes we therefore store the website locally with a wget https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html. To ensure that the examples work consistently, this page is part of the article’s repository. pandas doesn’t care whether we read the page remotely or pass it as a file path.

import pandas as pd df = pd.read_html('../Fallzahlen.html', decimal=',', thousands='.')

Since the numbers on the web page are formatted, we pass some information about them to read_html. We inform you that the decimal separation is done with commas and the separation of the thousands with a dot. Pandas thus interprets the correct data types when reading the data. To see how many tables were found, we check the length of the list with a simple len(df). In this case this returns a 1, which means that there was exactly one interpretable table. We save this DataFrame into a new variable for further cleanup:

df_de = df[0]

Fig. 5: Case numbers from Germany in the unadjusted dataframe

Since the table (Fig. 5) has column headings that are difficult to process, we rename them. The number of columns is known, so we do this step pragmatically:

df_de.columns = ['Bundesland', 'Anzahl', 'diff', 'Pro_Tsd', 'Gestorben']

Note: The number of columns must match exactly. Should the format of the table change, this must be adjusted. This makes the DataFrame look much more usable. We don’t need the column of the previous day’s increments in the dashboard, so it will be removed.

Furthermore, the last line of the table contains the grand total. It is also not necessary and will be removed. We directly access the index (16) of the line and remove it. This gives us our final dataframe with the numbers for the individual states of Germany.

df_de.drop(columns=['diff'], index=[16], inplace=True)

We will now save this data in a new CSV for further use.

df_de.to_csv('cases_germany_states.csv')

A resulting function looks as follows:

def clean_and_save_german_states(df): df.columns = ['Bundesland', 'Anzahl', 'diff', 'Pro_Tsd', 'Gestorben'] df.drop(columns=['diff'], index=[16], inplace=True) df.to_csv('cases_germany_states.csv')

As before, the function expects the dataframe as a transfer value. So we have cleaned up all existing data and saved it in new CSV files. In the next step, we will start developing the diagrams with Plotly.

Creating the visualizations

Since we will later use Dash from Plotly to develop the dashboard, we will create the desired visualizations in advance in a Jupyter notebook. Plotly is a library for creating interactive diagrams and maps for Python, R and JavaScript. The output of the diagrams is done automatically with Plotly.js. This gives all diagrams additional functions (zoom, save as PNG, etc.) without creating them yourself. Before that we have to install some more required modules.

# Anaconda Benutzer conda install -c plotly plotly

# Pip pip install plotly

In order to create diagrams with Plotly as quickly and easily as possible, we will use Plotly Express, which is now part of Plotly. Plotly Express is the easy-to-use high-level interface to Plotly, which works with “tidy” data and considerably simplifies the creation of diagrams. Since we are working with pandas DataFrames again, we import pandas and Plotly Express at the beginning.

import pandas as pd import plotly.express as px

Starting with the presentation of the development of infections over time in a line chart, we will import the previously cleaned and saved dataframe.

df_ts = pd.read_csv(‘../data/worldwide_timeseries.csv’)

Thanks to the data cleanup, creating a line chart with Plotly is very easy:

fig_ts = px.line(df_ts, x="Date", y="Germany")

The parameters are mostly self-explanatory. The first parameter tells you which DataFrame is used. With x='Date' and y='Germany' we determine the columns to be used from the DataFrame. On the horizontal the date and vertically the country. To make the diagram understandable, we set more parameters for the title and label of the axes. For the y-axis we define a linear division. If we want a logarithmic representation, we can set ‘log‘ instead of ‘linear‘.

fig_ts.update_layout(xaxis_type='date', xaxis={ 'title': 'Datum' }, yaxis={ 'title': 'Infektionen', 'type': 'linear', }, title_text='Infektionen in Deutschland')

To display diagrams in Jupyter notebooks, we need to tell the show() method that we are working in a notebook (Fig. 6).

fig_ts.show('notebook')

Fig. 6: Progression of infections over time

That’s all there is to it. The visualizations can be configured in many ways. For this I refer to Plotly’s extensive documentation.

Let us continue with the creation of further visualizations. For the dashboard, cases from Germany are to be presented. For this purpose, the likewise cleaned DataFrame is read in and then sorted in ascending order.

df_fs = pd.read_csv('../data/cases_germany_states.csv') df_fs.sort_values(by=['Anzahl'], ascending=True, inplace=True)

The first representation is a simple horizontal bar chart (Fig. 7). The code can be seen in Listing 3.

fig_fs = px.bar(df_fs, x='Anzahl', 
                     y='Bundesland',
                     hover_data=['Gestorben'],
                     height=600, 
                     orientation='h',
                     labels={'Gestorben':'Bereits verstorben'},
                     template='ggplot2')
 
fig_fs.update_layout(xaxis={
                               'title': 'Anzahl der Infektionen'
                             },
                             yaxis={
                               'title': '',
                             },
                             title_text='Infektionen in Deutschland')

The DataFrame is expected as the first parameter. Then we configure which columns represent the x and y axis. With hover_data we can determine which data is additionally displayed by a bar when hovering. To ensure a comprehensible description of the parameter, labels can be used to determine what the label for data in the frontend should be. Since “Died” sounds a bit strange, we set it to “Already deceased”. The parameter orientation specifies whether we want to create a vertical (v) or horizontal (h) bar chart. With height we set the height to 600 pixels. Finally, we must update the layout parameters, as we did for the line chart.

Fig. 7: Number of cases in Germany as a bar chart

To make the distribution easier to see, we create a pie chart.

fig_fs_pie = px.pie(df_fs, values='Anzahl', names='Bundesland', title='Verteilung auf Bundesländer', template='ggplot2')

The parameters are mostly self-explanatory. values defines which column of the DataFrame is used, and names, which column contains the labels. In our case these are the names of the federal states (Fig. 8).

Fig. 8: Case numbers Germany as a pie chart

Finally, we generate a world map with the distribution of cases per country. We import the adjusted data of the worldwide infections.

df_ww_cases = pd.read_csv('../data/Total_cases_wordlwide.csv')

Then we create a scatter-geo-plot. Simply put, this will draw a bubble for each country, the size of which corresponds to the number of cases. The code can be seen in Listing 4.

fig_geo_ww = px.scatter_geo(df_ww_cases, 
             locations="Country_Region",
             hover_name="Country_Region",
             hover_data=['Confirmed', 'Recovered', 'Deaths'],
             size="Confirmed",
             locationmode='country names',
             text='Country_Region',
             scope='world',
             labels={
               'Country_Region':'Land',
               'Confirmed':'Bestätigte Fälle',
               'Recovered':'Wieder geheilt',
               'Deaths':'Verstorbene',
             },
             projection="equirectangular",
             size_max=35,
             template='ggplot2',)

The parameters are somewhat more extensive, but no less easy to understand. In principle, it is an assignment of the fields from the DataFrame. The parameters labels and hover_data have the same functions as before. With locations we tell which column of our DataFrame contains the locations/countries. So that Plotly Express knows how to assign them on a world map, we set locationmode to country names. Plotly can thus carry out the assignment for this dataframe at country level without having to specify exact geo-coordinates. text determines which heading the mouseovers have. The size of the bubbles is calculated from the confirmed cases in the DataFrame. We pass this to size. We can define the maximum size of the bubbles with size_max (in this case 35 pixels). With scope we control the focus on the world map. Possibilities are ‘usa‘, ‘europe‘, ‘asia‘, ‘africa‘, ‘north america‘, and ‘south america‘. This means that the map is not only focused on a certain region, but also limited to it. The appropriate labels and other metaparameters for the display are applied when the layout is updated:

fig_geo_ww.update_layout( title_text = 'Bestätigte Infektionen weltweit', title_x = 0.5, geo = dict( showframe = False, showcoastlines = True, projection_type = 'equirectangular' ) )

geo defines the representations of the world map. The parameter projection_type is worth mentioning, since it can be used to control the visualization of the world map. For the dashboard we use equirectangular, better known as isosceles. The finished map is shown in figure 9.

Fig. 9: Distribution of cases worldwide

With this we have created all the necessary images to be used in our dashboard. In the next step we will come to the dashboard itself. Since Dash is used by Plotly, we can reuse, configure and interactively design our diagrams relatively easily.

Creating the dashboard with Dash from Plotly

Dash is a Python framework for building web applications. It is based on Flask, Plotly.js and React.js. Dash is very well suited for creating data visualization applications with highly customized user interfaces in Python. It is especially interesting for anyone working with data in Python. Since we have already created all diagrams with Plotly in advance, the next step to a web-based dashboard is straightforward.

We create an interactive dashboard from the data in our cleaned up CSV files – without a single line of HTML and JavaScript. We want to limit ourselves to basic functions and not go deeper into special features, as this would go beyond the scope of this article. For this I refer to the excellent documentation of Dash and the tutorial contained therein. Dash creates the website from a declarative description of the Python structure. To work with Dash, we need the Python module dash. It is installed with conda or pip:

$ conda install -c conda-forge dash # or $ pip install dash

For the dashboard we create a Python script called app.py. The first step is to import the required modules. Since we are importing the data with pandas in addition to Dash, we need to import the following packages:

import pandas as pd import dash import dash_core_components as dcc import dash_html_components as html # Module with help functions import dashboard_figures as mf

Besides the actual module for Dash, we need the core components as well as the HTML components. The former contain all important components we need for the diagrams. HTML components contain the HTML parts. Once the modules are installed, the dashboard can be developed (Figure 10).

Let’s start with the global data. The breakdown is relatively simple: a heading, with two rows below it, each with two columns. We place the world map with facts as text to the right. Then we want a combo box to select one or more countries for display in the line chart below. We also have the choice of whether we prefer a linear or logarithmic representation. The data on infections in Germany are a bit extensive and include the bar chart and the distribution as a pie chart.

In Dash the layout is described in code. Since we have already generated all diagrams with Plotly Express, we outsource them to an external module with helper functions. I won’t go into this code in detail, because it contains the diagrams we have already created. The code is stored in the GitHub repository. Before the diagrams can be displayed, we need to import the CSV files and have the diagrams created (Listing 5).

df_ww = pd.read_csv('../data/worldwide_timeseries.csv')
df_total = pd.read_csv('../data/Total_cases_wordlwide.csv')
df_germany = pd.read_csv('../data/cases_germany_states.csv')
 
fig_geo_ww = mf.get_wordlwide_cases(df_total)
fig_germany_bar = mf.get_german_barchart(df_germany)
fig_germany_pie = mf.get_german_piechart(df_germany)
 
fig_line = mf.get_lineplot('Germany', df_ww)
 
ww_facts = mf.get_worldwide_facts(df_total)
 
fact_string = ‚'''Weltweit gibt es insgesamt {} Infektionen. 
  Davon sind bereits {} Menschen vertorben und {} gelten als geheilt. 
  Somit sind offiziell {} Menschen aktiv an einer Coronainfektion erkrankt.'''
 
fact_string = fact_string.format(ww_facts['total'], 
                                             ww_facts['deaths'], 
                                             ww_facts['recovered'], 
                                             ww_facts['active'])
 
countries = mf.get_country_names(df_ww)

Further help functions return a list of all countries and aggregated data. For the combo box, a list of options must be created that contains all countries.

dd_options = [] for key in countries: dd_options.append({ 'label': key, 'value': key })

That was all the preparation that was necessary. The layout of the web application consists of the components provided by Dash. The complete description of the layout can be seen in Listing 6.

app.layout = html.Div(children=[
  html.H1(children='COVID-19: Dashboard', style={'textAlign': 'center'}),
  html.H2(children='Weltweite Verteilung', style={'textAlign': 'center'}),
  # World and Facts
  html.Div(children=[
 
    html.Div(children=[
 
        dcc.Graph(figure=fig_geo_ww),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '66%'}),
 
    html.Div(children=[
 
      html.H3(children='Fakten'),
      html.P(children=fact_string)
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '33%'})
 
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
 
  # Combobox and Checkbox
  html.Div(children=[
    html.Div(children=[
      # combobox
      dcc.Dropdown(
        id='country-dropdown',
        options=dd_options,
        value=['Germany'],
        multi=True
      ),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '66%'}),
 
    html.Div(children=[
      # Radio-Buttons
      dcc.RadioItems(
        id='yaxis-type',
        options=[{'label': i, 'value': i} for i in ['Linear', 'Log']],
        value='Linear',
        labelStyle={'display':'inline-block'}
      ),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '33%'})
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
  
  # Lineplot and Facts
  html.Div(children=[
    html.Div(children=[
 
      #Line plot: Infections
      dcc.Graph(figure=fig_line, id='infections_line'),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '100%'}),
 
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'}),
 
  # Germany
  html.H2(children=‘Zahlen aus Deutschland‘, style={‚textAlign‘: ‚center‘}),
  html.Div(children=[
    html.Div(children=[
 
      # Barchart Germany
      dcc.Graph(figure=fig_germany_bar),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '50%'}),
 
    html.Div(children=[
 
      # Pie Chart Germany
      dcc.Graph(figure=fig_germany_pie),
 
    ], style={'display': 'flex', 
                 'flexDirection': 'column',
                 'width': '50%'})
  ], style={'display': 'flex',
               'flexDirection': 'row', 
               'flexwrap': 'wrap',
               'width': '100%'})
])

The layout is described by Divs, or the respective wrappers for Divs, as code. There is a wrapper function for each HTML element. These functions can be nested however you like to create the desired layout. Instead of using inline styles, you can also work with your own classes and with a stylesheet. For our purposes, however, the inline styles and the external stylesheet read in at the beginning are sufficient. The approach of a declarative description for layouts has the advantage that we do not have to leave our code and do not have to be an expert in JavaScript or HTML. The focus is on dashboard development. If you look closely, you will find core components for the visualizations in addition to HTML components.

... # Bar chart Germany dcc.Graph(figure=fig_germany_bar), ....

The respective diagrams are inserted at these positions. For the line chart, we assign an ID so that we can access the chart later.

dcc.Graph(figure=fig_line, id='infections_line'),

Using the selected values from the combo box and radio buttons, we can adjust the display of the lines. In the combo box we want to provide a selection of countries. Multiple selection is also possible if you want to be able to see several countries in one diagram (see section # combo box in Listing 6). Furthermore, the radio buttons control the division of the y-axis according to linear or logarithmic display (see section # Radio-Buttons in Listing 6). In order to apply the selected options to the chart, we need to create an update function. We annotate this function with an @app.callback decorator (Listing 7).

# Interactive Line Chart
@app.callback(
  Output('infections_line', 'figure'),
  [Input('country-dropdown', 'value'), Input('yaxis-type', 'value')])
def update_graph(countries, axis_type):
  countries = countries if len(countries) > 0 else ['Germany']
  data_value = []
  for country in countries:
    data_value.append(dict(
      x= df_ww['Date'], 
      y= df_ww[country], 
      type= 'lines', 
      name= str(country)
    ))
 
  title = ', '.join(countries)
  title = 'Infektionen: ' + title
  return {
    'data': data_value,
    'layout': dict( 
      yaxis={
        'type': 'linear' if axis_type == 'Linear' else 'log'
      },
      hovermode='closest',
      title = title
    )
  }

The decorator has inputs and outputs. Output defines to which element the return is applied and what kind of element it is. In our case we want to access the line chart with the ID infections_line, which is of the type figure. Input describes which input values are needed in the function. In our case, these are the values from the combo box and the selection via the radio buttons. The corresponding function gets these values and we can work with them.

Since the countries are returned as a list, the line to be drawn must be configured in each case. In our example this is implemented with a simple for-in loop. For each country it is determined which columns from the dataframe are necessary. Finally, we return the new configuration of the diagram. When we return, we still define the division of the y-axis, depending on the selection, whether linear or logarithmic. Finally, we start a debug server:

if __name__ == '__main__': app.run_server(debug=True)

If we start the Python script, a server is started and the application can be called in the browser. If one or more countries are selected, they are displayed as a line chart. The development server supports hot reloading: If we change something in the code and save it, the page is automatically reloaded. With this we have created our own coronavirus/Covid-19 dashboard with interactive diagrams. All without having to write HTML, JavaScript and CSS. Congratulations!

Fig. 10: The finished dashboard

Summary

Together we have worked through a simple data project. From cleaning up data and the creation of visualizations to the provision of a dashboard. Even with small projects, you can see that the cleanup and visualization part takes up most of the time. If the data is poor, the end product is guaranteed not to be better. In the end, the dashboard is created quickly.

Charts can be created very quickly with Plotly and are interactive out of the box, so they are not just static images. The ability to create dashboards quickly is especially helpful in the prototypical development of larger data visualization projects. There is no need for a large team of data scientists and developers, ideas can be tried out quickly and discarded if necessary. Plotly and Dash are more than sufficient for many purposes. If the data is cleaned up right from the start, the next steps are much easier. If you are dealing with data and its visualization, you should take a look at Plotly and not ignore Dash.

One final remark: In this example I have not used predictions and predictive algorithms and models. On the one hand, the amount of data is too small. On the other hand, a prediction with exponential data is always difficult and should be treated with great care, otherwise wrong conclusions will be drawn.

The post Let’s visualize the coronavirus pandemic appeared first on ML Conference.

Tutorial: Explainable Machine Learning with Python and SHAP

kuebra — Tue, 11 Feb 2020 10:26:32 +0000

In this tutorial, you will learn how to use the SHAP package in Python applied to a practical example step by step.

Motivation

Machine Learning is used in a lot of contexts nowadays. We get offers for different products, recommendations on what to watch tonight and many more. Sometimes the predictions fit our needs and we buy or watch what was offered. Sometimes we get the wrong predictions. Sometimes those predictions are in more sensitive contexts than watching a show or buying a certain product. For example, when an algorithm that is supposed to automate hiring decisions discriminates against a group. Amazons recruiters used an algorithm that was systematically rejecting women before inviting them to job interviews.

To make sure that we know what the algorithms we use actually do, we have to take a closer look at what we are actually predicting. New methods of explainable machine learning open up the possibility to explore which factors were used exhaustively by the algorithm to come to the predictions. Those methods can lead to a better understanding of what the algorithm is actually doing and whether it emphasizes columns that should not contain much information.

Example

To have a clearer picture of explainable AI, we will go through an example. The used dataset consists out of Kickstarter projects and can be downloaded here. Kickstarter is a crowdfunding platform where people can upload a video or description about their planned projects. If one would like to support a project, he or she can donate money to that project. In this example, I would like to guide you through a machine learning algorithm that is going to predict whether a given project is going to be successful or not. The interesting part is that we are going to take a look at why the algorithm came to a certain decision.

Stay up to date

Learn more about MLCON

This explainable machine learning example will be in Python. So, at first we need to import a few packages (Listing 1). pandas, NumPy, skikit-learn and Matplotlib are frequently used in data science projects. CatBoost is a great tree based algorithm that can deal excellently with categorical data and has a good performance also in the default settings. SHAP is the package by Scott M. Lundberg that is the approach to interpret machine learning outcomes.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import catboost as catboost
from catboost import CatBoostClassifier, Pool, cv
import shap

Used versions of the packages:

pandas 0.25.0
NumPy 1.16.4
Matplotlib 3.0.3
skikit-learn 0.19.1
CatBoost 0.18.1
SHAP 0.28.3

Let’s take a look at the downloaded dataset (Figure 1) with kickstarter.head():

Figure 1

The first column is the identification number of each project. The name column is the name of the Kickstarter project. category classifies each project in one of 159 different categories. Those categories can be summed up into 15 main categories. Next is the currency of the project. The column deadline represents the last possible date to support the project. pledged describes the amount of money that was given in order to support the project. state is the state of the project after the deadline date. backers is defined as the number of supporters for the given project. The last column consists out of the country in which the project was launched.

We are just going to use the states failed and successful, as the other states like canceled do not seem to be very interesting (Listing 2).

kickstarter["state"] = kickstarter["state"].replace({"failed": 0, "successful": 1})

First machine learning model

We are going to start with a machine learning model that takes the following columns as the feature vector (Listing 3):

kickstarter_first = kickstarter[
    [
        "category",
        "main_category",
        "currency",
        "deadline",
        "goal",
        "launched",
        "backers",
        "country",
        "state",
    ]
]

The last column is going to be our target column, therefore y. All the other columns are the feature vector, therefore X (Listing 4).

X = kickstarter_first[kickstarter_first.columns[:-1]]
y = kickstarter_first[kickstarter_first.columns[-1:]]

We are going to split the dataset with the result of having 10% of the dataset as the test dataset, and 90% as the training dataset (Listing 5).

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42
)

As our classifier, I chose CatBoost, as it can deal very well with categorical data (Listing 6). We are going to take the preinstalled settings of the algorithm. Also, 150 iterations are enough for our purposes.

model = CatBoostClassifier(
    random_seed=42, logging_level="Silent", iterations=150
)

In order to use CatBoost properly, we need to define which columns are categorical (Listing 7). In our case, those are all columns that have the type object.

categorical_features_indices = np.where(X.dtypes == np.object)[0]
X.dtypes

Figure 2

We can see in Figure 2 that all columns but goal and backers are object columns and should be treated as categorical.

After fitting the model, we see a pretty good result (Listing 8):

model.fit(
    X_train,
    y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_test, y_test),
)

Figure 3

With this first model, we are able to classify 93% of our test dataset correctly (Figure 3).

Let’s not get too excited and check out what we are actually predicting.

With the package SHAP, we are able to see which factors were mostly responsible for the predictions (Listing 9).

shap_values = model.get_feature_importance(
    Pool(X_test, label=y_test, cat_features=categorical_features_indices),
    type="ShapValues",
)
shap_values = shap_values[:, :-1]


shap.summary_plot(shap_values, X_test, plot_type="bar")

Figure 4

With this bar plot (Figure 4), we can see that the column backers is contributing the most to the prediction!

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Learn more

Oh no! We have put an approximation of the target column (status failed or successful) into our model. If your Kickstarter project has a lot of backers, then it is most likely going to be successful.

Let’s give it another go. This time we are just going to use the columns that are not going to reveal too much information.

Second machine learning model

In the extended dataset kickstarter_extended = kickstarter.copy(), we are going to implement some feature engineering. Looking through the data, one can see that some projects are using special characters in their name. We are going to implement a new column number_special_character_name that is going to count the number of special characters per name (Listing 10).

kickstarter_extended[
    "number_special_character_name"
] = kickstarter_extended.name.str.count('[-()"#/@;:<>{}`+=~|.!?,]')
kickstarter_extended["word_count"] = kickstarter_extended["name"].str.split().map(len)

Also, we are going to change the deadline and launched column from the type object to datetime and thereby replace the columns. This is happening in order to get the new column delta_days, which consists out of the days between the “launched” date and the “deadline” date (Listing 11).

kickstarter_extended["deadline"] = pd.to_datetime(kickstarter_extended["deadline"])
kickstarter_extended["launched"] = pd.to_datetime(kickstarter_extended["launched"])

kickstarter_extended["delta_days"] = (
    kickstarter_extended["deadline"] - kickstarter_extended["launched"]
).dt.days

It is also interesting to see whether projects are more successful in certain months. Therefore, we are building the new column launched_month. The same for day of week and year (Listing 12).

kickstarter_extended["launched_month"] = kickstarter_extended["launched"].dt.month
kickstarter_extended[
    "day_of_week_launched"
] = kickstarter_extended.launched.dt.dayofweek
kickstarter_extended["year_launched"] = kickstarter_extended.launched.dt.year
kickstarter_extended.drop(["deadline", "launched"], inplace=True, axis=1)

The new dataset kickstarter_extended now consists of the following columns (Listing 13):

kickstarter_extended = kickstarter_extended[
    [
        "ID",
        "category",
        "main_category",
        "currency",
        "goal",
        "country",
        "number_special_character_name",
        "word_count",
        "delta_days",
        "launched_month",
        "day_of_week_launched",
        "year_launched",
        "state",
    ]
]

Again, building the test and training dataset (Listing 14).

X = kickstarter_extended[kickstarter_extended.columns[:-1]]
y = kickstarter_extended[kickstarter_extended.columns[-1:]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42
)

Initializing the new model and setting the categorical columns. Afterwards, fitting the model (Listing 15).

model = CatBoostClassifier(
    random_seed=42, logging_level="Silent", iterations=150
)
categorical_features_indices = np.where(X_train.dtypes == np.object)[0]

model.fit(
    X_train,
    y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_test, y_test),
)

model.score(X_test, y_test)

Figure 5

The current model is a little bit worse than the first try (Figure 5), but the assumption is that we are now actually predicting on a more accurate database. A quick look at the bar plot, generated by Listing 16 and containing the current feature importances, tells us that in fact goal is the most informative column now (Figure 6).

shap_values_ks = model.get_feature_importance(
    Pool(X_test, label=y_test, cat_features=categorical_features_indices),
    type="ShapValues",
)
shap_values_ks = shap_values_ks[:, :-1]

shap.summary_plot(shap_values_ks, X_test, plot_type="bar")

Figure 6

Until now, the SHAP package did not show anything other algorithm libraries cannot do. Showing feature importances has already been implemented in XGBoost and CatBoost some versions ago.
But now let’s get SHAP to shine. We enter shap.summary_plot(shap_values_ks, X_test) and receive the following summary plot (Figure 7):

Figure 7

In this summary plot, the order of the columns still represents the amount of information the column is accountable for in the prediction. Each dot in the visualization represents one prediction. The color is related to the real data point. If the actual value in the dataset was high, the color is pink; blue indicates the actual value being low. Grey represents the categorical values which cannot be scaled in high or low. But the package maintainers are working on it. The x-axis represents the SHAP value, which is the impact on the model output. The model output 1 equates to the prediction of successful; 0 the prediction that the project is going to fail.

Let’s take a look at the first row of the summary_plot. If a Kickstarter project owner set the goal high (pink dots) the model output was likely 0 (negative SHAP value, not successful). It totally makes sense: if you set the bar for the money goal too high, you cannot reach it. On the other hand, if you set it very low, you are likely to achieve it by asking just a few of your friends. The column word_count also shows a clear relationship: few words in the name description indicate a negative impact on the model output, in the sense that it is likely a failed project. Maybe more words in the name deliver more information, so that potential supporters already get interested after reading just the title. You can see that the other columns are showing a more complex picture as there are pink dots in a mainly blue area and the other way around.

The great thing about the SHAP package is that it gives the opportunity to dive even deeper into the exploration of our model. Namely, it will give us the feature contributions for every single prediction (Listing 17).

shap_values = model.get_feature_importance(
    Pool(X_test, label=y_test, cat_features=categorical_features_indices),
    type="ShapValues",
)
expected_value = shap_values[0, -1]
shap_values = shap_values[:, :-1]
shap.initjs()  

shap.force_plot(expected_value, shap_values[10, :], X_test.iloc[10, :])

Figure 8

In the force plot (Figure 8), we can see the row at position 10 of our test dataset. This was a correct prediction of a successful project. Features that are pink contribute to the model output being higher, that means predicting a success of the Kickstarter project. Blue parts of the visualization indicate a lower model output, predicting a failed project. So the biggest block here is the feature ‘category’, which in this case is Tabletop Games. Therefore, with this particular set of information, the project being a Tabletop Game is the most informative feature for the model. Also, the short period of 28 days of the project being online contributes towards the prediction of success.

Another example is row 33161 of the test dataset, which was a correct prediction of a failed project. As we can see in the force plot (Figure 9), generated by Listing 18, the biggest block is the feature goal. Apparently, the set goal of $25,000 was too high.

shap.force_plot(expected_value, shap_values[33161, :], X_test.iloc[33161, :])

Figure 9

So, now we got a better look at our model with this Kickstarter dataset. One could also explore the false predictions and get an even deeper understanding of the model. One can also take a look at the false positives and false negatives. There, you could see on which features the model concentrated that lead to an incorrect model output. There are also many other visualizations like interaction values. Check out the documentation if you are interested.

Outlook

The SHAP package is also useful in other machine learning tasks. For example, image recognition tasks. In Figure 10, you can see which pixels contributed to which model output.

Figure 10. Source: SHAP

SHAP is giving us the opportunity to better understand the model and which features contributed to which prediction. The package allows us to check whether we are taking just features into account which make sense. It is the first step towards preventing models from predicting things based on wrong input features. Thus, machine learning becomes less of a “black box”. This way, we are getting closer to explainable machine learning.

The post Tutorial: Explainable Machine Learning with Python and SHAP appeared first on ML Conference.

Deep Learning: Not only in Python

ML editorial team — Tue, 28 Jan 2020 13:17:04 +0000

AI and deep learning are certainly hot topics at the moment and despite some initial setbacks, e.g. in the field of self-driving cars, the potential of deep learning is far from exhausted. But there are still many areas of IT in which the topic is only just gaining momentum. It is therefore particularly important to investigate how deep learning systems can be implemented on the JVM, as Java (both the language and the platform) is still the dominant technology in the enterprise sector.

TensorFlow is one of the most important frameworks in the field of deep learning. Despite the increasing popularity of Keras, it is still inconceivable to do without it, especially as the AI top dog Google continues to drive its development forward. This article shows how TensorFlow can be used on the JVM to train and infer TensorFlow models.

What is the combination of TensorFlow and JVM suitable for?

DL4J is the only professional deep learning framework that is really at home on the JVM, so if you would like to use deep learning on the JVM, DL4J is usually the best choice. TensorFlow – like many machine learning frameworks – is mainly used with Python. However, there are reasons to use TensorFlow within a JVM context:

You want to use a process which has an implementation in TensorFlow, but not in DL4J, and the porting effort is too high.
You are working with a data science team that is used to work with TensorFlow and Python, but the target infrastructure runs on the JVM.
The data which is necessary for the training lies within a Java infrastructure (databases, custom data formats, APIs) and in order to get to the data, existing interface code must be ported from Java to Python.

The JVM TensorFlow combination is therefore always useful if an existing Java environment is available and, for personnel or project related reasons, TensorFlow has to be used for deep learning (see the box: “TensorFlow and JVM – always a good idea?”).

TensorFlow and JVM – always a good idea?

Although there may be good reasons for this combination, it is also important to mention what may speak against it. Especially the choice of TensorFlow should be well considered:

TensorFlow is not a suitable framework for deep learning or machine learning beginners.
TensorFlow is not user-friendly: The API changes quickly and in the mass of instructions it is often not clear which path is the best.
TensorFlow isn’t better just because it is made by Google: Deep learning is math, and math is the same for everyone. TensorFlow does not create “smarter” AIs than other frameworks. It is also not faster than the alternatives (but also not “dumber” or slower).

If you want to get into deep learning and stay on the JVM, the use of DL4J is absolutely recommended. Especially for professional enterprise projects, DL4J is a good choice. But also, if you want to look over the fence and try out a bit of Python, it is worth trying out the TensorFlow alternatives. Here, you are currently better off with Keras, thanks to a much more convenient API.

How does TensorFlow work?

Before you start to use a new framework, it is important to take a look at what happens under the hood (see the box: “TensorFlow cheat sheet”). When thinking of TensorFlow, the first things that come to mind are AI and neural networks. But from a technical point of view, TensorFlow is mainly a framework that can execute complex, iterative, parallel calculations on tensors – and that, if possible, GPU-accelerated. Although deep learning is the main field of application for TensorFlow, it can also be used for any other calculation.

A TensorFlow program – or better: the configuration of a calculation – is always structured like a graph in TensorFlow. The nodes of the graph represent operations, such as adding or multiplying, but also loading and saving. Everything that TensorFlow does takes place in the nodes of a previously defined calculation graph. The nodes (operations) of the graph are connected by edges through which the data flows in the form of tensors. Hence the name TensorFlow.

All calculations in TensorFlow take place in a so-called session. In the session, either a finished graph is loaded, or a new graph is created piece by piece by API calls. Special nodes in the graph can contain variables. In order for the graph to work, these must be initialized. Once this has happened and a session with a finished, initialized graph exists, TensorFlow interacts only by calling operations in the graph. What is calculated depends on which output nodes of the graph are queried. Thus, not the entire graph is executed, but only the operations that provide input for the queried node, then their input nodes, etc., back to the input operations, which must be filled with the necessary input tensors. The important thing with TensorFlow is that all operations are automatically differentiated for the user – this is needed for the training of neural networks. However, the user can safely blend it out since it happens automatically.

Usually, the graph is defined by a Python API. It can be represented graphically with auxiliary programs (Fig. 1). But such representations only serve for debugging, as they are not graphically programmed like in a visual programming language such as LabView.

_{Fig. 1: A (small) section of a TensorFlow graph: The numbers on the edges indicate the size of the tensor flowing through them, the arrows indicate the direction.}

Although, in most examples, Python is used to interact with TensorFlow, the actual engine is written in C/C++. Therefore, you can use TensorFlow with any language that can call C functions. Thus, you can also perform calculations in TensorFlow from the JVM.

TensorFlow cheat sheet

Tensor: The basis for calculations in TensorFlow. A tensor is actually an object from linear algebra, but for our purposes it is completely sufficient to consider a tensor as a multidimensional array (mostly from float or double values, sometimes also char or boolean). TensorFlow uses tensors for everything. All data that TensorFlow consumes, produces, and uses internally is packaged in tensors – hence the name.
Graph: The definition of TensorFlow calculation procedures is usually stored in a file called graph.pb in a ProtoBuf binary format, similar to a Java .class file.
Training: When training a machine learning method, data and expected results are presented to the algorithm over and over again, whereupon the algorithm adjusts the internal parameters of the model to improve the result. Sometimes this is called “learning”, although it has little to do with human learning.
Inference: Depending on the application, you may want to use a machine learning process to classify, predict, translate, create content, and much more. All these applications are summarized under the term inference. Inference therefore means as much as “using a procedure to obtain a result”. This is what we want to do most of the time in live use after training. During inference, a procedure does not learn.
Model: the learned parameters of a machine learning procedure, for example, a neural net. This is the result of the learning process and necessary to obtain results (the variable state of the graph, so to speak). It is distributed over several files and stored in one *.index and several *.data files, for example, *.data-0000-of-0001. The first number indicates the consecutive number of the file, the second the total number.
Session: the context in which TensorFlow is executed, such as a running JVM instance. In order to use TensorFlow, we need to create a session in which a graph is loaded that is initialized with a model. In Java, a JVM instance must be started in which classes are loaded that are instantiated with constructor parameters.

TensorFlow training and inference with Python

The training of a TensorFlow model with Python (box: “tf.data or feeding?”) can be separated into the following steps:

Create the graph, either via several API calls that compose the graph or through loading a *.pb file that contains the graph
Create a session for the graph
Initialize the graph variable either by calling a special operation in the graph which fills the variables with default values or through loading a pre-trained model

After these three steps, we have an executable TensorFlow session with a functioning model. If we want to (further) train it, the following three steps are always executed in a loop until the model has learned enough – either by defining a fixed number of training steps beforehand or by waiting until the training error drops below a certain level:

Package input data in arrays and assign input sensors
Select output node and pack into a list
Execute the session: a special command causes the session to perform the necessary operations to generate the selected output

But where does the training take place? It is done by executing the correct output nodes. There is no difference between training and inference for TensorFlow, mathematical operations are simply performed in the calculation graph. We speak of training if these lead to a neural network that learns a better weighting to solve a problem. However, the API calls for training and any other type of usage are the same.

Our input consists of the data that is to be learned (for example, an image as a two-dimensional tensor and the label “dog” or “cat” in the form of an integer ID in a zero-dimensional tensor) during training. By running the correct nodes, TensorFlow updates some variables in the graph to improve the prediction. The main difference between training and inference is that we periodically save the current state of the graph variables – which are constantly changing – while this is useless for the inference because they remain constant.

tf.data or feeding?

There are two possibilities to load training data into the graph when you train a TensorFlow model in Python:
the tf.data API or so-called “feeding”, i.e. the transfer of individual data for each calculation step. The tf.data API is implemented internally in C, integrated directly into the graph, and therefore very fast – but also complicated to use and very difficult to debug. The feeding method is easy to use and understand, but you need Python code at runtime. Therefore, Python usually slows down the more expensive graphics card, and valuable GPU capacity is not used. But which approach do we take in Java? Fortunately, Java is orders of magnitude faster than Python, so here we get the best of both worlds: We can use the easy-to-understand feeding method and still get full performance. That is why we leave the tf.data API out of this article, we just don’t need it.

The TensorFlow Java API

Now we can call all operations that are necessary in Python for training or inference via JNI, since TensorFlow is implemented internally in C/C++. Fortunately, we no longer have to bother wrapping the low-level C API with JNI, as Google has already done this for us. The necessary libraries are, as usual, available on Maven Central. There are four different artifacts, all in the group org.tensorflow:

tensorflow: A metapackage with dependencies on libtensorflow and libtensorflow_jni; in order to avoid confusion, it should not be used.
libtensorflow: The API against which you program in Java; this is the compile and runtime dependency and the central entry point.
libtensorflow_jni: Contains the native CPU dependencies for libtensorflow; this artifact is needed at runtime when using a machine without GPU; it contains native code for Windows, Linux and Mac; TensorFlow is completely included, you don’t have to install Python or TensorFlow on the running system.
libtensorflow_jni_gpu: The GPU equivalent to libtensorflow_jni; you should use this dependency if you use a computer with NVIDIA GPU and Cuda and CuDNN are installed correctly; it only works under Windows and Linux, there is no GPU support for TensorFlow under macOS.

The version numbers of the Java wrappers correspond to the version number of the included TensorFlow version. Here we should always use the newest stable release. We only have to pay attention if the code is supposed to be executed on a computer with GPU (box: “Selecting the GPU to be used”). Not every TensorFlow version supports every CUDA and CuDNN version (CUDA is a special NVIDIA driver to use graphics cards for parallel calculations, CuDNN is a CUDA based library for neural networks). We must ensure that the CUDA and TensorFlow versions are matching. Currently, all TensorFlow versions from 1.13 on support the same CUDA version: 10.0. With a Java-based solution, we already have a great advantage over Python software when installing the finished software. Thanks to Maven, our resulting artifact already includes all dependencies. Neither Python nor TensorFlow nor any Python libraries have to be pre-installed or the installations managed with a tool like Anaconda.

You should not use the top-level dependency tensorflow, it is better to directly use libtensorflow and one of the *_jni implementations. The reason for this is that the tensorflow artifact has a dependency on libtensorflow_jni (the CPU variant). If we now add libtensorflow_jni_gpu, the CPU-native code is still used and one wonders why everything runs so slowly despite the GPU. The Gradle dependencies for the TensorFlow training on the GPU look like this:

compile "org.tensorflow:libtensorflow:1.14.0"
runtimeOnly "org.tensorflow:libtensorflow_jni_gpu:1.14.0"

The required Java API for training and inference is simple and manageable. Only four classes are important: Graph, Session, Tensor and Tensors. We can now see how to use them correctly by rebuilding the Python-typical training steps in Java.

TensorFlow training in Java

The first step in training is to define the graph. Unfortunately, we have to make the first but only compromise right at the beginning. A graph can also be built step by step using the Java API, but for many node types, the Python API automatically generates many necessary helper nodes that are required for the frictionless use of the graph. In order to build this in Java, we would need a very detailed knowledge of the Python API internals. This step must therefore be done once in advance in Python. We then store the resulting graph file as a Java resource in order to then load it back into the JVM. Saving the current graph in Python is very easy:

with open(filename, 'wb') as f:
  f.write(tf.get_default_graph().as_graph_def().SerializeToString())

Important: Even if the method used here is called SerializeToString(), the result is still a binary file. For our convenience, we should also save the initialized variables here. Although initializing the variables in the graph from the JVM would be easy, if we always choose the here shown procedure, it makes it easier to do transfer training with complex models afterwards. Hereby, an already existing state of a model is further trained and adapted (Listing 1).

Listing 1

# This Python command creates a node for initialization
init_op = tf.global_variables_initializer()
# The saver is an auxiliary class that stores a model in Python.
saver = tf.train.Saver()
# Save is a graph operation
# and can only be executed in one session
with tf.Session() as sess:
  # Initializing Variables
  sess.run(init_op)
  # Save state
  save_path = saver.save(sess, filename)

Now we have saved the graph and the model and can train it in Java and execute the graph. For the sake of brevity, the following examples are in Kotlin but can be transferred to any JVM language:

//create empty graph
val graph = Graph()
//*. load pb file - either from a file or from resources
val graphDefBytes = javaClass.getResource(resourceName).readBytes()
//reconstruct graph from file
graph.importGraphDef(graphDefBytes)

Now we have loaded the TensorFlow graph into the JVM. In order to do something with it, we need a session:

val session = Session(graph)

We only have to load the latest version of the variable before we can really get started. This can be either the file initially saved in Python or the last state of a previous training, for example, in order to continue a training. The loading of variables is only an operation in the TensorFlow graph, and a string packed into a tensor is needed for this operation. The string contains the name of the *.index file without the suffix, so foo instead of foo.index.

Here, we need the Tensors class for the first time. This class contains help functions to package Java data types into Tensor objects. Hereby, it is automatically taken into consideration that the Tensor has the correct form. Important for every Tensor object: It contains memory that has been allocated outside the JVM. Therefore, it must be closed manually, for which it implements the Closable Interface. In Java, an own try{…} finally { tensor.close(); } block must be created for each tensor. Fortunately, this is much easier in Kotlin with use:

Tensors.create(path).use { pathTensor ->
  session.runner().feed("save/Const", pathTensor)
                  .addTarget("save/restore_all")
                  .run()
}

Here we can see all necessary parts of a TensorFlow action on the JVM:

A runner is created for the session; this class has a builder API that defines what is supposed to be executed.
The input node for the loading and saving (“save/Const”) is filled with the tensor which contains the file name.
The target node is defined as the target for loading.
The action is executed.

The trick for all operations is to know their names. But since we build the graph ourselves beforehand and can define the name of a node at creation, we can choose them for ourselves. Exceptions are the nodes for loading and saving, which always have the here stated names.

Selecting the GPU to be used

Sometimes we don’t want to block all GPUs on systems with multiple GPUs, for example, to run multiple trainings in parallel. For this, we can configure the TensorFlow graph, which normally automatically allocates the GPU or GPUs, so that only one GPU is used. This has the big disadvantage, though, that the graph is then “hard wired” to a certain GPU and can be used only on this GPU. It is much more convenient to show or hide the GPUs by environment variable before starting the JVM. This can easily be done with the environment variable CUDA_VISIBLE_DEVICES. Here, we can specify a comma-separated list of CUDA devices that should be visible in the current shell. Caution: The numbering starts at 1, not at 0. The following console command, for example, activates only the second graphics card for TensorFlow (or other deep learning frameworks):

export CUDA_VISIBLE_DEVICES=2

Now we have already seen all the operations needed to interact with TensorFlow from the JVM. Carrying out a training step is now very easy. Let’s assume that our input is an array of loaded images. The black and white values of the pixels are converted to float values in the range 0-1. Each image belongs to a class defined by an int value, for example, 0 = dog, 1 = cat. Then the input for a batch (multiple images are always trained at once) is a float[][] array, which contains the images, and an int[] array, which contains the classes to learn. A training step can now be executed as follows (Listing 2).

Listing 2

fun train(inputs: Array<FloatArray>, labels: IntArray) {
  withResources {
    val results: List<Tensor<*>> = session.runner()
      .feed("inputs", Tensors.create(inputs).use())
      .feed("labels", Tensors.create(labels).use())
      .fetch("total_loss:0")
      .fetch(“accuracy:0")
      .fetch("prediction")
      .addTarget("optimize").run().useAll()
    val trainingError = results[0].floatValue()
    val accuracy      = results[1].floatValue()
    val prediction    = results[2].intValue()
  }
}

We see the same pattern again: A runner is created, the inputs are packaged into tensors, the target is selected (“optimize“) and the action is executed. But now we have an innovation: We get values back. The names of the nodes that are to be returned are defined with fetch. The names contain a suffix: “:0”. This means that they are nodes with multiple outputs, the :0 suffix means that the output with index 0 of the node should be returned.

The output is a list of Tensor objects. These can be converted into various primitive types and arrays to make the result available. Important: The Tensor objects created by the API also have to be closed. Normally, the entries in the list would have to be iterated and closed in a finally block. However, this is very inconvenient and difficult to read. Therefore, it is useful to define an extended use API in Kotlin, with which several objects within a block are marked with use or useAll (for lists of Closables), which are then closed safely (Listing 3).

Listing 3

class Resources : AutoCloseable {
  private val resources = mutableListOf<AutoCloseable>()

  fun <T: AutoCloseable> T.use(): T {
    resources += this
    return this
  }

  fun <T: Collection<AutoCloseable>> T.useAll(): T {
    resources.addAll(this)
    return this
  }

  override fun close() {
    var exception: Exception? = null
    for (resource in resources.reversed()) {
      try {
        resource.close()
      } catch (closeException: Exception) {
        if (exception == null) {
          exception = closeException
        } else {
          exception.addSuppressed(closeException)
        }
      }
    }
    if (exception != null) throw exception
  }
}

inline fun <T> withResources(block: Resources.() -> T): T = 
  Resources().use(block)

This useful trick allows you to close all tensors within a TensorFlow call conveniently and safely. With the inference under Java, it becomes really easy. We remember: Every action on the TensorFlow graph is performed by filling input nodes with input tensors and querying the correct output nodes. This means the following for our example above: The code remains the same, only we don’t set the inputs for the correct solution (labels). This makes sense because we don’t know them yet. In the output, we do not call the nodes for the error calculation and the update of the neural net (total_loss:0, accuracy:0, optimize), so we do not learn. Instead, we only query the result (prediction). Since the input of the solutions is not necessary for the calculation of the result, everything works just like before: There is no error because the part of the graph that trains the neural net remains inactive.

Practical experiences

The method presented here is not only an interesting experiment, but the author has already used it successfully in several commercial projects. Thereby, several advantages have emerged in practical use:

- The Java API is fast and efficient: There is no performance loss compared to the pure Python application. On the contrary: Since Java is much faster than Python for tasks like data import and pre-processing, it is even easier to implement a high-performance training process.
- The training runs absolutely stable over several days, Google’s Java implementation has proven to be very reliable.
- The deployment of the finished product is much easier than that of Python-based products, since only a Java runtime environment and the correct CUDA drivers need to be present – all dependencies are part of the Java TensorFlow library.
- TensorFlow’s low-level persistence API (as presented here) is easier to use than many of the “official” methods, such as estimators.
The only real drawback is that part of the project is still Python-based – the definition of the graph. So you need a team that is at least partly at home in the Python world.

The post Deep Learning: Not only in Python appeared first on ML Conference.