Key Machine Learning Principles: Build Strong Foundations https://mlconference.ai/blog/ml-basics-principles/ The Conference for Machine Learning Innovation Wed, 15 May 2024 10:03:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 Address Matching with NLP in Python https://mlconference.ai/blog/address-matching-with-nlp-in-python/ Fri, 02 Feb 2024 12:02:35 +0000 https://mlconference.ai/?p=87201 Discover the power of address matching in real estate data management with this comprehensive guide. Learn how to leverage natural language processing (NLP) techniques using Python, including open-source libraries like SpaCy and fuzzywuzzy, to parse, clean, and match addresses. From breaking down data silos to geocoding and point-in-polygon searches, this article provides a step-by-step approach to creating a Source-of-Truth Real Estate Dataset. Whether you're in geospatial analysis, real estate data management, logistics, or compliance, accurate address matching is the key to unlocking valuable insights.

The post Address Matching with NLP in Python appeared first on ML Conference.

]]>
Address matching isn’t always simple in data; we often need to parse and standardize addresses into a consistent format first before we can use them as identifiers for matching. Address matching is an important step in the following use cases:

  1. Geospatial Analysis: Accurate address matching forms the foundation of geospatial analysis, allowing organizations to make informed decisions about locations, market trends, and resource allocation across various industries like retail and media.
  2. Real Estate Data Management: In the real estate industry, precise address matching facilitates property valuation, market analysis, and portfolio management.
  3. Logistics and Navigation: Efficient routing and delivery depend on accurate address matching.
  4. Compliance and Regulation: Many regulatory requirements mandate precise address data, such as tax reporting and census data collection.

Stay up to date

Learn more about MLCON

 

Cherre is the leading real estate data management company, we specialize in accurate address matching for the second use case. Whether you’re an asset manager, portfolio manager, or real estate investor, a building represents the atomic unit of all financial, legal, and operating information. However, real estate data lives in many silos, which makes having a unified view of properties difficult. Address matching is an important step in breaking down data silos in real estate. By joining disparate datasets on address, we can unlock many opportunities for further portfolio analysis.

Data Silos in Real Estate

Real estate data usually fall into the following categories: public, third party, and internal. Public data is collected by governmental agencies and made available publicly, such as land registers. The quality of public data is generally not spectacular and the data update frequency is usually delayed, but it provides the most comprehensive coverage geographically. Don’t be surprised if addresses from public data sources are misaligned and misspelled.

Third party data usually come from data vendors, whose business models focus on extracting information as datasets and monetizing those datasets. These datasets usually have good data quality and are much more timely, but limited in geographical coverage. Addresses from data vendors are usually fairly clean compared to public data, but may not be the same address designation across different vendors. For large commercial buildings with multiple entrances and addresses, this creates an additional layer of complexity.

Lastly, internal data is information that is collected by the information technology (I.T.) systems of property owners and asset managers. These can incorporate various functions, from leasing to financial reporting, and are often set up to represent the business organizational structures and functions. Depending on the governance standards, and data practices, the quality of these datasets can vary and data coverage only encompasses the properties in the owner’s portfolio. Addresses in these systems can vary widely, some systems are designated at the unit-level, while others designate the entire property. These systems also may not standardize addresses inherently, which makes it difficult to match property records across multiple systems.

With all these variations in data quality, coverage, and address formats, we can see the need for having standardized addresses to do basic property-level analysis.

[track_display_in_blog headline="NEW & PRACTICAL ENDEAVORS FOR ML" text="Machine Learning Principles" textcolor="white" backgroundimage="https://mlconference.ai/wp-content/uploads/2024/02/MLC_Global24_Website_Blog1.jpg" icon="https://mlconference.ai/wp-content/uploads/2019/10/MLC_Singapur20_Trackicons_MLPrinciples_250x250_54073_rot_v1.png" btnlink="machine-learning-principles" btntext="Learn more"]

Address Matching Using the Parse-Clean-Match Strategy

In order to match records across multiple datasets, the address parse-clean-match strategy works very well regardless of region. By breaking down addresses into their constituent pieces, we have many more options for associating properties with each other. Many of the approaches for this strategy use simple natural language processing (NLP) techniques.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Address Parsing

Before we can associate addresses with each other, we must first parse the address. Address parsing is the process of breaking down each address string into its constituent components. Components in addresses will vary by country.

In the United States and Canada, addresses are generally formatted as the following:

{street_number} {street_name}

{city}, {state_or_province} {postal_code}

{country}

In the United Kingdom, addresses are formatted very similarly as in the U.S. and Canada, with an additional optional locality designation:

{building_number} {street_name}

{locality (optional)}

{city_or_town}

{postal_code}

{country}

 

French addresses vary slightly from U.K. addresses with the order of postal code and city:

{building_number} {street_name}

{postal_code} {city}

{country}

 

German addresses take the changes in French addresses and then swap the order of street name and building number:

{street_name} {building_number} {postal_code} {city} {country}

 

Despite the slight variations across countries’ address formats, addresses generally have the same components, which makes this an easily digestible NLP problem. We can break down the process into the following steps:

  1. Tokenization: Split the address into its constituent words. This step segments the address into manageable units.
  2. Named Entity Recognition (NER): Identify entities within the address, such as street numbers, street names, cities, postal codes, and countries. This involves training or using pre-trained NER models to label the relevant parts of the address.
  3. Sequence Labeling: Use sequence labeling techniques to tag each token with its corresponding entity

Let’s demonstrate address parsing with a sample Python code snippet using the spaCy library. SpaCy is an open-source software library containing many neural network models for NLP functions. SpaCy supports models across 23 different languages and allows for data scientists to train custom models for their own datasets. We will demonstrate address parsing using one of SpaCy’s out-of-the-box models for the address of a historical landmark: David Bowie’s Berlin apartment.

 

import spacy

# Load the NER spaCy model
model = spacy.load("en_core_web_sm")

# Address to be parsed
address = "Hauptstraße 155, 10827 Berlin"

# Tokenize and run NER
doc = model(address)

# Extract address components
street_number = ""
street_name = ""
city = ""
state = ""
postal_code = ""

for token in doc:
    if token.ent_type_ == "GPE":  # Geopolitical Entity (City)
        city = token.text
    elif token.ent_type_ == "LOC":  # Location (State/Province)
        state = token.text
    elif token.ent_type_ == "DATE":  # Postal Code
        postal_code = token.text
    else:
        if token.is_digit:
            street_number = token.text
        else:
            street_name += token.text + " "

# Print the parsed address components
print("Street Number:", street_number)
print("Street Name:", street_name)
print("City:", city)
print("State:", state)
print("Postal Code:", postal_code)

Now that we have a parsed address, we can now clean each address component.

Address Cleaning

Address cleaning is the process of converting parsed address components into a consistent and uniform format. This is particularly important for any public data with misspelled, misformatted, or mistyped addresses. We want to have addresses follow a consistent structure and notation, which will make further data processing much easier.

To standardize addresses, we need to standardize each component, and how the components are joined. This usually entails a lot of string manipulation. There are many open source libraries (such as libpostal) and APIs that can automate this step, but we will demonstrate the basic premise using simple regular expressions in Python.


import pandas as pd
import re

# Sample dataset with tagged address components
data = {
    'Street Name': ['Hauptstraße', 'Schloß Nymphenburg', 'Mozartweg'],
    'Building Number': ['155', '1A', '78'],
    'Postal Code': ['10827', '80638', '54321'],
    'City': ['Berlin', ' München', 'Hamburg'],
}

df = pd.DataFrame(data)

# Functions with typical necessary steps for each address component
# We uppercase all text for easier matching in the next step

def standardize_street_name(street_name):
    # Remove special characters and abbreviations, uppercase names
    standardized_name = re.sub(r'[^\w\s]', '', street_name)
    return standardized_name.upper()

def standardize_building_number(building_number):
    # Remove any non-alphanumeric characters (although exceptions exist)
    standardized_number = re.sub(r'\W', '', building_number)
    return standardized_number

def standardize_postal_code(postal_code):
    # Make sure we have consistent formatting (i.e. leading zeros)
    return postal_code.zfill(5)

def standardize_city(city):
    # Upper case the city, normalize spacing between words
    return ' '.join(word.upper() for word in city.split())

# Apply standardization functions to our DataFrame
df['Street Name'] = df['Street Name'].apply(standardize_street_name)
df['Building Number'] = df['Building Number'].apply(standardize_building_number)
df['Postal Code'] = df['Postal Code'].apply(standardize_postal_code)
df['City'] = df['City'].apply(standardize_city)

# Finally create a standardized full address (without commas)
df[‘Full Address’] = df['Street Name'] + ' ' + df['Building Number'] + ' ' + df['Postal Code'] + ' ' + df['City']

Address Matching

Now that our addresses are standardized into a consistent format, we can finally match addresses from one dataset to address in another dataset. Address matching involves identifying and associating similar or identical addresses from different datasets. When two full addresses match exactly, we can easily associate the two together through a direct string match.

 

When addresses don’t match, we will need to apply fuzzy matching on each address component. Below is an example of how to do fuzzy matching on one of the standardized address components for street names. We can apply the same logic to city and state as well.


from fuzzywuzzy import fuzz

# Sample list of street names from another dataset
street_addresses = [
    "Hauptstraße",
    "Schlossallee",
    "Mozartweg",
    "Bergstraße",
    "Wilhelmstraße",
    "Goetheplatz",
]

# Target address component (we are using street name)
target_street_name = "Hauptstrasse " # Note the different spelling and space 

# Similarity threshold
# Increase this number if too many false positives
# Decrease this number if not enough matches
threshold = 80

# Perform fuzzy matching
matches = []

for address in street_addresses:
    similarity_score = fuzz.partial_ratio(address, target_street_name)
    if similarity_score >= threshold:
        matches.append((address, similarity_score))

matches.sort(key=lambda x: x[1], reverse=True)

# Display matched street name
print("Target Street Name:", target_street_name)
print("Matched Street Names:")
for match in matches:
    print(f"{match[0]} (Similarity: {match[1]}%)")

Up to here, we have solved the problem for properties with the same address identifiers. But what about the large commercial buildings with multiple addresses?

Other Geospatial Identifiers

Addresses are not the only geospatial identifiers in the world of real estate. An address typically refers to the location of a structure or property, often denoting a street name and house number.  There are actually four other geographic identifiers in real estate:

 

  1. A “lot” represents a portion of land designated for specific use or ownership.
  2. A “parcel” extends this notion to a legally defined piece of land with boundaries, often associated with property ownership and taxation.
  3. A “building” encompasses the physical structures erected on these parcels, ranging from residential homes to commercial complexes.
  4. A “unit” is a sub-division within a building, typically used in multi-unit complexes or condominiums. These can be commercial complexes (like office buildings) or residential complexes (like apartments).

 

What this means is that we actually have multiple ways of identifying real estate objects, depending on the specific persona and use case. For example, leasing agents focus on the units within a building for tenants, while asset managers optimize for the financial performance of entire buildings. The nuances of these details are also codified in many real estate software systems (found in internal data), in the databases of governments (found in public data), and across databases of data vendors (found in third party data). In public data, we often encounter lots and parcels. In vendor data, we often find addresses (with or without units). In real estate enterprise resource planning systems, we often find buildings, addresses, units, and everything else in between.

In the case of large commercial properties with multiple addresses, we need to associate various addresses with each physical building. In this case, we can use geocoding and point-in-polygon searches.

Geocoding Addresses

Geocoding is the process of converting addresses into geographic coordinates. The most common form is latitude and longitude. European address geocoding requires a robust understanding of local address formats, postal codes, and administrative regions. Luckily, we have already standardized our addresses into an easily geocodable format.

Many commercial APIs exist for geocoding addresses in bulk, but we will demonstrate geocoding using a popular Python library, Geopy, to geocode addresses.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="my_geocoder")
location = geolocator.geocode("1 Canada Square, London")
print(location.latitude, location.longitude)

 

 

Now that we’ve converted our addresses into latitude and longitude, we can use point-in-polygon searches to associate addresses with buildings.

Point-in-Polygon Search

A point-in-polygon search is a technique to determine if a point is located within the boundaries of a given polygon.

The “point” in a point-in-polygon search refers to a specific geographical location defined by its latitude and longitude coordinates. We have already obtained our points by geocoding our addresses.

The “polygon” is a closed geometric shape with three or more sides, which is usually characterized by a set of vertices (points) connected by edges, forming a closed loop. Building polygons can be downloaded from open source sites like OpenStreetMap or from specific data vendors. The quality and detail of the OpenStreetMap building data may vary, and the accuracy of the point-in-polygon search depends on the precision of the building geometries.

While the concept seems complex, the code for creating this lookup is quite simple. We demonstrate a simplified example using our previous example of 1 Canada Square in London.


import json
from shapely.geometry import shape, Point

# Load the GeoJSON data
with open('building_data.geojson') as geojson_file:
    building_data = json.load(geojson_file)

# Latitude and Longitude of 1 Canada Square in Canary Wharf
lat, lon = 51.5049, 0.0195

# Create a Point geometry for 1 Canada Square
point_1_canada = Point(lon, lat)

# See if point is within any of the polygons
for feature in building_data['features']:
    building_geometry = shape(feature['geometry'])

    if point_1_canada.within(building_geometry):
        print(f"Point is within this building polygon: {feature}")
        break
else:
    print("Point is not within any building polygon in the dataset.")

Using this technique, we can properly identify all addresses associated with this property.

Stay up to date

Learn more about MLCON

 

Summary

Addresses in real life are confusing because they are the physical manifestation of many disparate decisions in city planning throughout the centuries-long life of a city. But using addresses to match across different datasets doesn’t have to be confusing.

Using some basic NLP and geocoding techniques, we can easily associate property-level records across various datasets from different systems. Only through breaking down data silos can we have more holistic views of property behaviors in real estate.

Author Biography

Alyce Ge is data scientist at Cherre, the industry-leading real estate data management and analytics platform. Prior to joining Cherre, Alyce held data science and analytics roles for a variety of technology companies focusing on real estate and business intelligence solutions. Alyce is a Google Cloud-certified machine learning engineer, Google Cloud-certified data engineer, and Triplebyte certified data scientist. She earned her Bachelor of Science in Applied Mathematics from Columbia University in New York.

 

The post Address Matching with NLP in Python appeared first on ML Conference.

]]>
AI is a Human Endeavor https://mlconference.ai/blog/ai-human-endeavor/ Tue, 29 Aug 2023 09:00:46 +0000 https://mlconference.ai/?p=86755 As AI advances, calls for regulation are increasing. But viable regulatory policies will require a broad public debate. We spoke with Mhairi Aitken, Ethics Fellow at the British Alan Turing Institute, about the current discussions on risks, AI regulation, and visions of shiny robots with glowing brains.

The post AI is a Human Endeavor appeared first on ML Conference.

]]>
devmio: Could you please introduce yourself to our readers and a bit about why you are concerned with machine learning and artificial intelligence?

Mhairi Aitken: My name is Mhairi Aitken, I’m an ethics fellow at the Alan Turing Institute. The Alan Turing Institute is the UK’s National Institute for AI and data science and as an ethics fellow, I look at the ethical and social considerations around AI and data science. I work in the public policy program where our work is mostly focused on uses of AI within public policy and government, but also in relation to policy and government responses to AI as in regulation of AI and data science. 

devmio: For our readers who may be unfamiliar with the Alan Turing Institute, can you tell us a little bit about it? 

Mhairi Aitken: The national institute is publicly funded, but our research is independent. We have three main aims of our work. First, advancing world-class research and applying that to national and global challenges. 

Second, building skills for the future. That’s both going to technical skills and training the next generation of AI and data scientists, but also to developing skills around ethical and social considerations and regulation. 

Third, part of our mission is to drive an informed public conversation. We have a role in engaging with the public, as well as policymakers and a wide range of stakeholders to ensure that there’s an informed public conversation around AI and the complex issues surrounding it and clear up some misunderstandings often present in public conversations around AI.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

devmio: In your talk at Devoxx UK, you said that it’s important to demystify AI. What exactly is the myth surrounding AI?

Mhairi Aitken: There’s quite a few different misconceptions. Maybe one of the biggest ones is that AI is something that is technically super complex and not something everyday people can engage with. That’s a really important myth to debunk because often there’s a sense that AI isn’t something people can easily engage with or discuss. 

As AI is already embedded in all our individual lives and is having impacts across society, it’s really important that people feel able to engage in those discussions and that they have a say and influence the way AI shapes their lives. 

On the other hand, there are unfounded and unrealistic fears about what risks it might bring into our lives. There’s lots of imagery around AI that gets repeated, of shiny robots with glowing brains and this idea of superintelligence. These widespread narratives around AI come back again and again, and are very present within the public discourse. 

That’s a distraction and it creates challenges for public engagement and having an informed public discussion to feed into policy and regulation. We need to focus on the realities of what AI is and in most cases, it’s a lot less exciting than superintelligence and shiny robots.

devmio: You said that AI is not just a complex technical topic, but something we are all concerned with. However, many of these misconceptions stem from the problem that the core technology is often not well understood by laymen. Isn’t that a problem?

Mhairi Aitken: Most of the players in big tech are pushing this idea of AI being something about superintelligence, something far-fetched, that’s closing down the discussions. It’s creating that sense that AI is something more difficult to explain, or more difficult to grasp, then it actually is, in order to have an informed conversation. We need to do a lot more work in that space and give people the confidence to engage in meaningful discussions around AI. 

And yes, it’s important to enable enough of a technical understanding of what these systems are, how they’re built and how they operate. But it’s also important to note that people don’t need to have a technical understanding to engage in discussions around how systems are designed, how they’re developed, in what contexts they’re deployed, or what purposes they are used for. 

Those are political, economic, and cultural decisions made by people and organizations. Those are all things that should be open for public debate. That’s why, when we talk about AI, it’s really important to talk about it as a human endeavor. It’s something which is created by people and is shaped by decisions of organizations and people. 

That’s important because it means that everyone’s voices need to be heard within those discussions, particularly communities who are potentially impacted by these technologies. But if we present it as something very complex which requires a deep technical understanding to engage with, then we are shutting down those discussions. That’s a real worry for me.

Stay tuned!
Learn more about ML Conference:

devmio: If the topic of superintelligence as an existential threat to humanity is a distraction from the real problems of AI that is being pushed by Big Tech, then what are those problems?

Mhairi Aitken: A lot of the AI systems that we interact with on a daily basis are opaque systems that make decisions about people’s lives, in everything from policing to immigration, social care and housing, or algorithms that make decisions about what information we see on social media. 

Those systems rely on or are trained on data sets, which contain biases. This often leads to biased or discriminatory outcomes and impacts. Because the systems are often not transparent in the ways that they’re used or have been developed, it makes it very difficult for people to contest decisions that are having meaningful impacts on their lives. 

In particular, marginalized communities, who are typically underrepresented within development processes, are most likely to be impacted by the ways these systems are deployed. This is a really, really big concern. We need to find ways of increasing diversity and inclusiveness within design and development processes to ensure that a diverse set of voices and experiences are reflected, so that we’re not just identifying harms when they occur in the real world, but anticipating them earlier in the process and finding ways to mitigate and address them.

At the moment, there are also particular concerns and risks that we really need to focus on concerning generative AI. For example, misinformation, disinformation, and the ways generative AI can lead to increasingly realistic images, as well as deep fake videos and synthetic voices or clone voices. These technologies are leading to the creation of very convincing fake content, raising real concerns for potential spread of misinformation that might impact political processes. 

It’s not just becoming increasingly hard to spot that something is fake. It’s also a widespread concern that it is increasingly difficult to know what is real. But we need to have access to trustworthy and accurate information about the world for a functioning democracy. When we start to question everything as potentially fake, it’s a very dangerous place in terms of interference in political and democratic processes.

I could go on, but there are very real concrete examples of how AI is already having presented harms today and they disproportionately impact marginalized groups. A lot of the narratives of existential risk we currently see are coming from Big Tech and are mostly being pushed by privileged or affluent people. When we think about AI or how we address the risks around AI, it’s important that we shouldn’t center around the voices of Big Tech, but the voices of impacted communities. 

devmio: A lot of misinformation is already on the internet and social media without the addition of AI and generative AI. So potential misuse on a large scale is of a big concern for democracies. How can western societies regulate AI, either on an EU-level or a global scale? How do we regulate a new technology while also allowing for innovation?

Mhairi Aitken: There definitely needs to be clear and effective regulation around AI. But I think that the dichotomy between regulation and innovation is false. For a start, we don’t just want any innovation. We want responsible and safe innovation that leads to societal benefits. Regulation is needed to make sure that happens and that we’re not allowing or enabling dangerous and harmful innovation practices.

Also, regulation provides the conditions for certainty and confidence for innovation. The industry needs to have confidence in the regulatory environment and needs to know what the limitations and boundaries are. I don’t think that regulation should be seen as a barrier to innovation. It provides the guardrails, clarity, and certainty that is needed. 

Regulation is really important and there are some big conversations around that at the moment. The EU AI Act is likely to set an international standard of what regulation will look like in this regard. It’s going to have a big impact in the same way that GDPR had with data protection. Soon, any organization that’s operating in the EU, or that may export an AI product to the EU, is going to have to comply with the EU AI Act. 

We need international collaboration on this.

devmio: The EU AI Act was drafted before ChatGPT and other LLMs became publicly available. Is the regulation still up to date? How is an institution like the EU supposed to catch up to the incredible advancements in AI?

Mhairi Aitken: It’s interesting that over the last few months, developments with large language models have forced us to reconsider some elements of what was being proposed and developed, particularly around general purpose AI. Foundation models like large language models that aren’t designed for a particular purpose can be deployed in a wide range of contexts. Different AI models or systems are built on top of them as a foundation.

That’s posed some specific challenges around regulation. Some of this is still being worked out. There are big challenges for the EU, not just in relation to foundation models. AI encompasses so many things and is used across all industries, across all sectors in all contexts, which poses a big challenge. 

The UK-approach to regulation of AI has been quite different to that proposed in the EU: The UK set out a pro-innovation approach to regulation, which was a set of principles intended to equip existing UK regulatory bodies to grapple the challenges of AI. It recognized that AI is already being used across all industries and sectors. That means that all regulators have to deal with how to regulate AI in their sectors. 

In recent weeks and months in the UK we have seen an increasing emphasis on regulation and AI, and increased attention at the importance of developing effective regulation. But I have some concerns that this change of emphasis has, at least in part, come from Big Tech. We’ve seen this in the likes of Sam Altman on his tour of Europe, speaking to European regulators and governments. Many voices talking about the existential risk AI poses come from Silicon Valley. This is now beginning to have an influence on policy discussions and regulatory discussions, which is worrying. It’s a positive thing that we’re having these discussions about regulation and AI, but we need those discussions to focus on real risks and impacts. 

devmio: The idea of existential threat posed by AI often comes from a vision of self-conscious AI, something often called strong AI or artificial general intelligence (AGI). Do you believe AGI will ever be possible?

Mhairi Aitken: No, I don’t believe AGI will ever be possible. And I don’t believe the claims being made about an existential threat. These claims are a deliberate distraction from the discussions of regulation of current AI practices. The claim is that the technology and AI itself poses a risk to humanity and therefore, needs regulation. At the same time, companies and organizations are making decisions about that technology. That’s why I think this narrative is being pushed, but it’s never going to be real. AGI belongs in the realm of sci-fi. 

There are huge advancements in AI technologies and what they’re going to be capable of doing in the near future is going to be increasingly significant. But they are still always technologies that do what they are programmed to do. We can program them to do an increasing number of things and they do it with an increasing degree of sophistication and complexity. But they’re still only doing what they’re programmed for, and I don’t think that will ever change. 

I don’t think it will ever happen that AI will develop its own intentions, have consciousness, or a sense of itself. That is not going to emerge or be developed in what is essentially a computer program. We’re not going to get to consciousness through statistics. There’s a leap there and I have never seen any compelling evidence to suggest that could ever happen.

We’re creating systems that act as though they have consciousness or intelligence, but this is an illusion. It fuels a narrative that’s convenient for Big Tech because it deflects away from their responsibility and suggests that this isn’t about a company’s decisions.

devmio: Sometimes it feels like the discussions around AI are a big playing field for societal discourse in general. It is a playing field for a modern society to discuss its general state, its relation to technology, its conception of what it means to be human, and even metaphysical questions about God-like AI. Is there some truth to this?

Mhairi Aitken: There’s lots of discussions about potential future scenarios and visions of the future. I think it’s incredibly healthy to have discussions about what kind of future we want and about the future of humanity. To a certain extent this is positive.

But the focus has to be on the decisions we make as societies, and not hypothetical far-fetched scenarios of super intelligent computers. These conversations that focus on future risks have a large platform. But we are only giving a voice to Big Tech players and very privileged voices with significant influence in these discussions. Whereas, these discussions should happen at a much wider societal level. 

The conversations we should be having are about how we harness the value of AI as a set of tools and technologies. How do we benefit from them to maximize value across society and minimize the risks of technologies? We should be having conversations with civil society groups and charities, members of the public, and particularly with impacted communities and marginalized communities.

We should be asking what their issues are, how AI can find creative solutions, and where we could use these technologies to bring benefit and advocate for the needs of community groups, rather than being driven by commercial for-profit business models. These models are creating new dependencies on exploitative data practices without really considering if this is the future we want.

devmio: In the Alan Turing Institute’s strategy document, it says that the institute will make great leaps in AI development in order to change the world for the better. How can AI improve the world?

Mhairi Aitken: There are lots of brilliant things that AI can do in the area of medicine and healthcare that would have positive impacts. For example, there are real opportunities for AI to be used in developing diagnostic tools. If the tools are designed responsibly and for inclusive practices, they can have a lot of benefits. There’s also opportunities for AI in relation to the environment and sustainability in terms of modeling or monitoring environments and finding creative solutions to problems.

One area that really excites me is where AI can be used by communities, civil society groups, and charities. At the moment, there’s an emphasis on large language models. But actually, when we think about smaller AI, there’s real opportunities if we see them as tools and technologies that we can harness to process complex information or automate mundane tasks. In the hands of community groups or charities, this can provide valuable tools to process information about communities, advocate for their needs, or find creative solutions.

devmio: Do you have examples of AI used in the community setting?

Mhairi Aitken: For example, community environment initiatives or sustainability initiatives can use AI to monitor local environments, or identify plant and animal species in their areas through image recognition technologies. It can also be used for processing complex information, finding patterns, classifying information, and making predictions or recommendations from information. It can be useful for community groups to process information about aspects of community life and develop evidence needed to advocate for their needs, better services, or for political responses.

A lot of big innovation is in commercially-driven development. This leads to commercial products instead of being about how these tools can be used for societal benefit on a smaller scale. This changes our framing and helps us think about who we’re developing these technologies for and how this relates to different kinds of visions of the future that benefit from this technology.

devmio: What do you think is needed to reach this point?

Mhairi Aitken: We need much more open public conversations and demands about transparency and accountability relating to AI. That’s why it’s important to counter the sensational unrealistic narrative and make sure that we focus on regulation, policy and public conversation. All of us must focus on the here and now and the decisions of companies leading the way in order to hold them accountable. We must ensure meaningful and honest dialogue as well as transparency about what’s actually happening.

devmio: Thank you for taking the time to talk with us and we hope you succeed with your mission to inform the public.

The post AI is a Human Endeavor appeared first on ML Conference.

]]>
ChatGPT and Artificial General Intelligence: The Illusion of Understanding https://mlconference.ai/blog/chatgpt-artificial-general-intelligence-illusion-of-understanding/ Mon, 05 Jun 2023 13:21:35 +0000 https://mlconference.ai/?p=86309 The introduction of ChatGPT in late 2022 touched off a debate over the merits of artificial intelligence which continues to rage today.

The post ChatGPT and Artificial General Intelligence: The Illusion of Understanding appeared first on ML Conference.

]]>
Upon its release, ChatGPT immediately drew praise from tech experts and the media as “mind blowing” and the “next big disruptor,” while a recent Microsoft report praised GPT-4, the latest iteration of OpenAI’s tool, for its ability to solve novel and difficult tasks with “human-level performance” in advanced careers such as coding, medicine, and law. Google responded to the competition by launching its own AI-based chatbot and service, Bard.

On the flip side, ChatGPT has been roundly criticized for its inability to answer simple logic questions or work backwards from a desired solution to the steps needed to achieve it. Teachers and school administrators voiced fears that students would use the tool to cheat, while political conservatives complained that Chat generates answers with a liberal bias. Elon Musk, Apple co-founder Steve Wozniak, and others signed an open letter recommending a six-month pause in AI development, noting “Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable.”

The one factor missing from virtually all these comments – regardless of whether they regard ChatGPT as a huge step forward or a threat to humanity – is a recognition that no matter how impressive, ChatGPT merely gives the illusion of understanding. It is simply manipulating symbols and code samples which it has pulled from the Internet without any understanding of what they mean. And because it has no true understanding, it is neither good nor bad. It is simply a tool which can be manipulated by humans to achieve certain outcomes, depending on the intentions of the users.

It is that difference that distinguishes ChatGPT, and all other AI for that matter, from AGI – artificial general intelligence, defined as the ability of an intelligent agent to understand or learn any intellectual task that a human can. While ChatGPT undoubtedly represents a major advance in self-learning AI, it is important to recognize that it only seems to understand. Like all other AI to date, it is completely reliant on datasets and machine learning. ChatGPT simply appears more intelligent because it depends on bigger and more sophisticated datasets.

 

RETHINK YOUR APPROACHES

Business & Strategy

 

While some experts continue to argue that at some point in the future, AI will morph into AGI, that outcome seems highly unlikely. Because today’s AI is entirely dependent on massive data sets, there is no way to create a dataset big enough for the resulting system to cope with completely unanticipated situations. In short, AI has no common sense and we simply can’t store enough examples to handle every possible situation. Further, AI, unlike humans, is unable to merge information from multiple senses. So while it might be possible to stitch language and image processing applications together, researchers have not found a way to integrate them in the same seamless way that a child integrates vision, language, and hearing.

For today’s AI to advance to something approaching real human-like intelligence, it must have three essential components of consciousness: an internal mental model of surroundings with the entity at the center; a perception of time which allows for a prediction of future outcome(s) based on current actions; and an imagination so that multiple potential actions can be considered and their outcomes evaluated and chosen. Just like the average three-year-old child, it must be able to explore, experiment, and learn about real objects, interpreting everything it knows in the context of everything else it knows.

To get there, researchers must shift their reliance on ever-expanding datasets to a more biologically plausible system modelled on the human brain, with algorithms that enable it to build abstract “things” with limitless connections and context.

While we know a fair amount about the brain’s structure, we still don’t know what fraction of our DNA defines the brain or even how much DNA defines the structure of its neocortex, the part of the brain we use to think. If we presume that generalized intelligence is a direct outgrowth of the structure defined by our DNA and that structure could be defined by as little as one percent of that DNA, though, it is clear that AGI emergence depends not on more computer power or larger data sets but on what to write as the fundamental AGI algorithms.

With that in mind, it seems highly likely that a broader context that is actually capable of understanding and learning gradually could emerge if all of today’s AI systems could be built on a common underlying data structure that allowed their algorithms to begin interacting with each other. As these systems become more advanced, they would slowly begin to work together to create a more general intelligence that approaches the threshold for human-level intelligence, enabling AGI to emerge. To make that happen, though, our approach must change. Bigger and better data sets don’t always win the day.

The post ChatGPT and Artificial General Intelligence: The Illusion of Understanding appeared first on ML Conference.

]]>
AI Alignment https://mlconference.ai/blog/ai-alignment/ Thu, 30 Mar 2023 15:21:35 +0000 https://mlconference.ai/?p=86116 At least since the arrival of ChatGPT, many people have become fearful that we are losing control over technology and that we can no longer anticipate the consequences they may have. AI Alignment deals with this problem and the technical approaches to solve it.

The post AI Alignment appeared first on ML Conference.

]]>
Two positions can be identified in the AI discourse. First, “We’ll worry about that later, when the time comes” and second, “This is a problem for nerds who have no ethical values anyway”. Both positions are misguided, as the problem has existed for a long time and, moreover, there are certainly ways of setting boundaries for AI. Rather, there is a lack of consensus on what those boundaries should be.

AI Alignment [1] is concerned with aligning AI to desired goals. The first challenge here is to agree on these goals in the first place. The next difficulty is that it is not (yet?) possible to give these goals directly and explicitly to an AI system.

For example, Amazon developed a system several years ago that helps select suitable applicants for open positions ([2], [3]). For this, resumes of accepted and unaccepted applicants were used to train an AI system. Although they contained no explicit information about gender, male applicants were systematically preferred. We will discuss how this came about in more detail later. But first, this raises several questions: Is this desirable, or at least acceptable? And if not, how do you align the AI system so that it behaves as you want it to? In other words, how do you successfully engage in AI alignment?

 

Stay up to date

Learn more about MLCON

 

 

For some people, AI Alignment is an issue that will become more important in the future when machines are so intelligent and powerful that they might think the world would be better without humans [4]. Nuclear war provoked by supervillains is mentioned as another possibility of AI’s fatal importance. Whether these fears could ever become realistic remains speculation.

The claims being discussed as part of the EU’s emerging AI regulation are more realistic. Depending on what risk is realistically inherent in an AI system, different regulations may be applied here. This is shown in Figure 1, which is based on a presentation for the EU [5]. Four ranges from “no risk” to “unacceptable risk” are distinguished. In this context, a system with no significant risk only has the recommendation of a “Code of Conduct”, while a social credit system, as applied in China [6], is simply not allowed. However, this scheme only comes into effect if there is no specific law.

 

Fig. 1: Regulation based on outgoing risk, adapted from [5]

 

Alignment in Machine Learning Systems

A machine learning system is trained using sample data. It learns to mimic this sample data. In the best and most desirable case, the system can generalize beyond this sample data and recognizes an abstract pattern behind it. If this succeeds, the system can also react meaningfully to data that it has never seen before. Only then can we speak of learning or even a kind of understanding that goes beyond memorization.

This also happened in the example of Amazon’s applicant selection, as shown in a simplified form in Figure 2.

 

Fig. 2: How to learn from examples, also known as supervised learning

 

Here is another example. We use images of dogs and cats as sample data for a system, training it to distinguish between them. In the best case, after training, the system also recognizes cats that are not contained in the training data set. It has learned an abstract pattern of cats, which is still based on the given training data, however.

Therefore, this system can only reproduce what already exists. It is descriptive or representative, but hardly normative. In the Amazon example, it replicates past decisions. These decisions seemed to be that men simply had a better chance of being accepted. So, at least the abstract model would be accurate. Alternatively, perhaps there were just more examples of male applicants, or some other unfortunate circumstance caused the abstract model not to be a good generalization of the example data.

At its best, however, such an approach is analytical in nature. It shows the patterns of our sample data and their backgrounds, meaning that men performed better on job applications. If that matches our desired orientation, there is no further problem. But what if it doesn’t? That’s what we’re assuming and Amazon was of that opinion as well, since they scrapped the system.

 

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

 

Pre-assumptions, aka: Priors

How to provide a machine learning system additional information about our desired alignment in addition to sample data has been commonly understood for a long time. This is used to provide world or domain knowledge to the system to guide and potentially simplify or accelerate training. You support the learning process by specifying which domain to look for abstract patterns in the data. Therefore, a good abstract pattern can be learned even if the sample data describes it inadequately. In machine learning, data being an inadequate description of the desired abstract model is the rule, rather than the exception. Yann LeCun, a celebrity on the scene, vividly elaborates on this in a Twitter thread [7].

  This kind of previous assumption is also called a prior. An illustrative example of a prior is linearity. As an explanation, let’s take another application example. For car insurance, estimating accident risk is crucial. For an estimation, characteristics of the drivers and vehicles to be insured are collected. These characteristics are correlated with existing data on accident frequency in a machine-learning model. The method used for this is called supervised learning, and it is  the same as described above.

For this purpose, let us assume that the accident frequency increases linearly with increased distance driven. The more one drives, the more accidents occur. This domain knowledge can be incorporated into the training process. This way, you can hope for a simpler model and potentially even less complex training. In the simplest case, linear regression [8] can be used here, which produces a reasonable model even with little training data or effort. Essentially, training consists of choosing the parameters for a straight line, slope, and displacement, to best fit the training data. Because of its simplicity, the advantage of this model is its good explainability and low resource requirement. This is because a linear relationship, “one-to-one”,  is intellectually easy, and a straight-line equation can be calculated on a modern computer with extremely little effort.

However, it is also possible to describe the pattern contained in the training data and correct it normatively. For this, let us assume that the relationship between age and driving ability is clearly over-linear. Driving ability does not decline in proportion to age, but at a much faster rate. Or, to put it another way, the risk of accidents increases disproportionately with age. That’s how it is in the world, and that’s what the data reflects. Let’s assume that we don’t want to give up on this important influence completely. However, we equally want to avoid excessive age discrimination. Therefore, we decide to allow a linear dependence at most. We can support the model and align it with our needs. This relationship is illustrated in Figure 3. The simplest way to implement this is the aforementioned linear regression.

 

Fig. 3: Normative alignment of training outcomes

 

Now, you could also argue that models usually have not only one input, but many, which act in combination on the prediction. Moreover, in our example, the linear relationship between distance driven and accident frequency does not need to be immediately plausible. Don’t drivers with little driving experience have a higher risk? In that case, you could imagine a partial linear relationship. In the beginning, the risk decreases in relation to the distance driven, but then it increases again after a certain point and remains linear. There are also tools for these kinds of complex correlations. In the deep learning field, TensorFlow Lattice [9] offers the possibility of specifying a separate set of alignments for each individual influencing factor. This is also possible in a nonlinear or only partially linear way. 

In addition to these relatively simple methods, there are other ways to influence. These include the learning algorithms you choose, the sample data selected, and, especially in deep learning, the neural network’s architecture and learning approach. These interventions in the training process are technically challenging and must be performed sparingly under supervision. Depending on the training data, otherwise, it may become impossible to train a good model with the desired priors. 

 

Is all this not enough? Causal Inference

The field of classical machine learning is often accused of falling short. People say that these techniques are suitable for fitting straight lines and curves to sample data, but not for producing intelligent systems that behave as we want them to. In a Twitter thread by Pedro Domingos [10], typical representatives of a more radical course such as Gary Marcus and Judea Pearl also come forward. They agree that without modeling causality (Causal Inference), there will be no really intelligent system or AI Alignment.

In general, this movement can be accused of criticizing existing approaches but not having any executable systems to show for themselves. Nevertheless, Causal Inference has been a hyped topic for a while now and you should at least be aware of this critical position.

 

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

 

ChatGPT, or why 2023 is a special year for AI and AI Alignment.

Regardless of whether someone welcomes current developments in AI or is more fearful or dismissive of them, one thing seems certain: 2023 will be a special year in the history of AI. For the first time, an AI-based system, ChatGPT [11], managed to create a veritable boom of enthusiasm among a broad mass of the population. ChatGPT is a kind of chatbot that you can converse about any topic with, and not just in English. There are further articles for a general introduction to ChatGPT.

ChatGPT is simply the most prominent example of a variety of systems already in use in many places. They all share the same challenge: how do we ensure that the system does not issue inappropriate responses? One obvious approach is to check each response from the system for appropriateness. To do this, we can train a system using sample data. This data consists of pairs of texts and a categorization of whether they match our alignment or not. Operating this kind of system is shown in Figure 4. OpenAI, the producer of ChatGPT, offers this functionality already trained and directly usable as an API [12].

This approach can be applied to any AI setting. The system’s output is not directly returned, but first checked for your desired alignment. When in doubt, a new output can be generated by the same system, another system can be consulted, or the output can be denied completely. ChatGPT is a system that works with probabilities and is able to give any number of different answers to the same input. Most AI systems cannot do this and must choose one of the other options.

As mentioned at the beginning, we as a society still need to clarify which systems we consider risky. Where do we want to demand transparency or even regulation? Technically, this is already possible for a system like ChatGPT by inserting a kind of watermark [13] into generated text. This works by selecting words from a restricted list and assuming that a human making this specific combination has an extremely low probability. This can be used to establish the machine as the author. Additionally, the risk of plagiarism is greatly reduced because the machine – imperceivable to us – does not write exactly like a human. In fact, OpenAI is considering using these watermarks in ChatGPT [14]. There are other methods that work without watermarks to find out whether a text comes from a particular language model [15]. This only requires access to the model under suspicion. The obvious weakness is knowing or guessing the model under suspicion.

 

Fig. 4: A moderation system filters out undesirable categories

 

Conclusion

As AI systems become more intelligent, the areas where they can be used become more important and therefore, riskier. On the one hand, this is an issue that affects us directly today. On the other hand, an AI that wipes out humanity is just material for a science fiction movie.

However, targeting these systems for specific goals can only be achieved indirectly. This is done by selecting sample data and Priors that are introduced into these systems. Therefore, it may also be useful to subject the system’s results to further scrutiny. These are issues that are already being discussed at both the policy and technical levels. Neither group, those who see AI as a huge problem, and those who think no one cares, are correct.

 

Links & References

[1] https://en.wikipedia.org/wiki/AI_alignment 

[2] https://www.bbc.com/news/technology-45809919

[3] https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G

[4] https://www.derstandard.de/story/2000142763807/chatgpt-so-koennte-kuenstliche-intelligenz-die-menschheit-ausloeschen

[5] https://www.ceps.eu/wp-content/uploads/2021/04/AI-Presentation-CEPS-Webinar-L.-Sioli-23.4.21.pdf

[6] https://en.wikipedia.org/wiki/Social_Credit_System

[7] https://twitter.com/ylecun/status/1591463668612730880?t=eyUG-2osacHHE3fDMDgO3g

[8] https://en.wikipedia.org/wiki/Linear_regression

[9] https://www.tensorflow.org/lattice/overview

[10] https://twitter.com/pmddomingos/status/1576665689326116864

[11] https://openai.com/blog/chatgpt/

[12] https://openai.com/blog/new-and-improved-content-moderation-tooling/

[13] https://arxiv.org/abs/2301.10226 and https://twitter.com/tomgoldsteincs/status/1618287665006403585

[14] https://www.businessinsider.com/openai-chatgpt-ceo-sam-altman-responds-school-plagiarism-concerns-bans-2023-1

[15] https://arxiv.org/abs/2301.11305

The post AI Alignment appeared first on ML Conference.

]]>
Scalable Programming https://mlconference.ai/blog/scalable-programming/ Mon, 22 Aug 2022 08:13:11 +0000 https://mlconference.ai/?p=85032 Java continuously introduces new, useful features. For instance, Java 8 introduced the Stream API, one of the biggest highlights of the past few years. But is aggregating data with the Stream API a panacea? In this article, I’d like to explore if there’s a better alternative for certain cases from a complexity perspective.

The post Scalable Programming appeared first on ML Conference.

]]>
Some of you have probably used the following code in a program in order to spontaneously measure a logic’s runtime:

long start = System.currentTimeMillis();
doSomething();
long time = System.currentTimeMillis() - start;

Clearly, it’s easy to implement and you can quickly check the code’s speed. But there are also some disadvantages. First, the measured values can contain uncertainties as they can be influenced by other processes running on the same machine. Second, you can’t compare the readings with other readings taken from different environments. Declaring that one solution is faster than the other isn’t helpful if they were measured on different machines with different CPUs and RAM. Third, it’s difficult to estimate how the runtime could extend when working with larger amounts of data in the future. It’s become much easier to filter and aggregate data since the Stream API was introduced in Java 8. The Stream API even opens up the possibility to parallelize processing [1]. But do these solutions continue to perform when you need to work with 10 or 100 times the amount of data? Is there a measurement that we can use to answer this question?

Time complexity

Time complexity is a measurement for roughly estimating the time efficiency of an algorithm. It focuses on how runtime increases as the input gets longer. For example, if you iterate a list of n elements with a for loop, then n and the runtime have a linear relationship. If you have multiple for loops nested and executed n times each, then this logic has an exponential effect on runtime.

Big O notation is a way to represent the relationship between the input length and the runtime. A linear relationship is represented by O(n), O(n²) represents a quadratic relationship, where n is the input’s length. If the runtime is independent of the input’s length and is constant, then we write O(1). Figure 1 shows typical big O notation values for how the runtime grows as the input’s length increases.

Fig. 1: The relationship between runtime and input length per time complexity.

There are two important rules for representation using big O-notation:

  • Only the term with the highest degree is considered. For example: If the time complexity is n + nlogn + n², simply write O(n²), as the term has the strongest effect on runtime.
  • The coefficient is not considered. For example, the time complexity of 2n², 3n², and ½n² is equal to O(n²).

It’s important to emphasize that time complexity only focuses on scalability. Especially when n is a smaller value, one algorithm may have a longer runtime even if it has a better time complexity than others.

Space complexity

In addition to time complexity, there’s another measure for representing an algorithm’s efficiency: space complexity. It looks at how memory requirements grow as the input’s length increases. When you copy a list with n elements into a new list, the space complexity is O(n) because the need for additional memory increases linearly when you work with a larger input list. If an algorithm only needs a constant amount of memory, regardless of the input length, then the space complexity is O(1).

There’s often a trade-off relationship between time complexity and space complexity. Depending on the case, when comparing multiple algorithms, it’s important to consider if runtime or memory is more important.

Binary search

As shown in Figure 1, an algorithm with time complexity O(logn) has better time performance than O(n). Binary search is one of the algorithms with this time complexity. It’s applicable when you want to search for a target value from a sorted list. In each operation, the algorithm compares if the target value is in the left or right half of the search area. For example, imagine a dictionary. You probably won’t start on the first page of the dictionary to find the word you’re looking for. You’ll open up to a page in the middle of the book and start searching from there.

Fig. 2: Binary search sequence

Figure 2 shows how the binary search proceeds when searching for the target value 7 in a list of eleven elements. The element marked in red represents the middle of the current operation’s search area.  If the number of elements in the search area is an even number, then it takes the “left” element in the middle. In each operation, you compare if the target value (7, in this case) is less than or greater than the middle. Cut the search area in half until you reach the target value.

log2n is the maximum number of necessary comparison operations to find the target value with the binary search, where n is the length of the input list. Let’s take n = 8 as an example. The length of the search area starts with 8 and decreases to 4 after the first operation. After the second operation, it is divided in half again to 2 and after the third operation, there’s just one value in the search area. From this example, we can conclude that the number of operations needed is at most a logarithm of 8 to the base 2 (log28 = 3), because   23= 8. In big O notation, we omit the base and write only O(logn).

In Java, implementation of binary search is found in the java.util.Arrays.binarySearch [2] and java.util.Collections.binarySearch methods [3]. If you work with an array, you can use the methods in the java.util.Arrays. If you work with a list, then the methods in the class java.util.Collections are applicable.

Sorting algorithm

There are several kinds of sorting algorithms, each with different time complexities and space complexities. In practice, typical sorting algorithms used are Quicksort, Mergesort, and their variants. On average, the time complexity of these two methods is O(nlogn) [4], [5]. There are also sorting algorithms with better time complexities, but these often have limitations in the arrangement of input list or require special hardware.

The methods for sorting in Java are implemented in java.util.Arrays.sort [2] and java.util.Collections.sort [3]. Since Java 8, the List interface also provides the sort method [6], while the Stream API has the intermediate sorted operation [1]. According to Java documentation, these methods are implemented by default with Quicksort, Timsort, or Mergesort. But this can vary depending on the JDK vendor.

 

Task 1: Searching in a sorted list

The first task is finding the target value in an already sorted list. One potential solution is using the contains method of the List interface (Listing 1).

Listing 1

// input
List<Integer> list = List.of(12, 15, 19, 20, 21, 24);
int target = 19;
// solution
boolean answer = list.contains(target);

This solution’s time complexity is O(n). In the worst case, it searches the whole list until you reach the end. Another solution is to take advantage of the fact that the input list is already sorted, so the binary search is applicable (Listing 2). Collections.binarySearch returns an integer greater than or equal to 0 if the target value is in the list. The two solutions’ space complexity is O(1) since they only need a consistent amount of memory to fix the result, regardless of input values.

Listing 2

// input
List<Integer> list = List.of(12, 15, 19, 20, 21, 24);
int target = 19;
// solution
boolean answer = Collections.binarySearch(list, target) >= 0;

I generated test data with 103 and 104 elements and used it to compare the runtime for the two solutions. The target values for the search are selected at regular intervals and the runtime was measured multiple times for each target value. The tests were run on a Windows 10 PC with Intel Core i7-1065G7 CPU 1.30GHz and 32 GB RAM. I used the Amazon Corretto 11.0.11 JDK and runtimes were measured with the Java Microbenchmark Harness [7].

Figure 3 shows the results for each length of input as a box plot. Each box plot contains the measurements of calls that were executed with different target values. A box plot graphically represents the distribution of measurement results and represents the median, the two quartiles (whose intervals contain the middle 50% of the data), and the two minimum and maximum values of the data (Fig. 4). You can see in Figure 3 that the solution’s runtime in Listing 1 is more scattered than the binary search in Listing 2. This is because the runtime of the solution in Listing 1 heavily depends on where the target value is located in the list. This tendency becomes clearer when comparing results between the test cases n = 103 and n = 104. Between the two cases, the worst-case runtime of Listing 1 increased significantly compared to Listing 2.

Fig. 3: Running times of the respective solutions for task 1

Fig. 4: Box Plot

Task 2: Searching a range of values in a sorted list

The next task is counting the occurrence of values in a sorted list greater than or equal to a and less than b (a ≤ xi < b) where xi is the respective value in the input list. The requirements are that the input values a and b must always satisfy a ≤ b and there must not be any duplicates in the input list. An intuitive idea is using the intermediate operation filter in the Stream API to collect just the elements in a specific range of values, and ultimately, count the number of elements with the terminal operation count (Listing 3).

Listing 3

// input
List<Integer> list = List.of(12, 15, 19, 20, 21, 24);
int a = 14, b = 19;


// solution
long answer = list.stream().mapToInt(Integer::intValue)
.filter(value -> a <= value && value < b).count();

The time complexity of this solution is O(n), because you must iterate once through the whole list, checking each list element to see if the value is in the range. But is it possible to use binary search for this task too? What if we could set the following two pieces of information:

        • Position of the value a in the input list, if included. Otherwise, the position in the input list where you can insert the value a.
        • Position of the value b in the input list, if included. Otherwise, the position in the input list where you can insert the value b.

The difference between the two calculated positions is the number of elements between the two thresholds. In this solution, the binary search is performed twice. But since we don’t consider the coefficient in the big O notation, the time complexity of this solution is still O(logn). This is for the same reason as in Task 1: The space complexity of the two solutions is O(1). Listing 4 shows a sample implementation for this solution. Be aware that this code will not work if the input list has duplicates. As described in the documentation of Collections.binarySearch [3], the method does not guarantee which one will be found if the target value is included more than once in the list.

Listing 4

// input List<Integer> list = List.of(12, 15, 19, 20, 21, 24);
int a = 14, b = 19;

// solution
int lower = Collections.binarySearch(list, a);
int upper = Collections.binarySearch(list, b);
lower = lower < 0 ? ~lower : lower;
upper = upper < 0 ? ~upper : upper;
int answer = upper - lower;

Collections.binarySearch returns an integer greater than or equal to 0 if the target value is in the list. Otherwise, it returns a negative value where -(insertion point)-1 is. The insertion point is the position in the list where the target value should be inserted in order to keep the list sorted. In order to calculate the insertion point back from the return value -(insertion point)-1, you can simply use the bitwise NOT operator ~.

Just like with Task 1, Figure 5 plots the running times of the two solutions as a box plot, measured with different lengths of input and target values. Again, it’s easy to see that the solution in Listing 4 with binary search has more stable run times than the one in Listing 3. Fig. 5: Running times of the respective solution in Task 2

Task 3: Find the largest value in an unsorted list

Now the task is finding the largest value in an unsorted list consisting of integers. One possible solution using the Stream API is to use IntStream and its terminal operation max [8] (Listing 5).

Listing 5

// input
List<Integer> list = List.of(23, 18, 15, 38, 8, 24);

// solution
OptionalInt answer = list.stream().mapToInt(Integer::intValue).max();

This solution has time complexity O(n) and space complexity O(1). A different idea is to sort the list in descending order and return the first value in the list (Listing 6). As previously mentioned, Java provides several ways of sorting a list. To sort in descending order, you must specify a comparator in Java that compares backwards, since by default, the list is sorted in ascending order. You must also not use an immutable list, except when working with the intermediate sorted operation in the Stream API, because the sort methods will process the list directly. For instance, the List.of method returns an immutable list.

Listing 6

// input
List<Integer> list = Arrays.asList(23, 18, 15, 38, 8, 24);

// solution
list.sort(Collections.reverseOrder()); int answer = list.get(0);

This solution has time complexity O(nlogn). However, the solution’s space complexity depends on the method used in the implementation of the sort method. As previously seen in Figure 1, the time complexity O(nlogn) is worse than O(n). In fact, you can see in Figure 6 that as the length of the input list n increases, the solution’s runtime from Listing 6 also increases dramatically with sorting—more than it did in Listing 5. However, in the next task, we will see that in certain cases, sorting the list is a good idea. Fig. 6: Average runtimes of the respective solution for Task 3

Task 4: Find the largest k elements in an unsorted list

In the last task, we saw that sorting isn’t necessary if you only want to know the largest value of an unsorted list. What about needing the k largest values from the list? So, if k = 3, then you must find the three largest values from the list (assuming that k is less than the input length). In this case, it’s no longer enough to iterate through the input list once. But the solution with sorting will continue to work (Listing 7).

Listing 7

// input
List<Integer> list = Arrays.asList(23, 18, 15, 38, 8, 24);
int k = 3;

// solution
list.sort(Collections.reverseOrder());
List<Integer> answer = list.subList(0, k);

This solution can be easily optimized with a priority queue. A priority queue is implemented in Java with a binary heap [9] and is an abstract data structure that can be used to query the smallest value (or largest, depending on which comparator is specified) in the queue. Generally, the time complexity for adding and deleting values is O(logn). For querying the smallest value, it is O(1), where n is the length of the priority queue. In our case, we add individual elements from the input list to the priority queue and delete each time the smallest value from the priority queue as soon as the queue’s size is greater than k. Lastly, you insert individual elements from the priority queue into a list. Listing 8 shows a sample implementation of this solution. A small optimization, the priority queue is instantiated with an initial capacity of k+1 since it can contain k+1 elements at most. This solution’s time complexity is O(nlogk), since you insert n elements from the input list into the priority queue at a time, but the priority queue’s size is limited to k. The space complexity is O(k) because you keep k elements in the priority queue temporarily so that you can eventually create the result list. Figure 7 shows the average run times of the respective solutions when measured with different lengths for input list n. The larger the difference between n and k, the larger its effect on the runtime. Fig. 7: Average runtimes of the corresponding solution for Task 4

Listing 8

// input
List<Integer> list = List.of(23, 18, 15, 38, 8, 24);
int k = 3;

// solution
Queue<Integer> queue = new PriorityQueue<>(k+1);
for(int v : list) {
queue.offer(v);
if(queue.size() > k) {
queue.poll();
  }
}
List<Integer> answer = Stream.generate(queue::poll)
.takeWhile(Objects::nonNull).collect(Collectors.toList());

Conclusion

In this article, I summarized the ideas of time and space complexities, and—in particular—I compared how time complexity affects runtime when working with a large amount of data. It’s good practice to keep the two measures in mind and consider other criteria like code readability or maintainability during trade-offs. The Stream API is a very powerful tool for smaller data sets. But basically, the time complexity is O(n) if you filter or search over the entire input and don’t prematurely terminate. If there’s a possibility of the input growing in the future, then from the beginning you should consider if there’s a better solution from the point of view of both complexities.

Links & Literature

[1] https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/stream/Stream.html

[2] https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/Arrays.html

[3] https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/Collections.html

[4] https://www.inf.hs-flensburg.de/lang/algorithmen/sortieren/quick/quick.htm

[5] https://www.inf.hs-flensburg.de/lang/algorithmen/sortieren/merge/mergen.htm

[6] https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/List.html

[7] https://github.com/openjdk/jmh

[8] https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/stream/IntStream.html

[9] https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/PriorityQueue.html

The post Scalable Programming appeared first on ML Conference.

]]>
Why are we doing this anyway? https://mlconference.ai/blog/modularization-and-cognitive-psychology/ Tue, 07 Jun 2022 14:51:39 +0000 https://mlconference.ai/?p=84001 Modularization is frequently discussed, but after some time, the speakers realize that they don’t mean the same thing. Over the last fifty years, computer science has given us a number of good explanations about what modularization is all about—but is that really enough to come to the same conclusions and arguments?

The post Why are we doing this anyway? appeared first on ML Conference.

]]>
I didn’t learn the real reason why modularization is so important until I started studying cognitive psychology. Therefore, in this article, I will bring modularization and cognitive psychology together, giving you the crucial arguments at hand about why modularization actually helps us with software development.

Parnas is still right!

In the last 20 to 30 years, we’ve developed many large software systems in Java, C++, C#, and PHP. These systems hold a lot of business value but they frustrate development teams because they can only be developed with steadily increasing effort. David Parnas’ 50-year-old recipe for finding a way out of this situation is called modularization. It is said that if we have a modular architecture, then we have independent units that can be understood and quickly developed by small teams. Additionally, modular architecture gives us the possibility to deploy individual modules separately, making our architecture scalable. These are precisely the arguments that architects and developers discuss, and yet we always disagree on what exactly we mean by modular, modules, modular architectures, and modularization.

In my doctoral thesis, I dealt with the question of how to structure software systems so people—or our human brains—can find their way around them. This is especially important since development teams spend a lot of time reading and understanding existing code. Fortunately, cognitive psychology has identified various mechanisms that our brain uses to grasp complex structures. One of them provides a perfect explanation for modularization: It’s called chunking. With chunking as a basis, we can describe modularization much better than we can with design principles and heuristics, which are often used as justifications [1]. Additionally, cognitive psychology provides us with two other mechanisms: Hierarchies and schemata, bringing further vital cues for modularization.

Chunking ➔ Modularization

In order for people to manage the amount of information they confront, they must select and group partial information together into larger units. In cognitive psychology, the construction of higher-order abstractions increasingly grouped together is called chunking (Fig. 1). By storing partial information as higher-order knowledge units, short-term memory is freed up and more information can be absorbed.

Fig. 1: Chunking

 

As an example, consider a person working with a telegraph for the first time. At first, they hear transmitted Morse code as short and long tones; they process them as separate units of knowledge. But after a while, they’ll be able to combine the sounds into letters—new units of knowledge—so that they can more quickly understand what’s being transmitted. Some time later, individual letters become words, representing larger units of knowledge, and finally, they become whole sentences.

Developers and architects automatically apply chunking when they’re exposed to new software. Program text is read in detail, and the lines are grouped into knowledge units and are, therefore, retained. Little by little, the knowledge units are summarized on and on until they achieve an understanding of the program text and structures it embodies.
This approach to programs is called bottom-up program comprehension and is typically used by development teams when they’re unfamiliar with a software system and its application domain and need to gain understanding. Development teams are more likely to use top-down program comprehension when they have knowledge of the application domain and software system. Top-down program comprehension mainly uses two structure-building processes: forming hierarchies and building schemata. We will introduce these in the following sections. 

 

Stay up to date

Learn more about MLCON

 

 

Another form of chunking can be seen in experts. They don’t store new knowledge units individually in short-term memory, but summarize them directly by activating previously-stored knowledge units. However, knowledge units can only be built from other knowledge units that fit together in a way that makes sense to the subject. During experiments with experts and novices, the two groups were presented with word groups from the expert’s knowledge area. Experts were able to remember five times as many terms as the beginners, but only if the word groups contained meaningfully related terms.

These findings were also verified in developers and architects. Chunking also works for software systems, but only if the software system’s structure represents meaningfully connected units. Program units that randomly group operations or functions together in a way that isn’t obvious to development teams why they belong together don’t facilitate chunking. The bottom line is that chunking can only be used when meaningful relationships exist between chunks.

Modules as coherent units

Therefore, it’s essential that modularization and modular architectures consist of building blocks such as classes, components, modules, and layers grouped together in meaningfully related elements. There are several design principles in computer science that aim to satisfy the requirement of coherent units:

  • Information Hiding:  In 1972, David Parnas was the first person to require that a module should hide exactly one design decision and the data structure for this design decision should be encapsulated in the module (encapsulation and locality). Parnas named this principle Information Hiding [2].
  • Separation of Concerns: In an article titled “A Discipline of Programming” [3]—which is still worth reading today—Dijkstra wrote that different parts of a larger task should be represented in different elements of the solution, if possible. Here it’s about decomposing large knowledge units with multiple tasks. In the refactoring movement, units with too many responsibilities resurfaced as code smells under the name God Class.
  • Cohesion: In the 1970s, Myers elaborated his ideas about design and introduced the cohesion measurement for evaluating cohesion in modules [4]. Coad and Yourdon extended the concept for object orientation [5].
  • Responsibility-driven Design: In the same vein as Information Hiding and cohesion,  Rebecca Wirfs-Brock’s heuristic concept aims to create classes by competencies: A class is a design unit that should satisfy exactly one responsibility and combine only one role [6].
  • Single Responsibility Principle (SRP): First, Robert Martin’s SOLID principles state that each class should perform just one defined task. Only functions that directly contribute to fulfilling this task should be present in a class. The effect of focusing on one task is that there should never be more than one reason to change a class. Robert Martin adds the Common Closure Principle at the architectural level for this. Classes should be local in their parent building blocks, so changes will always affect either all classes or none [7].

All of these principles want to promote chunking through a unit’s internal cohesion. But modularity has even more to offer. According to Parnas, a module should also form a capsule for the inner implementation with its interface.

Modules with modular interfaces

Chunking can be heavily supported by interfaces, if the interfaces—what a surprise—form meaningful units. The unit of knowledge needed for chunking can be prepared in the module’s interface so well that development teams don’t need to gather the chunk by analyzing the inside of the module anymore.

A good coherent interface results when you apply the principles in the last section to the design of the module’s interior as well as its interface [1], [7], [8]:

  • Explicit and encapsulating interface: Modules should make their interfaces explicit. In other words, the module’s task must be clearly identifiable, and internal implementation is abstracted from it.
  • Delegating interfaces and the Law of Demeter: Since interfaces are capsules, services offered in them must be made to enable delegation. True delegation occurs when services at an interface completely take over tasks. Services that return internals to the caller, which then must make further calls to get to its destination, violate the Law of Demeter.
  • Explicit dependencies: By a module’s interface, you should be able to directly recognize which other modules it communicates with. If you fulfill this requirement, then development teams will know which other modules they need to understand or create to work with the module, without having to look into its implementation. Dependency injection fits directly with this basic principle as it causes all dependencies to be injected in a module via the interface.

The goal of all of these principles is interfaces that support chunking. If they’re met, then interfaces will process a unit of knowledge faster. If the basic principles of coupling are also met, then we’ve gained a lot for chunking in program comprehension.

Modules with loose coupling

In order to understand and change an architecture’s module, development teams need an overview of the module itself and its neighboring modules. All modules that the target module works together with are important. The more dependencies there are from one module to another (Fig. 2), the more difficult it becomes to analyze individual participants with the limited capacity of short-term memory and to form suitable knowledge units. Chunking is much easier when there are fewer modules and dependencies in play.

Fig. 2: Strongly coupled classes (left) and packages/directories (right)

 

In computer science, the loose coupling principle starts here [9], [10], [11]. Coupling refers to the degree of dependency between a software system’s modules. The more dependencies in a system, the stronger the coupling. If a system’s modules were developed in accordance with the principles seen in the previous two sections on units and interfaces, then the system should automatically consist of loosely coupled modules. A module performing one related task needs fewer other modules than a module performing many different tasks. If the interface is created in a delegating way according to the Law of Demeter, then the caller only needs this interface. They do not have to move from interface to interface, finally completing their tasks with lots of additional coupling.

Up until now, chunking has helped us look at modularization for the inside and outside of a module and its relationship. Excitingly, the next cognitive mechanism also plays into understanding modularization.

Modularization through patterns

The most efficient cognitive mechanism people use to structure complex relationships are schemata. A schema can be understood as a concept consisting of a combination of abstract and concrete knowledge. On the abstract level, a schema consists of typical properties of the relationships it schematically depicts. On the concrete level, a schema contains a set of examples that represent prototypical manifestations of the schema. For example, each of us has a teacher schema that describes abstract features of teachers, including images of our own teachers as prototypical characteristics.

If we have a schema for a connection in our life, then we can process the questions and problems we’re dealing with much faster than we would without a schema. Let’s look at an example. During an experiment, chess masters and beginners were shown game positions on a chessboard for about five seconds. When it came to sensible game piece placement, the chess masters were able to reconstruct the positions of more than twenty pieces. They saw schemata of positions they knew and stored them in their short-term memory. But the weaker players could only reproduce the position of four or five pieces. The beginners had to memorize the chess pieces’ positions individually. But when the pieces were randomly presented on the chessboard to the experts and laymen, the masters no longer had an advantage. They couldn’t use schemata and thus, they couldn’t more easily remember the game pieces’ distribution, which was meaningless to them.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

 

The design and architecture patterns widely used in software development exploit the strength of the human brain to work with schemata. If developers and architects already worked with a pattern and formed a schema from it, then they can recognize and understand program texts and structures designed according to these patterns more quickly. Constructing schemata provides decisive speed advantages for understanding complex structures. This is also why patterns found their way into software development years ago.

Figure 3 shows an anonymized blackboard image I developed with a team to record their patterns. On the right side of the image, the source code in the architecture analysis tool Sotograph is divided into pattern categories—you’ll see there are a lot of green relationships and a few red ones. The red relationships go from bottom to top against the layering created by the patterns. The low amount of red relationships is a very good result and testifies to the fact that the development team uses patterns consistently.

Fig. 3: Class level pattern = pattern language

 

It’s also exciting to see what proportion of the source code can be assigned to patterns and how many patterns the system ultimately contains. If 80% or more of the source code can be assigned to patterns, then I say that the system has a pattern language. Here, the development team created its own language to make it easier to discuss architecture.

Using patterns in the source code is especially important for modular architecture. Remember: for chunking, it’s crucial that we find meaningfully related units that have a common task. How can the modules’ tasks be described if not with patterns? Modularization is deepened and improved with extensive use of patterns if you can recognize which pattern the respective module belongs to and if the patterns are used consistently.

Hierarchies ➔ Modularization

The third cognitive mechanism, hierarchies, also plays an important role in perceiving and understanding complex structures and storing knowledge. People can absorb knowledge well, reproduce it, and navigate it if it’s available in hierarchical structures. Research about learning related word categories, organizing learning materials, text comprehension, text analysis, and text reproduction shows that hierarchies are beneficial. When reproducing lists of terms and texts, the subjects’ memory performance was significantly higher when they were offered decision trees with categorical subordination. Subjects learned content significantly faster with the help of hierarchical chapter structures or thought maps. If hierarchical structures were not available, test subjects tried to arrange the text hierarchically themselves. From these studies, cognitive psychology draws the conclusion that hierarchically ordered content is easier for people to learn and process, and content can be retrieved more efficiently from a hierarchical structure.

Hierarchy formation is supported in programming languages in the contained relationship. Classes are contained in packages or directories, while packages/directories are contained in packages/directories, and finally in projects or modules and build artifacts. These hierarchies fit our cognitive mechanisms. If hierarchies are based on the architecture’s patterns, they support us not only with their hierarchical structuring but also with their architecture patterns.

Let’s have a look at a bad example and a good example: Imagine that a team specified that a system should consist of four modules, which, in turn, contain some submodules (Fig 4).

Fig 4.: Architecture with four modules

 

This structure provides the development team with an architectural pattern of four top-level modules, each containing additional modules. Now, imagine that this system is implemented in Java and organized in a single Eclipse project due to its size. In this case, you’d expect that the architectural pattern of four modules with submodules should be reflected in the system’s package tree.

Figure 5 shows the anonymized package tree of a Java system where the development team made this exact statement: “Four modules with submodules, that’s our architecture!”.

The diagram in Figure 5 shows packages and arrows. The arrows go from the parent package to its children.

Fig. 5: A poorly implemented planned architectural pattern

 

In fact, the four modules can be found in the package tree. In Figure 5, they’re marked in the module’s colors as seen in Figure 4 (green, orange, purple, and blue). However, two of the modules are distributed over the package tree and their submodules are actually partially sorted under foreign upper packages. This implementation in the package tree isn’t consistent with the pattern that the architecture specified. It leads to confusion for developers and architects. Introducing one package root node each for the orange and purple components would solve this.

 

Figure 6 shows a better mapping of the architecture pattern to the pattern tree. In this system, the architectural pattern is symmetrically transferable to the package tree. Here, developers can quickly navigate using the hierarchical structure and benefit from the architectural pattern.

Fig. 6: A well-implemented architectural pattern

 

If the contained relationship is used correctly, it supports our cognitive mechanism hierarchies. This doesn’t apply to all other kinds of relationships: we can link random classes and interfaces in a source code base by usage relationship and/or inheritance relationship. By doing this, we create intertwined structures (cycles) that aren’t hierarchical in any way. It takes some discipline and effort to use the usage and inheritance relationships hierarchically. If the development team pursues this goal from the beginning, usually, the results are almost cycle-free. If the value of being cycle-free isn’t clear from the beginning, then structures like the one in Figure 7 will emerge.

Fig. 7: Cycle of 242 classes

 

But the desire to achieve freedom from cycles is not an end in itself! It’s not about satisfying some technical structure idea of “cycles must be avoided”. Instead, the goal is to design a modular architecture.


If you make sure that individual building blocks in your design are modular (meaning, they are each responsible for just one task) then cycle-free design and architecture usually emerge of their own accord. A module providing basic functionality should never need functionality from the modules that build upon it. If the tasks are clearly distributed, then it’s obvious which module must use which other module to fulfill its task. A reverse, cyclic relationship won’t arise in the first place.

Summary: Modularization rules

The three cognitive mechanisms of chunking, schemata, and hierarchies give us the background knowledge to use modularization in our discussions clearly and unambiguously. A well-modularized architecture contains modules that facilitate chunking, hierarchies, and schemata. In summary, we can establish the following rules. The modules in a modular architecture must:

  1. form a cohesive, coherent whole within them that’s responsible for exactly one clearly defined task (unit as a chunk),
  2. form an explicit, minimal, and delegating capsule to the outside (interface as a chunk),
  3. be designed according to uniform patterns throughout (pattern consistency) and
  4. be minimally, loosely, and cycle-free coupled with other modules (coupling for chunk separation and hierarchies).

If these mechanisms and their implementation in architecture are clear to the development team, then an important foundation for modularization has been laid.

 

Links & Literature

[1] This article is a revised excerpt from my book: Lilienthal, Carola: “Durable Software Architectures. Analyzing, Limiting, and Reducing Technical Debt”, dpunkt.verlag, 2019.

[2] Parnas, David Lorge: “On the Criteria to be Used in Decomposing Systems into Modules”; in: Communications of the ACM (15/12), 1972

[3] Dijkstra, Edsger Wybe: “A Discipline of Programming”; Prentice Hall, 1976

[4] Myers, Glenford J.: “Composite/Structured Design”; Van Nostrand Reinhold, 1978

[5] Coad, Peter; Yourdon, Edward: “OOD: Objektorientiertes Design”; Prentice Hall, 1994

[6] Wirfs-Brock, Rebecca; McKean, Alan: “Object Design: Roles, Responsibilities, and Collaborations”; Pearson Education, 2002

[7] Martin, Robert Cecil: “Agile Software Development, Principles, Patterns, and Practices”; Prentice Hall International, 2013

[8] Bass, Len; Clements, Paul; Kazman, Rick: “Software Architecture in Practice”; Addison-Wesley, 2012

[9] Booch, Grady: “Object-Oriented Analysis and Design with Applications”; Addison Wesley Longman Publishing Co., 2004

[10] Gamma, Erich; Helm, Richard; Johnson, Ralph E.; Vlissides, John: “Design Patterns. Elements of Reusable Object-Oriented Software”; Addison-Wesley, 1994

[11] Züllighoven, Heinz: “Object-Oriented Construction Handbook”; Morgan Kaufmann Publishers, 2005

The post Why are we doing this anyway? appeared first on ML Conference.

]]>
The Trends Shaping Natural Language Processing in 2021 https://mlconference.ai/blog/the-trends-shaping-natural-language-processing-in-2021/ Tue, 09 Nov 2021 11:26:21 +0000 https://mlconference.ai/?p=82593 According to research from last year, 2020 had an impact on business globally, but NLP was a bright spot for technology investments. That momentum has carried through to this year, but even still, we are just at the tip of the iceberg in terms of what NLP has to offer.

The post The Trends Shaping Natural Language Processing in 2021 appeared first on ML Conference.

]]>
For the last five years, natural language processing (NLP) has played a vital role in analyzing documents and conversations. Its business applications power everything from reviewing patent applications, summarizing and linking scientific papers, accelerating clinical trials, optimizing global supply chains, improving customer support, to recommending sports news. As the technology becomes widespread, much is evolving from enterprise investments in NLP software to common use cases.

According to research from last year, 2020 had an impact on business globally, but NLP was a bright spot for technology investments. That momentum has carried through to this year, but even still, we are just at the tip of the iceberg in terms of what NLP has to offer. As such, it’s important to keep tabs on how the industry is evolving, and new research from Gradient Flow achieves that.

The 2021 NLP Industry Survey Report analysis explores organizations with years of history deploying NLP applications in production compared to those just getting started, responses from Technical Leaders versus general practitioners, company size, scale of documents, and geographic regions. The contrast of respondents reflects a comprehensive image of the industry at large, as well as what’s to come.

Following the money is a sure way to keep the pulse on growth, and if finances are any indication, NLP is trending upward — and fast. Similar to last year, NLP budgets are increasing significantly; something that’s continued despite pandemic-driven IT spending setbacks. In fact, 60% of Tech Leaders indicated that their NLP budgets grew by at least 10%, a nearly 10% increase from last year (compared to 53% in 2020). Even more significant, 33% reported a 30% increase, and 15% said their budget more than doubled. This will only grow as the economy continues to stabilize.

As budgets grow — especially in mature organizations, classified as those that have had NLP in production for at least 2 years — there are several use cases driving the uptick in investments. More than half of Tech Leaders singled out named entity recognition (NER) as the primary use case for NLP, followed by document classification (46%). While this is not surprising, instances of more difficult tasks such as Question Answering are moving to the forefront, showing the potential of NLP to become more user-friendly over the next several years.

Not surprisingly, the top three data sources for NLP projects are text fields in databases, files (PDFs, docx, etc.), and online content. Progress is being made in extracting information from all of these data sources. For example, in the healthcare industry, using NLP to extract and normalize a patient’s history, diagnoses, labs, procedures, social determinants, and treatment plan is repeatedly proving useful in improving diagnosis and care. Adding information obtained from natural language documents and notes to other data sources – like structured data or medical imaging – is valuable for providing a more comprehensive picture of each patient.

The healthcare and pharmaceutical industries have been on the forefront of artificial intelligence (AI) and NLP, so their use cases vary slightly from overall industry practices. This is why entity linking / knowledge graphs (41%) and de-identification (39%), along with NER and document classification, were among the top use cases, typical of a highly regulated industry. Financial services is another area that’s gaining traction, as NLP has the ability to parse textual data, while understanding the nuances of industry jargon, numbers, different currencies, and company names and products.

NLP technology doesn’t come without challenges. Accuracy remains a top priority among all NLP practitioners, and with the need to constantly update and tune models, the barriers to entry are still high. Many companies are reliant on data scientists to build models and prevent them from degrading over time. Others default to cloud services, and report that they can be expensive, require data sharing with the cloud provider, and often make it hard or impossible to tune models (hurting accuracy). While 83% of respondents indicated they use at least one of the major cloud providers for NLP, this is mostly done in addition to also using NLP libraries. Tech Leaders cited difficulty tuning models and cost as primary challenges with cloud NLP services.

Fortunately, more tools are becoming widely available to even the playing field. Libraries like Spark NLP, the most popular library among survey respondents and currently used by 53% of the healthcare industry, is democratizing NLP through free offerings, pre-trained models, and no data-sharing requirements. NLP libraries popular within the Python ecosystem — Hugging Face, spaCy, Natural Language Toolkit (NLTK), Gensim, and Flair — are also being used by a majority of practitioners.

Between the numerous NLP libraries and cloud services, growing investments, and innovative new use cases, there is certainly cause to get excited about what’s next for NLP. By tracking and understanding the common practices and roadblocks that exist, we can apply these lessons across the AI industry, and keep moving the recent research breakthroughs into real-world systems that put them to good use.

The post The Trends Shaping Natural Language Processing in 2021 appeared first on ML Conference.

]]>
On pythonic tracks https://mlconference.ai/blog/on-pythonic-tracks/ Mon, 15 Mar 2021 13:00:26 +0000 https://mlconference.ai/?p=81377 Python has established itself as a quasi-standard in the field of machine learning over the last few years, in part due to the broad availability of libraries. It is logical that Oracle did not really like to watch this trend — after all, Java has to be widely used if it wants to earn serious money with its product. Some time ago, Oracle placed its own library Tribuo under an open source license.

The post On pythonic tracks appeared first on ML Conference.

]]>
In principle, Tribuo is an ML system intended to help close the feature gap between Python and Java in the field of artificial intelligence, at least to a certain extent.

According to the announcement, the product, which is licensed under the (very liberal) Apache license, can look back on a history of several years of use within Oracle. This can be seen in the fact that the library offers very extensive functions — in addition to the generation of “own” models, there are also interfaces for various other libraries, including TensorFlow.

The author and editors are aware that machine learning cannot be included in an article like this. However, just because of the importance of this idea, we want to show you a little bit about how you can play around with Tribuo.

Stay up to date

Learn more about MLCON

 

Modular library structure

Machine learning applications are not usually among the tasks you run on resource-constrained systems — IoT edge systems that run ML payloads usually have a hardware accelerator, such as the SiPeed MAiX. Nevertheless, Oracle’s Tribuo library is offered in a modularized way, so in theory, developers can only include those parts of the project in their Solutions that they really need. An overview of which functions are provided in the individual packages can be found under https://tribuo.org/learn/4.0/docs/packageoverview.html.

In this introductory article, we do not want to delve further into modularization, but instead, program a little on the fly. A Tribuo-based artificial intelligence application generally has the structure shown in Figure 1.

Fig. 1: Tribuo strives to be a full-stack artificial intelligence solution (Image source: Oracle).


The figure informs us that Tribuo seeks to process information from A to Z itself. On the far left is a DataSource object that collects the information to be processed by artificial intelligence and converts it into Tribuo’s own storage format called Example. These Example objects are then held in the form of a Dataset instance, which — as usual — moves towards a model that delivers predictions. A class called Evaluator can then make concrete decisions based on this information, which is usually quite general or probability-based.

An interesting aspect of the framework in this context is that many Tribuo classes come with a more or less generic system for configuration settings. In principle, an annotation is placed in front of the attributes, whose behavior can be adjusted:

public class LinearSGDTrainer implements Trainer<Label>, WeightedExamples {
@Config(description="The classification objective function to use.")
private LabelObjective objective = new LogMulticlass();

The Oracle Labs Configuration and Utilities Toolkit (OLCUT) described in detail in https://github.com/oracle/OLCUT can read this information from XML files — in the case of the property we just created, the parameterization could be done according to the following scheme:

<config>
<component name="logistic"
type="org.tribuo.classification.sgd.linear.LinearSGDTrainer">
<property name="objective" value="log"/>

The point of this approach, which sounds academic at first glance, is that the behavior of ML systems is strongly dependent on the parameters contained in the various elements. By implementing OLCUT, the developer gives the user the possibility to dump these settings, or return a system to a defined state with little effort.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Tribuo integration

After these introductory considerations, it is time to conduct your first experiments with Tribuo. Even though the library source code is available on GitHub for self-compilation, it is recommended to use a ready-made package for first attempts.

Oracle supports both Maven and Gradle and even offers (partial) support for Kotlin in the Gradle special case. However, we want to work with classic tools in the following steps, which is why we grab an Eclipse instance and animate it to create a new Maven-based project skeleton by clicking New | New Maven project.

In the first step of the generator, the IDE asks if you want to load templates called archetype. Please select the Create a simple project (skip archetype selection) checkbox to animate the IDE to create a primitive project skeleton. In the next step, we open the file pom.xml to command an insertion of the library according to the scheme in Listing 1.

 <name>tribuotest1</name>
  <dependencies>
    <dependency>
      <groupId>org.tribuo</groupId>
      <artifactId>tribuo-all</artifactId>
      <version>4.0.1</version>
      <type>pom</type>
    </dependency>
  </dependencies>
</project>

To verify successful integration, we then command a recompilation of the solution — if you are connected to the Internet, the IDE will inform you that the required components are moving from the Internet to your workstation.

Given the proximity to the classic ML ecosystem, it should not surprise anyone that the Tribuo examples are by and large in the form of Jupiter Notebooks — a widespread form of presentation, especially in the research field, that is not suitable for serial use in production, at least in the author’s opinion.

Therefore, we want to rely on classical Java programs in the following steps. The first problem we want to address is classification. In the world of ML, this means that the information provided is classified into a group of categories or histogram bins called a class. In the field of machine learning, a group of examples called data sets (sample data sets) has become established. These are prebuilt databases that are constant and can be used to evaluate different model hierarchies and training levels. In the following steps, we want to rely on the Iris data set provided in https://archive.ics.uci.edu/ml/datasets/iris. Funnily enough, the term iris does not refer to a part of the eye, but to a plant species.

Fortunately, the data set is available in a form that can be used directly by Tribuo. For this reason, the author working under Linux opens a terminal window in the first step, creates a new working directory, and downloads the information from the server via wget:

t@T18:~$ mkdir tribuospace
t@T18:~$ cd tribuospace/
t@T18:~/tribuospace$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data

Next, we add a class to our Eclipse project skeleton from the first step, which takes a Main method. Place in it the code from Listing 2.

import org.tribuo.classification.Label;
import java.nio.file.Paths;
 
public class ClassifyWorker {
  public static void main(String[] args) {
    var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
    DataSource<Label> irisData =
      new CSVLoader<>(new LabelFactory()).loadDataSource(Paths.get("bezdekIris.data"),
        irisHeaders[4],
        irisHeaders);

This routine seems simple at first glance, but it’s a bit tricky in several respects. First, we are dealing with a class called Label — depending on the configuration of your Eclipse working environment, the IDE may even offer dozens of Label candidate classes. It’s important to make sure you choose the org.tribuo.classification.Label import shown here — a label is a cataloging category in Tribuo.

The syntax starting with var then enables the IDE to reach a current Java version — if you want to use Tribuo (effectively), you have to use at least JDK 8, or even better, JDK 10. After all, the var syntax introduced in this version can be found in just about every code sample.

It follows from the logic that — depending on the system configuration — you may have to make extensive adjustments at this point. For example, the author working under Ubuntu 18.04 first had to provide a compatible JDK:

tamhan@TAMHAN18:~/tribuospace$ sudo apt-get install openjdk-11-jdk

Note that Eclipse sometimes cannot reach the new installation path by itself — in the package used by the author, the correct directory was /usr/lib/jvm/java-11-openjdk-amd64/bin/java.

Anyway, after successfully adjusting the Java execution configuration, we are able to compile our application — you may still need to add the NIO package to the Maven configuration because the Tribuo library relies on this novel EA library across the field for better performance.

Now that we have our program skeleton up and running, let’s look at it in turn to learn more about the inner workings of Tribuo applications. The first thing we have to deal with — think of Figure 1 on the far left — is data loading (Listing 3).

public static void main(String[] args) {
try {
  var irisHeaders = new String[]{"sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"};
  DataSource<Label> irisData;
  
  irisData = new CSVLoader<>(new LabelFactory()).loadDataSource(    Paths.get("/home/tamhan/tribuospace/bezdekIris.data"),
    irisHeaders[4],
    irisHeaders);

The provided dataset contains comparatively little information — headers and co. are looked for in vain. For this reason, we have to pass an array to the CSVLoader class that informs you about the column names of the data set. The assignment of irisHeaders ensures that the target variable of the model receives separate announcements.

In the next step, we have to take care of splitting our dataset into a training group and a test group. Splitting the information into two groups is quite a common procedure in the field of machine learning. The test data is used to “verify” the trained model, while the actual training data is used to adjust and improve the parameters. In the case of our program, we want to make a 70-30 split between training and other data, which leads to the following code:

var splitIrisData = new TrainTestSplitter<>(irisData, 0.7, 1L);
var trainData = new MutableDataset<>(splitIrisData.getTrain());
var testData = new MutableDataset<>(splitIrisData.getTest());

Attentive readers wonder at this point why the additional parameter 1L is passed. Tribuo works internally with a random generator. As with all or at least most pseudo-random generators, it can be animated to behave more or less deterministically if you set the seed value to a constant. The constructor of the class TrainTestSplitter exposes this seed field — we pass the constant value one here to achieve a reproducible behavior of the class.

At this point, we are ready for our first training run. In the area of training machine learning-based models, a group of procedures emerged that are generally grouped together by developers. The fastest way to a runnable training system is to use the LogisticRegressionTrainer class, which inherently loads a set of predefined settings:

var linearTrainer = new LogisticRegressionTrainer();
Model<Label> linear = linearTrainer.train(trainData);

The call to the Train method then takes care of the framework triggering the training process and getting ready to issue predictions. It follows from this logic that our next task is to request such a prediction, which is forwarded towards an evaluator in the next step. Last but not least, we need to output its results towards the command line:

Prediction<Label> prediction = linear.predict(testData.getExample(0));
LabelEvaluation evaluation = new LabelEvaluator().evaluate(linear,testData);
double acc = evaluation.accuracy();
System.out.println(evaluation.toString());

At this point, our program is ready for a first small test run — Figure 2 shows how the results of the Iris data set present themselves on the author’s workstation.

Fig. 2: It works: Machine learning without Python!


Learning more about elements

Having enabled our little Tribuo classifier to perform classification against the Iris data set in the first step, let’s look in detail at some advanced features and functions of the classes used.

The first Interesting feature is to check the work of TrainTestSplitter class. For this purpose it is enough to place the following code in a conveniently accessible place in the Main method:

System.out.println(String.format("data size = %d, num features = %d, num classes = %d",trainingDataset.size(),trainingDataset.getFeatureMap().size(),trainingDataset.getOutputInfo().size()));

The data set exposes a set of member functions that output additional information about the tuples they contain. For example, executing the code here would inform how many data sets, how many features, and how many classes to subdivide.

The LogisticRegressionTrainer class is also only one of several processes you can use to train the model. If we want to rely on the CART process instead, we can adapt it according to the following scheme:

var cartTrainer = new CARTClassificationTrainer();
Model<Label> tree = cartTrainer.train(trainData);
Prediction<Label> prediction = tree.predict(testData.getExample(0));
LabelEvaluation evaluation = new LabelEvaluator().evaluate(tree,testData);

If you run the program again with the modified classification, a new console window opens with additional information — it shows that Tribuo is also suitable for exploring machine learning processes. Incidentally, given the comparatively simple and unambiguous structure of the data set, the same result is obtained with both classifiers.

Number Recognition

Another dataset widely used in the ML field is the MNIST Handwriting Sample. This is a survey conducted by MNIST about the behavior of people writing numbers. Logically, such a model can then be used to recognize postal codes — a process that helps save valuable man-hours and money when sorting letters.

Since we have just dealt with basic classification using a more or less synthetic data set, we want to work with this more realistic information in the next step. In the first step, this information must also be transferred to the workstation, which is done under Linux by calling wget:

tamhan@TAMHAN18:~/tribuospace$ wget
http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
tamhan@TAMHAN18:~/tribuospace$ wget
http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

The most interesting thing about the structure given here is that each wget command downloads two files — both the training and the actual working files are in separate archives each with label and image information.

The MNIST data set is available in a data format known as IDX, which in many cases is compressed via GZIP and easier to handle. If you study the Tribuo library documentation available at https://tribuo.org/learn/4.0/javadoc/org/tribuo/datasource/IDXDataSource.html, you will find that with the class IDXDataSource, a loading tool intended for this purpose is available. It even decompresses the data on the fly if required.

It follows that our next task is to integrate the IDXDataSource class into our program workflow (Listing 4).

public static void main(String[] args) {
  try {
    LabelFactory myLF = new LabelFactory();
    DataSource<Label> ids1;
    ids1 = new
IDXDataSource<Label>(Paths.get("/home/tamhan/tribuospace/t10k-images-idx3-ubyte.gz"),
Paths.get("/home/tamhan/tribuospace/t10k-labels-idx1-ubyte.gz"), myLF);

From a technical point of view, the IDXDataSource does not differ significantly from the CSVDataSource used above. Interestingly, we pass two file paths here, because the MNIST data set is available in the form of a separate training and working data set.

By the way, developers who grew up with other IO libraries should pay attention to a little beginner’s trap with the constructor: The classes contained in Tribuo do not expect a string, but a finished Path instance. However, this can be generated with the unbureaucratic call of Paths.get().

Be that as it may, the next task of our program is the creation of a second DataSource, which takes care of the training data. Its constructor differs — logically — only in the file name for the source information (Listing 5).

DataSource<Label> ids2;
ids2 = new
IDXDataSource<Label>(Paths.get("/home/tamhan/tribuospace/train-images-idx3-ubyte.gz"),
Paths.get("/home/tamhan/tribuospace/train-labels-idx1-ubyte.gz"), myLF);
var trainData = new MutableDataset<>(ids1);
var testData = new MutableDataset<>(ids2);
var evaluator = new LabelEvaluator();

Most data classes included in Tribuo are generic with respect to output format, at least to some degree. This affects us in that we must permanently pass a LabelFactory to inform the constructor of the desired output format and the Factory class to be used to create it

The next task we take care of is to convert the information into Data Sets and launch an emulator and a CARTClassificationTrainer training class:

var cartTrainer = new CARTClassificationTrainer();
Model<Label> tree = cartTrainer.train(trainData);

The rest of the code — not printed here for space reasons — is then a one-to-one modification of the model processing used above. If you run the program on the workstation, you will see the result shown in Figure 3. By the way, don’t be surprised if Tribuo takes some time here — the MNIST dataset is somewhat larger than the one in the table used above.

Fig. 3: Not bad for a first attempt: the results of our CART classifier.


Since our achieved accuracy is not so great after all, we want to use another training algorithm here. The structure of Tribuo shown in Figure 1, based on standard interfaces, helps us make such a change with comparatively little code.

This time we want to use the LinearSGDTrainer class as a target, which we integrate into the program by the command import org.tribuo.classification.sgd.linear.LinearSGDTrainer;. By the way, the explicit mention of it is not a pedantry of the publisher — besides the Classification class used here, there is also a LinearSGDTrainer, which is intended for regression tasks and cannot help us at this point, however.

Since the LinearSGDTrainer class does not come with a convenience constructor that initializes the trainer with a set of (more or less well-working) default values, we have to make some changes:

var cartTrainer = new LinearSGDTrainer(new LogMulticlass(), SGD.getLinearDecaySGD(0.2),10,200,1,1L);
Model<Label> tree = cartTrainer.train(trainData);

At this point, this program version is also ready to use — since LinearSGD requires more computing power, it will take a little more time to process.

Excursus: Automatic configuration

LinearSGD may be a powerful machine learning algorithm — but the constructor is long and difficult to understand because of the number of parameters, especially without IntelliSense support.

Since with ML systems you generally work with a Data Scientist with limited Java skills, it would be nice to be able to provide the model information as a JSON or XML file.

This is where the OLCUT configuration system mentioned above comes in. If we include it in our solution, we can parameterize the class in JSON according to the scheme shown in Listing 6.

"name" : "cart",
"type" : "org.tribuo.classification.dtree.CARTClassificationTrainer",
"export" : "false",
"import" : "false",
"properties" : {
  "maxDepth" : "6",
  "impurity" : "gini",
  "seed" : "12345",
  "fractionFeaturesInSplit" : "0.5"
}

Even at first glance, it is noticeable that parameters like the seed to be used for the random generator are directly addressable — the line “seed” : “12345” should be found even by someone who is challenged by Java.

By the way, OLCUT is not limited to the linear creation of object instances — the system is also able to create nested class structures. In the markup we just printed, the attribute “impurity” : “gini” is an excellent example of this — it can be expressed according to the following scheme to generate an instance of the GiniIndex class:

"name" : "gini",
"type" : "org.tribuo.classification.dtree.impurity.GiniIndex",

Once we have such a configuration file, we can invoke an instance of the ConfigurationManager class as follows:

ConfigurationManager.addFileFormatFactory(new JsonConfigFactory())
String configFile = "example-config.json";
String.join("\n",Files.readAllLines(Paths.get(configFile)))

OLCUT is agnostic in the area of file formats used — the actual logic for reading the data present in the file system or configuration file can be written in by an adapter class. In the case of our present code we use the JSON file format, so we register an instance of JsonConfigFactory via the addFileFormatFactory method.

Next, we can also start parameterizing the elements of our machine learning application using ConfigurationManager:

var cm = new ConfigurationManager(configFile);
DataSource<Label> mnistTrain = (DataSource<Label>) cm.lookup("mnist-train");

The lookup method takes a string, which is compared against the Name attributes in the knowledge base. Provided that the configuration system finds a match at this point, it automatically invokes the class structure described in the file.

ConfigurationManager is extremely powerful in this respect. Oracle offers a comprehensive example at https://tribuo.org/learn/4.0/tutorials/configuration-tribuo-v4.html that harnesses the configuration system using Jupiter Notebooks to generate a complex machine learning toolchain.

This measure, which seems pointless at first glance, is actually quite sensible. Machine learning systems live and die with the quality of the training data as well as with the parameterization — if these parameters can be conveniently addressed externally, it is easier to adapt the system to the present needs.

Grouping with synthetic data sets

In the field of machine learning, there is also a group of universal procedures that can be applied equally to various detailed problems. Having dealt with classification so far, we now turn to clustering. As shown in Figure 4, the focus here is on the idea of inscribing groups into a wild horde of data in order to be able to recognize trends or associations more easily.

Fig. 4: The inscription of the dividing lines divides the dataset (Image source: Wikimedia Commons/ChiRe).


It follows from logic that we also need a database for regression experiments. Tribuo helps us at this point with the ClusteringDataGenerator class, which harnesses the Gaussian distribution (from mathematics and probability theory) to generate test data sets. For now, we want to populate two test data fields according to the following scheme:

public static void main(String[] args) {
try {

var data = ClusteringDataGenerator.gaussianClusters(500, 1L);
var test = ClusteringDataGenerator.gaussianClusters(500, 2L);

The numbers passed as the second parameter determine which number will be used as the initial value for the random generator. Since we pass two different values here, the PRNG will generate two sequences of numbers that differ from the sequence here, but which always follow the rules of the Gaussian normal distribution.

In the field of clustering methods, the K-Means process has become well-established. For this reason alone, we want to use it again in our Tribuo example. If you look carefully at the structure of the code used, you will notice that the standardized structure is also apparent:

var trainer = new KMeansTrainer(5,10,Distance.EUCLIDEAN,1,1);
var model = trainer.train(data);
var centroids = model.getCentroidVectors();
for (var centroid : centroids) {
System.out.println(centroid);
}

Of particular interest is actually the passed Distance value, which determines which method should be used to calculate or weight the two-dimensional distances between elements. It is critical in that the Tribuo library comes with several Distance Enums — the one we need is in the namespace org.tribuo.clustering.kmeans.KMeansTrainer.Distance.

Now the program can be executed again — Figure 5 shows the generated center point matrices.

Fig. 5: Tribuo informs us of the whereabouts of clusters.


The red messages are progress reports: most of the classes included in Tribuo contain logic to inform the user about the progress of long-running processes in the console. However, since these operations also consume processing power, there is a way to influence the frequency of operation of the output logic in many constructors.

Be that as it may: To evaluate the results, it is helpful to know the Gaussian parameters used by the ClusteringDataGenerator function. Oracle, funnily enough, only reveals them in the Tribuo tutorial example; in any case, the values for the three parameter variables are as follows:

N([ 0.0,0.0], [[1.0,0.0],[0.0,1.0]])
N([ 5.0,5.0], [[1.0,0.0],[0.0,1.0]])
N([ 2.5,2.5], [[1.0,0.5],[0.5,1.0]])
N([10.0,0.0], [[0.1,0.0],[0.0,0.1]])
N([-1.0,0.0], [[1.0,0.0],[0.0,0.1]])

Since a detailed discussion of the mathematical processes taking place in the background would go beyond the scope of this article, we would like to cede the evaluation of the results to Tribuo instead. The tool of choice is, once again, an element based on the evaluator principle. However, since we are dealing with clustering this time, the necessary code looks like this:

ClusteringEvaluator eval = new ClusteringEvaluator();
var mtTestEvaluation = eval.evaluate(model,test);
System.out.println(mtTestEvaluation.toString());

When you run the present program, you get back a human-readable result — the Tribuo library takes care of preparing the results contained in ClusteringEvaluator for convenient output on the command line or in the terminal:

Clustering Evaluation
Normalized MI = 0.8154291916732408
Adjusted MI = 0.8139169342020222

Excursus: Faster when parallelized

Artificial intelligence tasks tend to consume immense amounts of computing power — if you don’t parallelize them, you lose out.

Parts of the Tribuo library are provided by Oracle out of the box with the necessary tooling that automatically distributes the tasks to be done across multiple cores of a workstation.

The trainer presented here is an excellent example of this. As a first task, we want to use the following scheme to ensure that both the training and the user data are much more extensive:

var data = ClusteringDataGenerator.gaussianClusters(50000, 1L);
var test = ClusteringDataGenerator.gaussianClusters(50000, 2L);

In the next step, it is sufficient to change a value in the constructor of the KMeansTrainer class according to the following scheme — if you pass eight here, you instruct the engine to take eight processor cores under fire at the same time:

var trainer = new KMeansTrainer(5,10,Distance.EUCLIDEAN,8,1);

At this point, you can release the program for testing again. When monitoring the overall computing power consumption with a tool like Top, you should see a brief flash of the utilization of all cores.

Regression with Tribuo

By now, it should be obvious that one of the most important USPs of the Tribuo library is its ability to represent more or less different artificial intelligence procedures via a common scheme of thought, process, and implementation shown in Figure 1. As a final task in this section, let’s turn to regression — which, in the world of artificial intelligence, is the analysis of a data set to determine the relationship between variables in an input set and an output set. This is for early neural networks in the field of AI for games — and the area that one expects most from AI as a non-initiated developer.

For this task we want to use a wine quality data set: The data sample available in detail at https://archive.ics.uci.edu/ml/datasets/Wine+Quality correlates wine ratings to various chemical analysis values. So the purpose of the resulting system is to give an accurate judgment about the quality or ratings of wine after feeding it with this chemical information.

As always, the first task is to provide the test data, which we (of course) do again via wget:

tamhan@TAMHAN18:~/tribuospace$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Our first order of business is to load the sample information we just provided — a task that is done via a CSVLoader class:

try {
var regressionFactory = new RegressionFactory();
var csvLoader = new CSVLoader<>(';',regressionFactory);
var wineSource = csvLoader.loadDataSource(Paths.get("/home/tamhan/tribuospace/winequality-red.csv"),"quality");

Two things are new compared to the previous code: First, the CSVLoader is now additionally passed a semicolon as a parameter. This string informs it that the submitted sample file has a “special” data format in which the individual pieces of information are not separated by commas. The second special feature is that we now use an instance of RegressionFactory as factory instance — the previously used LabelFactory is not really suitable for regression analyses.

The wine data set is not divided into training and working data. For this reason, the TrainTestSplitter class from before comes to new honors; we assume 30 percent training and 70 percent useful information:

var splitter = new TrainTestSplitter<>(wineSource, 0.7f, 0L);
Dataset<Regressor> trainData = new MutableDataset<>(splitter.getTrain());
Dataset<Regressor> evalData = new MutableDataset<>(splitter.getTest());

In the next step we need a trainer and an evaluator (Listing 7).

var trainer = new CARTRegressionTrainer(6);
Model<Regressor> model = trainer.train(trainData);
RegressionEvaluator eval = new RegressionEvaluator();
var evaluation = eval.evaluate(model,trainData);
var dimension = new Regressor("DIM-0",Double.NaN);
System.out.printf("Evaluation (train):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
  evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));

Two things are new about this code: First, we now have to create a regressor that marks one of the values contained in the data set as relevant. Second, we perform an evaluation against the data set used for training as part of the model preparation. While this approach can lead to overtraining, it does provide some interesting parameters if used intelligently.

If you run our program as-is, you will see the following output on the command line:

Evaluation (train):
RMSE 0.545205
MAE 0.406670
R^2 0.544085

RMSE and MAE are both parameters that describe the quality of the model. For both, a lower value indicates a more accurate model.

As a final task, we have to take care of performing another evaluation, but it no longer gets the training data as a comparison. To do this, we simply need to adjust the values passed to the Evaluate method:

evaluation = eval.evaluate(model,evalData);
dimension = new Regressor("DIM-0",Double.NaN);
System.out.printf("Evaluation (test):%n RMSE %f%n MAE %f%n R^2 %f%n",
evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));

The reward for our effort is the screen image shown in Figure 6 — since we are no longer evaluating the original training data, the accuracy of the resulting system has deteriorated.

Fig. 6: Against real data the test delivers less accurate results.


This program behavior makes sense: if you train a model against new data, you will naturally get worse data than if you short-circuit it with an already existing source of information. Note at this point, however, that overtraining is a classic antipattern, especially in financial economics, and has cost more than one algorithmic trader a lot of money.

This brings us to the end of our journey through the world of artificial intelligence with Tribuo. The library also supports anomaly detection as a fourth design pattern, which we will not discuss here. At https://tribuo.org/learn/4.0/tutorials/anomaly-tribuo-v4.html you can find a small tutorial — the difference between the three methods presented so far and the new method, however, is that the new method works with other classes.

Conclusion

Hand on heart: If you know Java well, you can learn Python — while swearing, but without any problems: So it is more a question of not wanting to than a problem of not being able to.

On the other hand, there is no question that Java payloads are much easier to integrate into various enterprise processes and enterprise toolchains than their Python-based counterparts. The use of the Tribuo library already provides a remedy since you do not have to bother with the manual brokering of values between the Python and the Java parts of the application.

If Oracle would improve the documentation a bit and offer even more examples, there would be absolutely nothing against the system. Thus, it is true that the hurdle to switching is somewhat higher: Those who have never worked with ML before will learn the basics faster with Python, because they will find more turnkey examples. On the other hand, it is also true that working with Tribuo is faster in many cases due to the various comfort functions. All in all, Oracle succeeds with Tribuo in a big way, which should give Java a good chance in the field of machine learning — at least from the point of view of an ecosystem.

The post On pythonic tracks appeared first on ML Conference.

]]>
Explainability – a promising next step in scientific machine learning https://mlconference.ai/blog/explainability-a-promising-next-step-in-scientific-machine-learning/ Tue, 05 May 2020 09:07:11 +0000 https://mlconference.ai/?p=17040 With the emergence of deep neural networks, the question has arisen how machine learning models can be not only accurate but also explainable. In this article, you will learn more about explainability and what elements it consists of, and why we need expert knowledge to interpret machine learning results to avoid making the right decisions for the wrong reasons.

The post Explainability – a promising next step in scientific machine learning appeared first on ML Conference.

]]>
Machine learning has become an integral part of our daily life – whether it be an essential component of all social media services or a simple helper for personal optimization. For some time now, also most areas of science have been influenced by machine learning in one way or another, as it opens up possibilities to derive findings and discoveries primarily from data.

Probably the most common objective has always been the prediction accuracy of the models. However, with the emergence of complex models such as deep neural networks, another aspired goal for scientific applications has emerged: explainability. This means that machine learning models should be designed in such a way that they not only provide accurate estimates, but also allow an understanding of why specific decisions are made and why the model operates in a certain way. With these demands to move away from non-transparent black-box models, new fields of research have emerged such as explainable artificial intelligence (XAI) or theory-guided/informed machine learning.

From transparency to explainability

Explainability is not a discrete state that either exists or does not exist, but rather a property that promotes that results become more trustworthy, models can be improved in a more targeted way, or scientific insights can be gained that did not exist before. Key elements of explainability are transparency and interpretability. Transparency is comparatively easy to achieve by describing and motivating the machine learning process by the creator. Even deep neural networks, often referred to as complete black boxes, are at least transparent in the sense that the relation between input and output can be written down in mathematical terms. The fact that the model would not be accessible is therefore usually not the problem, but rather that the models are often too complex to fully understand how they work and how decisions are made. That is exactly where interpretability comes into play. Interpretability is achieved by transferring abstract and complex processes into a domain that can be understood by a human.

A visualization tool often used in the sciences are heatmaps. They highlight parts of the input data which are salient, important, or sensitive to occlusions, depending on which method is used. Heatmaps are displayed in the same space as the input, so when analyzing images heatmaps are created that are images of the same size. They can also be applied to other data as long as they are in human-understandable domain.  One of the most prominent methods is layer-wise relevance propagation for neural networks, which is applied after the model was learned. It uses the learned weights and activations that result when applied to a given input and propagates the output back into the input space. Another principle is pursued by model-agnostic approaches such as LIME (local interpretable model-agnostic explanations), which can be used with all kinds of methods – even non-transparent ones. The idea behind these approaches is to change the inputs and analyze how the output changes in response.

Nevertheless, domain knowledge is essential to achieve explainability for an intended application. Despite that processes in a model could be explained from a purely mathematical point of view, an additional integration of knowledge from the respective application is indispensable, not least to assess the meaningfulness of the results.

Explainable machine learning in the natural sciences

The possibilities this opens up in the natural sciences are wide-ranging. In the biosciences, as an example, the identification of whales from photographs plays an important role to analyze their migration over time and space. Identification by an expert is accurate and based on specific features such as scars and shape. Machine learning methods can automate this process and are therefore in great demand, which is why this task is approached as a Kaggle Challenge. Before such a tool is actually used in practice, the quality of such a model can be assessed by analyzing the derived heatmaps (Fig. 1).

This way it can be checked whether the model also looks at relevant features in the image rather than insignificant ones like water. In this way, the so called clever-Hans-effect can be excluded, which is defined by making right decisions for wrong reasons. This could occur, for example, if by chance a whale was always photographed with a mountain in the background and the identification algorithm falsely assumed this to be a feature of the whale. Therefore, human-understandable interpretations and their explanation by an expert are essential for scientific applications, as they allow us to draw conclusions about whether the models operate as expected.

machine learning

Figure 1: Heatmaps derived by Grad-Cam, an interpretation tool utilizing the gradients in a neural network. All four original images show the same whale at different points in time.

Much more far-reaching, however, is the application of explainable machine learning when we provoke that the methods do not deliver what we expect but rather gives us new scientific insights. A prominent approach is presented, for example, by Iten et al., in which physical principles are to be derived automatically from observational data without any prior knowledge. The idea behind this is that the learned representation of the neural network is much simpler than the input data, and explanatory factors of the system such as physical parameters are captured in a few interpretable elements such as neurons.

In combination with expert knowledge, techniques such as neural networks can thus recognize patterns that help us to encounter things that were previously unknown to us.

The post Explainability – a promising next step in scientific machine learning appeared first on ML Conference.

]]>
Tutorial: Explainable Machine Learning with Python and SHAP https://mlconference.ai/blog/tutorial-explainable-machine-learning-with-python-and-shap/ Tue, 11 Feb 2020 10:26:32 +0000 https://mlconference.ai/?p=16398 Machine learning algorithms can cause the “black box” problem, which means we don’t always know exactly what they are predicting. This may lead to unwanted consequences. In the following tutorial, Natalie Beyer will show you how to use the SHAP (SHapley Additive exPlanations) package in Python to get closer to explainable machine learning results.

The post Tutorial: Explainable Machine Learning with Python and SHAP appeared first on ML Conference.

]]>
In this tutorial, you will learn how to use the SHAP package in Python applied to a practical example step by step.

Motivation

Machine Learning is used in a lot of contexts nowadays. We get offers for different products, recommendations on what to watch tonight and many more. Sometimes the predictions fit our needs and we buy or watch what was offered. Sometimes we get the wrong predictions. Sometimes those predictions are in more sensitive contexts than watching a show or buying a certain product. For example, when an algorithm that is supposed to automate hiring decisions discriminates against a group. Amazons recruiters used an algorithm that was systematically rejecting women before inviting them to job interviews.

To make sure that we know what the algorithms we use actually do, we have to take a closer look at what we are actually predicting. New methods of explainable machine learning open up the possibility to explore which factors were used exhaustively by the algorithm to come to the predictions. Those methods can lead to a better understanding of what the algorithm is actually doing and whether it emphasizes columns that should not contain much information.

Example

To have a clearer picture of explainable AI, we will go through an example. The used dataset consists out of Kickstarter projects and can be downloaded here. Kickstarter is a crowdfunding platform where people can upload a video or description about their planned projects. If one would like to support a project, he or she can donate money to that project. In this example, I would like to guide you through a machine learning algorithm that is going to predict whether a given project is going to be successful or not. The interesting part is that we are going to take a look at why the algorithm came to a certain decision.

Stay up to date

Learn more about MLCON

 

This explainable machine learning example will be in Python. So, at first we need to import a few packages (Listing 1). pandas, NumPy, skikit-learn and Matplotlib are frequently used in data science projects. CatBoost is a great tree based algorithm that can deal excellently with categorical data and has a good performance also in the default settings. SHAP is the package by Scott M. Lundberg that is the approach to interpret machine learning outcomes.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import catboost as catboost
from catboost import CatBoostClassifier, Pool, cv
import shap

Used versions of the packages:

  • pandas 0.25.0
  • NumPy 1.16.4
  • Matplotlib 3.0.3
  • skikit-learn 0.19.1
  • CatBoost 0.18.1
  • SHAP 0.28.3

Let’s take a look at the downloaded dataset (Figure 1) with kickstarter.head():

machine learning

Figure 1

The first column is the identification number of each project. The name column is the name of the Kickstarter project. category classifies each project in one of 159 different categories. Those categories can be summed up into 15 main categories. Next is the currency of the project. The column deadline represents the last possible date to support the project. pledged describes the amount of money that was given in order to support the project. state is the state of the project after the deadline date. backers is defined as the number of supporters for the given project. The last column consists out of the country in which the project was launched.

We are just going to use the states failed and successful, as the other states like canceled do not seem to be very interesting (Listing 2).

kickstarter["state"] = kickstarter["state"].replace({"failed": 0, "successful": 1})

First machine learning model

We are going to start with a machine learning model that takes the following columns as the feature vector (Listing 3):

kickstarter_first = kickstarter[
    [
        "category",
        "main_category",
        "currency",
        "deadline",
        "goal",
        "launched",
        "backers",
        "country",
        "state",
    ]
]

The last column is going to be our target column, therefore y. All the other columns are the feature vector, therefore X (Listing 4).

X = kickstarter_first[kickstarter_first.columns[:-1]]
y = kickstarter_first[kickstarter_first.columns[-1:]]

We are going to split the dataset with the result of having 10% of the dataset as the test dataset, and 90% as the training dataset (Listing 5).

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42
)

As our classifier, I chose CatBoost, as it can deal very well with categorical data (Listing 6). We are going to take the preinstalled settings of the algorithm. Also, 150 iterations are enough for our purposes.

model = CatBoostClassifier(
    random_seed=42, logging_level="Silent", iterations=150
)

In order to use CatBoost properly, we need to define which columns are categorical (Listing 7). In our case, those are all columns that have the type object.

categorical_features_indices = np.where(X.dtypes == np.object)[0]
X.dtypes
machine learning

Figure 2

We can see in Figure 2 that all columns but goal and backers are object columns and should be treated as categorical.

After fitting the model, we see a pretty good result (Listing 8):

model.fit(
    X_train,
    y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_test, y_test),
)
machine learning

Figure 3

With this first model, we are able to classify 93% of our test dataset correctly (Figure 3).

Let’s not get too excited and check out what we are actually predicting.

With the package SHAP, we are able to see which factors were mostly responsible for the predictions (Listing 9).

shap_values = model.get_feature_importance(
    Pool(X_test, label=y_test, cat_features=categorical_features_indices),
    type="ShapValues",
)
shap_values = shap_values[:, :-1]


shap.summary_plot(shap_values, X_test, plot_type="bar")

Figure 4

With this bar plot (Figure 4), we can see that the column backers is contributing the most to the prediction!

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Oh no! We have put an approximation of the target column (status failed or successful) into our model. If your Kickstarter project has a lot of backers, then it is most likely going to be successful.

Let’s give it another go. This time we are just going to use the columns that are not going to reveal too much information.

Second machine learning model

In the extended dataset kickstarter_extended = kickstarter.copy(), we are going to implement some feature engineering. Looking through the data, one can see that some projects are using special characters in their name. We are going to implement a new column number_special_character_name that is going to count the number of special characters per name (Listing 10).

kickstarter_extended[
    "number_special_character_name"
] = kickstarter_extended.name.str.count('[-()"#/@;:&lt;&gt;{}`+=~|.!?,]')
kickstarter_extended["word_count"] = kickstarter_extended["name"].str.split().map(len)

Also, we are going to change the deadline and launched column from the type object to datetime and thereby replace the columns. This is happening in order to get the new column delta_days, which consists out of the days between the “launched” date and the “deadline” date (Listing 11).

kickstarter_extended["deadline"] = pd.to_datetime(kickstarter_extended["deadline"])
kickstarter_extended["launched"] = pd.to_datetime(kickstarter_extended["launched"])

kickstarter_extended["delta_days"] = (
    kickstarter_extended["deadline"] - kickstarter_extended["launched"]
).dt.days

It is also interesting to see whether projects are more successful in certain months. Therefore, we are building the new column launched_month. The same for day of week and year (Listing 12).

kickstarter_extended["launched_month"] = kickstarter_extended["launched"].dt.month
kickstarter_extended[
    "day_of_week_launched"
] = kickstarter_extended.launched.dt.dayofweek
kickstarter_extended["year_launched"] = kickstarter_extended.launched.dt.year
kickstarter_extended.drop(["deadline", "launched"], inplace=True, axis=1)

The new dataset kickstarter_extended now consists of the following columns (Listing 13):

kickstarter_extended = kickstarter_extended[
    [
        "ID",
        "category",
        "main_category",
        "currency",
        "goal",
        "country",
        "number_special_character_name",
        "word_count",
        "delta_days",
        "launched_month",
        "day_of_week_launched",
        "year_launched",
        "state",
    ]
]

Again, building the test and training dataset (Listing 14).

X = kickstarter_extended[kickstarter_extended.columns[:-1]]
y = kickstarter_extended[kickstarter_extended.columns[-1:]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42
)

Initializing the new model and setting the categorical columns. Afterwards, fitting the model (Listing 15).

model = CatBoostClassifier(
    random_seed=42, logging_level="Silent", iterations=150
)
categorical_features_indices = np.where(X_train.dtypes == np.object)[0]

model.fit(
    X_train,
    y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_test, y_test),
)

model.score(X_test, y_test)
machine learning

Figure 5

The current model is a little bit worse than the first try (Figure 5), but the assumption is that we are now actually predicting on a more accurate database. A quick look at the bar plot, generated by Listing 16 and containing the current feature importances, tells us that in fact goal is the most informative column now (Figure 6).

shap_values_ks = model.get_feature_importance(
    Pool(X_test, label=y_test, cat_features=categorical_features_indices),
    type="ShapValues",
)
shap_values_ks = shap_values_ks[:, :-1]

shap.summary_plot(shap_values_ks, X_test, plot_type="bar")
machine learning

Figure 6

Until now, the SHAP package did not show anything other algorithm libraries cannot do. Showing feature importances has already been implemented in XGBoost and CatBoost some versions ago.
But now let’s get SHAP to shine. We enter shap.summary_plot(shap_values_ks, X_test) and receive the following summary plot (Figure 7):

machine learning

Figure 7

In this summary plot, the order of the columns still represents the amount of information the column is accountable for in the prediction. Each dot in the visualization represents one prediction. The color is related to the real data point. If the actual value in the dataset was high, the color is pink; blue indicates the actual value being low. Grey represents the categorical values which cannot be scaled in high or low. But the package maintainers are working on it. The x-axis represents the SHAP value, which is the impact on the model output. The model output 1 equates to the prediction of successful; 0 the prediction that the project is going to fail.

Let’s take a look at the first row of the summary_plot. If a Kickstarter project owner set the goal high (pink dots) the model output was likely 0 (negative SHAP value, not successful). It totally makes sense: if you set the bar for the money goal too high, you cannot reach it. On the other hand, if you set it very low, you are likely to achieve it by asking just a few of your friends. The column word_count also shows a clear relationship: few words in the name description indicate a negative impact on the model output, in the sense that it is likely a failed project. Maybe more words in the name deliver more information, so that potential supporters already get interested after reading just the title. You can see that the other columns are showing a more complex picture as there are pink dots in a mainly blue area and the other way around.

The great thing about the SHAP package is that it gives the opportunity to dive even deeper into the exploration of our model. Namely, it will give us the feature contributions for every single prediction (Listing 17).

shap_values = model.get_feature_importance(
    Pool(X_test, label=y_test, cat_features=categorical_features_indices),
    type="ShapValues",
)
expected_value = shap_values[0, -1]
shap_values = shap_values[:, :-1]
shap.initjs()  

shap.force_plot(expected_value, shap_values[10, :], X_test.iloc[10, :])
machine learning

Figure 8

In the force plot (Figure 8), we can see the row at position 10 of our test dataset. This was a correct prediction of a successful project. Features that are pink contribute to the model output being higher, that means predicting a success of the Kickstarter project. Blue parts of the visualization indicate a lower model output, predicting a failed project. So the biggest block here is the feature ‘category’, which in this case is Tabletop Games. Therefore, with this particular set of information, the project being a Tabletop Game is the most informative feature for the model. Also, the short period of 28 days of the project being online contributes towards the prediction of success.

Another example is row 33161 of the test dataset, which was a correct prediction of a failed project. As we can see in the force plot (Figure 9), generated by Listing 18, the biggest block is the feature goal. Apparently, the set goal of $25,000 was too high.

shap.force_plot(expected_value, shap_values[33161, :], X_test.iloc[33161, :])
machine learning

Figure 9

So, now we got a better look at our model with this Kickstarter dataset. One could also explore the false predictions and get an even deeper understanding of the model. One can also take a look at the false positives and false negatives. There, you could see on which features the model concentrated that lead to an incorrect model output. There are also many other visualizations like interaction values. Check out the documentation if you are interested.

Outlook

The SHAP package is also useful in other machine learning tasks. For example, image recognition tasks. In Figure 10, you can see which pixels contributed to which model output.

machine learning

Figure 10. Source: SHAP

SHAP is giving us the opportunity to better understand the model and which features contributed to which prediction. The package allows us to check whether we are taking just features into account which make sense. It is the first step towards preventing models from predicting things based on wrong input features. Thus, machine learning becomes less of a “black box”. This way, we are getting closer to explainable machine learning.

The post Tutorial: Explainable Machine Learning with Python and SHAP appeared first on ML Conference.

]]>