Advanced ML Development
Generative AI & Large Language Models (LLMs)
ML Business & Strategy
MLOps
Tools, APIs & Frameworks

Real Estate Insights: A Guide to Address Matching with NLP in Python ML Basics & Principles

Comprehensive Strategies for Geospatial Analysis, Data Management, and more

Alyce Ge
Feb 2, 2024

Discover the power of address matching in real estate data management with this comprehensive guide. Learn how to leverage natural language processing (NLP) techniques using Python, including open-source libraries like SpaCy and fuzzywuzzy, to parse, clean, and match addresses. From breaking down data silos to geocoding and point-in-polygon searches, this article provides a step-by-step approach to creating a Source-of-Truth Real Estate Dataset. Whether you're in geospatial analysis, real estate data management, logistics, or compliance, accurate address matching is the key to unlocking valuable insights.

Address matching isn’t always simple in data; we often need to parse and standardize addresses into a consistent format first before we can use them as identifiers for matching. Address matching is an important step in the following use cases:

  1. Geospatial Analysis: Accurate address matching forms the foundation of geospatial analysis, allowing organizations to make informed decisions about locations, market trends, and resource allocation across various industries like retail and media.
  2. Real Estate Data Management: In the real estate industry, precise address matching facilitates property valuation, market analysis, and portfolio management.
  3. Logistics and Navigation: Efficient routing and delivery depend on accurate address matching.
  4. Compliance and Regulation: Many regulatory requirements mandate precise address data, such as tax reporting and census data collection.

Stay up to date

Learn more about MLCON

 

Cherre is the leading real estate data management company, we specialize in accurate address matching for the second use case. Whether you’re an asset manager, portfolio manager, or real estate investor, a building represents the atomic unit of all financial, legal, and operating information. However, real estate data lives in many silos, which makes having a unified view of properties difficult. Address matching is an important step in breaking down data silos in real estate. By joining disparate datasets on address, we can unlock many opportunities for further portfolio analysis.

Data Silos in Real Estate

Real estate data usually fall into the following categories: public, third party, and internal. Public data is collected by governmental agencies and made available publicly, such as land registers. The quality of public data is generally not spectacular and the data update frequency is usually delayed, but it provides the most comprehensive coverage geographically. Don’t be surprised if addresses from public data sources are misaligned and misspelled.

Third party data usually come from data vendors, whose business models focus on extracting information as datasets and monetizing those datasets. These datasets usually have good data quality and are much more timely, but limited in geographical coverage. Addresses from data vendors are usually fairly clean compared to public data, but may not be the same address designation across different vendors. For large commercial buildings with multiple entrances and addresses, this creates an additional layer of complexity.

Lastly, internal data is information that is collected by the information technology (I.T.) systems of property owners and asset managers. These can incorporate various functions, from leasing to financial reporting, and are often set up to represent the business organizational structures and functions. Depending on the governance standards, and data practices, the quality of these datasets can vary and data coverage only encompasses the properties in the owner’s portfolio. Addresses in these systems can vary widely, some systems are designated at the unit-level, while others designate the entire property. These systems also may not standardize addresses inherently, which makes it difficult to match property records across multiple systems.

With all these variations in data quality, coverage, and address formats, we can see the need for having standardized addresses to do basic property-level analysis.

[track_display_in_blog headline="NEW & PRACTICAL ENDEAVORS FOR ML" text="Machine Learning Principles" textcolor="white" backgroundimage="https://mlconference.ai/wp-content/uploads/2024/02/MLC_Global24_Website_Blog1.jpg" icon="https://mlconference.ai/wp-content/uploads/2019/10/MLC_Singapur20_Trackicons_MLPrinciples_250x250_54073_rot_v1.png" btnlink="machine-learning-principles" btntext="Learn more"]

Address Matching Using the Parse-Clean-Match Strategy

In order to match records across multiple datasets, the address parse-clean-match strategy works very well regardless of region. By breaking down addresses into their constituent pieces, we have many more options for associating properties with each other. Many of the approaches for this strategy use simple natural language processing (NLP) techniques.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Address Parsing

Before we can associate addresses with each other, we must first parse the address. Address parsing is the process of breaking down each address string into its constituent components. Components in addresses will vary by country.

In the United States and Canada, addresses are generally formatted as the following:

{street_number} {street_name}

{city}, {state_or_province} {postal_code}

{country}

In the United Kingdom, addresses are formatted very similarly as in the U.S. and Canada, with an additional optional locality designation:

{building_number} {street_name}

{locality (optional)}

{city_or_town}

{postal_code}

{country}

 

French addresses vary slightly from U.K. addresses with the order of postal code and city:

{building_number} {street_name}

{postal_code} {city}

{country}

 

German addresses take the changes in French addresses and then swap the order of street name and building number:

{street_name} {building_number} {postal_code} {city} {country}

 

Despite the slight variations across countries’ address formats, addresses generally have the same components, which makes this an easily digestible NLP problem. We can break down the process into the following steps:

  1. Tokenization: Split the address into its constituent words. This step segments the address into manageable units.
  2. Named Entity Recognition (NER): Identify entities within the address, such as street numbers, street names, cities, postal codes, and countries. This involves training or using pre-trained NER models to label the relevant parts of the address.
  3. Sequence Labeling: Use sequence labeling techniques to tag each token with its corresponding entity

Let’s demonstrate address parsing with a sample Python code snippet using the spaCy library. SpaCy is an open-source software library containing many neural network models for NLP functions. SpaCy supports models across 23 different languages and allows for data scientists to train custom models for their own datasets. We will demonstrate address parsing using one of SpaCy’s out-of-the-box models for the address of a historical landmark: David Bowie’s Berlin apartment.

 

import spacy

# Load the NER spaCy model
model = spacy.load("en_core_web_sm")

# Address to be parsed
address = "Hauptstraße 155, 10827 Berlin"

# Tokenize and run NER
doc = model(address)

# Extract address components
street_number = ""
street_name = ""
city = ""
state = ""
postal_code = ""

for token in doc:
    if token.ent_type_ == "GPE":  # Geopolitical Entity (City)
        city = token.text
    elif token.ent_type_ == "LOC":  # Location (State/Province)
        state = token.text
    elif token.ent_type_ == "DATE":  # Postal Code
        postal_code = token.text
    else:
        if token.is_digit:
            street_number = token.text
        else:
            street_name += token.text + " "

# Print the parsed address components
print("Street Number:", street_number)
print("Street Name:", street_name)
print("City:", city)
print("State:", state)
print("Postal Code:", postal_code)

Now that we have a parsed address, we can now clean each address component.

Address Cleaning

Address cleaning is the process of converting parsed address components into a consistent and uniform format. This is particularly important for any public data with misspelled, misformatted, or mistyped addresses. We want to have addresses follow a consistent structure and notation, which will make further data processing much easier.

To standardize addresses, we need to standardize each component, and how the components are joined. This usually entails a lot of string manipulation. There are many open source libraries (such as libpostal) and APIs that can automate this step, but we will demonstrate the basic premise using simple regular expressions in Python.


import pandas as pd
import re

# Sample dataset with tagged address components
data = {
    'Street Name': ['Hauptstraße', 'Schloß Nymphenburg', 'Mozartweg'],
    'Building Number': ['155', '1A', '78'],
    'Postal Code': ['10827', '80638', '54321'],
    'City': ['Berlin', ' München', 'Hamburg'],
}

df = pd.DataFrame(data)

# Functions with typical necessary steps for each address component
# We uppercase all text for easier matching in the next step

def standardize_street_name(street_name):
    # Remove special characters and abbreviations, uppercase names
    standardized_name = re.sub(r'[^\w\s]', '', street_name)
    return standardized_name.upper()

def standardize_building_number(building_number):
    # Remove any non-alphanumeric characters (although exceptions exist)
    standardized_number = re.sub(r'\W', '', building_number)
    return standardized_number

def standardize_postal_code(postal_code):
    # Make sure we have consistent formatting (i.e. leading zeros)
    return postal_code.zfill(5)

def standardize_city(city):
    # Upper case the city, normalize spacing between words
    return ' '.join(word.upper() for word in city.split())

# Apply standardization functions to our DataFrame
df['Street Name'] = df['Street Name'].apply(standardize_street_name)
df['Building Number'] = df['Building Number'].apply(standardize_building_number)
df['Postal Code'] = df['Postal Code'].apply(standardize_postal_code)
df['City'] = df['City'].apply(standardize_city)

# Finally create a standardized full address (without commas)
df[‘Full Address’] = df['Street Name'] + ' ' + df['Building Number'] + ' ' + df['Postal Code'] + ' ' + df['City']

Address Matching

Now that our addresses are standardized into a consistent format, we can finally match addresses from one dataset to address in another dataset. Address matching involves identifying and associating similar or identical addresses from different datasets. When two full addresses match exactly, we can easily associate the two together through a direct string match.

 

When addresses don’t match, we will need to apply fuzzy matching on each address component. Below is an example of how to do fuzzy matching on one of the standardized address components for street names. We can apply the same logic to city and state as well.


from fuzzywuzzy import fuzz

# Sample list of street names from another dataset
street_addresses = [
    "Hauptstraße",
    "Schlossallee",
    "Mozartweg",
    "Bergstraße",
    "Wilhelmstraße",
    "Goetheplatz",
]

# Target address component (we are using street name)
target_street_name = "Hauptstrasse " # Note the different spelling and space 

# Similarity threshold
# Increase this number if too many false positives
# Decrease this number if not enough matches
threshold = 80

# Perform fuzzy matching
matches = []

for address in street_addresses:
    similarity_score = fuzz.partial_ratio(address, target_street_name)
    if similarity_score >= threshold:
        matches.append((address, similarity_score))

matches.sort(key=lambda x: x[1], reverse=True)

# Display matched street name
print("Target Street Name:", target_street_name)
print("Matched Street Names:")
for match in matches:
    print(f"{match[0]} (Similarity: {match[1]}%)")

Up to here, we have solved the problem for properties with the same address identifiers. But what about the large commercial buildings with multiple addresses?

Other Geospatial Identifiers

Addresses are not the only geospatial identifiers in the world of real estate. An address typically refers to the location of a structure or property, often denoting a street name and house number.  There are actually four other geographic identifiers in real estate:

 

  1. A “lot” represents a portion of land designated for specific use or ownership.
  2. A “parcel” extends this notion to a legally defined piece of land with boundaries, often associated with property ownership and taxation.
  3. A “building” encompasses the physical structures erected on these parcels, ranging from residential homes to commercial complexes.
  4. A “unit” is a sub-division within a building, typically used in multi-unit complexes or condominiums. These can be commercial complexes (like office buildings) or residential complexes (like apartments).

 

What this means is that we actually have multiple ways of identifying real estate objects, depending on the specific persona and use case. For example, leasing agents focus on the units within a building for tenants, while asset managers optimize for the financial performance of entire buildings. The nuances of these details are also codified in many real estate software systems (found in internal data), in the databases of governments (found in public data), and across databases of data vendors (found in third party data). In public data, we often encounter lots and parcels. In vendor data, we often find addresses (with or without units). In real estate enterprise resource planning systems, we often find buildings, addresses, units, and everything else in between.

In the case of large commercial properties with multiple addresses, we need to associate various addresses with each physical building. In this case, we can use geocoding and point-in-polygon searches.

Geocoding Addresses

Geocoding is the process of converting addresses into geographic coordinates. The most common form is latitude and longitude. European address geocoding requires a robust understanding of local address formats, postal codes, and administrative regions. Luckily, we have already standardized our addresses into an easily geocodable format.

Many commercial APIs exist for geocoding addresses in bulk, but we will demonstrate geocoding using a popular Python library, Geopy, to geocode addresses.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="my_geocoder")
location = geolocator.geocode("1 Canada Square, London")
print(location.latitude, location.longitude)

 

 

Now that we’ve converted our addresses into latitude and longitude, we can use point-in-polygon searches to associate addresses with buildings.

Point-in-Polygon Search

A point-in-polygon search is a technique to determine if a point is located within the boundaries of a given polygon.

The “point” in a point-in-polygon search refers to a specific geographical location defined by its latitude and longitude coordinates. We have already obtained our points by geocoding our addresses.

The “polygon” is a closed geometric shape with three or more sides, which is usually characterized by a set of vertices (points) connected by edges, forming a closed loop. Building polygons can be downloaded from open source sites like OpenStreetMap or from specific data vendors. The quality and detail of the OpenStreetMap building data may vary, and the accuracy of the point-in-polygon search depends on the precision of the building geometries.

While the concept seems complex, the code for creating this lookup is quite simple. We demonstrate a simplified example using our previous example of 1 Canada Square in London.


import json
from shapely.geometry import shape, Point

# Load the GeoJSON data
with open('building_data.geojson') as geojson_file:
    building_data = json.load(geojson_file)

# Latitude and Longitude of 1 Canada Square in Canary Wharf
lat, lon = 51.5049, 0.0195

# Create a Point geometry for 1 Canada Square
point_1_canada = Point(lon, lat)

# See if point is within any of the polygons
for feature in building_data['features']:
    building_geometry = shape(feature['geometry'])

    if point_1_canada.within(building_geometry):
        print(f"Point is within this building polygon: {feature}")
        break
else:
    print("Point is not within any building polygon in the dataset.")

Using this technique, we can properly identify all addresses associated with this property.

Stay up to date

Learn more about MLCON

 

Summary

Addresses in real life are confusing because they are the physical manifestation of many disparate decisions in city planning throughout the centuries-long life of a city. But using addresses to match across different datasets doesn’t have to be confusing.

Using some basic NLP and geocoding techniques, we can easily associate property-level records across various datasets from different systems. Only through breaking down data silos can we have more holistic views of property behaviors in real estate.

Author Biography

Alyce Ge is data scientist at Cherre, the industry-leading real estate data management and analytics platform. Prior to joining Cherre, Alyce held data science and analytics roles for a variety of technology companies focusing on real estate and business intelligence solutions. Alyce is a Google Cloud-certified machine learning engineer, Google Cloud-certified data engineer, and Triplebyte certified data scientist. She earned her Bachelor of Science in Applied Mathematics from Columbia University in New York.

 

Top Articles About ML Basics & Principles

Behind the Tracks