MLOps Best Practices: Scaling Machine Learning in Production https://mlconference.ai/blog/mlops/ The Conference for Machine Learning Innovation Wed, 15 May 2024 10:02:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 Keeping an Eye on AI https://mlconference.ai/blog/keeping-an-eye-on-ai/ Fri, 01 Apr 2022 11:45:52 +0000 https://mlconference.ai/?p=83549 Your machine learning model is trained and finally running in production. But that was the easy part. Now, the real challenge is reliably running your machine learning system in production. For this, monitoring systems are essential. But while monitoring machine learning models, you must consider some challenges that go beyond traditional DevOps metrics.

The post Keeping an Eye on AI appeared first on ML Conference.

]]>
We’ve all been there: we’ve spent weeks or even months working on our ML model. We collected and processed data, tested different model architectures, and spent a lot of time fine-tuning our model’s hyperparameters. Our model is ready! Maybe we need a few more tweaks to improve performance more, but it’s ready for the real world. Finally, we put our model into production, and sit back and relax. Three weeks later, we get an angry call from our customer because our model makes predictions that don’t have anything to do with reality. A look at the log reveals no errors. In fact, everything still looks good.

However, since we haven’t established continuous monitoring for our model, we don’t know if and when our model’s predictions change. We have to hope that they’ll always be just as good as on day one. But if our infrastructure is made out of only duct tape and hope, then we’ll find lots of errors only in production.

Stay up to date

Learn more about MLCON

 

What is MLOps, anyway?

We want to create infrastructure and processes that combine the development of machine learning systems with the system’s operation. That’s the goal that MLOps is pursuing. This is closely interwoven with DevOps’ goals and definition. But the goal here isn’t just developing and operating software systems, but developing and operating machine learning systems.

As a data scientist, I have to ask myself why I should worry about the operational part of ML systems at all. Technically, my goal is just to train the best possible ML model. While this is a worthy goal, I have to keep in mind the system’s overall context and the business of the company I’m operating out in. About 90% of all ML models are never deployed in a production environment [1], [2]. This means they never reach a user or customer. Provocatively speaking: 90% of machine learning projects are useless for our business. Of course, in the end, a model is only useful if it generates added value for my users or processes.

Besides, when developing a machine learning system, I always keep the following quote from Andrew Ng in mind: “You deployed your model in production? Congratulations. You’re halfway done with your project.” [3]

RETHINK YOUR APPROACHES

Business & Strategy

Productive challenges

Our machine learning project isn’t over when we bring our first model into a production environment. After all, even a model that delivered excellent results in training and testing faces many challenges once it arrives in production. Perhaps the best-known challenge are changes in the distribution of the data that our model receives as input. A whole range of events can trigger this, but they’re often grouped under the term “concept drift”. The data set used to train our model is the only part of the reality our model can perceive. This is one of the reasons why collecting as much data as we can is so critical for a well-functioning, robust model. When more data is available to the model it can represent a bigger part of the real world with higher fidelity. But the training data is always just a static snapshot of reality, while the world around it is constantly changing. If our model isn’t trained with new data, it has no way to update its outdated basic assumptions about reality. This leads to a decline in the model’s performance.

How does concept drift happen? Data can change gradually. For example, sensors get less accurate over a long time due to wear and tear and show increasing deviations from the actual value. Another example of slow changes is customer behavior and customer preferences. Image a model that makes recommendations for products in a fashion shop. If it isn’t updated and keeps recommending winter cloth in the summer to customers, then customer satisfaction in our shop system will significantly decrease. Recurring events such as seasons or holidays can also have an impact if we want to use our model to predict sales figures. But concept drift can also happen abruptly: If COVID-19 brings global air traffic to a screeching halt, then our carefully trained models for predicting daily passenger traffic will produce poor results. Or if the sales department launches an Instagram promotion without prior notice and doubles the sales of our vitamin supplement, that’s a great result, but it’s not something our model is good at predicting.

Another challenge is both technical and organizational. In many companies and projects, there is one team developing a machine learning model (data scientists) and another team bringing the models into production and supporting them (software engineers/DevOps engineers). The data science team spends a lot of time conceptualizing and selecting model architectures, feature engineering, and training the model. When the fully trained model is handed over to the software development team, it’s often implemented again. This implementation might differ from the actual, intended implementation. Even with small differences, this can lead to significant and unexpected effects. The separation of data science teams and software engineering teams can also lead to entirely different problems. Even if the data science team takes the model all the way to production, it’s often still used by different applications developed by other teams. These applications send input data to the model and receive predictions from it. If the data changes in the applications due to changes or errors, then the input data will fall short of the model’s expectations.

But challenges or threats to the model can even arise outside the organization itself. While actual security problems related to ML models have been rare up until now, it’s still possible for third parties to actively try and find vulnerabilities in the model as part of adversarial attacks. This danger shouldn’t be ignored, especially in fraud detection models.

Monitoring: DevOps vs. MLOps

To address any challenges in a productive model, we must first be aware of arising challenges. For this, we need monitoring. There are two approaches for monitoring traditional production software: First, we monitor business metrics (KPIs — key performance indicators). These can be metrics such as customer satisfaction, revenue, or how long customers stay on our site. We monitor service and infrastructure metrics to gain comprehensive insights into the state of our system. We also collect these metrics when we monitor a machine learning system. But this still isn’t sufficient. Unlike conventional software systems, ML systems behave partly non-deterministically and their performance depends significantly on the distribution of input data and the time of the most recent training. We need to collect further data. Because model hyperparameters, input feature selection, and architecture are not determined during deployment, but already in the training pipeline, the smallest error can lead to radically different system behavior that traditional software testing wouldn’t record. This is especially true in systems where models are constantly iterated and improved upon.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Monitoring for ML systems

First, we need to gather the traditional metrics that we’d monitor for any other software system. There are well-established best practices here. Google’s site reliability engineering manual [4] can serve as a reference. I’d like to mention the following as examples of DevOps metrics:

  • Latency: How long does our model take to predict a new value? How long does our preprocessing pipeline take?
  • Resource consumption: What are the CPU/GPU and RAM utilization of our model server? Do we need more servers?
  • Events: When does the model server receive an HTTP request? When is a specific function called? When is an exception thrown?
  • Calls: How often is a model requested?

For these metrics, it is worth looking at both the individual values and the trend of the values over time. If you see changes in either that is a reason to take a closer look at our model and the whole system. Let’s imagine that we’re looking at the number of predictions per hour. If these deviate considerably in our daily or weekly comparison—by 40%, for instance—then it would certainly make sense to send an e-mail to the team members involved. In addition to these classic metrics, we also need to keep an eye on metrics specific to machine learning. In any case, we need to collect and store the predictions of the model. Both the individual predicted values and the value distribution are of interest. If our system predicts numerical values, we have to check their numerical stability. If it predicts “NaNs” (non-numerical values) or “infinity”, then this might warrant alerting our data scientists. Does your model that should predict tomorrow’s user number return “-15” or “cat”? Then something is very wrong.

But the distribution of predictions is also of interest. Does the distribution match our expected values? We should also check what the minima and maxima of our predicted values are. Are they within an expected, reasonable range? What do the median, mean, and standard deviations look like over an hour or a day? If we find a large discrepancy between the predicted and observed classes, then we have a prediction bias. Then maybe the label distribution of our training data—the predicted values—is different from the distribution in the real world. This is a clear sign that we need to re-train our model.

When we monitor our model’s outputs, it’s only logical that we monitor the inputs too. What data does our model receive? We can determine the distributions and statistical key figures. Naturally, this is easier for structured data. But we can also determine various characteristic values for texts or images. For example, we can look at the length of texts or even word distributions within them. For images, we can capture their size or maybe even their brightness values. This can also help us to find errors in our pipeline and data pre-processing. Monitoring input makes it possible to infer problems in data sources. Some columns in the database are no longer populated with data due to a bug. The data’s definition or format in some columns might change, or the columns could be renamed. If we don’t notice these changes, then our models will still assume the original definitions and formats. To avoid this, we need to monitor the distribution of values of each feature extracted from a database table for a significant shift. You can detect a shift in the distribution input data and predictions by applying statistical tests like the chi-square test or the Kolmogorov-Smirnov test. If the tests detect a significant deviation in the distribution of the data, this may be an indicator of a change in the data structures. This requires a re-training of the model.

There are a few things to keep in mind when monitoring input data. If you’re working with particularly large data, it isn’t feasible to record it completely. In this case, it might be more practical to process the input independently of the model first using deterministic code, and then log a preprocessed and compressed version of the input data. Of course, you must take special care to create and test the preprocessing. We also need to use identical preprocessing during training and in the production environment. Otherwise, there could be major discrepancies between the training data and real data. Caution is also necessary when processing particularly sensitive business data or personal data. Naturally, we want to collect as little sensitive data as we can. But we also need to collect data to improve and debug the models. This opens up a large range of interesting problems. There are some exciting approaches for solving these challenges, such as Differential Privacy [5] for Machine Learning.

Finally, I’d like to mention one area that’s very easy to monitor, but still gets neglected often: tracking model versions. In our machine learning system, we must be able to always know which model is currently active in the production environment. Not only do we need to version our models, but in the best case, we also should be able to trace the complete experimental history that led to our model’s creation. We can use a tool like MLflow [6] for this. Even though there are some approaches that help debug machine learning models, right now deep learning models should be considered black boxes when it comes to the explainability and traceability of predictions. To shed some light on this, we need to be able to understand which features and data were used to train the model. Was there a bias introduced during training? Did someone mislabel data? Is there a bug in our data cleaning pipeline? These are all things that, in the worst case, we only discover in production. To have any chance of debugging in these cases, we need to version not just the model itself, but everything that led to its creation, monitor the currently active versions and make them transparent to all stakeholders. We use tools like DVC [7] or Feast [8] to track our data or features. But it’s also important that we version and test code in the data pipeline that we use to collect and process data.

The toolbox

There are many specialized tools in the field of MLOps like Sacred [9] for managing experiments or Kubeflow [10] for training models and managing data pipelines. But for monitoring machine learning systems, we can use the DevOps toolbox. We’ll use Elasticsearch [11], a classic tool from the Elastic Stack to capture, collect, and evaluate logs and input data. Elasticsearch is a key-value store commonly used to store logs from applications or containers. These logs can be visualized with Kibana [12] and used for error diagnosis.

For MLOps, we use Elasticsearch not just for logs, but to also store our model’s pre-processed inputs for further processing. We can also store transaction data and contexts for our users’ actions in Elasticsearch. In addition to basic data like timestamps and accessed pages, this can also be explicit actions from users. Both the interactions of our users performed before the model delivered them a prediction and their reactions to the predictions are of interest. We can use the time series store Prometheus [13] together with the visualization tool Grafana [14] to capture operational metrics and model predictions.

Fig. 1: Example architecture for monitoring ML systems

 

What can an architecture for monitoring ML systems look like? (Fig. 1) Our model is made available via a model server. This can be either a custom software solution or a standard tool like Seldon Core [15]. Input data is delivered to the model server as usual. But at the same time, it is also logged in Elasticsearch and visualized with Grafana or Kibana if required. Application logs and error messages from our model server are also stored in Elasticsearch. Now our model performs inferences and calculates predictions. These are returned to the requesting applications or users. Predictions are also stored in Prometheus. Standard metrics such as utilization, inference duration, and confidence are stored there too. These can also be visualized with Grafana or Kibana.

We’ll provide our own microservice for statistical analysis to collect statistical data such as distributions and standard deviations. It pulls data from Elasticsearch and Prometheus, performs analysis, and returns the data to Prometheus. If there are major deviations, then our microservice can send notifications to our team. In parallel, we can provide a drift detector as a second microservice, which handles detecting data drift and context drift. For this, it obtains the model’s input and output data from Elasticsearch and Prometheus. To compare the production data to a baseline we need to make the model’s training data available to the microservice. Ideally, we’ll provide this via a feature store like Feast. Metrics on drift are also stored in Prometheus. In this example, the drift detector and the service for statistically evaluating metrics are custom developments, since the tool selection in this category is currently very sparse.

An evolutionary approach

Of course, this is just an example architecture because. Building machine learning monitoring infrastructures and MLOps pipelines in general always needs to be an evolutionary process. Deployment can not happen with a one-off, big-bang approach. Each company has its own unique set of challenges and needs when developing machine learning strategies. Furthermore, there are different users in machine learning systems, different organizational structures, and existing application landscapes.

One possible approach here could be to implement an initial prototype without ML components. After that, you should start building the infrastructure and the simplest models, which you continuously monitor. The big advantage of this strategy is that you can start early with a collection of input data and especially with the collection of ground truth labels, i.e the real, correct results. It is fairly common for ground truth labels to become available a long time after the prediction was made. For example, if we want to predict a company’s quarterly results, then they will only be available after three months. Therefore, you should start collecting data and labels early.

With the help of your infrastructure, collected data, and simple models, you can begin moving towards more complicated models in small, incremental steps. After that, you can lift these into a production environment until you have achieved your desired level of predictive accuracy and performance. By doing this, your monitoring infrastructure gives you a constant overview of how your model is performing, and when it’s time to roll out a new model. And you can also determine when you have to roll back your model to its previous state after finding a bug in your feature pipeline that slipped into production.

Links & Literature

[1] https://www.redapt.com/blog/why-90-of-machine-learning-models-never-make-it-to-production

[2] https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/

[3] A Chat with Andrew on MLOps:: https://www.youtube.com/watch?v=06-AZXmwHjo

[4] https://landing.google.com/sre/book.html

[5] Deep Learning with Differential Privacy: https://arxiv.org/abs/1607.00133

[6] https://mlflow.org

[7] https://dvc.org

[8] https://feast.dev

[9] https://github.com/IDSIA/sacred

[10] https://www.kubeflow.org/

[11] https://www.elastic.co/de/elasticsearch/

[12] https://www.elastic.co/de/kibana/

[13] https://prometheus.io

[14] https://grafana.com

[15] https://www.seldon.io/tech/products/core/

The post Keeping an Eye on AI appeared first on ML Conference.

]]>
Tools & Processes for MLOps https://mlconference.ai/blog/tools-and-processes-for-mlops/ Wed, 26 May 2021 10:47:53 +0000 https://mlconference.ai/?p=81706 Training a machine learning model is getting easier. But building and training the model is also the easy part. The real challenge is getting a machine learning system into production and running it reliably. In the field of software development, we have gained a significant insight in this regard: DevOps is no longer just nice to have, but absolutely necessary. So why not use DevOps tools and processes for machine learning projects as well?

The post Tools & Processes for MLOps appeared first on ML Conference.

]]>
When we want to use our familiar tools and workflows from software development for data science and machine learning projects, we quickly run into problems. Data science and machine learning model building follow a different process than the classic software development process, which is fairly linear.

When I create a branch in software development, I have a clear goal in mind of what the outcome of that branch will be: I want to fix a bug, develop a user story, or revise a component. I start working on this defined task. Then, once I upload my code to the version control system, automated tests run – and one or more team members perform a code review. Then I usually do another round to incorporate the review comments. When all issues are fixed, my branch is integrated into the main branch and the CI/CD pipeline starts running; a normal development process. In summary, the majority of the branches I create are eventually integrated in and deployed to a production environment.

In the area of machine learning and data science, things are different. Instead of a linear and almost “mechanical” development process, the process here is very much driven by experiments. Experiments can fail; that is the nature of an experiment. I also often start an experiment precisely with the goal of disproving a thesis. Now, any training of a machine learning model is an experiment and an attempt to achieve certain results with a specific model and algorithm configuration and data set. If we imagine that for a better overview we manage each of these experiments in a separate branch, we will get very many branches very quickly. Since the majority of my experiments will not produce the desired result, I will discard many branches. Only a few of my experiments will ever make it into a production environment. But still, I want to have an overview of what experiments I have already done and what the results were so that I can reproduce and reuse them in the future.

But that’s not the only difference between traditional software development and machine learning model development. Another difference is behavior over time.

ML models deteriorate over time

Classic software works just as well after a month as it did on day one. Of course, there may be changes in memory and computational capacity requirements, and of course bugs will occur, but the basic behavioral characteristics of the production software do not change. With machine learning models, it’s different. For these, the quality decreases over time. A model that operates in a production environment and is not re-trained will degrade over time and never achieve as good a predictive accuracy as it did on day one.

Concept drift is to blame [1]. The world outside our machine learning system changes and so does the data that our model receives as input values. Different types of concept drift occur: data can change gradually, for example, when a sensor becomes less accurate over a long period of time due to wear and tear and shows an ever-increasing deviation from the actual measured value. Cyclical events such as seasons or holidays can also have an effect if we want to predict sales figures with our model.

But concept drift can also occur very abruptly: If global air traffic is brought to a standstill by COVID-19, then our carefully trained model for predicting daily passenger traffic will deliver poor results. Or if the sales department launches an Instagram promotion without notice that leads to a doubling of buyers of our vitamin supplement, that’s a good result, but not something our model is good at predicting.

There are two ways to counteract this deterioration in prediction quality: either we enable our model to actively retrain itself in the production environment, or we have to update our model frequently. Or better yet, update as often as we can somehow. We may also have made a necessary adjustment to an algorithm or introduced a new model that needs to be rolled out as quickly as possible.

So in our machine learning workflow, our goal is not just to deliver models to the user. Instead, our goal must be to build infrastructure that quickly informs our team when a model is providing incorrect predictions and enables the team to lift a new, better model into production environments as quickly as possible.

MLOps as DevOps for Machine Learning

We have seen that data science and machine learning model building require a different process than traditional, “linear” software development. It is also necessary that we achieve a high iteration speed in the development of machine learning models, in order to counteract concept drift. For this reason, it is necessary that we create a machine learning workflow and a machine learning platform to help us with these two requirements. This is a set of tools and processes that are to our machine learning workflow what DevOps is to software development: A process that enables rapid but controlled iteration in development supported by continuous integration, continuous delivery, and continuous deployment. This allows us to quickly and continuously bring high-quality machine learning systems into production, monitor their performance, and respond to changes. We call this process MLOps [2] or CD4ML (Continuous Delivery for Machine Learning) [3].

MLOps also provides us with other benefits: Through reproducible pipelines and versioned data, we create consistency and repeatability in the training process as well as in production environments. These are necessary prerequisites to implement business-critical ML use cases and to establish trust in the new technology among all stakeholders.

In the enterprise environment, we have a whole set of requirements that need to be implemented and adhered to in addition to the actual use case. There are privacy, data security, reproducibility, explainability, non-discrimination, and various compliance policies that may differ from company to company. If we leave these additional challenges for each team member to solve individually, we will create redundant, inconsistent and simply unnecessary processes. A unified machine learning workflow can provide a structure that addresses all of these issues, making each team member’s job easier.

Due to the experimental and iterative nature of machine learning, each step in the process that can be automated has a significant positive impact on the overall run time of the process from data to productive model. A machine learning platform allows data scientists and software engineers to focus on the critical aspects of the workflow and delegate the routine tasks to the automated workflows. But what sub-steps and tools can a platform for MLOps be built from?

Components of an MLOps pipeline

An MLOps workflow can be roughly divided into three areas:

  1. Data pipeline and feature management
  2. Experiment management and model development
  3. Deployment and monitoring

In the following, I describe the individual areas and present a selection of tools that are suitable for implementing the workflow. Of course, this selection is not conclusive or even representative, since the entire landscape is in a rapid development process, so that only individual snapshots are always possible.

Fig. 1: MLOps workflow

Data pipeline and feature management

As hackneyed as slogans like “data is the new oil” may seem, they have a kernel of truth: The first step in any machine learning and data science workflow is to collect and prepare data.

Centralized access to raw data

Companies with modern data warehouses or data lakes have a distinct advantage when developing machine learning products. Without a centralized point to collect and store raw data, finding appropriate data sources and ensuring access to that data is one of the most difficult steps in the lifecycle of a machine learning project in larger organizations.

Centralized access can be implemented here in the form of a Hadoop-based platform. However, for smaller data volumes, a relational database such as Postgres [4] or MySQL [5], or a document database based on an EL stack [6] is also perfectly adequate. The major cloud providers also provide their own products for centralized raw data management: Amazon Redshift [7], Google BigQuery [8] or Microsoft Azure Cosmos DB [9].

In any case, it is necessary that we first archive a canonical form of our original data before applying any transformation to it. This gives us an unmodified dataset of original data that we can use as a starting point for processing.

Even at this point in the workflow, it is important to rely on good documentation and to document the sources of the data, its meaning, and where it is stored. Even though this step seems simple, it is still of utmost importance. Invalid data, the wrong naming of a column of data, or a misconfigured scraping job can lead to a lot of frustration and wasted time.

Data Transformation

Rarely will we train our machine learning model directly on raw data. Instead, we generate features from the raw data. In the context of machine learning, a feature is one or more processed data attributes that an algorithm uses to make predictions. This could be a temperature value, for example, but in the case of deep learning applications also highly abstract features in images. To extract features from raw data, we will apply various transformations. We will typically define these transformations in code, although some ETL tools also allow us to define them using a graphical interface. Our transformations will either be run as batch jobs on larger sets of data, or we will define them as long-running streaming applications that continuously transform data.

We also need to split our dataset into training and testing datasets. To train a machine learning model, we need a set of training data. To test how our model performs with previously unknown data, we need another structurally identical set of test data. Therefore, we split our original transformed data set into two data sets. These must not overlap, meaning that the same data point does not occur twice in them. A common split here is to use 70 percent of the dataset for training and 30 percent for testing the model.

The exact split of the data sets depends on the context. For time-series data, sequential slices from the series should be chosen, while for image processing, random images from the data set should be chosen since they have no sequential relation to each other.

For non-sequential data, the individual data points can also be placed in a (pseudo-)random order. We also want to perform this process in a reproducible and automated manner rather than manually. A pleasantly usable tool for management and coordination here is Apache Airflow [10]. Here, according to the “pipeline as code” principle, one can define various pipelines in the form of a data flow graph, connect a wide variety of systems, and thus perform the desired transformations.

Feature repositories

Many machine learning models and systems within a company use the same or at least similar features. Now, once we have extracted a feature from our raw data, there is a high probability that this feature can be useful for other applications as well. Therefore, it can be useful not to have to implement feature extraction again for each application. For this, we can store known features in a feature store. This can be done either in a dedicated component (such as Feast) [11], or in well-documented database tables populated by appropriate transformations. These transformations can be mapped automatically using Apache Airflow.

Data versioning

In addition to code versioning, data versioning is useful in a machine learning context. This allows us to increase the reproducibility of our experiments and to validate our models and their predictions by retracing the exact state of a training dataset that was used at a given time. Tools such as DVC [12] or Pachyderm [13] can be used for this purpose.

Experiment management and model development

In order to deploy an optimal model into production, we need to create a process that enables the development of that optimal model. To do this, we need to capture and visualize information that enables the decision of what the optimal model is, since in most cases this decision is made by a human and not automated.

Since the data science process is very experiment-driven, multiple experiments are run in parallel, often by different people at the same time. And most will not be deployed in a production environment. The experimental approach in this phase of research is very different from the “traditional” software development process, as we can expect that the code for these experiments will be discarded in the majority of cases, and only some experiments will reach a production status.

Experiment management and visualizations

Running hundreds or even thousands of iterations on the way to an optimally trained ML model is not uncommon. In the process, quite a few parameters used to define each experiment and the results of that experiment are accumulated. Often, this metadata is stored in Excel spreadsheets or, in the worst case, in the heads of team members. However, to establish optimal reproducibility, avoid time-consuming multiple experiments, and enable optimal collaboration, this data should be captured automatically. Possible tools here are MLflow tracking [14] or Sacred [15]. To visualize the output metrics, either classical dashboards like Grafana [16] or specialized tools like TensorBoard [17] can be used. TensorBoard can also be used for this purpose independently of its use with TensorFlow. For example, PyTorch provides a compatible logging library [18]. However, there is still much room for optimization and experimentation here. For example, the combination of other tools from the DevOps environment such as Jenkins [19] and Terraform [20] would also be conceivable.

Version control for models

In addition to the results of our experiments, the trained models themselves can also be captured and versioned. This allows us to more easily roll back to a previous model in a production environment. Models can be versioned in several ways: In the simplest variant, we export our trained model in serialized form, for example as a Python .pkl file. We then record this in a suitable version control system (Git, DVC), depending on its size.

Another option is to provide a central model registry. For example, the MLflow Model Registry [21] or the model registry of one of the major cloud providers can be used here. Also, the model can be packaged in a Docker container and managed in a private Docker Registry [22].

Infrastructure for distributed training

Smaller ML models can usually still be trained on one’s own laptop with reasonable effort. However, as soon as the models become larger and more complex, this is no longer possible and a central on-premise or cloud server becomes necessary for training. For automated training in such an environment, it makes sense to build a model training pipeline.

This is executed with training data at specific times or on demand. The pipeline receives the configuration data that defines a training cycle. These data are for example model type, hyperparameters and used features. The pipeline can obtain the training data set automatically from the feature store and distribute it to all model variants to be trained in parallel. Once training is complete, the model files, original configuration, learned parameters, and metadata and timings are captured in the experiment and model tracking tools. One possible tool for building one is Kubeflow [23]. Kubeflow provides a number of useful features for automated training and for (cost-)efficient resource management based on Kubernetes.

Deployment and monitoring

Unless our machine learning project is purely a proof of concept or an academic project, we will eventually need to lift our model into a production environment. And that’s not all: once it gets there, we’ll need to monitor it and deploy a new version as needed. Also, in a large part of the cases, we will have not just one, but rather a whole set of models in our production environments.

Deploy models

On a technical level, any model training pipeline must produce an artifact that can be deployed into a production environment. The prediction results may be bad, but the model itself must be in a packaged state that allows it to be deployed directly into a production environment. This is a familiar idea from software development: continuous delivery. This packaging can be done in two different ways.

Either our model is deployed as a separate service and accessed by the rest of our systems via an interface. Here, the deployment can be done, for example, with TensorFlow Serving [24] or in a Docker container with a matching (Python) web server.

An alternative way of deployment is to embed it in our existing system. Here, the model files are loaded and input and output data are routed within the existing system. The problem here is that the model must be in a compatible format or a conversion must be performed before deployment. If such a conversion is performed, automated tests are essential. Here it must be ensured that both the original and the converted model deliver identical prediction results.

Monitoring

A data science project does not end with the deployment of a model into production. Even a production model has to face many challenges. The value distribution of my input values may be different in the real world than the one mapped in the training data. Also, value distributions can change slowly over time or due to singular events. This then requires retraining with the changed data.

Also, despite intensive testing, errors may have crept in during the previous steps. For this reason, infrastructure should be provided to continuously collect data on model performance. The input values and the resulting predictions of the model should also be recorded, as far as this is compatible with the applicable data protection regulations. On the other hand, if privacy considerations are only introduced at this point, one has to ask how a sufficient amount of training data could be collected without questionable privacy practices.

Here’s a sampling of basic information we should be collecting about our machine learning system in production:

  • How many times did the model make a prediction?
  • How long does it take the model to perform a prediction?
  • What is the distribution of the input data?
  • What features were used to make the prediction?
  • What results were predicted and what real results were observed later in the system?

Tools such as Logstash [25] or Prometheus [26] can be used to collect our monitoring data. To get a quick overview of the performance of the model, it is recommended to set up a dashboard that visualizes the most important metrics and to set up an automatic notification system that alerts the team in case of strong deviations so that appropriate countermeasures can be taken if needed.

Challenges on the road to MLOps

Companies face numerous challenges in staffing and challenges within their teams on the road to a successful machine learning strategy. There is also the financial challenge of attracting experienced software engineers and data scientists. But even if we manage to assemble a good team, we need to enable them to work together in the best possible way to bring out the strengths of each team member. Generally speaking, data scientists feel very comfortable using various statistical tools, machine learning algorithms, and Jupyter notebooks. However, they are often less familiar with version control software and testing tools that are widely used in software engineering. While software engineers are familiar with these tools, they often lack the expertise to choose the algorithm for a problem or to extract the last five percent of predictive accuracy from a model through skillful optimizations. Our workflows and processes must be designed to support both groups as best as possible and enable smooth collaboration.

In terms of technological challenges, we face a broad and dynamic technology landscape that is constantly evolving. In light of this confusing situation, we are often faced with the question of how to get started with new machine learning initiatives.

How do I get started with MLOps?

Building MLOps workflows must always be an evolutionary process and cannot be done in a one-time “big bang” approach. Each company has its own unique set of challenges and needs when developing its machine learning strategies. In addition, there are different users of machine learning systems, different organizational structures, and existing application landscapes. One possible approach here may be to create an initial prototype without ML components. Then, one should start building the infrastructure and simplest models.

From this starting point, the infrastructure created can then be used to move forward in small and incremental steps to more complicated models and lift them into production environments until the desired level of predictive accuracy and performance has been achieved. Short development cycles for machine learning models, in the range of days rather than weeks or even months, enable faster response to changing circumstances and data. However, such short iteration cycles can only be achieved with a high degree of automation.

Developing, putting into production, and keeping machine learning models productive is itself a complex and iterative process with many inherent challenges. Even on a small or experimental scale, many companies find it difficult to implement these processes cleanly and without failures. The data science and machine learning development process is particularly challenging in that it requires careful balancing of the iterative, exploratory components and the more linear engineering components.

The tools and processes for MLOps presented in this article are an attempt to provide structure to these development processes.

Although there are no proven and standardized processes for MLOps, we can take many lessons learned from “traditional” software engineering. In many teams there is a division between data scientists and software engineers. Here we have the pragmatic approach: Data scientists develop a model in a Jupyter notebook and throw this notebook over the fence to software engineering, which then follows the model into production with a DevOps approach.

If we think back a few years, “throwing it over the fence” is exactly the problem that gave rise to DevOps. This is exactly how Werner Vogels (CTO at Amazon) described the separation between development and operations in his famous interview in 2006 [27]: “The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon.” Then came the phrase that looks good on DevOps conference T-shirts, coffee mugs and motivational posters: “You build it, you run it.” As naturally as development and operations belong together today, we must also make the collaboration between data science and DevOps a matter of course.

Links & Literature

[1] Tsymbal, Alexey: “The problem of concept drift. Definitions and related work. Technical Report”: https://www.scss.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf

[2] https:// http://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

[3] https://martinfowler.com/articles/cd4ml.html

[4] https://www.postgresql.org

[5] https://www.mysql.com

[6] https://www.elastic.co/de/elasticsearch/

[7] https://aws.amazon.com/redshift/

[8] https://cloud.google.com/bigquery

[9] https://docs.microsoft.com/de-de/azure/cosmos-db/

[10] https://airflow.apache.org

[11] https://feast.dev

[12] https://dvc.org

[13] https://www.pachyderm.com

[14] https://www.mlflow.org/docs/latest/tracking.html

[15] https://github.com/IDSIA/sacred

[16] https://grafana.com

[17] https://www.tensorflow.org/tensorboard

[18] https://pytorch.org/docs/stable/tensorboard.html

[19] https://www.jenkins.io

[20] https://www.terraform.io

[21] https://www.mlflow.org/docs/latest/model-registry.html

[22] https://docs.docker.com/registry/deploying/

[23] https://www.kubeflow.org

[24] https://www.tensorflow.org/tfx/guide/serving

[25] https://www.elastic.co/de/logstash

[26] https://prometheus.io

[27] https://queue.acm.org/detail.cfm?id=1142065

 

The post Tools & Processes for MLOps appeared first on ML Conference.

]]>
Continuous Delivery for Machine Learning https://mlconference.ai/blog/continuous-delivery-for-machine-learning/ Tue, 14 Apr 2020 11:54:24 +0000 https://mlconference.ai/?p=16976 In modern software development, we’ve grown to expect that new software features and enhancements will simply appear incrementally, on any given day. This applies to consumer applications such as mobile, web, and desktop apps, as well as modern enterprise software. We’re no longer tolerant of big, disruptive software deployments. ThoughtWorks has been a pioneer in Continuous Delivery (CD), a set of principles and practices that improve the throughput of delivering software to production in a safe and reliable way.

The post Continuous Delivery for Machine Learning appeared first on ML Conference.

]]>
 

 

As organizations move to become more “data-driven” or “AI-driven”, it’s increasingly important to incorporate data science and data engineering approaches into the software development process to avoid silos that hinder efficient collaboration and alignment. However, this integration also brings new challenges when compared to traditional software development. These include:

A higher number of changing artifacts. Not only do we have to manage the software code artifacts, but also the data sets, the machine learning models, and the parameters and hyperparameters used by such models. All these artifacts have to be managed, versioned, and promoted through different stages until they’re deployed to production. It’s harder to achieve versioning, quality control, reliability, repeatability and audibility in that process.

Size and portability: Training data and machine learning models usually come in volumes that are orders of magnitude higher than the size of the software code. As such they require different tools that are able to handle them efficiently. These tools impede the use of a single unified format to share those artifacts along the path to production, which can lead to a “throw over the wall” attitude between different teams.

Different skills and working processes in the workforce: To develop machine learning applications, experts with complementary skills are necessary, and they sometimes have contradicting goals, approaches, and working processes:

  • Data Scientists look into the data, extract features and try to find models which best fit the data to achieve the predictive and prescriptive insights they seek out. They prefer a scientific approach by defining hypotheses and verifying or rejecting them based on the data. They need tools for data wrangling, parallel experimentation, rapid prototyping, data visualization, and for training multiple models at scale.
  • Developers and machine learning engineers aim for a clear path to incorporate and use the models in a real application or service. They want to ensure that these models are running as reliably, securely, efficiently and as scalable as possible.
  • Data engineers do the work needed to ensure that the right data is always up-to-date and accessible in the required amount, shape, speed, and granularity, as well as with high quality and minimal cost.
  • Business representatives define the outcomes to guide the data scientists’ research and exploration, and the KPIs to evaluate if the machine learning system is achieving the desired results with the desired quality levels.

Continuous Delivery for Machine Learning (CD4ML) is the technical approach to solve these challenges, bringing these groups together to develop, deliver, and continuously improve machine learning applications.

 


Figure 1: Continuous Delivery for Machine Learning (CD4ML) is integrating the different development processes and workflows of different roles with different skill sets for the development of machine learning applications

Stay up to date

Learn more about MLCON

 

 

The Continuous Intelligence Cycle

In the first article of The Intelligent Enterprise series, we introduced the Continuous Intelligence cycle (see figure 2).


Figure 2: The Continuous Intelligence Cycle

This is a fundamental cycle of transforming data into information, insights and actions that support an organization as it moves towards data-driven decision making. In traditional organizations, this cycle relies on legacy systems (e.g. data warehouses, ERP systems) and human decision making. In these organizations, the process is slow and contains many friction points: machine learning applications are often developed in isolation and never leave the proof of concept phase. If they make it into production, this is often a one-time ad-hoc process that makes it difficult to update and re-train them, leading to stale and outdated models.

Intelligent Enterprises implement ways to speed up the Continuous Intelligence cycle and remove the different friction points along the way. CD4ML is the technical approach to accelerate the value generation of machine learning applications as part of the Continuous Intelligence cycle. It enables you to move from offline or bench models and manual deployments; to automate the end-to-end process of gathering information and insights out of data; to productionize decisions and actions based on those insights; and collect more data to measure the outcomes once actions have been taken. This allows the Continuous Intelligence cycle to run faster and produces higher quality outcomes at lower risks by allowing feedback to be incorporated into the process.

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

 

What is CD4ML?

To understand CD4ML, we need to first understand Continuous Delivery (CD) and where its principles originated. Continuous Delivery, as Jez Humble and David Farley defined it in their seminal book, is: “… a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time”, which can be achieved if you “…create a repeatable, reliable process for releasing software, automate almost everything and build quality in.” 

They also state: “Continuous Delivery is the ability to get changes of all types — including new features, configuration changes, bug fixes, and experiments — into production, or into the hands of users, safely and quickly in a sustainable way.”

Changes to machine learning models are just another type of change that needs to be managed and released into production. Besides the code, it requires our CD toolset to be extended so that it can handle new types of artifacts. What’s more, the whole process of producing software in short cycles becomes more complex because there is more variety in the team’s skill sets (data scientists, data engineers, developers and machine learning engineers), with each following different workflows.

ThoughtWorks has further developed the Continuous Delivery approach to overcome these challenges to be applicable to machine learning applications and calls this new approach Continuous Delivery for Machine Learning (CD4ML). It allows us to extend the Continuous Delivery definition to incorporate the new elements required to speed up the Continuous Intelligence cycle:

Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles.

This definition contains all the basic principles:

Software engineering approach. It enables teams to efficiently produce high quality software.

Cross-functional team. Experts with different skill sets and workflows across data engineering, data science, development, operations, and other knowledge areas are working together in a collaborative way emphasizing the skills and strengths of each team member.

Producing software based on code, data, and machine learning models. All artifacts of the software production process (code, data, models, parameters) require different tools and workflows and must be managed accordingly.

Small and safe increments. The release of software artifacts is divided into small increments, this provides visibility and control around the levels of variance of the outcomes, adding safety into the process.

Reproducible and reliable software release. The process of releasing software into production is reliable and reproducible, leveraging automation as much as possible. This means that all artifacts (code, data, models, parameters) are versioned appropriately.

Software release at any time. It’s important that the software could be delivered into production at any time. Even if organizations don’t want to deliver software all the time, the fact is that being ready for release makes the decision about when to release it a business decision instead of a technical decision

Short adaptation cycles. Short cycles means development cycles are in the order of days or even hours, not weeks, months, or even years. To achieve this, you want to automate the process — including quality safeguards built in. This creates a feedback loop that enables you to adapt your models as you learn from their behavior in production.

 

How it all works together

CD4ML aims to automate the end-to-end machine learning lifecycle and ensures a continuous and frictionless process from data capture, modeling, experimentation, and governance, to production deployment. Figure 3 gives an overview of the whole process.


Figure 3: Continuous Delivery for Machine Learning in action

Starting at the left side of the cycle, data scientists work on data they discover and access from data sources. They wrangle the data, perform feature extraction, split the data into training and test data, build data models and experiment with all of them. They write code to train the models (often in Python or R) and tune them by choosing parameters and hyperparameters.

As these models are trained, the data scientists are constantly evaluating them. This means looking at the model’s error rate, the confusion matrix, the number of false positives and false negatives, or running certain test scripts — for example, for chatbots. The tests should be as automated as possible with the help of test environments, test scripts or test programs.

Once a good model is found, it’s ready to be productionized. The model has to be adapted to the production environment. This could mean containerization of the model code or even transforming it to a high-performance language like Java or C++ — either manually or using automatic transformation tools. The productionized version of the model has to be tested again in conjunction with other components of the overall architecture before it can be deployed to production.

 

In production, we have to observe and monitor how the model behaves “in the wild”. Metrics like usage, model input, model output, and possible model bias are important information about the model performance. This data can be fed back to the first stage of the process to enable further improvement: the whole Continuous Intelligence cycle starts again.

The transportation of the artifacts (source code, executables, training, and test data or model parameters) between the different process stages is controlled via pipelines that are executed by a CD orchestration tool. Every artifact is versioned, enabling reproducibility and auditability, so prior versions can be rebuilt or redeployed if required. The CD orchestration tool ensures the smooth and frictionless operation of the whole process and also allows governance and compliance, so certain quality standards and fairness checks are built into the process.

 

CD4ML in Action

We want to demonstrate the approach in practice based on a real client project delivered by ThoughtWorks. In fact, our current notion of CD4ML first emerged several years ago when we first applied Continuous Delivery to a user-facing machine learning application. You can read about it in detail here.

Our challenge was to build a price estimation engine for a leading European online car marketplace. The engine needed to be able to give a realistic estimate for anybody looking to buy or sell a car. That price estimate would be based on past car sales within the marketplace. As the market for used cars is constantly changing, the price estimation model has to be continuously re-trained on new data. A perfect case for CD4ML.


Figure 4: A CD4ML end-to-end process in a real-world example

Figure 4 shows the overall CD4ML flow for this specific case. The data scientists train the model using data from the marketplace — such as car specs, asking price and actual sales price. The model then predicts a price based on the car model, age, mileage, engine type, equipment, etc.

Before training a model, there’s a lot of data cleanup work to be done: detecting outliers, wrong listings, or dirty data. This is the first quality gate to be automated — is there enough good data to even provide a prediction model for a certain car model?

Once the trained model can make sufficiently accurate price estimates, it’s exported as a productionizable artifact, — a JAR or a pickle file. This is the second quality gate: is the model’s error rate acceptable?

This prediction model is then transformed into a format matching the target platform, then packaged, wrapped, and integrated into a deployable artifact — a prediction service JAR containing a web server or a container image that can be readily deployed into a production environment. This deployment artifact is now tested again, this time in an end-to-end fashion: is it still producing the same results as the original, non-integrated prediction model? Does it behave correctly in a production environment, for instance, does it adhere to contracts specified by other consuming services? This is the third quality gate.

If all three quality gates succeed, a new re-trained price prediction service is deployed and released. Importantly, all of those steps should be automated so that re-training to reflect the latest market changes happens without manual intervention as long as all quality gates are satisfied.

Finally, the live price prediction is continuously monitored: how do the sellers react to the price recommendations? How much is the listing price deviating from the suggestion? How close is the price prediction to the final buying price of the respective vehicle? Is the overall conversion and user experience being impacted, for instance by rising complaints or direct positive feedback? In some cases, it makes sense to deploy the new model next to the old version to compare their performance. All this new data then informs the next iteration of training the prediction model, either directly through new data from cars that were sold or by tweaking the model’s hyperparameters based on user feedback, which closes the Continuous Intelligence cycle.

Opportunities of CD4ML and the road ahead

Adopting Continuous Delivery for Machine Learning creates new opportunities to become an Intelligent Enterprise. By automating the end-to-end process from experimentation to deployment, to monitoring in production, CD4ML becomes a strategic enabler to the business. It creates a technological capability that yields a competitive advantage. It allows your organization to incorporate learning and feedback into the process, towards a path of continuous improvement.

This approach also breaks down the silos between different teams and skill sets, shifting towards a cross-functional and collaborative structure to deliver value. It allows you to rethink your organizational structures and technology landscape to create teams and systems aligned to business outcomes. In subsequent articles in the series, we’ll explore how to bring product thinking into the data and machine learning world, as well as the importance of creating a culture that supports Continuous Intelligence.

Another key opportunity to implement CD4ML successfully is to apply platform thinking at the data infrastructure level. This enables teams to quickly build and release new machine learning and insight products without having to reinvent or duplicate efforts to build common components from scratch. We’ll dedicate an entire article to the technical components, tools, techniques, and automation infrastructure that can help you to implement CD4ML.

Finally, leveraging automation and open standards, CD4ML can provide the means to build a robust data and architecture governance process within the organization. It allows introducing processes to check fairness, bias, compliance, or other quality attributes within your models on their path to production. Like Continuous Delivery for software development, CD4ML allows you to manage the risks of releasing changes to production at speed, in a safe and reliable fashion.

All in all, Continuous Delivery for Machine Learning moves the development of such applications from proof-of-concept programming to professional state-of-the-art software engineering.

 

This article was first published on ThoughtWorks.com

The post Continuous Delivery for Machine Learning appeared first on ML Conference.

]]>