MLOps, LLMOps & Pipelines: Scaling ML in Production

Model Context Protocol Servers and Security: What You Need to Know

rdsouza@sandsmedia.com — Mon, 13 Oct 2025 06:46:59 +0000

Use of Model Context Protocol (MCP) has grown exponentially in the past year. Since its initial launch by Anthropic as a way to manage autonomous AI agents, MCP has become the de facto standard for connecting AI application components. With MCP, users can create AI agents that can move assets, alter data, and execute business processes – with or without human oversight. MCP relies on servers to connect and manage interaction between agents, processes, and data.

But like many developer-focused projects built to scale rapidly, MCP does not include much in the way of security out of the box. Given the kinds of data AI is often given access to, a lack of robust security can pose a massive risk – and with so much power available, MCP servers are a prime, high-risk target for threat actors.

This is a pattern we have seen before with application programming interfaces, or APIs. So, what can we learn from the ways we’ve learned to manage and harden APIs, and how can this be applied to MCP before potential threats turn into real risks?

Learning from API history

APIs have become the preferred way for developers to build their software. Using APIs, you can connect your software builds to third-party tools, or to cloud services, and create that experience faster. Using microservices, you can add more functionality into your components and APIs without taking down the entire application to achieve your goal. And from a security perspective, having those feature-rich APIs from vendors meant that you can get more data on what was happening, then use that insight to automate some of your security operations.

The logic then, with APIs, is the same as it is now with MCP servers – getting that integration to work fast makes things easier for developers. However, it also introduces new classes of risk. For example, a simple error in an application component attached to one API can affect the rest of the application, either leading to interruptions in performance or security vulnerabilities. While APIs made it faster to build, they also made it easier to scale up that mistake over time. Now, imagine this same potential risk amplified by autonomous agents acting at machine speed.

But MCP servers are not just APIs – they provide the operational backbone for agentic AI. Legacy APIs are deterministic and act in the same way time after time. MCPs don’t have that degree of control. Instead, they operate differently based on context in order to empower their large language models to take action. The protocol often assumes that both the requestor and the object requested are benign, so requests are not validated before they are acted on.

From a security perspective, this is a significant miss that can lead to unintended consequences. An attacker could trigger the MCP to leak data or move data to an unauthorised location. It could also trigger a workflow that should not be allowed, or attempt to sabotage operations altogether.

There are three security issues here: weak authentication, prompt injection, and broad authorization. In response, regulators in the EU and the US have stepped up to require organisations to address these risks directly under the auspices of both the EU AI Act and NIST’s AI Risk Management Framework. Developers should be aware that MCP security is something they will have to address sooner rather than later.

How to secure MCP deployments

To address this problem, developers and security teams must work together. The challenge is how to effectively solve these problems before deployment, but the opportunity is that any effort will improve your overall long-term approach. Doing so should reduce management overhead and simplify security over time.

To start, you must understand how to approach authentication and credential management. We all know that multi-factor authentication should be in place, but this should go one step further when it comes to MCP. Rather than using static tokens, look at how you can use short-lived tokens and credentials that rotate over time. This prevents attackers from stealing your tokens and using them for other attacks. On the security side, you should monitor for token misuse and revoke any credentials that you don’t actually need.

Permissions and access aren’t just a concern for humans, though. The biggest reason agentic AI has surged in popularity is that it can act on its own – a defining feature that opens up potential problems when the AI agents have free rein. How much are you willing to allow your systems to act independently, and how many processes will require a human to be in the loop? This is a business risk conversation rather than just one that developers should make on their own.

After defining those boundaries for agentic AI, you can look at tightening permissions around who gets to ask questions or provide prompts. Prompt injection is a proven attack approach, and attackers will use it if they can get access. To prevent this, use input validation and sanitisation at every layer. You can also route all prompt queries through a proxy to remove malicious requests before they reach the MCP server. This allows you to control all potential inputs into your system, even if you don’t have a standardised approach to follow.

You should also look at reducing permissions to control the impact of any attack getting through. If you have broad permissions and poor multi-tenancy controls, an attack will create more issues and affect more systems than one that is locked down. MCP servers have little to no standard authorisation controls in place, as this is not included in the protocol by default. As such, it’s crucial to ensure you have a robust and well-defined approach to managing access and permissions in place before you connect up any MCP server to your sensitive data.

Ensuring that security and development teams collaborate to enforce least-privilege access and role-based authorisation is critical. You should also isolate your contexts and tenants to reduce the potential impacts from any successful attack. In practice, this helps you contain any potential breach to a single workflow or user, rather than affecting your entire organisation.

Knowing you are in control

For developers, working with security teams on controls for MCP deployments is another task to add to the list. But with more and more AI application deployments happening, and so much demand for agentic AI systems, getting MCP deployments right from the start is an essential skill to develop.

Unfortunately, many of the previous static controls that worked around areas like APIs don’t work for MCP servers. Instead, your security controls have to work in real-time, just like your prompting and responses. Furthermore, your entire company will have to develop a sense of what is needed for security around AI systems. It’s not just the responsibility of the security team or you as the person who developed the application, but the whole organisation. This mindset shift helps the business innovate safely and deliver on impact.

Attacks on AI systems are already happening. From bad actors looking out for LLM service accounts that they can hijack and resell proxy access to, dubbed LLMjacking, to training data containing API keys, credentials, and user accounts that could be pillaged for access, AI systems are already under threat. As we consider how to innovate and move faster with agentic AI, we can’t underestimate its inherent risks. New standards, like MCP, should have security by design baked into them from the start – but when they don’t, other guardrails should be put in place.

As autonomous agents become inseparable from business operations, the MCP servers that run these services will be targeted. Putting in strong security principles in collaboration with security will be a necessary investment if agentic AI services are to deliver. For developers, thinking about this at the start will make your businesses more secure – and your life easier – in the long run, too.

Frequently Asked Questions (FAQ)

Q: What is Model Context Protocol (MCP)?
A: Model Context Protocol (MCP) is a standard used to connect AI application components, allowing autonomous AI agents to perform operations like altering data, moving assets, and executing business processes. It relies on servers to manage the interaction between these agents, processes, and data.

Q: What is the main security risk associated with MCP?
A: The main risk is that MCP does not include much security out of the box. Given the powerful actions MCP servers can perform and the sensitive data they can access, this lack of built-in security makes them a prime, high-risk target for attackers.

Q: How does MCP security differ from traditional API security?
A: Traditional APIs are deterministic, meaning they act the same way every time. MCPs are different; they operate based on context to empower large language models. The protocol often assumes requests are benign and doesn’t validate them, which can lead to unintended consequences if attacked.

Q: What are the three primary security threats to MCP servers?
A: The three main security issues identified are weak authentication, prompt injection, and broad authorization.

Q: How can you defend against weak authentication in MCP deployments?
A: Instead of using static tokens, you should implement short-lived tokens and credentials that rotate over time to prevent attackers from stealing and reusing them. It is also crucial to monitor for token misuse and revoke unnecessary credentials.

Q: What are the recommended methods to prevent prompt injection attacks?
A: To prevent prompt injection, you should use input validation and sanitization at every layer. Another effective strategy is to route all prompt queries through a proxy to filter out malicious requests before they reach the MCP server.

Q: Why is managing permissions and authorization so critical for MCP?
A: It is critical because MCP servers have few to no standard authorization controls by default. Without enforcing least-privilege access and isolating tenants, a successful attack could impact your entire organization rather than being contained to a single user or workflow.

Q: Who is responsible for securing MCP within an organization?
A: Securing AI systems and MCP deployments is the responsibility of the entire organization, not just the security team or the application’s developer. It requires a collaborative mindset shift across the business to innovate safely.

The post Model Context Protocol Servers and Security: What You Need to Know appeared first on ML Conference.

AI Security in Focus: Managing Identity, Model Drift, and LLMOps Risks

adharaneedharan@sandsmedia.com — Mon, 01 Sep 2025 15:28:10 +0000

The AI Triple Threat: Why Identity Security Must be the Cornerstone of AI Adoption

by David Higgins

AI brings new possibilities, but with it, new risks. This article looks at the three threats that AI brings and the best strategies to use identity security and keep cybersecurity at the forefront of digital strategies.

A series of recent high-profile breaches has demonstrated that the UK remains highly exposed to increasingly sophisticated cyber threats. This vulnerability is growing as artificial intelligence becomes more deeply embedded in day-to-day business operations. From driving innovation to enabling faster decision-making, AI is now integral to how organisations deliver value and stay competitive. Yet, its transformative potential comes with risks that too many organisations have yet to fully address.

CyberArk’s latest research shows that AI now presents a complex “triple threat”. It is being exploited as an attack vector, deployed as a defensive tool and, perhaps most concerning, introducing critical new security gaps. This dynamic threat landscape demands that organisations place identity security at the centre of any AI strategy if they wish to build resilience for the future.

AI is enhancing familiar threats

AI has raised the bar for traditional attack methods. Phishing, which remains the most common entry point for identity breaches, has evolved beyond poorly worded emails to sophisticated scams that use AI-generated deepfakes, cloned voices and authentic-looking messages. Nearly 70% of UK organisations fell victim to successful phishing attacks last year, with more than a third reporting multiple incidents. This shows that even robust training and technical safeguards can be circumvented when attackers use AI to mimic trusted contacts and exploit human psychology.

It is no longer enough to assume that conventional perimeter defences can stop such threats. Organisations must adapt by layering in stronger identity verification processes and building a culture where suspicious activity is flagged and investigated without hesitation.

AI as a defensive asset

While AI is strengthening attackers’ capabilities, it is also transforming how defenders operate. Nearly nine in ten UK organisations now use AI and large language models to monitor network behaviour, identify emerging threats and automate repetitive tasks that previously consumed hours of manual effort. In many security operations centres, AI has become an essential force multiplier that allows small teams to handle a vast and growing workload.

Almost half of organisations expect AI to be the biggest driver of cybersecurity spending in the coming year. This reflects a growing recognition that human analysts alone cannot keep up with the scale and speed of modern attacks. However, AI-powered defence must be deployed responsibly. Over-reliance without sufficient human oversight can lead to blind spots and false confidence. Security teams must ensure AI tools are trained on high-quality data, tested rigorously, and reviewed regularly to avoid drift or unexpected bias.

AI is expanding the attack surface

The third element of the triple threat is the rapid growth in machine identities and AI agents. As employees embrace new AI tools to boost productivity, the number of non-human accounts accessing critical data has surged, now outnumbering human users by a ratio of 100 to one. Many of these machine identities have elevated privileges but operate with minimal governance. Weak credentials, shared secrets and inconsistent lifecycle management create opportunities for attackers to compromise systems with little resistance.

Shadow AI is compounding this challenge. Research indicates that over a third of employees admit to using unauthorised AI applications, often to automate tasks or generate content quickly. While the productivity gains are real, the security consequences are significant. Unapproved tools can process confidential data without proper safeguards, leaving organisations exposed to data leaks, regulatory non-compliance and reputational damage.

Addressing this risk requires more than technical controls alone. Organisations should establish clear policies on acceptable AI use, educate staff on the risks of bypassing security, and provide approved, secure alternatives that meet business needs without creating hidden vulnerabilities.

Putting identity security at the centre

Securing AI-driven businesses demands that identity security be embedded into every layer of the organisation’s digital strategy. This means achieving real-time visibility of all identities, whether human, machine or AI agent, applying least privilege principles consistently, and continuously monitoring for abnormal access behaviours that may indicate compromise.

Forward-looking organisations are already adapting their identity and access management frameworks to handle the unique demands of AI. This includes adopting just-in-time access for machine identities, implementing privilege escalation monitoring and ensuring that all AI agents are treated with the same rigour as human accounts.

AI promises enormous value for organisations ready to embrace it responsibly. However, without strong identity security, that promise can quickly turn into a liability. The companies that succeed will be those that understand that building resilience is not optional, but foundational to long-term growth and innovation.

In an era where adversaries are equally empowered by AI, one principle holds true: securing AI begins and ends with securing identity.

———————————————————————————————————————————————————————————————-

Managing Model Drift in LLMs for the Safe Use of AI

by João Freitas

Successfully implementing a successful LLMOps framework can help enterprises avoid that output from their LLMs stays free of model drift and AI hallucinations. This article explains how to create a successful LLMOps strategy, managing model drift, and ensure customer trust and satisfaction.

The number of business professionals using AI continues to grow as both sanctioned and unsanctioned use skyrocket, and organizations deploy commercially available LLMs internally. Given the increasing adoption of LLMs, organizations must ensure outputs from these models are trustworthy and repeatable over time. LLMs have become business-critical systems in modern enterprises, and any potential failure of these systems can rapidly harm customer trust, violate regulations and damage an organization’s reputation.

Foundational AI models are expensive to train and run, and in most business contexts, there is minimal return on investment for companies that invest millions in building their models. With this cost in mind, organizations instead choose to rely on LLMs developed by third parties, which must be managed in the same way other enterprise systems are managed.

However, organizations must be on guard for model drift and AI hallucinations when using these third-party models, and implement standardized processes to remediate these issues. This specialized space, called LLMOps, is emerging as organizations adopt dedicated platforms that extend traditional MLOps and observability frameworks to meet the unique challenges posed by widespread LLM use.

But what does a suitable LLMOps framework look like?

Forming the bedrock of LLMOps

It’s clear that organizations need LLMOps to mitigate the risk of hallucinations or model drift, but the practical aspects of an LLMOps framework can be less apparent. Several crucial considerations must form the bedrock of an organization’s LLMOps practices.

When any publicly available LLM is adopted by an organization, the first step in managing its use is to establish clear guardrails for the systems and data it can access. Approved use cases for the LLM must also be made clear across teams to strike the right balance between enabling innovation without ever exposing sensitive data or systems to a third-party provider, or crossing data permissions boundaries.

Similarly, organizations must set up a good level of observability around any LLMs to detect issues with latency or inaccurate outputs before they can escalate into issues that directly affect engineering teams. Both of these steps can improve organizational security around LLM usage to reduce the risk exposure often associated with the adoption of new tools.

To maintain the long-term accuracy and trustworthiness of LLM outputs, organizations must implement safeguards to reduce bias and ensure fairness in any outputs generated. LLMs are prone to bias, which is present in the data they were trained with. For example, LLMs often refer to developers as “he” rather than using a gender-neutral term. While this may seem innocuous, it can be a sign of other biases within the LLM, which can ultimately affect hiring decisions or internal company policies, often to the detriment of one or more groups.

It is also vital for organizations to test the LLMs they use for degradation over time due to changes in the data. This is necessary to ensure the model aligns with the data in their environment and provides an additional layer of security against AI hallucinations.

The final pillar of an effective LLMOps framework is for the organization to proactively address risks related to the generation of incorrect sensitive data, such as generating incorrect pricing. Sensitive, business-critical decisions cannot be wholly given over to LLMs. Instead, responsible LLMOps will keep human oversight for critical operations.

When successfully adopted, LLMOps will enable LLMs to scale as more users within an organization adopt tools with guardrails in place. LLMOps will also keep LLMs performing well so they never become blockers to innovation or cause operational slowdowns.

However, LLMOps is not a one-and-done process. Instead, LLMs must be constantly monitored and retained on up-to-date datasets to avoid model drift over time.

How LLMOps prevent model drift

With a vast number of organizations using commercially available LLMs, there is a growing risk of model drift influencing LLM-generated outputs as time goes on. The primary cause of model drift is a model basing its responses on outdated data. For example, an organization using GPT-1 would only receive answers based on that model’s training data, which comes from pre-2018, while GPT-4 has been trained on data up to 2023.

So, how can enterprises use LLMOps to combat model drift?

There are five strategies organizations can employ, depending on their datasets and computational resources:

Use the latest version of an LLM model to account for more recent data, helping to ensure that any generated outputs will be up to date and reduce the chance of AI hallucinations where the LLM tries to fill gaps in its training data.
Fine tune pre-trained LLMs to respond to a specific topic, improving the accuracy of outputs without the major investment of training a proprietary model.
Adjust parameters for responses and adjust the weighting of responses to enable an LLM to give more importance to certain tokens over others in response generation.
Use Retrieval-Augmented Generation (RAG) to enhance the LLM’s case-specific knowledge and factual accuracy by retrieving relevant information from external knowledge sources during inference.
Pass sufficient, industry-focused context to the model to ensure users get better responses to questions and more relevant answers for the enterprise’s specific industry.

Successful LLMOps is continuous

While enterprises can adopt LLMOps to manage how teams use LLMs, they cannot treat it as a one-off process.

Preventing model drift requires constant supervision of AI-generated outputs and regular retraining of LLMs as an organization’s internal datasets evolve. Given the potentially damaging business impact of incorrect results, mitigating hallucination risk is crucial to the success of a modern organization.

Through the creation of an effective LLMOps strategy, organizations will be able to improve customer trust, ensure their regulatory compliance and protect their reputation, all while making their operations more efficient.
———————————————————————————————————————————————————————————————————————–

FREQUENTLY ASKED QUESTIONS

What are the three main AI-related cybersecurity threats?

AI presents a triple threat: it serves as an attack vector, a defensive tool, and introduces new identity-related vulnerabilities. These roles increase the complexity and risk in enterprise cybersecurity strategies.

How has AI changed traditional phishing attacks?

AI has enabled highly convincing phishing scams using deepfakes, cloned voices, and realistic messages. These attacks bypass human training and technical safeguards, making identity verification critical.

How is AI used as a cybersecurity defense mechanism?

AI and LLMs help monitor networks, detect emerging threats, and automate repetitive security tasks. However, over-reliance without human oversight can result in blind spots and biased decisions.

What risks are introduced by machine identities in AI systems?

Machine identities now outnumber human users 100 to 1, often with high privileges and little governance. Poor credential management and lifecycle policies make them a major attack surface.

What is Shadow AI and why is it dangerous?

Shadow AI refers to employees using unauthorized AI tools without IT approval. This exposes sensitive data and creates compliance and reputational risks.

How can organizations secure AI-driven environments?

By embedding identity security into digital strategies: applying least privilege, monitoring access behavior, and managing both human and machine identities equally.

What are the foundational pillars of an LLMOps strategy?

LLMOps includes guardrails for LLM access, observability for latency and accuracy, bias reduction, data alignment, and human oversight for critical decisions.

What causes model drift in LLMs and how can it be mitigated?

Model drift results from outdated training data. It can be addressed through updated LLM versions, fine-tuning, RAG, parameter adjustments, and industry-specific prompts.

Why is continuous monitoring critical in LLMOps?

LLMOps must be an ongoing process. Regular retraining and supervision are required to ensure accuracy, prevent hallucinations, and uphold customer trust.

The post AI Security in Focus: Managing Identity, Model Drift, and LLMOps Risks appeared first on ML Conference.

The Expanding Scope of Observability for AI Systems

adharaneedharan@sandsmedia.com — Tue, 12 Aug 2025 14:19:59 +0000

As organizations accelerate their adoption of AI-powered tools—ranging from CodeBots to agentic AI—observability is rapidly shifting from a technical afterthought to a strategic business enabler. In our last article, “Observability in the Era of CodeBots, AI Assistants, and AI Agents”, we briefly touched upon key enhancement in the observability space. Continuing here – stakes are high for the next steps in Observability where AI systems are predicted to act autonomously, make complex decisions, and interact with humans and other agents in ways that are often opaque. Without robust observability, organizations risk not only technical debt and operational inefficiency, but also ethical lapses, compliance violations, and loss of user trust.

Join us at MLCon New York to attend Garima Bajpai‘s keynote & workshop LIVE!

Keynote : Charting the Way Forward for AI-Native Software Organizations

Workshop: Operationalizing AI Workshop – Leadership Sprints

The Expanding Scope of Observability

The traditional boundaries of observability—metrics, logs, and traces—are being redrawn. In the AI era, observability must encompass:

Fig. 1: The expanding scope of observability

Intent and Outcome Alignment: Did the AI system achieve what was intended, and can we explain how it got there?
Model and Data Drift: Are models behaving consistently as data and environments evolve?
Autonomous Decision Auditing: Can we trace and audit the rationale behind AI agent decisions?
Human-AI Interaction Quality: How effectively are developers and end-users collaborating with AI assistants?

In the next section, we’ll expand on each of the specific questions and outline the next steps.

Intent and Outcome Alignment

AI alignment refers to ensuring that an AI system’s goals, actions, and behaviors are consistent with human intentions, values, and ethical principles. Achieving intent and outcome alignment means the system not only delivers the desired results but does so for the right reasons, avoiding unintended consequences such as bias, or reward hacking. For example, if an AI is designed to assist with customer queries, alignment ensures it provides accurate, helpful responses rather than hallucinating or misleading users. Regular outcome auditing is essential—this involves evaluating real-world results to check for disparities or unintended effects, ensuring the AI’s outputs match the original intent and are explainable.

Observability is foundational for intent and outcome alignment because it makes the AI’s decision-making transparent and traceable, allowing stakeholders to explain, verify, and correct its behavior as needed.

Intent tracing and validation: Mechanisms to explicitly track the mapping from user intent to system objectives and emergent behaviors, allowing for validation that intent is preserved through each stage of the AI’s operation.
Robust logging of agent interactions: Especially for agentic AI, detailed logs of external actions, tool invocations, and inter-agent communications are necessary to detect misuse or unintended consequences.
Automated anomaly and misalignment detection: Integration of anomaly detection systems that can flag when observed behaviors deviate from expected, aligned patterns—potentially using machine learning to recognize subtle forms of misalignment.

Model and Data Drift

Model and data drift refer to the phenomenon where machine learning models gradually lose predictive accuracy as the data and environments they operate in evolve. This happens because the statistical properties of the input data or the relationships between features and target variables change over time, making the model’s original assumptions less valid. There are two primary types:

Data drift (covariate shift): The distribution of input features changes, but the relationship between inputs and outputs may remain the same.
Concept drift: The relationship between inputs and outputs changes, often due to shifts in the underlying process generating the data.

As data and environments evolve, observability is essential to ensure models behave consistently and maintain their predictive power. Advanced observability features—especially automated, real-time drift detection and diagnostics—are critical for robust, production-grade machine learning systems.

Drift detection: Observability tools can implement statistical tests (e.g., Population Stability Index, KL Divergence, KS Test) to compare incoming data distributions with those seen during training, flagging significant deviations.
Automated drift detection and alerting: Real-time, automated identification of both data and concept drift, with configurable thresholds and notifications.
Granular performance monitoring: Tracking model accuracy, precision, recall, and other metrics across different data segments and time windows to pinpoint where drift is occurring.

Autonomous Decision Auditing

Tracing and auditing the rationale behind AI agent decisions, especially in autonomous or agentic AI systems, is both possible and increasingly necessary, but it presents significant technical and organizational challenges. Auditing the rationale behind autonomous AI decisions is feasible with the right combination of observability, explainability, and compliance tools is of utmost importance.

As AI systems grow in complexity and autonomy, advanced observability features such as real-time monitoring, detailed logging, and integrated XAI—are essential for ensuring transparency, accountability, and trust.

Decision provenance tracking, recording the sequence of transformations and inferences leading to each decision.
Automated bias and fairness checks at both data and outcome levels, with alerts for detected issues.
Integration of XAI tools for on-demand explanation of individual decisions, especially in high-stakes or regulated environments.

Human-AI Interaction Quality

Developers and end-users are collaborating with AI assistants with increasing effectiveness, but the quality of these interactions varies widely depending on the application, the clarity of communication, and the feedback mechanisms in place. Observability in the context of human-AI interaction means having comprehensive visibility into both the AI’s internal decision-making processes and the dynamics of user-AI exchanges.

This enables:

Multimodal Analytics: Ability to combine quantitative metrics (e.g., error rates, session lengths) with qualitative data (e.g., sentiment analysis, user feedback) for a holistic view of interaction quality.
Integration with Human-in-the-Loop & in the Lead Systems: Seamless handoff and tracking between AI and human agents, ensuring continuity and accountability in complex workflows.
Automated Feedback Impact Analysis: Tools that automatically correlate user feedback with subsequent changes in AI behavior or performance, quantifying the value of human input.

Effective human-AI collaboration depends on robust observability, which empowers developers and end-users to monitor, understand, and continuously improve interaction quality.

Key Challenges Ahead

Complexity and Scale: AI-powered systems introduce unprecedented complexity. Multi-agent workflows, dynamic model updates, and real-time adaptation all multiply the points of failure and uncertainty. Observability solutions must scale horizontally and adapt to changing system topologies.
Data Privacy and Security: With observability comes the collection of sensitive telemetry—prompt data, user interactions, model outputs. Ensuring privacy, compliance (e.g., GDPR, HIPAA), and secure handling of observability data is paramount.
Semantic Gaps: Traditional observability tools lack the semantic understanding needed for AI systems. For example, tracing a hallucination or bias back to its root cause requires context-aware instrumentation and domain-specific metrics.
Standardization and Interoperability: Fragmentation remains a challenge. While projects like OpenTelemetry’s GenAI SIG are making strides, the ecosystem is still maturing. Vendor lock-in, proprietary data formats, and inconsistent APIs can hinder unified observability across diverse AI stacks.

Best Practices: Building AI-Aware Observability

Design for Explainability: Instrument AI systems with explainability hooks—capture not just what happened, but why. Integrate model interpretability tools (e.g., SHAP, LIME) into observability pipelines to surface feature importances, decision paths, and confidence scores.
Embrace Open Standards: Adopt open-source, community-driven observability frameworks (OpenTelemetry, LangSmith, Langfuse) to ensure interoperability and future proofing. Contribute to evolving standards for LLMs and agentic workflows.
Feedback Loops and Continuous Learning: Observability should not be passive. Establish automated feedback loops—use observability data to retrain models, refine prompts, and adapt agent strategies in near real-time. This enables self-healing and continuous improvement.
Cross-Disciplinary Collaboration: Break down silos between developers, data scientists, MLOps, and security teams. Define shared observability goals and metrics that span the full lifecycle—from data ingestion to model deployment to end-user interaction.
Ethics and Governance: Instrument for ethical guardrails: monitor for bias, fairness, and compliance violations. Enable rapid detection and remediation of unintended consequences.

The Road Ahead: From Observability to Business Enablement

The evolution of observability in the AI era is not just about better dashboards or faster debugging. It’s about empowering organizations to:

Build Trust: Transparent, explainable AI systems foster user and stakeholder confidence.
Accelerate Innovation: Rapid feedback cycles and robust monitoring enable faster iteration and safer experimentation.
Unlock Business Value: Observability becomes a lever for optimizing AI-driven business processes, reducing downtime, and uncovering new opportunities.

Conclusion: Closing the Strategic Gap

AI is rewriting the rules of software engineering. To harness its full potential, organizations must invest in next-generation observability—one that is AI-native, explainable, and deeply integrated across the stack. Leaders who prioritize observability will be best positioned to navigate complexity, drive responsible innovation, and close the strategic gap in the era of CodeBots, AI Assistants, and AI Agents.

References

Evaluating Human-AI Collaboration: A Review and Methodological Framework https://arxiv.org/html/2407.19098v1
https://galileo.ai/blog/human-evaluation-metrics-ai
Auditing of Automated Decision Systems https://ieeeusa.org/assets/public-policy/positions/ai/AIAudits0224.pdf
How Model Observability Provides a 360° View of Models in Production https://www.datarobot.com/blog/how-model-observability-provides-a-360-view-of-models-in-production/
Observability in the Era of CodeBots, AI Assistants, and AI Agents https://devm.io/devops/ai-observability-agents

FREQUENTLY ASKED QUESTIONS

Why is observability now a strategic business enabler in the AI era?

As organizations adopt CodeBots, AI assistants, and agentic AI, systems make opaque, autonomous decisions at scale. Without robust observability, teams risk technical debt, operational inefficiency, ethical lapses, compliance violations, and loss of user trust. The article argues observability must evolve from a technical afterthought to a strategic capability.

What expands the scope of observability beyond metrics, logs, and traces?

The article identifies four new focal areas: intent and outcome alignment, model and data drift, autonomous decision auditing, and human‑AI interaction quality. These dimensions reflect the behaviors of AI systems, not just infrastructure signals.

What is “intent and outcome alignment,” and why does it matter?

Alignment ensures an AI system’s goals, actions, and behaviors reflect human intentions and ethical principles. It means delivering desired results for the right reasons—avoiding bias, hallucinations, or reward hacking—and requires regular outcome auditing to verify that outputs match intent and remain explainable.

Which observability capabilities support intent alignment?

The text calls for intent tracing and validation to map user goals to system objectives and emergent behaviors. It also stresses robust logging of agent interactions (external actions, tool calls, inter‑agent messages) and automated anomaly/misalignment detection that flags deviations from expected patterns.

How do model drift and data drift differ?

Data drift (covariate shift) occurs when input feature distributions change while input‑output relationships may remain stable. Concept drift changes the relationship between inputs and outputs due to shifts in the generating process, eroding model assumptions and performance over time.

What drift monitoring features belong in production‑grade observability?

The article recommends statistical tests such as PSI, KL Divergence, and the KS Test to compare live vs. training distributions. It also calls for real‑time, automated drift detection with thresholds/alerts and granular performance tracking (e.g., accuracy, precision, recall) across segments and time windows.

What does autonomous decision auditing require for agentic AI?

Auditing needs decision‑provenance tracking to record the sequence of transformations and inferences leading to each decision. It should include automated bias/fairness checks with alerts and integrate XAI tools for on‑demand explanations, particularly in regulated or high‑stakes contexts.

How does observability improve human‑AI interaction quality?

By combining quantitative signals (error rates, session length) with qualitative insights (sentiment analysis, user feedback), teams gain a holistic view of interactions. Observability should support human‑in‑the‑loop/“in the lead” handoffs and track how feedback changes system behavior over time.

What key challenges complicate AI‑aware observability?

The article highlights complexity and scale (multi‑agent workflows, real‑time adaptation), privacy/security requirements for sensitive telemetry, and semantic gaps in traditional tools. It also notes fragmentation and limited interoperability despite progress from efforts like OpenTelemetry’s GenAI SIG.

Which best practices does the article recommend to build AI‑aware observability?

Instrument for explainability (e.g., SHAP, LIME), adopt open standards (OpenTelemetry, LangSmith, Langfuse), and close the loop by using observability data to retrain models and refine prompts. Cross‑disciplinary collaboration and ethics/governance monitoring (bias, fairness, compliance) are emphasized as ongoing practices.

The post The Expanding Scope of Observability for AI Systems appeared first on ML Conference.

Keeping an Eye on AI

abairakdar — Fri, 01 Apr 2022 11:45:52 +0000

We’ve all been there: we’ve spent weeks or even months working on our ML model. We collected and processed data, tested different model architectures, and spent a lot of time fine-tuning our model’s hyperparameters. Our model is ready! Maybe we need a few more tweaks to improve performance more, but it’s ready for the real world. Finally, we put our model into production, and sit back and relax. Three weeks later, we get an angry call from our customer because our model makes predictions that don’t have anything to do with reality. A look at the log reveals no errors. In fact, everything still looks good.

However, since we haven’t established continuous monitoring for our model, we don’t know if and when our model’s predictions change. We have to hope that they’ll always be just as good as on day one. But if our infrastructure is made out of only duct tape and hope, then we’ll find lots of errors only in production.

Stay up to date

Learn more about MLCON

What is MLOps, anyway?

We want to create infrastructure and processes that combine the development of machine learning systems with the system’s operation. That’s the goal that MLOps is pursuing. This is closely interwoven with DevOps’ goals and definition. But the goal here isn’t just developing and operating software systems, but developing and operating machine learning systems.

As a data scientist, I have to ask myself why I should worry about the operational part of ML systems at all. Technically, my goal is just to train the best possible ML model. While this is a worthy goal, I have to keep in mind the system’s overall context and the business of the company I’m operating out in. About 90% of all ML models are never deployed in a production environment [1], [2]. This means they never reach a user or customer. Provocatively speaking: 90% of machine learning projects are useless for our business. Of course, in the end, a model is only useful if it generates added value for my users or processes.

Besides, when developing a machine learning system, I always keep the following quote from Andrew Ng in mind: “You deployed your model in production? Congratulations. You’re halfway done with your project.” [3]

RETHINK YOUR APPROACHES

Business & Strategy

Learn more

Productive challenges

Our machine learning project isn’t over when we bring our first model into a production environment. After all, even a model that delivered excellent results in training and testing faces many challenges once it arrives in production. Perhaps the best-known challenge are changes in the distribution of the data that our model receives as input. A whole range of events can trigger this, but they’re often grouped under the term “concept drift”. The data set used to train our model is the only part of the reality our model can perceive. This is one of the reasons why collecting as much data as we can is so critical for a well-functioning, robust model. When more data is available to the model it can represent a bigger part of the real world with higher fidelity. But the training data is always just a static snapshot of reality, while the world around it is constantly changing. If our model isn’t trained with new data, it has no way to update its outdated basic assumptions about reality. This leads to a decline in the model’s performance.

How does concept drift happen? Data can change gradually. For example, sensors get less accurate over a long time due to wear and tear and show increasing deviations from the actual value. Another example of slow changes is customer behavior and customer preferences. Image a model that makes recommendations for products in a fashion shop. If it isn’t updated and keeps recommending winter cloth in the summer to customers, then customer satisfaction in our shop system will significantly decrease. Recurring events such as seasons or holidays can also have an impact if we want to use our model to predict sales figures. But concept drift can also happen abruptly: If COVID-19 brings global air traffic to a screeching halt, then our carefully trained models for predicting daily passenger traffic will produce poor results. Or if the sales department launches an Instagram promotion without prior notice and doubles the sales of our vitamin supplement, that’s a great result, but it’s not something our model is good at predicting.

Another challenge is both technical and organizational. In many companies and projects, there is one team developing a machine learning model (data scientists) and another team bringing the models into production and supporting them (software engineers/DevOps engineers). The data science team spends a lot of time conceptualizing and selecting model architectures, feature engineering, and training the model. When the fully trained model is handed over to the software development team, it’s often implemented again. This implementation might differ from the actual, intended implementation. Even with small differences, this can lead to significant and unexpected effects. The separation of data science teams and software engineering teams can also lead to entirely different problems. Even if the data science team takes the model all the way to production, it’s often still used by different applications developed by other teams. These applications send input data to the model and receive predictions from it. If the data changes in the applications due to changes or errors, then the input data will fall short of the model’s expectations.

But challenges or threats to the model can even arise outside the organization itself. While actual security problems related to ML models have been rare up until now, it’s still possible for third parties to actively try and find vulnerabilities in the model as part of adversarial attacks. This danger shouldn’t be ignored, especially in fraud detection models.

Monitoring: DevOps vs. MLOps

To address any challenges in a productive model, we must first be aware of arising challenges. For this, we need monitoring. There are two approaches for monitoring traditional production software: First, we monitor business metrics (KPIs — key performance indicators). These can be metrics such as customer satisfaction, revenue, or how long customers stay on our site. We monitor service and infrastructure metrics to gain comprehensive insights into the state of our system. We also collect these metrics when we monitor a machine learning system. But this still isn’t sufficient. Unlike conventional software systems, ML systems behave partly non-deterministically and their performance depends significantly on the distribution of input data and the time of the most recent training. We need to collect further data. Because model hyperparameters, input feature selection, and architecture are not determined during deployment, but already in the training pipeline, the smallest error can lead to radically different system behavior that traditional software testing wouldn’t record. This is especially true in systems where models are constantly iterated and improved upon.

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Learn more

Monitoring for ML systems

First, we need to gather the traditional metrics that we’d monitor for any other software system. There are well-established best practices here. Google’s site reliability engineering manual [4] can serve as a reference. I’d like to mention the following as examples of DevOps metrics:

Latency: How long does our model take to predict a new value? How long does our preprocessing pipeline take?
Resource consumption: What are the CPU/GPU and RAM utilization of our model server? Do we need more servers?
Events: When does the model server receive an HTTP request? When is a specific function called? When is an exception thrown?
Calls: How often is a model requested?

For these metrics, it is worth looking at both the individual values and the trend of the values over time. If you see changes in either that is a reason to take a closer look at our model and the whole system. Let’s imagine that we’re looking at the number of predictions per hour. If these deviate considerably in our daily or weekly comparison—by 40%, for instance—then it would certainly make sense to send an e-mail to the team members involved. In addition to these classic metrics, we also need to keep an eye on metrics specific to machine learning. In any case, we need to collect and store the predictions of the model. Both the individual predicted values and the value distribution are of interest. If our system predicts numerical values, we have to check their numerical stability. If it predicts “NaNs” (non-numerical values) or “infinity”, then this might warrant alerting our data scientists. Does your model that should predict tomorrow’s user number return “-15” or “cat”? Then something is very wrong.

But the distribution of predictions is also of interest. Does the distribution match our expected values? We should also check what the minima and maxima of our predicted values are. Are they within an expected, reasonable range? What do the median, mean, and standard deviations look like over an hour or a day? If we find a large discrepancy between the predicted and observed classes, then we have a prediction bias. Then maybe the label distribution of our training data—the predicted values—is different from the distribution in the real world. This is a clear sign that we need to re-train our model.

When we monitor our model’s outputs, it’s only logical that we monitor the inputs too. What data does our model receive? We can determine the distributions and statistical key figures. Naturally, this is easier for structured data. But we can also determine various characteristic values for texts or images. For example, we can look at the length of texts or even word distributions within them. For images, we can capture their size or maybe even their brightness values. This can also help us to find errors in our pipeline and data pre-processing. Monitoring input makes it possible to infer problems in data sources. Some columns in the database are no longer populated with data due to a bug. The data’s definition or format in some columns might change, or the columns could be renamed. If we don’t notice these changes, then our models will still assume the original definitions and formats. To avoid this, we need to monitor the distribution of values of each feature extracted from a database table for a significant shift. You can detect a shift in the distribution input data and predictions by applying statistical tests like the chi-square test or the Kolmogorov-Smirnov test. If the tests detect a significant deviation in the distribution of the data, this may be an indicator of a change in the data structures. This requires a re-training of the model.

There are a few things to keep in mind when monitoring input data. If you’re working with particularly large data, it isn’t feasible to record it completely. In this case, it might be more practical to process the input independently of the model first using deterministic code, and then log a preprocessed and compressed version of the input data. Of course, you must take special care to create and test the preprocessing. We also need to use identical preprocessing during training and in the production environment. Otherwise, there could be major discrepancies between the training data and real data. Caution is also necessary when processing particularly sensitive business data or personal data. Naturally, we want to collect as little sensitive data as we can. But we also need to collect data to improve and debug the models. This opens up a large range of interesting problems. There are some exciting approaches for solving these challenges, such as Differential Privacy [5] for Machine Learning.

Finally, I’d like to mention one area that’s very easy to monitor, but still gets neglected often: tracking model versions. In our machine learning system, we must be able to always know which model is currently active in the production environment. Not only do we need to version our models, but in the best case, we also should be able to trace the complete experimental history that led to our model’s creation. We can use a tool like MLflow [6] for this. Even though there are some approaches that help debug machine learning models, right now deep learning models should be considered black boxes when it comes to the explainability and traceability of predictions. To shed some light on this, we need to be able to understand which features and data were used to train the model. Was there a bias introduced during training? Did someone mislabel data? Is there a bug in our data cleaning pipeline? These are all things that, in the worst case, we only discover in production. To have any chance of debugging in these cases, we need to version not just the model itself, but everything that led to its creation, monitor the currently active versions and make them transparent to all stakeholders. We use tools like DVC [7] or Feast [8] to track our data or features. But it’s also important that we version and test code in the data pipeline that we use to collect and process data.

The toolbox

There are many specialized tools in the field of MLOps like Sacred [9] for managing experiments or Kubeflow [10] for training models and managing data pipelines. But for monitoring machine learning systems, we can use the DevOps toolbox. We’ll use Elasticsearch [11], a classic tool from the Elastic Stack to capture, collect, and evaluate logs and input data. Elasticsearch is a key-value store commonly used to store logs from applications or containers. These logs can be visualized with Kibana [12] and used for error diagnosis.

For MLOps, we use Elasticsearch not just for logs, but to also store our model’s pre-processed inputs for further processing. We can also store transaction data and contexts for our users’ actions in Elasticsearch. In addition to basic data like timestamps and accessed pages, this can also be explicit actions from users. Both the interactions of our users performed before the model delivered them a prediction and their reactions to the predictions are of interest. We can use the time series store Prometheus [13] together with the visualization tool Grafana [14] to capture operational metrics and model predictions.

Fig. 1: Example architecture for monitoring ML systems

What can an architecture for monitoring ML systems look like? (Fig. 1) Our model is made available via a model server. This can be either a custom software solution or a standard tool like Seldon Core [15]. Input data is delivered to the model server as usual. But at the same time, it is also logged in Elasticsearch and visualized with Grafana or Kibana if required. Application logs and error messages from our model server are also stored in Elasticsearch. Now our model performs inferences and calculates predictions. These are returned to the requesting applications or users. Predictions are also stored in Prometheus. Standard metrics such as utilization, inference duration, and confidence are stored there too. These can also be visualized with Grafana or Kibana.

We’ll provide our own microservice for statistical analysis to collect statistical data such as distributions and standard deviations. It pulls data from Elasticsearch and Prometheus, performs analysis, and returns the data to Prometheus. If there are major deviations, then our microservice can send notifications to our team. In parallel, we can provide a drift detector as a second microservice, which handles detecting data drift and context drift. For this, it obtains the model’s input and output data from Elasticsearch and Prometheus. To compare the production data to a baseline we need to make the model’s training data available to the microservice. Ideally, we’ll provide this via a feature store like Feast. Metrics on drift are also stored in Prometheus. In this example, the drift detector and the service for statistically evaluating metrics are custom developments, since the tool selection in this category is currently very sparse.

An evolutionary approach

Of course, this is just an example architecture because. Building machine learning monitoring infrastructures and MLOps pipelines in general always needs to be an evolutionary process. Deployment can not happen with a one-off, big-bang approach. Each company has its own unique set of challenges and needs when developing machine learning strategies. Furthermore, there are different users in machine learning systems, different organizational structures, and existing application landscapes.

One possible approach here could be to implement an initial prototype without ML components. After that, you should start building the infrastructure and the simplest models, which you continuously monitor. The big advantage of this strategy is that you can start early with a collection of input data and especially with the collection of ground truth labels, i.e the real, correct results. It is fairly common for ground truth labels to become available a long time after the prediction was made. For example, if we want to predict a company’s quarterly results, then they will only be available after three months. Therefore, you should start collecting data and labels early.

With the help of your infrastructure, collected data, and simple models, you can begin moving towards more complicated models in small, incremental steps. After that, you can lift these into a production environment until you have achieved your desired level of predictive accuracy and performance. By doing this, your monitoring infrastructure gives you a constant overview of how your model is performing, and when it’s time to roll out a new model. And you can also determine when you have to roll back your model to its previous state after finding a bug in your feature pipeline that slipped into production.

Links & Literature

[1] https://www.redapt.com/blog/why-90-of-machine-learning-models-never-make-it-to-production

[2] https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/

[3] A Chat with Andrew on MLOps:: https://www.youtube.com/watch?v=06-AZXmwHjo

[4] https://landing.google.com/sre/book.html

[5] Deep Learning with Differential Privacy: https://arxiv.org/abs/1607.00133

[6] https://mlflow.org

[7] https://dvc.org

[8] https://feast.dev

[9] https://github.com/IDSIA/sacred

[10] https://www.kubeflow.org/

[11] https://www.elastic.co/de/elasticsearch/

[12] https://www.elastic.co/de/kibana/

[13] https://prometheus.io

[14] https://grafana.com

[15] https://www.seldon.io/tech/products/core/

The post Keeping an Eye on AI appeared first on ML Conference.

Tools & Processes for MLOps

kraheel — Wed, 26 May 2021 10:47:53 +0000

When we want to use our familiar tools and workflows from software development for data science and machine learning projects, we quickly run into problems. Data science and machine learning model building follow a different process than the classic software development process, which is fairly linear.

When I create a branch in software development, I have a clear goal in mind of what the outcome of that branch will be: I want to fix a bug, develop a user story, or revise a component. I start working on this defined task. Then, once I upload my code to the version control system, automated tests run – and one or more team members perform a code review. Then I usually do another round to incorporate the review comments. When all issues are fixed, my branch is integrated into the main branch and the CI/CD pipeline starts running; a normal development process. In summary, the majority of the branches I create are eventually integrated in and deployed to a production environment.

In the area of machine learning and data science, things are different. Instead of a linear and almost “mechanical” development process, the process here is very much driven by experiments. Experiments can fail; that is the nature of an experiment. I also often start an experiment precisely with the goal of disproving a thesis. Now, any training of a machine learning model is an experiment and an attempt to achieve certain results with a specific model and algorithm configuration and data set. If we imagine that for a better overview we manage each of these experiments in a separate branch, we will get very many branches very quickly. Since the majority of my experiments will not produce the desired result, I will discard many branches. Only a few of my experiments will ever make it into a production environment. But still, I want to have an overview of what experiments I have already done and what the results were so that I can reproduce and reuse them in the future.

But that’s not the only difference between traditional software development and machine learning model development. Another difference is behavior over time.

ML models deteriorate over time

Classic software works just as well after a month as it did on day one. Of course, there may be changes in memory and computational capacity requirements, and of course bugs will occur, but the basic behavioral characteristics of the production software do not change. With machine learning models, it’s different. For these, the quality decreases over time. A model that operates in a production environment and is not re-trained will degrade over time and never achieve as good a predictive accuracy as it did on day one.

Concept drift is to blame [1]. The world outside our machine learning system changes and so does the data that our model receives as input values. Different types of concept drift occur: data can change gradually, for example, when a sensor becomes less accurate over a long period of time due to wear and tear and shows an ever-increasing deviation from the actual measured value. Cyclical events such as seasons or holidays can also have an effect if we want to predict sales figures with our model.

But concept drift can also occur very abruptly: If global air traffic is brought to a standstill by COVID-19, then our carefully trained model for predicting daily passenger traffic will deliver poor results. Or if the sales department launches an Instagram promotion without notice that leads to a doubling of buyers of our vitamin supplement, that’s a good result, but not something our model is good at predicting.

There are two ways to counteract this deterioration in prediction quality: either we enable our model to actively retrain itself in the production environment, or we have to update our model frequently. Or better yet, update as often as we can somehow. We may also have made a necessary adjustment to an algorithm or introduced a new model that needs to be rolled out as quickly as possible.

So in our machine learning workflow, our goal is not just to deliver models to the user. Instead, our goal must be to build infrastructure that quickly informs our team when a model is providing incorrect predictions and enables the team to lift a new, better model into production environments as quickly as possible.

MLOps as DevOps for Machine Learning

We have seen that data science and machine learning model building require a different process than traditional, “linear” software development. It is also necessary that we achieve a high iteration speed in the development of machine learning models, in order to counteract concept drift. For this reason, it is necessary that we create a machine learning workflow and a machine learning platform to help us with these two requirements. This is a set of tools and processes that are to our machine learning workflow what DevOps is to software development: A process that enables rapid but controlled iteration in development supported by continuous integration, continuous delivery, and continuous deployment. This allows us to quickly and continuously bring high-quality machine learning systems into production, monitor their performance, and respond to changes. We call this process MLOps [2] or CD4ML (Continuous Delivery for Machine Learning) [3].

MLOps also provides us with other benefits: Through reproducible pipelines and versioned data, we create consistency and repeatability in the training process as well as in production environments. These are necessary prerequisites to implement business-critical ML use cases and to establish trust in the new technology among all stakeholders.

In the enterprise environment, we have a whole set of requirements that need to be implemented and adhered to in addition to the actual use case. There are privacy, data security, reproducibility, explainability, non-discrimination, and various compliance policies that may differ from company to company. If we leave these additional challenges for each team member to solve individually, we will create redundant, inconsistent and simply unnecessary processes. A unified machine learning workflow can provide a structure that addresses all of these issues, making each team member’s job easier.

Due to the experimental and iterative nature of machine learning, each step in the process that can be automated has a significant positive impact on the overall run time of the process from data to productive model. A machine learning platform allows data scientists and software engineers to focus on the critical aspects of the workflow and delegate the routine tasks to the automated workflows. But what sub-steps and tools can a platform for MLOps be built from?

Components of an MLOps pipeline

An MLOps workflow can be roughly divided into three areas:

Data pipeline and feature management
Experiment management and model development
Deployment and monitoring

In the following, I describe the individual areas and present a selection of tools that are suitable for implementing the workflow. Of course, this selection is not conclusive or even representative, since the entire landscape is in a rapid development process, so that only individual snapshots are always possible.

Fig. 1: MLOps workflow

Data pipeline and feature management

As hackneyed as slogans like “data is the new oil” may seem, they have a kernel of truth: The first step in any machine learning and data science workflow is to collect and prepare data.

Centralized access to raw data

Companies with modern data warehouses or data lakes have a distinct advantage when developing machine learning products. Without a centralized point to collect and store raw data, finding appropriate data sources and ensuring access to that data is one of the most difficult steps in the lifecycle of a machine learning project in larger organizations.

Centralized access can be implemented here in the form of a Hadoop-based platform. However, for smaller data volumes, a relational database such as Postgres [4] or MySQL [5], or a document database based on an EL stack [6] is also perfectly adequate. The major cloud providers also provide their own products for centralized raw data management: Amazon Redshift [7], Google BigQuery [8] or Microsoft Azure Cosmos DB [9].

In any case, it is necessary that we first archive a canonical form of our original data before applying any transformation to it. This gives us an unmodified dataset of original data that we can use as a starting point for processing.

Even at this point in the workflow, it is important to rely on good documentation and to document the sources of the data, its meaning, and where it is stored. Even though this step seems simple, it is still of utmost importance. Invalid data, the wrong naming of a column of data, or a misconfigured scraping job can lead to a lot of frustration and wasted time.

Data Transformation

Rarely will we train our machine learning model directly on raw data. Instead, we generate features from the raw data. In the context of machine learning, a feature is one or more processed data attributes that an algorithm uses to make predictions. This could be a temperature value, for example, but in the case of deep learning applications also highly abstract features in images. To extract features from raw data, we will apply various transformations. We will typically define these transformations in code, although some ETL tools also allow us to define them using a graphical interface. Our transformations will either be run as batch jobs on larger sets of data, or we will define them as long-running streaming applications that continuously transform data.

We also need to split our dataset into training and testing datasets. To train a machine learning model, we need a set of training data. To test how our model performs with previously unknown data, we need another structurally identical set of test data. Therefore, we split our original transformed data set into two data sets. These must not overlap, meaning that the same data point does not occur twice in them. A common split here is to use 70 percent of the dataset for training and 30 percent for testing the model.

The exact split of the data sets depends on the context. For time-series data, sequential slices from the series should be chosen, while for image processing, random images from the data set should be chosen since they have no sequential relation to each other.

For non-sequential data, the individual data points can also be placed in a (pseudo-)random order. We also want to perform this process in a reproducible and automated manner rather than manually. A pleasantly usable tool for management and coordination here is Apache Airflow [10]. Here, according to the “pipeline as code” principle, one can define various pipelines in the form of a data flow graph, connect a wide variety of systems, and thus perform the desired transformations.

Feature repositories

Many machine learning models and systems within a company use the same or at least similar features. Now, once we have extracted a feature from our raw data, there is a high probability that this feature can be useful for other applications as well. Therefore, it can be useful not to have to implement feature extraction again for each application. For this, we can store known features in a feature store. This can be done either in a dedicated component (such as Feast) [11], or in well-documented database tables populated by appropriate transformations. These transformations can be mapped automatically using Apache Airflow.

Data versioning

In addition to code versioning, data versioning is useful in a machine learning context. This allows us to increase the reproducibility of our experiments and to validate our models and their predictions by retracing the exact state of a training dataset that was used at a given time. Tools such as DVC [12] or Pachyderm [13] can be used for this purpose.

Experiment management and model development

In order to deploy an optimal model into production, we need to create a process that enables the development of that optimal model. To do this, we need to capture and visualize information that enables the decision of what the optimal model is, since in most cases this decision is made by a human and not automated.

Since the data science process is very experiment-driven, multiple experiments are run in parallel, often by different people at the same time. And most will not be deployed in a production environment. The experimental approach in this phase of research is very different from the “traditional” software development process, as we can expect that the code for these experiments will be discarded in the majority of cases, and only some experiments will reach a production status.

Experiment management and visualizations

Running hundreds or even thousands of iterations on the way to an optimally trained ML model is not uncommon. In the process, quite a few parameters used to define each experiment and the results of that experiment are accumulated. Often, this metadata is stored in Excel spreadsheets or, in the worst case, in the heads of team members. However, to establish optimal reproducibility, avoid time-consuming multiple experiments, and enable optimal collaboration, this data should be captured automatically. Possible tools here are MLflow tracking [14] or Sacred [15]. To visualize the output metrics, either classical dashboards like Grafana [16] or specialized tools like TensorBoard [17] can be used. TensorBoard can also be used for this purpose independently of its use with TensorFlow. For example, PyTorch provides a compatible logging library [18]. However, there is still much room for optimization and experimentation here. For example, the combination of other tools from the DevOps environment such as Jenkins [19] and Terraform [20] would also be conceivable.

Version control for models

In addition to the results of our experiments, the trained models themselves can also be captured and versioned. This allows us to more easily roll back to a previous model in a production environment. Models can be versioned in several ways: In the simplest variant, we export our trained model in serialized form, for example as a Python .pkl file. We then record this in a suitable version control system (Git, DVC), depending on its size.

Another option is to provide a central model registry. For example, the MLflow Model Registry [21] or the model registry of one of the major cloud providers can be used here. Also, the model can be packaged in a Docker container and managed in a private Docker Registry [22].

Infrastructure for distributed training

Smaller ML models can usually still be trained on one’s own laptop with reasonable effort. However, as soon as the models become larger and more complex, this is no longer possible and a central on-premise or cloud server becomes necessary for training. For automated training in such an environment, it makes sense to build a model training pipeline.

This is executed with training data at specific times or on demand. The pipeline receives the configuration data that defines a training cycle. These data are for example model type, hyperparameters and used features. The pipeline can obtain the training data set automatically from the feature store and distribute it to all model variants to be trained in parallel. Once training is complete, the model files, original configuration, learned parameters, and metadata and timings are captured in the experiment and model tracking tools. One possible tool for building one is Kubeflow [23]. Kubeflow provides a number of useful features for automated training and for (cost-)efficient resource management based on Kubernetes.

Deployment and monitoring

Unless our machine learning project is purely a proof of concept or an academic project, we will eventually need to lift our model into a production environment. And that’s not all: once it gets there, we’ll need to monitor it and deploy a new version as needed. Also, in a large part of the cases, we will have not just one, but rather a whole set of models in our production environments.

Deploy models

On a technical level, any model training pipeline must produce an artifact that can be deployed into a production environment. The prediction results may be bad, but the model itself must be in a packaged state that allows it to be deployed directly into a production environment. This is a familiar idea from software development: continuous delivery. This packaging can be done in two different ways.

Either our model is deployed as a separate service and accessed by the rest of our systems via an interface. Here, the deployment can be done, for example, with TensorFlow Serving [24] or in a Docker container with a matching (Python) web server.

An alternative way of deployment is to embed it in our existing system. Here, the model files are loaded and input and output data are routed within the existing system. The problem here is that the model must be in a compatible format or a conversion must be performed before deployment. If such a conversion is performed, automated tests are essential. Here it must be ensured that both the original and the converted model deliver identical prediction results.

Monitoring

A data science project does not end with the deployment of a model into production. Even a production model has to face many challenges. The value distribution of my input values may be different in the real world than the one mapped in the training data. Also, value distributions can change slowly over time or due to singular events. This then requires retraining with the changed data.

Also, despite intensive testing, errors may have crept in during the previous steps. For this reason, infrastructure should be provided to continuously collect data on model performance. The input values and the resulting predictions of the model should also be recorded, as far as this is compatible with the applicable data protection regulations. On the other hand, if privacy considerations are only introduced at this point, one has to ask how a sufficient amount of training data could be collected without questionable privacy practices.

Here’s a sampling of basic information we should be collecting about our machine learning system in production:

How many times did the model make a prediction?
How long does it take the model to perform a prediction?
What is the distribution of the input data?
What features were used to make the prediction?
What results were predicted and what real results were observed later in the system?

Tools such as Logstash [25] or Prometheus [26] can be used to collect our monitoring data. To get a quick overview of the performance of the model, it is recommended to set up a dashboard that visualizes the most important metrics and to set up an automatic notification system that alerts the team in case of strong deviations so that appropriate countermeasures can be taken if needed.

Challenges on the road to MLOps

Companies face numerous challenges in staffing and challenges within their teams on the road to a successful machine learning strategy. There is also the financial challenge of attracting experienced software engineers and data scientists. But even if we manage to assemble a good team, we need to enable them to work together in the best possible way to bring out the strengths of each team member. Generally speaking, data scientists feel very comfortable using various statistical tools, machine learning algorithms, and Jupyter notebooks. However, they are often less familiar with version control software and testing tools that are widely used in software engineering. While software engineers are familiar with these tools, they often lack the expertise to choose the algorithm for a problem or to extract the last five percent of predictive accuracy from a model through skillful optimizations. Our workflows and processes must be designed to support both groups as best as possible and enable smooth collaboration.

In terms of technological challenges, we face a broad and dynamic technology landscape that is constantly evolving. In light of this confusing situation, we are often faced with the question of how to get started with new machine learning initiatives.

How do I get started with MLOps?

Building MLOps workflows must always be an evolutionary process and cannot be done in a one-time “big bang” approach. Each company has its own unique set of challenges and needs when developing its machine learning strategies. In addition, there are different users of machine learning systems, different organizational structures, and existing application landscapes. One possible approach here may be to create an initial prototype without ML components. Then, one should start building the infrastructure and simplest models.

From this starting point, the infrastructure created can then be used to move forward in small and incremental steps to more complicated models and lift them into production environments until the desired level of predictive accuracy and performance has been achieved. Short development cycles for machine learning models, in the range of days rather than weeks or even months, enable faster response to changing circumstances and data. However, such short iteration cycles can only be achieved with a high degree of automation.

Developing, putting into production, and keeping machine learning models productive is itself a complex and iterative process with many inherent challenges. Even on a small or experimental scale, many companies find it difficult to implement these processes cleanly and without failures. The data science and machine learning development process is particularly challenging in that it requires careful balancing of the iterative, exploratory components and the more linear engineering components.

The tools and processes for MLOps presented in this article are an attempt to provide structure to these development processes.

Although there are no proven and standardized processes for MLOps, we can take many lessons learned from “traditional” software engineering. In many teams there is a division between data scientists and software engineers. Here we have the pragmatic approach: Data scientists develop a model in a Jupyter notebook and throw this notebook over the fence to software engineering, which then follows the model into production with a DevOps approach.

If we think back a few years, “throwing it over the fence” is exactly the problem that gave rise to DevOps. This is exactly how Werner Vogels (CTO at Amazon) described the separation between development and operations in his famous interview in 2006 [27]: “The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon.” Then came the phrase that looks good on DevOps conference T-shirts, coffee mugs and motivational posters: “You build it, you run it.” As naturally as development and operations belong together today, we must also make the collaboration between data science and DevOps a matter of course.

Links & Literature

[1] Tsymbal, Alexey: “The problem of concept drift. Definitions and related work. Technical Report”: https://www.scss.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf

[2] https:// http://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

[3] https://martinfowler.com/articles/cd4ml.html

[4] https://www.postgresql.org

[5] https://www.mysql.com

[6] https://www.elastic.co/de/elasticsearch/

[7] https://aws.amazon.com/redshift/

[8] https://cloud.google.com/bigquery

[9] https://docs.microsoft.com/de-de/azure/cosmos-db/

[10] https://airflow.apache.org

[11] https://feast.dev

[12] https://dvc.org

[13] https://www.pachyderm.com

[14] https://www.mlflow.org/docs/latest/tracking.html

[15] https://github.com/IDSIA/sacred

[16] https://grafana.com

[17] https://www.tensorflow.org/tensorboard

[18] https://pytorch.org/docs/stable/tensorboard.html

[19] https://www.jenkins.io

[20] https://www.terraform.io

[21] https://www.mlflow.org/docs/latest/model-registry.html

[22] https://docs.docker.com/registry/deploying/

[23] https://www.kubeflow.org

[24] https://www.tensorflow.org/tfx/guide/serving

[25] https://www.elastic.co/de/logstash

[26] https://prometheus.io

[27] https://queue.acm.org/detail.cfm?id=1142065

The post Tools & Processes for MLOps appeared first on ML Conference.

Continuous Delivery for Machine Learning

kuebra — Tue, 14 Apr 2020 11:54:24 +0000

As organizations move to become more “data-driven” or “AI-driven”, it’s increasingly important to incorporate data science and data engineering approaches into the software development process to avoid silos that hinder efficient collaboration and alignment. However, this integration also brings new challenges when compared to traditional software development. These include:

A higher number of changing artifacts. Not only do we have to manage the software code artifacts, but also the data sets, the machine learning models, and the parameters and hyperparameters used by such models. All these artifacts have to be managed, versioned, and promoted through different stages until they’re deployed to production. It’s harder to achieve versioning, quality control, reliability, repeatability and audibility in that process.

Size and portability: Training data and machine learning models usually come in volumes that are orders of magnitude higher than the size of the software code. As such they require different tools that are able to handle them efficiently. These tools impede the use of a single unified format to share those artifacts along the path to production, which can lead to a “throw over the wall” attitude between different teams.

Different skills and working processes in the workforce: To develop machine learning applications, experts with complementary skills are necessary, and they sometimes have contradicting goals, approaches, and working processes:

Data Scientists look into the data, extract features and try to find models which best fit the data to achieve the predictive and prescriptive insights they seek out. They prefer a scientific approach by defining hypotheses and verifying or rejecting them based on the data. They need tools for data wrangling, parallel experimentation, rapid prototyping, data visualization, and for training multiple models at scale.
Developers and machine learning engineers aim for a clear path to incorporate and use the models in a real application or service. They want to ensure that these models are running as reliably, securely, efficiently and as scalable as possible.
Data engineers do the work needed to ensure that the right data is always up-to-date and accessible in the required amount, shape, speed, and granularity, as well as with high quality and minimal cost.
Business representatives define the outcomes to guide the data scientists’ research and exploration, and the KPIs to evaluate if the machine learning system is achieving the desired results with the desired quality levels.

Continuous Delivery for Machine Learning (CD4ML) is the technical approach to solve these challenges, bringing these groups together to develop, deliver, and continuously improve machine learning applications.

_{Figure 1: Continuous Delivery for Machine Learning (CD4ML) is integrating the different development processes and workflows of different roles with different skill sets for the development of machine learning applications}

Stay up to date

Learn more about MLCON

The Continuous Intelligence Cycle

In the first article of The Intelligent Enterprise series, we introduced the Continuous Intelligence cycle (see figure 2).

_{Figure 2: The Continuous Intelligence Cycle}

This is a fundamental cycle of transforming data into information, insights and actions that support an organization as it moves towards data-driven decision making. In traditional organizations, this cycle relies on legacy systems (e.g. data warehouses, ERP systems) and human decision making. In these organizations, the process is slow and contains many friction points: machine learning applications are often developed in isolation and never leave the proof of concept phase. If they make it into production, this is often a one-time ad-hoc process that makes it difficult to update and re-train them, leading to stale and outdated models.

Intelligent Enterprises implement ways to speed up the Continuous Intelligence cycle and remove the different friction points along the way. CD4ML is the technical approach to accelerate the value generation of machine learning applications as part of the Continuous Intelligence cycle. It enables you to move from offline or bench models and manual deployments; to automate the end-to-end process of gathering information and insights out of data; to productionize decisions and actions based on those insights; and collect more data to measure the outcomes once actions have been taken. This allows the Continuous Intelligence cycle to run faster and produces higher quality outcomes at lower risks by allowing feedback to be incorporated into the process.

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Learn more

What is CD4ML?

To understand CD4ML, we need to first understand Continuous Delivery (CD) and where its principles originated. Continuous Delivery, as Jez Humble and David Farley defined it in their seminal book, is: “… a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time”, which can be achieved if you “…create a repeatable, reliable process for releasing software, automate almost everything and build quality in.”

They also state: “Continuous Delivery is the ability to get changes of all types — including new features, configuration changes, bug fixes, and experiments — into production, or into the hands of users, safely and quickly in a sustainable way.”

Changes to machine learning models are just another type of change that needs to be managed and released into production. Besides the code, it requires our CD toolset to be extended so that it can handle new types of artifacts. What’s more, the whole process of producing software in short cycles becomes more complex because there is more variety in the team’s skill sets (data scientists, data engineers, developers and machine learning engineers), with each following different workflows.

ThoughtWorks has further developed the Continuous Delivery approach to overcome these challenges to be applicable to machine learning applications and calls this new approach Continuous Delivery for Machine Learning (CD4ML). It allows us to extend the Continuous Delivery definition to incorporate the new elements required to speed up the Continuous Intelligence cycle:

Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles.

This definition contains all the basic principles:

Software engineering approach. It enables teams to efficiently produce high quality software.

Cross-functional team. Experts with different skill sets and workflows across data engineering, data science, development, operations, and other knowledge areas are working together in a collaborative way emphasizing the skills and strengths of each team member.

Producing software based on code, data, and machine learning models. All artifacts of the software production process (code, data, models, parameters) require different tools and workflows and must be managed accordingly.

Small and safe increments. The release of software artifacts is divided into small increments, this provides visibility and control around the levels of variance of the outcomes, adding safety into the process.

Reproducible and reliable software release. The process of releasing software into production is reliable and reproducible, leveraging automation as much as possible. This means that all artifacts (code, data, models, parameters) are versioned appropriately.

Software release at any time. It’s important that the software could be delivered into production at any time. Even if organizations don’t want to deliver software all the time, the fact is that being ready for release makes the decision about when to release it a business decision instead of a technical decision

Short adaptation cycles. Short cycles means development cycles are in the order of days or even hours, not weeks, months, or even years. To achieve this, you want to automate the process — including quality safeguards built in. This creates a feedback loop that enables you to adapt your models as you learn from their behavior in production.

How it all works together

CD4ML aims to automate the end-to-end machine learning lifecycle and ensures a continuous and frictionless process from data capture, modeling, experimentation, and governance, to production deployment. Figure 3 gives an overview of the whole process.

_{Figure 3: Continuous Delivery for Machine Learning in action}

Starting at the left side of the cycle, data scientists work on data they discover and access from data sources. They wrangle the data, perform feature extraction, split the data into training and test data, build data models and experiment with all of them. They write code to train the models (often in Python or R) and tune them by choosing parameters and hyperparameters.

As these models are trained, the data scientists are constantly evaluating them. This means looking at the model’s error rate, the confusion matrix, the number of false positives and false negatives, or running certain test scripts — for example, for chatbots. The tests should be as automated as possible with the help of test environments, test scripts or test programs.

Once a good model is found, it’s ready to be productionized. The model has to be adapted to the production environment. This could mean containerization of the model code or even transforming it to a high-performance language like Java or C++ — either manually or using automatic transformation tools. The productionized version of the model has to be tested again in conjunction with other components of the overall architecture before it can be deployed to production.

In production, we have to observe and monitor how the model behaves “in the wild”. Metrics like usage, model input, model output, and possible model bias are important information about the model performance. This data can be fed back to the first stage of the process to enable further improvement: the whole Continuous Intelligence cycle starts again.

The transportation of the artifacts (source code, executables, training, and test data or model parameters) between the different process stages is controlled via pipelines that are executed by a CD orchestration tool. Every artifact is versioned, enabling reproducibility and auditability, so prior versions can be rebuilt or redeployed if required. The CD orchestration tool ensures the smooth and frictionless operation of the whole process and also allows governance and compliance, so certain quality standards and fairness checks are built into the process.

CD4ML in Action

We want to demonstrate the approach in practice based on a real client project delivered by ThoughtWorks. In fact, our current notion of CD4ML first emerged several years ago when we first applied Continuous Delivery to a user-facing machine learning application. You can read about it in detail here.

Our challenge was to build a price estimation engine for a leading European online car marketplace. The engine needed to be able to give a realistic estimate for anybody looking to buy or sell a car. That price estimate would be based on past car sales within the marketplace. As the market for used cars is constantly changing, the price estimation model has to be continuously re-trained on new data. A perfect case for CD4ML.

_{Figure 4: A CD4ML end-to-end process in a real-world example}

Figure 4 shows the overall CD4ML flow for this specific case. The data scientists train the model using data from the marketplace — such as car specs, asking price and actual sales price. The model then predicts a price based on the car model, age, mileage, engine type, equipment, etc.

Before training a model, there’s a lot of data cleanup work to be done: detecting outliers, wrong listings, or dirty data. This is the first quality gate to be automated — is there enough good data to even provide a prediction model for a certain car model?

Once the trained model can make sufficiently accurate price estimates, it’s exported as a productionizable artifact, — a JAR or a pickle file. This is the second quality gate: is the model’s error rate acceptable?

This prediction model is then transformed into a format matching the target platform, then packaged, wrapped, and integrated into a deployable artifact — a prediction service JAR containing a web server or a container image that can be readily deployed into a production environment. This deployment artifact is now tested again, this time in an end-to-end fashion: is it still producing the same results as the original, non-integrated prediction model? Does it behave correctly in a production environment, for instance, does it adhere to contracts specified by other consuming services? This is the third quality gate.

If all three quality gates succeed, a new re-trained price prediction service is deployed and released. Importantly, all of those steps should be automated so that re-training to reflect the latest market changes happens without manual intervention as long as all quality gates are satisfied.

Finally, the live price prediction is continuously monitored: how do the sellers react to the price recommendations? How much is the listing price deviating from the suggestion? How close is the price prediction to the final buying price of the respective vehicle? Is the overall conversion and user experience being impacted, for instance by rising complaints or direct positive feedback? In some cases, it makes sense to deploy the new model next to the old version to compare their performance. All this new data then informs the next iteration of training the prediction model, either directly through new data from cars that were sold or by tweaking the model’s hyperparameters based on user feedback, which closes the Continuous Intelligence cycle.

Opportunities of CD4ML and the road ahead

Adopting Continuous Delivery for Machine Learning creates new opportunities to become an Intelligent Enterprise. By automating the end-to-end process from experimentation to deployment, to monitoring in production, CD4ML becomes a strategic enabler to the business. It creates a technological capability that yields a competitive advantage. It allows your organization to incorporate learning and feedback into the process, towards a path of continuous improvement.

This approach also breaks down the silos between different teams and skill sets, shifting towards a cross-functional and collaborative structure to deliver value. It allows you to rethink your organizational structures and technology landscape to create teams and systems aligned to business outcomes. In subsequent articles in the series, we’ll explore how to bring product thinking into the data and machine learning world, as well as the importance of creating a culture that supports Continuous Intelligence.

Another key opportunity to implement CD4ML successfully is to apply platform thinking at the data infrastructure level. This enables teams to quickly build and release new machine learning and insight products without having to reinvent or duplicate efforts to build common components from scratch. We’ll dedicate an entire article to the technical components, tools, techniques, and automation infrastructure that can help you to implement CD4ML.

Finally, leveraging automation and open standards, CD4ML can provide the means to build a robust data and architecture governance process within the organization. It allows introducing processes to check fairness, bias, compliance, or other quality attributes within your models on their path to production. Like Continuous Delivery for software development, CD4ML allows you to manage the risks of releasing changes to production at speed, in a safe and reliable fashion.

All in all, Continuous Delivery for Machine Learning moves the development of such applications from proof-of-concept programming to professional state-of-the-art software engineering.

This article was first published on ThoughtWorks.com

The post Continuous Delivery for Machine Learning appeared first on ML Conference.