ML Conference https://mlconference.ai/ The Conference for Machine Learning Innovation Fri, 08 May 2026 12:14:50 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://mlconference.ai/wp-content/uploads/2025/09/cropped-favicon-32x32.png ML Conference https://mlconference.ai/ 32 32 AI Architecture: Scan vs Seek https://mlconference.ai/blog/ai-architecture-scan-vs-seek/ Fri, 08 May 2026 12:13:54 +0000 https://mlconference.ai/?p=1080357 Most AI development tools rely on a “scan” approach—dumping large chunks of code into a model and hoping it finds what matters. This article argues for a fundamentally different architecture: “seek,” where AI retrieves only the most relevant knowledge on demand. See why this shift is more efficient and how it unlocks deeper organizational intelligence.

The post AI Architecture: Scan vs Seek appeared first on ML Conference.

]]>
I’ve been thinking about this framing for a while, and I think it captures the fundamental architectural split in AI tooling better than anything else I’ve come up with. There are two ways to give an AI the context it needs. The industry picked one. I think they picked wrong.

How every tool works today

The pattern is the same everywhere. Your AI tool scans your codebase — files, directory structure, maybe some git history. It stuffs as much as it can into the context window and sends the whole thing to the LLM. Hopes the model finds the relevant parts.

Cursor calls it “codebase indexing.” Copilot calls it “code referencing.” Claude Code reads files on demand. The implementation varies, but the architecture is identical: dump everything in, let the model sort it out.

I call this the Scan approach. And it has problems I don’t think are fixable within the paradigm.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

Why scanning breaks

Context windows are finite. A medium-sized project has millions of tokens of source code. You can’t fit it all. So the tool has to guess which files matter — and it guesses wrong constantly. I’ve watched tools include entire test directories when the task is about production code, or load a database migration file when the engineer is working on a frontend component.

More fundamentally: scanning is O(n). As your codebase grows, the problem gets worse. More files to index. More irrelevant context diluting the relevant parts. More tokens wasted on code the model doesn’t need for the current task.

But here’s the thing that really gets me: scanning can only see code. Your codebase contains source files. It doesn’t contain why you chose your architecture. It doesn’t contain the error pattern that burned two engineers last month. It doesn’t contain the fact that your frontend team prefers composition over inheritance, or that the one person who understands the billing pipeline just went on leave.

No amount of codebase scanning will surface this knowledge. It doesn’t live in files. It lives in conversations, decisions, and people’s heads.

The alternative I keep coming back to

What if instead of dumping everything in and hoping, you sent only what’s relevant — and gave the AI tools to find more when it needed to?

This is what I think of as the Seek approach. It works in layers:

Always-present: A small set of high-signal knowledge that matters for every interaction. Your team’s rules. The structural flows in your system. These are injected automatically because they always apply. A few hundred tokens, not thousands.

Context-aware: What the AI has learned while working in this specific context. Decisions it made. Patterns it discovered. Errors it hit. This is the AI’s working memory for the current task — and it persists across sessions.

On-demand: Everything else. The full organizational knowledge base, searchable by the AI when it needs it. Error patterns from six months ago. Team expertise maps. Deployment runbooks. The AI doesn’t carry this — it reaches for it when the task demands it.

The math that convinced me

Scan:

[200K token context window]
├── 50K: source files (maybe relevant, maybe not)
├── 30K: conversation history
├── 10K: system prompt
└── 110K: remaining capacity (shrinks every turn)

Seek:

[200K token context window]
├── 2K: rules that always apply
├── 3K: knowledge from this context
├── 10K: system prompt
└── 185K: available for actual work

The seek model uses ~96% of the context window for the current task. The scan model wastes 25-50% on context that might not be relevant.

But the efficiency difference, honestly, isn’t the most important part. The most important part is what you can represent.

What seek can surface that scan can’t

Knowledge type In files? In a seek system?
Current source code Yes Yes (file tools)
Why you chose this architecture No Yes
Known error patterns No Yes
Team conventions Partially Yes
Who knows what No Yes
Past incidents No Yes
What was done last week No Yes
Git history context Partially Yes

A scan system gives the AI your code. A seek system gives the AI your organization’s knowledge. These are fundamentally different products masquerading as the same category.

The self-priming insight

The part that took me the longest to figure out: the best source of organizational knowledge is the AI’s own conversations.

When an engineer explains to the AI why they’re choosing a particular approach, that’s a decision being made. When they discover a coupling between services while debugging, that’s an insight being created. When they fix a bug and explain the root cause, that’s an error pattern being documented.

These moments happen every day. The knowledge is right there — fresh, contextualized, structured. In a scan system, it evaporates when the session ends. In a seek system, it’s captured, stored, and available to the entire team.

No documentation sprints. No wiki maintenance. The knowledge just accumulates because people use the tool.

The compounding difference

This is the part that keeps me up at night, because I think the implications are bigger than most people realize. Scan systems are stateless. The 1,000th session is exactly as informed as the 1st. Seek systems compound. The 1,000th session has access to everything the organization learned in the first 999.

Without compounding, your team’s effective knowledge equals the smartest person in the room. With it, your team’s effective knowledge equals the sum of everything anyone ever learned.

The difference between scan and seek isn’t a feature. It’s an architecture. And architecture is hard to change once you’ve committed.

The post AI Architecture: Scan vs Seek appeared first on ML Conference.

]]>
The Basics of Machine Learning https://mlconference.ai/blog/the-basics-of-machine-learning/ Tue, 14 Apr 2026 11:19:15 +0000 https://mlconference.ai/?p=1080273 Fully trained models are everywhere, and AI is almost synonymous with prompt engineering. But how does machine learning actually work and how are models trained? This article will address these questions.

The post The Basics of Machine Learning appeared first on ML Conference.

]]>
AI models have made great strides in recent years, and AI often has a quasi-magical perception. In this image, AI models are black boxes that somehow continuously learn from data it’s been provided and, on this basis, somehow responds to queries with answers (Fig. 1).

Fig. 1: Native view of AI models

Fig. 1: Native view of AI models

This image is not wrong and it’s largely sufficient at the level of prompt engineering. But it’s vague, of course, and often contains the word “somehow.” This article lifts the hood and takes a look at how machine learning works in detail. I use the terms ML and AI synonymously, with AI referring to “large” models in all their vagueness.

Let’s begin by defining ML to give the topic some structure and distinguish it from similar, related fields: “Machine learning is the training of a model using statistical methods to predict the values of dependent variables based on input variables.”

Let’s look at the individual parts of this definition. The starting point for a machine learning project is typically the dependent variables, i.e., the values that the model is supposed to deliver. These could be sales and profit forecasts for the next quarter, or even a poem in dialect on a given topic, or the next move in a chess game. It’s important to have a clear idea of what the model should ultimately deliver because this determines all other aspects.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

The input variables (or independent variables) are the variables that the model is supposed to derive its values from. There are often many possibilities for this, and the selection is a creative, technical process. For example, should previous business figures be used in weekly, monthly, or quarterly granularity to predict sales? How far back should the figures go? Should employee data be used, and is this even legally permissible? What about figures from competitors, etc.? Experience is needed, and different variables are often tested and lessons learned along the way.

The third important aspect is machine learning’s predictive nature. It involves predicting the dependent variables for new, unknown contexts. For example: training an AI for facial recognition that evaluates a live video feed or predicts sales figures for future quarters. The limits on the quality of predictions based on new, unknown input data and their verification can arise in many ML areas.

Fourth, ML means tackling problems with statistical methods. This is by no means the only option, and sometimes it’s clearly a poor choice. When a bank calculates a new account balance based on deposits and withdrawals, this must be done using deterministic algorithms, not statistical approximation. For many other problems, it works well to find and implement heuristics through careful thinking – plenty of video game “AIs” work this way. ML is not about completely “solving” a problem domain, but rather about approximations. For some variables – lottery numbers, for example – statistical methods are fundamentally unsuitable. At the very beginning of an ML project, you should check if statistically validated approximate solutions are suitable for the domain and if they’re the approach of choice.

Once you’ve decided to solve a problem statistically, you must choose a model. This is a formula for calculating the dependent variables from the input variables, but it contains parameters that can be adjusted. The model can be very simple—in an oversimplified extreme case, a constant percentage growth in sales from quarter to quarter, with the percentage as the only parameter—or it can be as complicated as you like, e.g., a deep neural network with millions of parameters. The model choice is a human decision based on experience and domain requirements. Deep neural networks are just one of many options, it is not automatically the best.

Sixth, ML ultimately means training the model, i.e., optimizing the parameter values using statistical methods. This is the part where ML “learning” happens and is done by iteratively adjusting the parameters. In the simplest case, training can be done using existing training data (supervised learning). But there are also approaches for when there is not enough training data (unsupervised learning, reinforcement learning) – more on that later.

A simple example

Let’s consider a simple example. The mathematics and fundamentals are the same as for deep neural networks and other large models, but a small example is clearer.

Let’s assume there is a variable y that we want to predict as a function of a variable x (I find this easier than giving the variables pseudo-illustrative technical names). y is therefore the dependent variable and x the only independent variable. We have sample data for the relationship between the variables; Figure 2 shows 200 data points.

Fig. 2: An example data set as a basis for training

Fig. 2: An example data set as a basis for training

Let’s also assume that we have examined the domain in detail and decided that we want to predict the relationship between y and x statistically rather than algorithmically. This makes the problem a candidate for ML. The next step is to find a model whose parameters we can then adjust to the specific values. For this, we choose a third-degree polynomial: ŷ(x) = aꞏx³ + bꞏx² + cꞏx + d. Here, ŷ denotes the predicted value for the variable y. The function has four parameters, a, b, c, and d, as well as the independent variable x. Choosing this model means that we assume that the relationship between x and y can be described well or well enough by this polynomial (with a suitable choice of parameter values).

Choosing a third-degree polynomial is not mandatory. For example, based on our knowledge of the domain, we could have decided that b and d are always 0 and removed the corresponding terms from the model. Or we could have assumed a fifth-degree polynomial or a sine curve. Or a deep neural network.

The choice of model is often a trade-off between complexity, comprehensibility, and accuracy, and a good choice is based on an understanding of the domain or assumptions about it. In practice, you often start with one model, learn more about the domain, and start over with a newly trained model.

Training the model

Now we need to find “good” values for the parameters a, b, c, and d, i.e., values with which the model makes the “best possible” predictions for ŷ. This trains the model. Parameter values are changed iteratively until they fit. This must be strictly separated from model usage, in which the parameters are fixed and the model always processes new input values. We start with random values:

  • a = 0,15434
  • b = 0,75297
  • c = -0,08099
  • d = 0,57356

Figure 3 shows the predictions in light green and the data points in purple. The predictions have no relation to the data, which was to be expected with random parameter values. If the iterative improvement of the values works, then the starting values should not matter.

Fig. 3: The initial parameter values are not optimal yet

Fig. 3: The initial parameter values are not optimal yet

Before we iteratively “improve” the parameter values using any algorithm, we need to specify what “good” means. In our example, we want the prediction to be as accurate as possible on average—a plausible and very common goal. But we could also optimize it so that the maximum error is as small as possible, for example.

Mathematically, our choice means that the mean squared error (MSE) is as small as possible: for each data point, we take the difference between the actual and predicted values, square this difference (to become independent of the sign and because it dampens the influence of small differences), and calculate the mean value across all data points (Fig. 4).

Fig. 4: Definition of mean square error

Fig. 4: Definition of mean square error

The smaller the MSE, the better our model fits the actual data. A value of exactly 0 would mean a perfect match, which of course, does not occur in real problems. Negative values cannot occur because the MSE is the sum of squares.

Figure 5 shows the magnitude of the square error for each data point (as blue bars at the bottom of the diagram) for the random initial values of the model. There is a range in which the predictions are quite close to the real data, but especially for small values of x, they are far apart. The mean square error is MSE = 2.318 – so the mathematical description fits with the observation that the predictions are still terrible.

Fig. 5: The square error varies for different data points

Fig. 5: The square error varies for different data points

With these preparations in place, we can now enter the training loop (Fig. 6):

  • We apply the model with the current parameter values to our training data to calculate the current MSE (we have already done this).
  • Then, based on this, we calculate changes for the parameters that reduce the MSE.
  • Finally, we change the model parameters accordingly to start a new run with the new parameter values.

Fig. 6: The training loop used to optimize the model parameters

The second step, calculating the parameter changes that reduce the error, requires some mathematics. This is explained separately at the end of the article. The rest of the procedure still makes sense even if this step is considered a black box.

In any case, this black box delivers the following changes for the parameters in the first step:

  • a: 0.15434 ↦ 0.17299
  • b: 0.75297 ↦ 0.72049
  • c: -0.08099 ↦ -0.0687
  • d: 0.57356 ↦ 0.54846

We will start the second iteration (epoch) with these values. Figure 7 shows how the models—the colored curves—continue to converge on the training data. After 2,000 epochs, the model curve fits the training data well, at least visually.

Fig. 7: The model is getting closer and closer to the training datas

Fig. 7: The model is getting closer and closer to the training data

Convergence

This raises the question: When do we consider the training to be complete? The training loop does not have an automatic end; instead, an explicit criterion is needed to terminate the training.

In practice, this is trickier than it seems at first glance. Let’s first take a look at how the MSE develops over the training epochs (Table 1).

The first thing that stands out is that the error decreases over the course of training and converges toward a stable value. This is not a given for more complicated problems, such as if the model does not properly fit the problem or the feedback mechanism for parameter values is too coarse. Achieving convergence during training is a major milestone for many ML projects.

Epoch MSE
0 2.31801
100 0.08127
1000 0.00217
3000 0.000038603
5000 0.000037943
10000 0.000037942

Table 1: Development of the MSE with iterations[.caption}]

Secondly, the MSE becomes small, but it converges to a value greater than zero. This is expected and a good thing, because the model is a simplification of reality. Real data always has some form of noise; for example, even good weather forecasts cannot predict the temperature to two decimal places. If the error becomes too small when training a model, this is suspicious and may indicate that the model has too many parameters and is also mapping the noise in the training data (overfitting). In our example, however, the MSE threshold fits our understanding of the (fictional) domain.

If we knew in advance that the MSE would converge to 0.00003794, we could specify a slightly larger value as the termination criterion. But in practice, we generally don’t know in advance how accurate the model predictions will be, so such an absolute threshold value is out of the question.

Training is often terminated when the relative change in the MSE becomes small enough, e.g., when it changes by less than 0.0001 percent over 100 epochs. This value must also be adapted to the problem at hand: a criterion that is too coarse can terminate training before the possible accuracy is reached, but a criterion that is too fine can prolong training unnecessarily or even lead to an infinite loop due to the finite accuracy of floating-point arithmetic. This is another instance where experience and trial and error are important.

Validation

The fully trained model in our example is ŷ(x) = 0.00041841 + 0.98694384 x + 0.00022286 x² – 0.14406674 x³ and it fits the training data well visually (Fig. 8). But for real-world applications, it’s usually important to know how good the model’s predictions are.

Fig. 8: The converged model fits the training data well from a purely visual perspective

Fig. 8: The converged model fits the training data well from a purely visual perspective

Assuming that the data is normally distributed (often a plausible approximation, which is beyond the scope of this article, see here), the standard deviation is the root of the MSE, i.e., σ = 0.0062 in our example. This means that for two-thirds of the training data, the true values are within ±0.0062 of the model value.

That doesn’t automatically make it good enough. It depends on the domain and context, and assessment is a technical decision. If the accuracy is lower (or higher!) than is technically plausible, this is cause for reflection. You can use a different model but perhaps you’ve learned something new about the domain from the data. It’s important to check the accuracy for plausibility.

So far, we’ve looked at the accuracy of the model on the data we used to train it. It’s also important to consider how well the model can predict values for new, unknown inputs. After all, these predictions are the reason we do ML in the first place.

To do this, you can split the available data and use 80 percent for training, reserving the other 20 percent for subsequent validation. Fortunately, when we did this we only used 200 of the 250 available data points for training. It is crucial that the validation data is never used for training in any way.

The 50 retained data points are plotted as green dots in Figure 9 and visually fit well with the model’s predictions. We verify this quantitatively by comparing the accuracy of the model for training and validation data. The MSE on the training data is 3.79 10-5, on the validation data it is 3.71 10-5, the two values are of the same order of magnitude and it is plausible that the training data is representative and that the model will also fit unknown data.

Fig. 9: Separate validation data can be used to check the model quality

Fig. 9: Separate validation data can be used to check the model quality

Figure 10 illustrates how a parameter set can be implausible despite good convergence. Here, the same model has been trained with only four data points and has an MSE of less than 10-12, meaning that the prediction accuracy on the training data is phenomenal. Apart from the fact that this is obviously too little training data for meaningful fitting, it only covers part of the value range of x, which is easy to overlook in this diagram.

Fig. 10: Too little or unrepresentative training data can lead to overfitting

Fig. 10: Too little or unrepresentative training data can lead to overfitting

During the initial validation step, it may become apparent that model deviations from the training data are significantly lower than expected based upon the technically expected noise of the values. In a joint diagram with the entire available data set, it’s obvious that the parameter values are poor (Fig. 11).

Fig. 11: In this extreme case, the mismatch is obvious when compared to the entire training data set

Fig. 11: In this extreme case, the mismatch is obvious when compared to the entire training data set

But this effect can also be much more subtle. Figure 12 shows a curve representing the predictions of a model that was trained with the six points marked in green. The MSE on the training data is 1.6 10-5, which is within a plausible range. The result also seems plausible when looking at the plot.

Fig. 12: The trained model appears to be a good fit for the data as a whole

Fig. 12: The trained model appears to be a good fit for the data as a whole

But if you calculate the prediction accuracy on the validation data, you get an MSE of 1.4 10-4. This is an entire order of magnitude higher than on the training data and is a strong indication that something went wrong during training. That doesn’t just mean that the model has slightly poorer accuracy, it calls the whole process into question: both values should have been the same, but they weren’t, so something must have fundamentally gone wrong somewhere. Conceptual debugging is necessary.

Autograd

Now we’ve walked through how ML model training using an example, from choosing a model to validating the result. We only omitted details about adjusting the parameters to minimize the error. Now, let’s make up for that.

This section is a little more mathematical than the rest of the article, but you can certainly do ML without getting into this level of detail. However, I think it’s good to understand all the steps.

First, we select a parameter, a, and a single data point (x, y). Then we have the question of optimization: How much and in which direction should we change a so that the square error SE at this point becomes smaller?

As a reminder, here are the formulas again:

ŷ = aꞏx³ + bꞏx² + cꞏx + d SE = (ŷ – y)²

So far, we’ve considered these quantities as functions of x, ŷ(x) = …. But we can just as easily consider them as functions of a, without changing the formulas: ŷ(a) = a x³ + … and SE(a) = (ŷ(a) – y)² . We now treat a as the variable and x as a parameter like any other. This is purely a change of perspective, without us having carried out any further analysis, but it is a first step towards investigating how a influences the error.

Figure 13 shows an example of a section of such a function SE(a) with fixed values for x, b, c, and d. For each value of a, the slope of the function gives an indication of the direction and magnitude of the change in a required to get closer to the minimum of SE. If this reminds you of derivatives, that isn’t a coincidence.

Fig. 13: Example excerpt from a curve SE(a) – depending on the position, a must be selected larger or smaller in order to get closer to the minimum

Fig. 13: Example excerpt from a curve SE(a) – depending on the position, a must be selected larger or smaller in order to get closer to the minimum

To do this, we need the slope of SE at point a (for the current parameter values). An easy way is to approximate the slope by calculating SE for a second, closely adjacent value of a and dividing the deltas (“difference quotient”) (Fig. 14).

Fig. 14: The difference quotient of error and parameter is an approximation for the differential quotient

Fig. 14: The difference quotient of error and parameter is an approximation for the differential quotient

The advantage of this brute force approach is that you don’t need to know anything about the underlying function. The disadvantage is that you have to calculate the entire function a second time – and for the slope depending on b another time, for c yet another time, and so on. This is very expensive for large models with thousands or millions of parameters. But it is a potential approach that can benefit greatly from the parallelism of graphics cards.

Many ML frameworks like PyTorch and TensorFlow take a different, often much more efficient approach: they remember the arithmetic operations when calculating the error and differentiate this function symbolically to determine the gradients as a function of the various parameters (autograd).

Figure 15 shows this for our example. The derivative of SE with respect to a is, according to the chain rule, the derivative of SE with respect to ŷ multiplied by the derivative of ŷ with respect to a. The former is 2 (ŷ – y), the latter is x³, so that the total term is 2 (ŷ – y) ꞏ x³.

Fig. 15: Derivation of the error according to parameter a using the chain rule

Fig. 15: Derivation of the error according to parameter a using the chain rule

This calculation is much more favorable than that of the entire model, and it allows for a number of optimizations. For example, ŷ (or even the difference ŷ-y) has already been calculated for the determination of the error and can be reused from there, and x³ is constant across all epochs.

Similarly, the dependence of the error on parameters b, c, and d can be calculated. Averaging these values across all training data yields the parameter corrections for the next epoch.

Conclusion

This artice on machine learning basics used a simple example to show how step-by-step optimization of model parameters can converge over many epochs. It highlighted several fundamental challenges and introduced some statistical tools for recognizing and handling them.

The algorithms and statistical concepts used in this simple example are largely the same as those used for large and complex models and AIs. The rest of the series will pick up on this.

My main concern in this first part was to remove the quasi-magical aura surrounding machine learning. The mathematics and algorithms are not overly complex and basic validation procedures apply on both a large and small scale.

The post The Basics of Machine Learning appeared first on ML Conference.

]]>
Building APIs for an Agentic World https://mlconference.ai/blog/building-apis-for-an-agentic-world/ Wed, 11 Mar 2026 14:25:47 +0000 https://mlconference.ai/?p=1080152 This article provides a comprehensive guide for senior software engineers, technical leads, and product managers on designing robust and effective APIs for agentic AI systems. It balances foundational principles with practical considerations, serving as both a reference manual and the core material for the how to build APIs for an agentic world. What are Agentic AI Systems?

The post Building APIs for an Agentic World appeared first on ML Conference.

]]>
Agentic AI systems represent a significant evolution in artificial intelligence. Unlike traditional AI applications that might perform a single, predefined task, agentic systems are autonomous or semi-autonomous AIs capable of:

  • Maintain context across multiple interactions
  • Break down complex goals into actionable steps
  • Use tools and external resources dynamically
  • Adapt their approach based on changing conditions
  • Make decisions with varying degrees of autonomy

These systems can:

  • Multi-Step Planning: Agentic systems decompose complex objectives into sequential or parallel tasks, creating and executing plans that may span minutes, hours, or days.
  • Dynamic Tool Use: These systems can discover, select, and invoke appropriate tools or functions based on current needs and context, rather than following pre-programmed workflows.
  • Persistent Memory: Unlike stateless applications, agentic systems maintain both short-term working memory and long-term knowledge stores that inform future decisions.
  • Goal-Oriented Behavior: Agents operate with explicit or implicit objectives, continuously evaluating progress and adjusting strategies to achieve desired outcomes.
  • Environmental Awareness: Advanced agentic systems can perceive and respond to changes in their operating environment, including user feedback, system constraints, and external events.

Examples range from sophisticated customer service agents that can resolve multi-turn queries across various systems, to research agents autonomously searching and synthesizing information, to industrial automation agents optimizing complex workflows.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

How Agentic Systems Differ from Traditional Applications

Traditional applications follow predictable request-response patterns with well-defined input-output relationships. Agentic systems introduce several paradigm shifts:

  • From Stateless to Stateful: Traditional APIs assume each request is independent. Agentic systems require persistent state management across extended interactions.
  • From Predetermined to Dynamic: While traditional systems execute fixed workflows, agents make runtime decisions about which operations to perform and in what sequence.
  • From Single-Step to Multi-Horizon: Traditional APIs optimize for single-request latency. Agentic systems must support long-running processes that may involve hundreds of API calls.
  • From Human-Driven to Agent-Driven: Traditional interfaces are designed for human users making deliberate requests. Agentic systems may generate thousands of API calls autonomously, requiring different patterns for rate limiting, error handling, and resource management.

Why APIs are Critical for Agentic Systems

APIs are not just a convenience for agentic systems; they are a fundamental necessity. They serve as the nervous system, enabling these intelligent entities to:

  • Enable Modularity and Interoperability: Decouple the core AI logic from external capabilities, allowing agents to interact with a diverse ecosystem of services (e.g., databases, external APIs, IoT devices) without needing to be rebuilt for each integration.
  • Bridge AI Capabilities with Real-World Actions: Provide the structured interface through which an agent’s internal reasoning translates into tangible actions in the digital or physical world. Without well-defined APIs, an agent’s intelligence remains confined to its internal processing.

Core API Requirements for Agentic Systems

Designing APIs for agentic systems demands specific considerations that go beyond traditional API design. These APIs must inherently support the dynamic, stateful, and long-horizon nature of agentic workflows.

Key requirements include:

    1. Dynamic Tool/Function Calling: Agents must be able to discover, understand, and invoke a wide array of external tools or functions on demand. The API needs to facilitate this dynamic binding and execution.
    1. Memory Management (Read/Write/Query): Agents require robust mechanisms to store, retrieve, and query various forms of memory—from short-term contextual information (e.g., current conversation state) to long-term factual knowledge. The API must provide interfaces for these memory operations.
    1. Long-Term Agent State Tracking: An agent’s “mind” or internal state—its current goal, progress, accumulated knowledge, and internal variables—needs to be persistently tracked and accessible. APIs must support reading and updating this complex, evolving state.
    1. Multi-Turn, Long-Horizon Workflows: Agentic tasks often span multiple interactions, require revisiting past states, and can take significant time to complete. The API design must accommodate these prolonged, asynchronous, and often interruptible processes.
    1. Workflow Orchestration: Support for complex, multi-step processes with branching logic, error recovery, and progress tracking.
    1. Observability and Control: Comprehensive monitoring, logging, and intervention capabilities to ensure safe and effective agent operation.

Foundational REST API Design for Agentic Systems

While agentic systems introduce unique challenges, standard architectural patterns like REST (Representational State Transfer) provide a solid foundation. We’ll leverage REST principles, adapting them to the specific needs of AI agents.

Fig. 1: Foundational REST API Design for Agentic Systems

Fig. 1: Foundational REST API Design for Agentic Systems

Review of REST Principles for AI Agents Context

REST is an architectural style for distributed hypermedia systems. For API design, its core principles translate to:

  • Resources: Everything is a resource (e.g., an agent, a task, a memory entry). Resources are identified by unique Uniform Resource Identifiers (URIs).
  • URIs: Uniform Resource Identifiers (e.g., _/agents/agent_id, /tasks/task_i_d) are used to identify resources. They should be intuitive and hierarchical.
  • HTTP Methods: Standard HTTP methods (GET, POST, PUT, DELETE, PATCH) map directly to CRUD (Create, Read, Update, Delete) operations on resources.
  • GET: Retrieve a resource or collection.
    
  • POST: Create a new resource or perform a non-idempotent operation.
    
  • PUT: Update/replace an existing resource (idempotent).
    
  • DELETE: Remove a resource.
    
  • PATCH: Partially update an existing resource.
    
  • Statelessness: Each request from client to server must contain all the information needed to understand the request. The server should not store any client context between requests.
  • Nuance for Agent State: While the API interaction itself should be stateless, the agent system being controlled via the API is inherently stateful. The API's role is to provide endpoints for managing (reading, writing, updating) that externalized agent state, not to store session state within the API gateway itself.
    
  • Representational State Transfer: Resources are represented using standard formats (like JSON or XML). The client manipulates the resource’s state by transferring representations.

Designing Agent-Centric Resources

Applying REST principles to agentic systems means modeling agents, their tasks, and their tools as discoverable and manipulable resources.

1. Agents as Resources

  • Represent the agent’s identity, high-level configuration, and meta-information.
  • URI Example: /agents/{agent_id}
  • Example Usage:
  • GET /agents/{agent_id}: Retrieve the profile and current high-level status of a specific agent. Response might include agent_id, name, description, owner, status (e.g., "active", "paused", "error").
    
  • POST /agents: Create a new agent instance.
    
  • PUT /agents/{agent_id}: Update an agent's configuration.
    

2. Tasks/Goals as Resources

  • These represent the specific objectives given to an agent and their progress.
  • URI Example: /agents/{agent_id}/tasks/{task_id} or a top-level /goals/{goal_id} if tasks are shared across agents.
  • Example Usage:
  • POST /agents/{agent_id}/tasks: Submit a new task to an agent. The request payload would describe the task. The response might return a task_id.
    
  • GET /agents/{agent_id}/tasks/{task_id}: Query the status and results of a specific task.
    
  • GET /agents/{agent_id}/tasks?status=in_progress: List all tasks for an agent, with filtering capability.
    

3. Tools/Functions as Resources (or Discoverable Endpoints):

  • While tools are invoked by agents, the tools themselves can be managed as resources for discovery and administration.
  • URI Example: /tools/{tool_name} or a more general /functions/
  • Example Usage:
  • GET /tools: Retrieve a list of all available tools that agents can use.
    
  • GET /tools/{tool_name}: Get detailed information about a specific tool, including its capabilities, required parameters, and usage instructions (metadata for the agent).
    
  • POST /tools/{tool_name}/invoke: (Less RESTful for direct tool invocation, often preferred to have a dedicated invocation endpoint or a broader /actions resource). A more RESTful approach might be for the agent to directly call a service endpoint that represents the tool, e.g., POST /calendar/events to create a calendar event, with the agent acting as the orchestrator.

Data Formats and Schemas (JSON, Protobufs)

Consistent data formats and rigorous schemas are paramount for reliable agent-API interaction. Agents need to understand the structure of data they send and receive. Using schema definitions (e.g., JSON Schema, or Protobuf .proto files) for agent state, memory entries, and tool inputs/outputs is crucial.

  • JSON (JavaScript Object Notation): Widely adopted for its human-readability and flexibility, making it a common choice for REST APIs.
  • Protobufs (Protocol Buffers): A language-neutral, platform-neutral, extensible mechanism for serializing structured data. Protobufs offer better performance and smaller message sizes, which can be critical for high-volume agent interactions.

This ensures that:

  • Agents can correctly parse responses and formulate requests.
  • API contracts are clear and enforceable.
  • Evolution of APIs is managed with backward compatibility.

Imagine a task resource representing an agent’s objective. Its JSON schema might look like this:

{
  "type": "object",
  "properties": {
    "task_id": {
      "type": "string",
      "description": "Unique identifier for the task."
    },
    "description": {
      "type": "string",
      "description": "A natural language description of the task."
    },
    "status": {
      "type": "string",
      "enum": ["pending", "in_progress", "completed", "failed", "paused"],
      "description": "The current status of the task."
    },
    "priority": {
      "type": "integer",
      "minimum": 1,
      "maximum": 5,
      "description": "Priority of the task, 1 (highest) to 5 (lowest)."
    },
    "assigned_agent_id": {
      "type": "string",
      "description": "ID of the agent assigned to this task, if any."
    },
    "parameters": {
      "type": "object",
      "description": "Additional parameters specific to the task type, e.g., target URL for a 'web_scrape' task."
    },
    "results": {
      "type": "object",
      "description": "Output or results once the task is completed or has partial results."
    },
    "created_at": {
      "type": "string",
      "format": "date-time",
      "description": "Timestamp when the task was created."
    },
    "last_updated_at": {
      "type": "string",
      "format": "date-time",
      "description": "Timestamp of the last status or data update."
    }
  },
  "required": ["task_id", "description", "status"]
}

This schema defines the structure, data types, and constraints for a task resource, ensuring both human developers and AI agents can reliably interact with it.

Asynchronous Interactions and Webhooks

Many agentic tasks are long-running and cannot be completed within a single synchronous API request-response cycle. This necessitates asynchronous communication patterns.

Synchronous Request, Asynchronous Processing:

The most common RESTful approach for long-running tasks.

    1. The client (e.g., an external system or another agent) sends a POST request to initiate a task.
    1. The API immediately responds with a 202 Accepted status code, indicating that the request has been received and will be processed. The response body includes a job ID and a status URL.
    1. The client can then poll the status URL (GET /tasks/{job_id}/status) to check for completion or updates.

 

Webhooks for Asynchronous Notifications:

Polling can be inefficient. For truly event-driven and long-horizon workflows, webhooks are a superior alternative.

    1. When initiating a task via POST, the client includes a callback_url parameter in the request.
    1. Once the task’s status changes (e.g., from in_progress to completed or failed), the API server makes an outgoing POST request to the provided callback_url, sending the updated task status and results. This pushes information to the client instead of requiring constant pulling.

Example Flow (Webhooks):

Client initiates task: POST /agents/{agent_id}/tasks

Request Body:

{
  "description": "Generate a summary of Q3 financial reports.",
  "parameters": { "report_ids": ["R123", "R456"] },
  "callback_url": "https://your-service.com/api/agent-callbacks"
}

API responds immediately: HTTP/1.1 202 Accepted

Response Body:

{
  "job_id": "xyz123",
  "status_url": "/tasks/xyz123/status",
  "message": "Task initiated successfully."
}

Agent processes task (long-running).

Task completes.

API sends webhook notification to client: POST https://your-service.com/api/agent-callbacks

Request Body:

{
  "job_id": "xyz123",
  "status": "completed",
  "results": {
    "summary_url": "https://storage.example.com/summaries/q3.pdf"
  },
  "last_updated_at": "2025-07-20T18:30:00Z"
}

This webhook-based approach is essential for supporting multi-turn, long-horizon workflows where agents might need to react to external events or signal completion of a lengthy process.

Foundational Design Principles

Modular and Composable Endpoints

Principle

Design each endpoint to perform a single, well-defined function that can be combined with other endpoints to create complex behaviors.

Implementation Strategy

  • Create fine-grained endpoints that map to atomic operations.
  • Ensure endpoints can be called in any logical sequence without breaking system consistency.
  • Design resource representations that include necessary context for subsequent operations.
  • Avoid endpoints that assume specific calling patterns or sequences.

Example Structure

POST /agents/{agentId}/memory/store
GET /agents/{agentId}/memory/query
PUT /agents/{agentId}/goals/{goalId}
POST /agents/{agentId}/tools/invoke
GET /agents/{agentId}/state

Each endpoint handles a specific aspect of agent operation, allowing agents to compose these operations into complex workflows.

Agent State Management

Principle

Provide explicit, structured management of agent state that supports both persistence and efficient access patterns.

Core State Categories

  • Working Memory: Current context, active tasks, and immediate operational data.
  • Long-Term Memory: Historical information, learned patterns, and persistent knowledge.
  • Goal State: Current objectives, priorities, and success metrics.
  • Execution State: Workflow progress, pending operations, and error conditions.

Design Patterns

  • Use consistent state schema across all endpoints.
  • Support both full state retrieval and incremental updates.
  • Implement versioning for state objects to handle concurrent modifications.
  • Provide query capabilities for complex state structures.

Function Registry Architecture

Principle

Implement dynamic function discovery and invocation through a centralized registry that agents can query and utilize at runtime.

Registry Components

  • Function Catalog: Available functions with descriptions, parameters, and return types.
  • Capability Matching: Logic to help agents discover relevant functions for specific tasks.
  • Dynamic Binding: Runtime function invocation with parameter validation and result handling.
  • Version Management: Support for evolving function interfaces without breaking existing agents.

API Pattern

GET /functions                    # Discover available functions
GET /functions/{functionId}       # Get detailed function specification
POST /functions/{functionId}/invoke # Execute function with parameters
GET /functions/search?capability={capability} # Find functions by capability

Transparent Operations

Principle

All agent operations should be observable, auditable, and explainable through comprehensive logging and state tracking.

Transparency Requirements

  • Decision Logging: Record the reasoning behind agent decisions and actions.
  • Execution Tracking: Monitor progress through complex workflows with detailed timestamps.
  • State Changes: Log all modifications to agent state with before/after snapshots.
  • External Interactions: Track all tool usage and external API calls with full context.

Implementation Approach

  • Embed logging capabilities directly into core API operations.
  • Use structured logging formats that support automated analysis.
  • Provide query APIs for accessing historical operation data.
  • Implement configurable logging levels for different operational needs.

Fail-Safe Design

Principle

Build robust error handling, recovery mechanisms, and safety constraints directly into API design.

Fail-Safe Components

  • Circuit Breakers: Prevent cascading failures by stopping operations when error rates exceed thresholds.
  • Timeout Management: Implement configurable timeouts for all operations with graceful degradation.
  • Recovery Mechanisms: Provide APIs for agents to recover from partial failures and resume operations.
  • Safety Constraints: Enforce operational boundaries and prevent potentially harmful actions.

Error Response Strategy

  • Return structured error objects with sufficient context for agent decision-making.
  • Include recovery suggestions and alternative approaches in error responses.
  • Implement progressive backoff strategies for retryable operations.
  • Provide clear distinctions between temporary and permanent failures.

The post Building APIs for an Agentic World appeared first on ML Conference.

]]>
Is Cursor Evolving into a Developer AI Cloud Platform? https://mlconference.ai/blog/cursor-ai-developer-cloud-platform/ Fri, 20 Feb 2026 09:53:47 +0000 https://mlconference.ai/?p=1079949 This article explores the recent features released in the Cursor ecosystem, including tab completion models, plan mode, security support, local agents, remote coding agents, bug review agents, System prompts & Agent.md support which are evolving software engineering. the capability to seamlessly switch between different AI models, and the versatility to integrate with classic IDEs through Cursor CLI.

The post Is Cursor Evolving into a Developer AI Cloud Platform? appeared first on ML Conference.

]]>
In just a few years, Cursor AI, the first product from Anysphere, has gained huge traction in the software market, and Anysphere has increased its company value to $9BN in recent months.

Nowadays, Cursor AI is considered the leader in the developer AI tools market and offers multiple ways to increase software developer productivity. Many companies are reviewing their policies regarding paid developer AI tools and how to keep up with the fast-paced evolution of LLM capabilities.

A few months ago, Anysphere released Cursor CLI and Cursor Cloud agents API background agents API, which offer new possibilities to interact with models in your pipeline workflows. Using a single company subscription, it is possible to manage the usage of Cursor for all your engineers, and using the new User API Keys, it is possible to handle access to Cursor from your pipelines. Let’s explore these new products in this article.

Autocompleting your code with Cursor AI Tab model

When you start using Cursor AI, the Desktop Java IDE, with your repository, the capacity to predict the next steps when you are coding and the capacity to autocomplete comes from Cursor Tab, a specialized local model which interacts with your Java classes, records, or interfaces. The Tab model is able to autocomplete the missing parts like imports, getter/setter methods, and initial logic associated with the method signature. And this is the magic: with methods that have good naming, good javadocs, and good signatures, the Tab model can sometimes predict the logic inside the method. For example, if you have to create a test, it is able to suggest a few ideas to implement the test. Another nice use case for the Tab model is when the class has some repetitive tasks, it is able to autocomplete the next action based on the previous actions from the software engineer.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

Improving the planning phase with a new Plan mode

Traditionally, Cursor AI has included the development modes Ask & Agent. The first one is dedicated to answering questions about code or other topics, like: ‘Can you provide functional alternatives to this Java method?’ Agent mode is designed to delegate a task to be executed by models, like: ‘Can you refactor this method using alternative 3 provided before and verify changes with ‘./mvnw clean verify.’ But sometimes when you are using Agent mode and the task is a bit more complex than usual, it could require some planning like in the real world, and recently Cursor added this feature.

Fig. 1: New Cursor Plan mode

Fig. 1: New Cursor Plan mode

Using Plan mode, Cursor AI analyses in detail the User prompt and designs a specific plan to solve your problem.

Fig. 2: Following a Cursor plan mode in action

Fig. 2: Following a Cursor plan mode in action

Data privacy ensured

One of the most common aspects of AI tooling that generates questions is everything related to security. When you use models, you send your corporate code in the HTTP requests to Cursor and later to the different models that you use, so Cursor needs to implement safe policies to protect your code. For this purpose, Cursor is SOC 2 Type 2 certified, the main architectural components are audited, and it has clear agreements with main model providers.

Fig. 3: Following a Cursor plan mode in action

Fig. 3: Following a Cursor plan mode in action

The user could configure the data privacy options in the IDE:

Fig. 4: Following a Cursor plan mode in action

Fig. 4: Following a Cursor plan mode in action

Or be configured in a centralized way in the dashboard:

Fig. 5: Following a Cursor plan mode in action

Fig. 5: Following a Cursor plan mode in action

Using Cursor capabilities from the terminal, the CLI alternative

Not everyone feels comfortable with all Java IDEs, at the end it is a tool that uses several hours every day, if this is your case, maybe you could consider using Cursor CLI. With this motivation in mind, Anysphere expanded its offering to software developers by providing a CLI tool to interact with the Cursor platform for your development activities.

Installing Cursor Agent, the Cursor CLI, is super easy. Open a terminal and execute:

curl https://cursor.com/install -fsS | bash

Once you have installed it, type:

cursor-agent

Fig.6: Cursor Agent installation screen

Fig.6: Cursor Agent installation screen

With this alternative to Cursor AI, developers now have more options for their daily work. On the other hand, you could use Cursor AI in cloud dev environments with the help of Devcontainers without any issue.

Review your Pull requests with BugBot

If your team uses the pull requests to merge feature into main branch, you might consider enriching the PR experience using BugBot which reviews pull requests and identifies bugs, security issues, and code quality problems.

Fig.7: Cursor BugBot configuration

Fig.7: Cursor BugBot configuration

This solution could be a good complement for the manual review or the automatic static code analysis.

Delegating development tasks to Cursor background agents API

Recently, Cursor released a new product named Cursor Background Agents API, which allows organizations to handle the full lifecycle of AI-powered coding agents to work on your GitHub repositories and create PRs in a programmatic way. This new service is organized into 3 different sets of endpoints:

  • Agent Management (Launch an Agent, Follow-up, Delete Agent)
  • Agent Information (Get Agent status, List of agents & Provide the agent conversation)
  • General (List of Models, List of Keys & List of Github repositories)

This idea is awesome because using a REST API client, it is easy to integrate with this new cloud service and you could enrich your pipelines with new automation or delegate some tasks from local to the cloud.

Example about potential Data pipeline enhanced:

Fig. 8: Automation workflow scenario with Cursor Background Agent API

Fig. 8: Automation workflow scenario with Cursor Cloud Agents API

To use this solution, you need to generate an API KEY from your dashboard:

Fig. 9: API Key generation example

Fig. 9: API Key generation example

Once you have the API Key, you could launch a Remote agent in this easy way from your terminal:

curl -X 'POST' \
'https://api.cursor.com/v0/agents' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <token>' \
-d '{
"prompt": {
"text": "Create a Java Hello World program and verify the results when compile and execute it"
},
"source": {
"repository": "https://github.com/your-org/your-repo",
"ref": "main"
},
"target": {
    "autoCreatePr": true
},
"model": "Default"
}'

Or you could generate a Java Http Client using an OpenAPI generator from the source: https://cursor.com/docs-static/background-agents-openapi.yaml

If you need to understand the details about the different endpoints, you could review the following online resources:

Note: This service is in Beta phase.

Cursor rules homogenize model responses to your user prompts in Java.

define a set of instructions given to an AI model that defines how it should behave, and this idea is valuable because not every organization implements solutions in the same way. If you can define some guidelines about different aspects of your development, it could be great. Imagine the case of one company with a functional programming culture that will use some new features released in Java, like lambdas, records, pattern matching, sealed classes, and other organizations that are not interested in this style, you could instruct models to return answers with these ideas in mind.

Cursor provides a way to create Cursor rules by taking ideas from the repository, or you could use specialized Cursor rules defined in ready-to-use repositories from GitHub or websites.

Cursor adhered to the Agent.md initiative

Recently, Cursor adhered to the new Agent.md initiative in order to help Cursor products understand the Agent.md file, which includes ideas about how models should use a particular Git repository. If you take a look at any repository, it includes a README.md file which helps software engineers to understand how to begin, the repository’s purpose, and how to contribute to it. On the other hand, AGENTS.md closes the loop because it is designed for models adding information that it is required like build steps, tests approach, and conventions that might clutter a README or aren’t relevant to human contributors.

Conclusions

Cursor continues adding new useful capabilities for software development, and the team behind the different products/services releases with high cadence. Every new feature added to their products enriches the Software Development Life Cycle (SDLC) in different aspects. Analyzing the different DORA metrics with the different Cursor products/services, the positive impact is clear:

Deployment Frequency (How often a team releases code to production) Lead Time for Changes (The time it takes for a code commit to be deployed into production) Change Failure Rate (The percentage of deployments that result in a failure in the production environment) Mean Time to Recover (The average time it takes to restore service after a production failure)
Cursor Tab Model X X
Plan Mode X X
Plan Agent X X X
Background Agents X X X
Bug Bot X X

If you are a new user of the Cursor product, you could take out a subscription and experiment with the autocomplete features from the Cursor Tab Model. Later, use Ask mode to ask questions about different alternatives to implement a feature, then refactor the code from Ask mode using Agent mode, and finally go a bit further and try to solve a complete feature from scratch by creating a plan based on Plan mode to see the results. We are living in a new age of software development.

References

The post Is Cursor Evolving into a Developer AI Cloud Platform? appeared first on ML Conference.

]]>
Can MCP Enable Truly Cooperative AI Agents? https://mlconference.ai/blog/can-mcp-enable-truly-cooperative-ai-agents/ Thu, 15 Jan 2026 15:34:10 +0000 https://mlconference.ai/?p=1079767 The Model Context Protocol (MCP) is a new technical standard designed to solve the biggest challenge facing AI agents: their inability to work together. MCP provides a universal "handshake" that allows agents from different providers (like OpenAI, Google, and Anthropic) to discover each other's skills, share necessary data, and collaborate on complex tasks. This breakthrough enables true multi-agent orchestration, where specialized agents can hand off sub-tasks automatically, finally paving the way for a "chorus" of cooperative AI.

The post Can MCP Enable Truly Cooperative AI Agents? appeared first on ML Conference.

]]>
Picture this: you give a single voice command and, minutes later, an OpenAI-powered writing agent drafts an event brochure, a Gemini spreadsheet agent reconciles supplier invoices, and a Claude negotiation agent emails final quotes. Each working from the same facts, updating the same timeline, and handing off sub-tasks automatically.

That choreography is still rare, but in 2025 it finally feels within reach because the industry is rallying around a new wiring standard called the Model Context Protocol (MCP). MCP provides a universal handshake that lets any AI agent advertise what it can do, discover what others can do, and stream precisely the context each step requires.

Lets unpack why interoperability has held agents back, how MCP fixes the plumbing, and why Microsoft’s embrace of the protocol may accelerate an era of truly cooperative AI.

Understanding AI Agents and the Interoperability Gap

AI agents aren’t merely smarter chatbots. They perceive their environment, break an objective into smaller goals, choose the best tool for each step, and learn from the outcome so the next attempt is better. GitHub’s new Agent Mode in Visual Studio Code is a case in point: it refactors multi-file codebases, issues terminal commands, and patches runtime errors until tests pass—often without another engineer touching a keyboard.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

Yet, autonomy creates a new problem: isolation. Enterprises already deploy multiple brand-specific agents, think Claude for coding, Gemini for analytics, and ChatGPT for customer support. Each is effective in its own sandbox, yet, blind to the others’ memories. This means end users juggle three conversations instead of one, while institutional knowledge fragments.

It’s estimated that 85% of enterprises will operate more than one agent this year, but with nothing like the inter-agent coherence we expect from human teams.

Traditional REST or GraphQL APIs were meant to be glue, but they assume the user knows the exact endpoint and schema. Agents, by contrast, can explore the tools they can access and find resources that can sharpen their reasoning. What if those tools were other AI agents?

Model Context Protocol: A Universal Language for Agents

MCPs were introduced last year and have been refined since then, and they represent a radical step forward in the potential capabilities of AI agents.

Think of MCP as a universal language for AI cognition. An application can attach itself as an MCP server advertising three things:

  • Tools it can execute (for example, create_invoice or run_sql_query).
  • Read-only resources it can share (say, a PDF or a database schema).
  • Reusable prompt templates.

An MCP client, typically the AI agent, starts by asking the server what capabilities exist, then decides which to invoke as reasoning unfolds. Discovery is baked in, so a client that meets a new server at runtime adapts automatically. Connect a Sentry MCP server to an incident-management agent and, with no new code, that agent learns it can pull stack traces and link them to remediation steps.

Want to make a change? Replace Sentry with Datadog, and the conversation pattern hardly changes, as it can follow the same learning patterns as its alternative.

Another breakthrough is Context Protocols. MCP messages can carry arbitrary chunks of text or embeddings, so an agent can request ‘customer 12345’s order notes’ and receive only the paragraphs its model can digest, trimming token costs while protecting privacy. Where REST asks, ‘What function do you want to run?’, MCP first asks, ‘What do you already know, and what extra context will sharpen your reasoning?’.

An AI agent automating cloud optimization could communicate with other agents to prioritise resource deployments, making things much more efficient. It will be able to go much deeper than just tracking and optimizing around historic usage, and identifying ‘peak times’, it will understand the context of what deadlines and projects are high priority, and allocate resources based on that context.

Microsoft’s Bold Bet on MCP

Microsoft detected the MCP upside early on. They’ve partnered with Anthropic to release an official C# SDK, letting any .NET service become an MCP server or client with a few annotations. GitHub has now rolled MCP into Agent Mode for every Visual Studio Code user, instantly opening a marketplace of servers, from Playwright for browser automation to Notion for documentation, in one update.

MCP Everywhere in Copilot Studio

MCP has been declared generally available inside Copilot Studio, Microsoft’s low-code canvas for business agents. Makers can now drag an MCP connector onto the canvas, point it at an Azure API Management gateway, and grant an AI agent controlled access to any tool the organisation has registered, with Azure API Center acting as a private catalogue of trusted servers.

Multi-Agent Orchestration

Most eye-catching, though, was multi-agent orchestration. Instead of scripting a single super-Copilot, builders can link specialised agents, like sales, legal, and DevOps, so they delegate tasks to one another. A Copilot Studio agent might pull CRM data, hand it to a Microsoft 365 agent to draft a Word proposal, then trigger another agent to schedule Outlook follow-ups, all without human nudging.

A Converging Protocol Landscape

Interoperability isn’t a Microsoft-only crusade. Google has unveiled the open Agent-to-Agent (A2A) protocol aimed at secure information exchange between agents, signalling that the majors prefer convergence over yet another standards war. Microsoft promptly added A2A bridging in Copilot Studio for agents that already speak MCP, betting on a layered approach akin to the web’s TCP/IP stack.

Tooling and Runtime Support

Support is rippling outward. Visual Studio, JetBrains IDEs, and Eclipse now auto-discover local MCP servers, while Windows maintains a per-machine registry so desktop apps can publish capabilities without magic ports. Azure AI Foundry rounded things off by exposing an MCP endpoint for every model it hosts, meaning a freshly fine-tuned proprietary model can drop into agent workflows with no glue code.

Towards Truly Cooperative Agents

Once agents share a protocol, new patterns emerge. A travel-booking agent can store your seat preference and hand it to a finance agent reconciling expenses, no fragile database sync required. Agents wired together can open tickets, fetch logs, and suggest patches inside the same Slack thread, turning multi-step incidents into single conversations.

There’s a clear appetite for this level of interoperability, as protocol-level interoperability could be the top enabler for scaling agentic AI. A bank would be far more willing to let a Gemini-powered compliance agent vet loan documents when it can rely on an MCP handshake to fetch them from a GPT-powered classifier, with OAuth scopes and audit trails enforced end-to-end.

The ‘Internet of Agents’ Vision

There’s a lot of chatter about how MCP could enable an ‘Internet of Agents’. Just as HTTP, TCP, and DNS let millions of web servers cooperate without sharing code, MCP (plus A2A) could let agents publish their tool catalogues and subscribe to others’. A personal health agent might grant a nutrition agent read-only access to biometric data and, in return, call its meal-planning tool. Capability scopes embedded in MCP metadata would lock the contract, and either agent could be swapped out without rewriting the rest of the system

For developers, the payoff is simplicity. Instead of importing SDKs for Salesforce, ServiceNow, and Confluence, they register those systems as MCP servers. At reasoning time, the agent decides which tool to call, and when a new SaaS vendor ships an MCP server, integration is instantaneous. Software begins to resemble a colony of cooperating experts rather than a brittle monolith of APIs.

Conclusion

The Model Context Protocol tackles a deceptively mundane yet existential question: how can thinking machines share what they know? MCP frees agents from their silos without forcing developers to rewrite the internet.

If the vision holds, tomorrow’s users will no longer pick an ‘OpenAI agent’ or a ‘Google agent.’ They will state a goal, and a chorus of cooperative agents will decide, negotiate, and execute behind the scenes. The real question may no longer be whether MCP can enable truly cooperative agents, but what new kinds of work and creativity will emerge once the walls between AI agents finally fall.

The post Can MCP Enable Truly Cooperative AI Agents? appeared first on ML Conference.

]]>
Model Context Protocol Servers and Security: What You Need to Know https://mlconference.ai/blog/model-context-protocol-servers-and-security-what-you-need-to-know/ Mon, 13 Oct 2025 06:46:59 +0000 https://mlconference.ai/?p=108419 Model Context Protocol doesn’t include all the necessary security right out of the box. This article will walk you through how to secure yourself against common attack vectors, including weak authentication, prompt injection, and broad authorization to keep your data safe from bad actors.

The post Model Context Protocol Servers and Security: What You Need to Know appeared first on ML Conference.

]]>
Use of Model Context Protocol (MCP) has grown exponentially in the past year. Since its initial launch by Anthropic as a way to manage autonomous AI agents, MCP has become the de facto standard for connecting AI application components. With MCP, users can create AI agents that can move assets, alter data, and execute business processes – with or without human oversight. MCP relies on servers to connect and manage interaction between agents, processes, and data.

But like many developer-focused projects built to scale rapidly, MCP does not include much in the way of security out of the box. Given the kinds of data AI is often given access to, a lack of robust security can pose a massive risk – and with so much power available, MCP servers are a prime, high-risk target for threat actors.

This is a pattern we have seen before with application programming interfaces, or APIs. So, what can we learn from the ways we’ve learned to manage and harden APIs, and how can this be applied to MCP before potential threats turn into real risks?

Learning from API history

APIs have become the preferred way for developers to build their software. Using APIs, you can connect your software builds to third-party tools, or to cloud services, and create that experience faster. Using microservices, you can add more functionality into your components and APIs without taking down the entire application to achieve your goal. And from a security perspective, having those feature-rich APIs from vendors meant that you can get more data on what was happening, then use that insight to automate some of your security operations.

The logic then, with APIs, is the same as it is now with MCP servers – getting that integration to work fast makes things easier for developers. However, it also introduces new classes of risk. For example, a simple error in an application component attached to one API can affect the rest of the application, either leading to interruptions in performance or security vulnerabilities. While APIs made it faster to build, they also made it easier to scale up that mistake over time. Now, imagine this same potential risk amplified by autonomous agents acting at machine speed.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]


But MCP servers are not just APIs – they provide the operational backbone for agentic AI. Legacy APIs are deterministic and act in the same way time after time. MCPs don’t have that degree of control. Instead, they operate differently based on context in order to empower their large language models to take action. The protocol often assumes that both the requestor and the object requested are benign, so requests are not validated before they are acted on.

From a security perspective, this is a significant miss that can lead to unintended consequences. An attacker could trigger the MCP to leak data or move data to an unauthorised location. It could also trigger a workflow that should not be allowed, or attempt to sabotage operations altogether.

There are three security issues here: weak authentication, prompt injection, and broad authorization. In response, regulators in the EU and the US have stepped up to require organisations to address these risks directly under the auspices of both the EU AI Act and NIST’s AI Risk Management Framework. Developers should be aware that MCP security is something they will have to address sooner rather than later.

How to secure MCP deployments

To address this problem, developers and security teams must work together. The challenge is how to effectively solve these problems before deployment, but the opportunity is that any effort will improve your overall long-term approach. Doing so should reduce management overhead and simplify security over time.

To start, you must understand how to approach authentication and credential management. We all know that multi-factor authentication should be in place, but this should go one step further when it comes to MCP. Rather than using static tokens, look at how you can use short-lived tokens and credentials that rotate over time. This prevents attackers from stealing your tokens and using them for other attacks. On the security side, you should monitor for token misuse and revoke any credentials that you don’t actually need.

Permissions and access aren’t just a concern for humans, though. The biggest reason agentic AI has surged in popularity is that it can act on its own – a defining feature that opens up potential problems when the AI agents have free rein. How much are you willing to allow your systems to act independently, and how many processes will require a human to be in the loop? This is a business risk conversation rather than just one that developers should make on their own.

After defining those boundaries for agentic AI, you can look at tightening permissions around who gets to ask questions or provide prompts. Prompt injection is a proven attack approach, and attackers will use it if they can get access. To prevent this, use input validation and sanitisation at every layer. You can also route all prompt queries through a proxy to remove malicious requests before they reach the MCP server. This allows you to control all potential inputs into your system, even if you don’t have a standardised approach to follow.

You should also look at reducing permissions to control the impact of any attack getting through. If you have broad permissions and poor multi-tenancy controls, an attack will create more issues and affect more systems than one that is locked down. MCP servers have little to no standard authorisation controls in place, as this is not included in the protocol by default. As such, it’s crucial to ensure you have a robust and well-defined approach to managing access and permissions in place before you connect up any MCP server to your sensitive data.

Ensuring that security and development teams collaborate to enforce least-privilege access and role-based authorisation is critical. You should also isolate your contexts and tenants to reduce the potential impacts from any successful attack. In practice, this helps you contain any potential breach to a single workflow or user, rather than affecting your entire organisation.

Knowing you are in control

For developers, working with security teams on controls for MCP deployments is another task to add to the list. But with more and more AI application deployments happening, and so much demand for agentic AI systems, getting MCP deployments right from the start is an essential skill to develop.

Unfortunately, many of the previous static controls that worked around areas like APIs don’t work for MCP servers. Instead, your security controls have to work in real-time, just like your prompting and responses. Furthermore, your entire company will have to develop a sense of what is needed for security around AI systems. It’s not just the responsibility of the security team or you as the person who developed the application, but the whole organisation. This mindset shift helps the business innovate safely and deliver on impact.

Attacks on AI systems are already happening. From bad actors looking out for LLM service accounts that they can hijack and resell proxy access to, dubbed LLMjacking, to training data containing API keys, credentials, and user accounts that could be pillaged for access, AI systems are already under threat. As we consider how to innovate and move faster with agentic AI, we can’t underestimate its inherent risks. New standards, like MCP, should have security by design baked into them from the start – but when they don’t, other guardrails should be put in place.

As autonomous agents become inseparable from business operations, the MCP servers that run these services will be targeted. Putting in strong security principles in collaboration with security will be a necessary investment if agentic AI services are to deliver. For developers, thinking about this at the start will make your businesses more secure – and your life easier – in the long run, too.

The post Model Context Protocol Servers and Security: What You Need to Know appeared first on ML Conference.

]]>
AI Security in Focus: Managing Identity, Model Drift, and LLMOps Risks https://mlconference.ai/blog/mlops/ai-security-identity-llmops-model-drift/ Mon, 01 Sep 2025 15:28:10 +0000 https://mlconference.ai/?p=108302 In this blog, we share two in-depth articles: one explores the AI triple threat and why identity security must be the cornerstone of adoption, while the other looks at LLMOps and how to manage model drift for the safe use of large language models. Together, they highlight the dual challenge of securing AI while ensuring its reliability.

The post AI Security in Focus: Managing Identity, Model Drift, and LLMOps Risks appeared first on ML Conference.

]]>
The AI Triple Threat: Why Identity Security Must be the Cornerstone of AI Adoption

by David Higgins

AI brings new possibilities, but with it, new risks. This article looks at the three threats that AI brings and the best strategies to use identity security and keep cybersecurity at the forefront of digital strategies.

A series of recent high-profile breaches has demonstrated that the UK remains highly exposed to increasingly sophisticated cyber threats. This vulnerability is growing as artificial intelligence becomes more deeply embedded in day-to-day business operations. From driving innovation to enabling faster decision-making, AI is now integral to how organisations deliver value and stay competitive. Yet, its transformative potential comes with risks that too many organisations have yet to fully address.

CyberArk’s latest research shows that AI now presents a complex “triple threat”. It is being exploited as an attack vector, deployed as a defensive tool and, perhaps most concerning, introducing critical new security gaps. This dynamic threat landscape demands that organisations place identity security at the centre of any AI strategy if they wish to build resilience for the future.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

AI is enhancing familiar threats

AI has raised the bar for traditional attack methods. Phishing, which remains the most common entry point for identity breaches, has evolved beyond poorly worded emails to sophisticated scams that use AI-generated deepfakes, cloned voices and authentic-looking messages. Nearly 70% of UK organisations fell victim to successful phishing attacks last year, with more than a third reporting multiple incidents. This shows that even robust training and technical safeguards can be circumvented when attackers use AI to mimic trusted contacts and exploit human psychology.

It is no longer enough to assume that conventional perimeter defences can stop such threats. Organisations must adapt by layering in stronger identity verification processes and building a culture where suspicious activity is flagged and investigated without hesitation.

AI as a defensive asset

While AI is strengthening attackers’ capabilities, it is also transforming how defenders operate. Nearly nine in ten UK organisations now use AI and large language models to monitor network behaviour, identify emerging threats and automate repetitive tasks that previously consumed hours of manual effort. In many security operations centres, AI has become an essential force multiplier that allows small teams to handle a vast and growing workload.

Almost half of organisations expect AI to be the biggest driver of cybersecurity spending in the coming year. This reflects a growing recognition that human analysts alone cannot keep up with the scale and speed of modern attacks. However, AI-powered defence must be deployed responsibly. Over-reliance without sufficient human oversight can lead to blind spots and false confidence. Security teams must ensure AI tools are trained on high-quality data, tested rigorously, and reviewed regularly to avoid drift or unexpected bias.

AI is expanding the attack surface

The third element of the triple threat is the rapid growth in machine identities and AI agents. As employees embrace new AI tools to boost productivity, the number of non-human accounts accessing critical data has surged, now outnumbering human users by a ratio of 100 to one. Many of these machine identities have elevated privileges but operate with minimal governance. Weak credentials, shared secrets and inconsistent lifecycle management create opportunities for attackers to compromise systems with little resistance.

Shadow AI is compounding this challenge. Research indicates that over a third of employees admit to using unauthorised AI applications, often to automate tasks or generate content quickly. While the productivity gains are real, the security consequences are significant. Unapproved tools can process confidential data without proper safeguards, leaving organisations exposed to data leaks, regulatory non-compliance and reputational damage.

Addressing this risk requires more than technical controls alone. Organisations should establish clear policies on acceptable AI use, educate staff on the risks of bypassing security, and provide approved, secure alternatives that meet business needs without creating hidden vulnerabilities.

Putting identity security at the centre

Securing AI-driven businesses demands that identity security be embedded into every layer of the organisation’s digital strategy. This means achieving real-time visibility of all identities, whether human, machine or AI agent, applying least privilege principles consistently, and continuously monitoring for abnormal access behaviours that may indicate compromise.

Forward-looking organisations are already adapting their identity and access management frameworks to handle the unique demands of AI. This includes adopting just-in-time access for machine identities, implementing privilege escalation monitoring and ensuring that all AI agents are treated with the same rigour as human accounts.

AI promises enormous value for organisations ready to embrace it responsibly. However, without strong identity security, that promise can quickly turn into a liability. The companies that succeed will be those that understand that building resilience is not optional, but foundational to long-term growth and innovation.

In an era where adversaries are equally empowered by AI, one principle holds true: securing AI begins and ends with securing identity.

———————————————————————————————————————————————————————————————-

Managing Model Drift in LLMs for the Safe Use of AI

by João Freitas

Successfully implementing a successful LLMOps framework can help enterprises avoid that output from their LLMs stays free of model drift and AI hallucinations. This article explains how to create a successful LLMOps strategy, managing model drift, and ensure customer trust and satisfaction.

The number of business professionals using AI continues to grow as both sanctioned and unsanctioned use skyrocket, and organizations deploy commercially available LLMs internally. Given the increasing adoption of LLMs, organizations must ensure outputs from these models are trustworthy and repeatable over time. LLMs have become business-critical systems in modern enterprises, and any potential failure of these systems can rapidly harm customer trust, violate regulations and damage an organization’s reputation.

Foundational AI models are expensive to train and run, and in most business contexts, there is minimal return on investment for companies that invest millions in building their models. With this cost in mind, organizations instead choose to rely on LLMs developed by third parties, which must be managed in the same way other enterprise systems are managed.

However, organizations must be on guard for model drift and AI hallucinations when using these third-party models, and implement standardized processes to remediate these issues. This specialized space, called LLMOps, is emerging as organizations adopt dedicated platforms that extend traditional MLOps and observability frameworks to meet the unique challenges posed by widespread LLM use.

But what does a suitable LLMOps framework look like?

Forming the bedrock of LLMOps

It’s clear that organizations need LLMOps to mitigate the risk of hallucinations or model drift, but the practical aspects of an LLMOps framework can be less apparent. Several crucial considerations must form the bedrock of an organization’s LLMOps practices.

When any publicly available LLM is adopted by an organization, the first step in managing its use is to establish clear guardrails for the systems and data it can access. Approved use cases for the LLM must also be made clear across teams to strike the right balance between enabling innovation without ever exposing sensitive data or systems to a third-party provider, or crossing data permissions boundaries.

Similarly, organizations must set up a good level of observability around any LLMs to detect issues with latency or inaccurate outputs before they can escalate into issues that directly affect engineering teams. Both of these steps can improve organizational security around LLM usage to reduce the risk exposure often associated with the adoption of new tools.

To maintain the long-term accuracy and trustworthiness of LLM outputs, organizations must implement safeguards to reduce bias and ensure fairness in any outputs generated. LLMs are prone to bias, which is present in the data they were trained with. For example, LLMs often refer to developers as “he” rather than using a gender-neutral term. While this may seem innocuous, it can be a sign of other biases within the LLM, which can ultimately affect hiring decisions or internal company policies, often to the detriment of one or more groups.

It is also vital for organizations to test the LLMs they use for degradation over time due to changes in the data. This is necessary to ensure the model aligns with the data in their environment and provides an additional layer of security against AI hallucinations.

The final pillar of an effective LLMOps framework is for the organization to proactively address risks related to the generation of incorrect sensitive data, such as generating incorrect pricing. Sensitive, business-critical decisions cannot be wholly given over to LLMs. Instead, responsible LLMOps will keep human oversight for critical operations.

When successfully adopted, LLMOps will enable LLMs to scale as more users within an organization adopt tools with guardrails in place. LLMOps will also keep LLMs performing well so they never become blockers to innovation or cause operational slowdowns.

However, LLMOps is not a one-and-done process. Instead, LLMs must be constantly monitored and retained on up-to-date datasets to avoid model drift over time.

How LLMOps prevent model drift

With a vast number of organizations using commercially available LLMs, there is a growing risk of model drift influencing LLM-generated outputs as time goes on. The primary cause of model drift is a model basing its responses on outdated data. For example, an organization using GPT-1 would only receive answers based on that model’s training data, which comes from pre-2018, while GPT-4 has been trained on data up to 2023.

So, how can enterprises use LLMOps to combat model drift?

There are five strategies organizations can employ, depending on their datasets and computational resources:

  • Use the latest version of an LLM model to account for more recent data, helping to ensure that any generated outputs will be up to date and reduce the chance of AI hallucinations where the LLM tries to fill gaps in its training data.
  • Fine tune pre-trained LLMs to respond to a specific topic, improving the accuracy of outputs without the major investment of training a proprietary model.
  • Adjust parameters for responses and adjust the weighting of responses to enable an LLM to give more importance to certain tokens over others in response generation.
  • Use Retrieval-Augmented Generation (RAG) to enhance the LLM’s case-specific knowledge and factual accuracy by retrieving relevant information from external knowledge sources during inference.
  • Pass sufficient, industry-focused context to the model to ensure users get better responses to questions and more relevant answers for the enterprise’s specific industry.

Successful LLMOps is continuous

While enterprises can adopt LLMOps to manage how teams use LLMs, they cannot treat it as a one-off process.

Preventing model drift requires constant supervision of AI-generated outputs and regular retraining of LLMs as an organization’s internal datasets evolve. Given the potentially damaging business impact of incorrect results, mitigating hallucination risk is crucial to the success of a modern organization.

Through the creation of an effective LLMOps strategy, organizations will be able to improve customer trust, ensure their regulatory compliance and protect their reputation, all while making their operations more efficient.

The post AI Security in Focus: Managing Identity, Model Drift, and LLMOps Risks appeared first on ML Conference.

]]>
The Expanding Scope of Observability for AI Systems https://mlconference.ai/blog/the-expanding-scope-of-observability/ Tue, 12 Aug 2025 14:19:59 +0000 https://mlconference.ai/?p=108224 Observability is undergoing a big change in the era of AI and must now cover model and data drift, autonomous decision making, intent and outcome alignment, and more. This article gives some best practices for the next-generation of AI-native observability, examines potential challenges, and looks towards the future of achieving observability's full potential.

The post The Expanding Scope of Observability for AI Systems appeared first on ML Conference.

]]>
As organizations accelerate their adoption of AI-powered tools—ranging from CodeBots to agentic AI—observability is rapidly shifting from a technical afterthought to a strategic business enabler. In our last article, “Observability in the Era of CodeBots, AI Assistants, and AI Agents”, we briefly touched upon key enhancement in the observability space. Continuing here – stakes are high for the next steps in Observability where AI systems are predicted to act autonomously, make complex decisions, and interact with humans and other agents in ways that are often opaque. Without robust observability, organizations risk not only technical debt and operational inefficiency, but also ethical lapses, compliance violations, and loss of user trust.

Join us at MLCon New York to attend Garima Bajpai‘s keynote & workshop LIVE!

Keynote : Charting the Way Forward for AI-Native Software Organizations

Workshop: Operationalizing AI Workshop – Leadership Sprints

The Expanding Scope of Observability

The traditional boundaries of observability—metrics, logs, and traces—are being redrawn. In the AI era, observability must encompass:

 

Fig. 1: The expanding scope of observability

  • Intent and Outcome Alignment: Did the AI system achieve what was intended, and can we explain how it got there?
  • Model and Data Drift: Are models behaving consistently as data and environments evolve?
  • Autonomous Decision Auditing: Can we trace and audit the rationale behind AI agent decisions?
  • Human-AI Interaction Quality: How effectively are developers and end-users collaborating with AI assistants?

In the next section, we’ll expand on each of the specific questions and outline the next steps.

Intent and Outcome Alignment

AI alignment refers to ensuring that an AI system’s goals, actions, and behaviors are consistent with human intentions, values, and ethical principles. Achieving intent and outcome alignment means the system not only delivers the desired results but does so for the right reasons, avoiding unintended consequences such as bias, or reward hacking. For example, if an AI is designed to assist with customer queries, alignment ensures it provides accurate, helpful responses rather than hallucinating or misleading users. Regular outcome auditing is essential—this involves evaluating real-world results to check for disparities or unintended effects, ensuring the AI’s outputs match the original intent and are explainable.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]


Observability is foundational for intent and outcome alignment because it makes the AI’s decision-making transparent and traceable, allowing stakeholders to explain, verify, and correct its behavior as needed.

  • Intent tracing and validation: Mechanisms to explicitly track the mapping from user intent to system objectives and emergent behaviors, allowing for validation that intent is preserved through each stage of the AI’s operation.
  • Robust logging of agent interactions: Especially for agentic AI, detailed logs of external actions, tool invocations, and inter-agent communications are necessary to detect misuse or unintended consequences.
  • Automated anomaly and misalignment detection: Integration of anomaly detection systems that can flag when observed behaviors deviate from expected, aligned patterns—potentially using machine learning to recognize subtle forms of misalignment.

Model and Data Drift

Model and data drift refer to the phenomenon where machine learning models gradually lose predictive accuracy as the data and environments they operate in evolve. This happens because the statistical properties of the input data or the relationships between features and target variables change over time, making the model’s original assumptions less valid. There are two primary types:

  • Data drift (covariate shift): The distribution of input features changes, but the relationship between inputs and outputs may remain the same.
  • Concept drift: The relationship between inputs and outputs changes, often due to shifts in the underlying process generating the data.

As data and environments evolve, observability is essential to ensure models behave consistently and maintain their predictive power. Advanced observability features—especially automated, real-time drift detection and diagnostics—are critical for robust, production-grade machine learning systems.

  • Drift detection: Observability tools can implement statistical tests (e.g., Population Stability Index, KL Divergence, KS Test) to compare incoming data distributions with those seen during training, flagging significant deviations.
  • Automated drift detection and alerting: Real-time, automated identification of both data and concept drift, with configurable thresholds and notifications.
  • Granular performance monitoring: Tracking model accuracy, precision, recall, and other metrics across different data segments and time windows to pinpoint where drift is occurring.

Autonomous Decision Auditing

Tracing and auditing the rationale behind AI agent decisions, especially in autonomous or agentic AI systems, is both possible and increasingly necessary, but it presents significant technical and organizational challenges. Auditing the rationale behind autonomous AI decisions is feasible with the right combination of observability, explainability, and compliance tools is of utmost importance.

As AI systems grow in complexity and autonomy, advanced observability features such as real-time monitoring, detailed logging, and integrated XAI—are essential for ensuring transparency, accountability, and trust.

  • Decision provenance tracking, recording the sequence of transformations and inferences leading to each decision.
  • Automated bias and fairness checks at both data and outcome levels, with alerts for detected issues.
  • Integration of XAI tools for on-demand explanation of individual decisions, especially in high-stakes or regulated environments.

Human-AI Interaction Quality

Developers and end-users are collaborating with AI assistants with increasing effectiveness, but the quality of these interactions varies widely depending on the application, the clarity of communication, and the feedback mechanisms in place. Observability in the context of human-AI interaction means having comprehensive visibility into both the AI’s internal decision-making processes and the dynamics of user-AI exchanges.

This enables:

  • Multimodal Analytics: Ability to combine quantitative metrics (e.g., error rates, session lengths) with qualitative data (e.g., sentiment analysis, user feedback) for a holistic view of interaction quality.
  • Integration with Human-in-the-Loop & in the Lead Systems: Seamless handoff and tracking between AI and human agents, ensuring continuity and accountability in complex workflows.
  • Automated Feedback Impact Analysis: Tools that automatically correlate user feedback with subsequent changes in AI behavior or performance, quantifying the value of human input.

Effective human-AI collaboration depends on robust observability, which empowers developers and end-users to monitor, understand, and continuously improve interaction quality.

Key Challenges Ahead

  • Complexity and Scale: AI-powered systems introduce unprecedented complexity. Multi-agent workflows, dynamic model updates, and real-time adaptation all multiply the points of failure and uncertainty. Observability solutions must scale horizontally and adapt to changing system topologies.
  • Data Privacy and Security: With observability comes the collection of sensitive telemetry—prompt data, user interactions, model outputs. Ensuring privacy, compliance (e.g., GDPR, HIPAA), and secure handling of observability data is paramount.
  • Semantic Gaps: Traditional observability tools lack the semantic understanding needed for AI systems. For example, tracing a hallucination or bias back to its root cause requires context-aware instrumentation and domain-specific metrics.
  • Standardization and Interoperability: Fragmentation remains a challenge. While projects like OpenTelemetry’s GenAI SIG are making strides, the ecosystem is still maturing. Vendor lock-in, proprietary data formats, and inconsistent APIs can hinder unified observability across diverse AI stacks.

Best Practices: Building AI-Aware Observability

  • Design for Explainability: Instrument AI systems with explainability hooks—capture not just what happened, but why. Integrate model interpretability tools (e.g., SHAP, LIME) into observability pipelines to surface feature importances, decision paths, and confidence scores.
  • Embrace Open Standards: Adopt open-source, community-driven observability frameworks (OpenTelemetry, LangSmith, Langfuse) to ensure interoperability and future proofing. Contribute to evolving standards for LLMs and agentic workflows.
  • Feedback Loops and Continuous Learning: Observability should not be passive. Establish automated feedback loops—use observability data to retrain models, refine prompts, and adapt agent strategies in near real-time. This enables self-healing and continuous improvement.
  • Cross-Disciplinary Collaboration: Break down silos between developers, data scientists, MLOps, and security teams. Define shared observability goals and metrics that span the full lifecycle—from data ingestion to model deployment to end-user interaction.
  • Ethics and Governance: Instrument for ethical guardrails: monitor for bias, fairness, and compliance violations. Enable rapid detection and remediation of unintended consequences.

The Road Ahead: From Observability to Business Enablement

The evolution of observability in the AI era is not just about better dashboards or faster debugging. It’s about empowering organizations to:

  • Build Trust: Transparent, explainable AI systems foster user and stakeholder confidence.
  • Accelerate Innovation: Rapid feedback cycles and robust monitoring enable faster iteration and safer experimentation.
  • Unlock Business Value: Observability becomes a lever for optimizing AI-driven business processes, reducing downtime, and uncovering new opportunities.

Conclusion: Closing the Strategic Gap

AI is rewriting the rules of software engineering. To harness its full potential, organizations must invest in next-generation observability—one that is AI-native, explainable, and deeply integrated across the stack. Leaders who prioritize observability will be best positioned to navigate complexity, drive responsible innovation, and close the strategic gap in the era of CodeBots, AI Assistants, and AI Agents.

References

FREQUENTLY ASKED QUESTIONS

  • Why is observability now a strategic business enabler in the AI era?

As organizations adopt CodeBots, AI assistants, and agentic AI, systems make opaque, autonomous decisions at scale. Without robust observability, teams risk technical debt, operational inefficiency, ethical lapses, compliance violations, and loss of user trust. The article argues observability must evolve from a technical afterthought to a strategic capability.

  • What expands the scope of observability beyond metrics, logs, and traces?

The article identifies four new focal areas: intent and outcome alignment, model and data drift, autonomous decision auditing, and human‑AI interaction quality. These dimensions reflect the behaviors of AI systems, not just infrastructure signals.

  • What is “intent and outcome alignment,” and why does it matter?

Alignment ensures an AI system’s goals, actions, and behaviors reflect human intentions and ethical principles. It means delivering desired results for the right reasons—avoiding bias, hallucinations, or reward hacking—and requires regular outcome auditing to verify that outputs match intent and remain explainable.

  • Which observability capabilities support intent alignment?

The text calls for intent tracing and validation to map user goals to system objectives and emergent behaviors. It also stresses robust logging of agent interactions (external actions, tool calls, inter‑agent messages) and automated anomaly/misalignment detection that flags deviations from expected patterns.

  • How do model drift and data drift differ?

Data drift (covariate shift) occurs when input feature distributions change while input‑output relationships may remain stable. Concept drift changes the relationship between inputs and outputs due to shifts in the generating process, eroding model assumptions and performance over time.

  • What drift monitoring features belong in production‑grade observability?

The article recommends statistical tests such as PSI, KL Divergence, and the KS Test to compare live vs. training distributions. It also calls for real‑time, automated drift detection with thresholds/alerts and granular performance tracking (e.g., accuracy, precision, recall) across segments and time windows.

  • What does autonomous decision auditing require for agentic AI?

Auditing needs decision‑provenance tracking to record the sequence of transformations and inferences leading to each decision. It should include automated bias/fairness checks with alerts and integrate XAI tools for on‑demand explanations, particularly in regulated or high‑stakes contexts.

  • How does observability improve human‑AI interaction quality?

By combining quantitative signals (error rates, session length) with qualitative insights (sentiment analysis, user feedback), teams gain a holistic view of interactions. Observability should support human‑in‑the‑loop/“in the lead” handoffs and track how feedback changes system behavior over time.

  • What key challenges complicate AI‑aware observability?

The article highlights complexity and scale (multi‑agent workflows, real‑time adaptation), privacy/security requirements for sensitive telemetry, and semantic gaps in traditional tools. It also notes fragmentation and limited interoperability despite progress from efforts like OpenTelemetry’s GenAI SIG.

  • Which best practices does the article recommend to build AI‑aware observability?

Instrument for explainability (e.g., SHAP, LIME), adopt open standards (OpenTelemetry, LangSmith, Langfuse), and close the loop by using observability data to retrain models and refine prompts. Cross‑disciplinary collaboration and ethics/governance monitoring (bias, fairness, compliance) are emphasized as ongoing practices.

The post The Expanding Scope of Observability for AI Systems appeared first on ML Conference.

]]>
Are AI Tools Hurting Developer Productivity? https://mlconference.ai/blog/ai-developer-productivity-tools/ Wed, 30 Jul 2025 12:23:58 +0000 https://mlconference.ai/?p=108170 A recent study [1] suggests that developers may become less productive when using AI tools. We've asked our experts to weigh in: Is this a temporary setback, a methodological flaw, or a sign of things to come?

The post Are AI Tools Hurting Developer Productivity? appeared first on ML Conference.

]]>
 

Sebastian Springer:

Lately, there have been several studies highlighting the negative aspects of AI: AI makes us less productive, less creative… I believe it really depends on how we use the tools. The same could be said about search engines or platforms like Stack Overflow. If I rely on such channels for every aspect of my work, I’d become less productive as well. With modern AI tools, the risk is naturally greater, since they’re much more integrated into our work environments and are far more intuitive to use.

On the topic of productivity: Personally, I feel more productive thanks to tools like Copilot and similar tools. That’s mainly because I use them to solve repetitive tasks. There are situations where writing a good prompt takes significantly longer than writing the code myself. And of course, working with AI tools comes with the risk of being distracted from the actual problem or heading in the wrong direction. In other cases, the suggestions the AI offers – without any manual prompt – are exactly what I need.

In general, I think: Whether AI makes us unproductive, uncreative, or even dumb – it’s a technology that’s established itself in the market, and one we simply can’t ignore. So, we should focus on leveraging its strengths. And if we already know it has downsides (as almost every technology does), we should try to avoid those pitfalls as much as possible. Besides, AI is in good company: People once claimed that steam engines would never be economical, newspapers would overwhelm us mentally, and written information in general was dangerous – let alone the internet, which supposedly makes people stupid and causes crime to skyrocket. There’s always a grain of truth in every accusation, but in the end, it all comes down to how we deal with it.


Paul Dubs:

Based on my experience with AI tooling for development, which I discussed in a keynote at the JAX conference in May, the impact on productivity is highly dependent on how these tools are used and the developer’s experience level with them. The study actually supports what I’ve observed: there’s a significant learning curve with AI development tools. The one developer in that study who had substantial prior experience with Cursor was notably faster, an anomaly that proves the point. Like any tool, you need to know how to use it effectively to see productivity gains.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]


During my keynote, I described using agentic AI coding tools as “playing chess with a pigeon”: they would destroy the game and claim victory. Claude Code struggled to navigate projects properly and would even sabotage its own progress by resetting the Git state. The Claude 3.5 / 3.7 models used in the study weren’t well-suited for larger changes or project navigation. However, things changed dramatically with Claude 4’s release at the end of May. Even the smaller, faster Sonnet model became quite capable when used correctly. I now use Roo Code, a Visual Studio Code plugin that allows me to create specialized prompts for different tasks: debugging, programming, documentation, and language-specific work. This customization has made me considerably more productive.

The productivity gains aren’t uniform across all project types. I’m much more productive on greenfield (new) projects. For brownfield (existing) projects requiring major changes, I need to provide extensive additional context, often directly referencing the specific files the AI needs to work with. When I handle the navigation burden myself, the AI can be quite effective. There’s an important caveat: using AI tools creates a knowledge and memory gap. Since I’m not writing every line myself, it feels like delegating to someone else and doing a quick review. When I return to AI-generated code later, I need to reread it because I don’t fully remember the implementation details. It’s similar to working on a project where multiple developers touch every piece: you lose that intimate familiarity with the codebase.

The study’s findings align with my experience: developers unfamiliar with AI tools often see productivity losses, while those with significant experience can achieve net gains. The outlier in the study who was more productive validates this. Success with AI coding tools requires understanding their limitations, using them appropriately for the task at hand, and accepting the trade-off between speed and deep code familiarity.


Christoph Henkelmann:

The issue with AI-assisted coding is the same as with many current AI debates: it’s dominated by hype and quick dismissals, rather than a nuanced understanding. Yes, AI tools can deliver massive productivity gains – but only if you actually learn how to use them. This means understanding the basics of LLMs, knowing your domain, and practicing with the tools until you develop a sense for when they help and when they don’t. Most people just install something like Cursor and expect miracles. Naturally, this leads to disappointment. “Vibe coding” might get you a prototype, but real productivity comes from what Paul Dubs calls “omega coding”: deep domain knowledge, familiarity with your tools, and persistent practice. These tools don’t replace thinking; they amplify skill. Managers hoping for instant results will see the opposite at first: initial productivity drops, much like switching to a new IDE. But if you invest the time to learn and adapt, the gains are real and substantial. Most don’t (or better: aren’t given the time to do so), which is why recent studies show lackluster results.


Melanie Bauer:

As an informatics student, I spend a lot of time researching and learning about new tools and topics, especially in the field of software development. AI tools have made this process significantly easier and faster for me. For example, when I have a question, I can get direct and precise answers without having to scroll through extensive documentation.

That’s why tools like GitHub Copilot, Cursor, and ChatGPT have become a regular part of my workflow as a future software developer. Of course, at the end of the day, AI doesn’t think for me, and I am still responsible for reviewing and validating the generated output. But overall, I’ve noticed a clear increase in my productivity, especially when it comes to routine tasks, reducing the ramp-up time when learning new technologies, or understanding code snippets and programming concepts by having them broken down and explained step by step.


Rainer Hahnekamp:

Based on my experience, the use of AI in software development can be divided into three levels:

  1. Code Completion in the IDE: Here, AI offers valuable support by suggesting small code snippets that boost productivity without taking control away from the developer.
  2. Automated Code Generation: In this area – where the AI generates larger code blocks or even entire files – I’ve found that the time required to correct and adapt the output often outweighs the immediate benefit. Still, I see this as an investment in learning how to work with AI effectively. While it may currently slow things down, I’m confident that the technology will improve – and when it does, I want to be ready to make the most of it.
  3. AI-Supported Research and Conceptual Work: Using AI as a sparring partner for brainstorming, idea generation, and problem-solving has proven extremely helpful. It supports creativity and often leads to productive insights.

Personally, I can’t confirm a loss in productivity – quite the opposite. While I haven’t read the details of the referenced study, I suspect the reasons might be due to the current lack of best practices and the necessary intuition for using AI effectively. And, of course, to be transparent, this statement reflects my personal opinion, but the wording was created with the assistance of AI 😉.


Pieter Buteneers:

I use the following AI tools:
Cursor (the agents are a big step forward) (my go-to tool)
ChatGPT (GPT 4.1) (if cursor makes mistakes)
Claude 4 Sonnet if the above don’t know the answer which is once every 2-3 weeks or so.

In terms of advantages, I started using typescript (ts) instead of python and it really helps me understand the syntax and convert python code into ts much faster. It writes fewer errors in ts than in python, allows me to write more code, writes unit tests for me and allows me to use new packages/technology much faster. It helps me with the DevOps side of things which I am a real noob at. Overall, it makes me about 2 times faster

I use it a lot to brainstorm ideas and figure out best/bad practices, but it comes with a huge list of caveats. There is a lot of code duplication since it doesn’t know your entire code. So, your code becomes hard to maintain and turns into ‘spaghetti’ fast. Cursor often fixes a bug by just writing some code to cover an edge case, but it doesn’t always go deep into the underlying problem, so you think you end up with a fix, where it is just an ugly patch and you don’t understand the code. Ultimately, you still need a senior dev to tell you what good code practices are. I spend more time debugging than writing code, so tests are even more important.


Tam Hanna:

At Tamoggemon Holdings, we currently use AI systems mainly for menial tasks. Using them to write stock correspondence (think cover letters, etc.) has shown to be a significant performance booster, allowing us to refocus on more productive tasks. As for line work (EE or SW), we – so far – have not seen the systems as a valid replacement to classic manual work.


Rainer Stropek:

After gaining extensive experience with modern AI tools, I can’t imagine my daily work as a developer without their support. My productivity has noticeably increased because I have consistently aligned my entire workflow around collaboration with AI. This goes far beyond classic code completion: autocomplete suggestions are convenient but often too generic and sometimes break my flow. Chat agents, by default, start every conversation from scratch. To work efficiently with them, one must formulate complete, consistent requirements in the prompt, prompt context, document the architecture, establish coding guidelines, and provide meaningful test data with expected results. This level of diligence would be advisable anyway; working with AI makes it essential.

Spec-Driven Development instead of Vibe Coding
Many developers underestimate prompting and context management. A few buzzwords are only enough as long as the goal remains vague. As soon as I face concrete customer requirements, I rely on Spec-Driven Development:

  1. I invest significant time in detailing the requirements.
  2. The AI questions and discusses the specification with me.
  3. Only once a sufficient level of maturity is reached do I let the AI implement the solution and review the result.
    It’s crucial to create clarity before I let the AI write code.

From Coder to AI Orchestrator
My role is shifting. Instead of primarily writing code, I define work packages that I delegate to AI agents. This is similar to delegating to human team members. I see my future in the role of a product developer with a strong focus on requirements engineering and software architecture – structuring complex requirements in a way that makes them executable by AIs.

Limitations of Today’s AI Systems (especially in large projects)
Despite larger context windows and advanced retrieval (e.g., using MCP servers or function tools integrated in IDEs), AI still lacks a holistic overview of large projects. Humans remain responsible for slicing and documenting tasks so they can be worked on without requiring knowledge of the entire project. If this is done successfully, project size becomes almost irrelevant to the use of AI.

What Companies Need to Do Now
The tool landscape is evolving in months, not years. Instead of committing to a single tool long-term, companies should:

  • Allocate budgets and create space for teams to experiment with various AI tools.
  • Deploy pilot groups to quickly gain hands-on experience.
  • Embrace usage-based pricing models and make their cost-benefit ratio transparent.

From my perspective, those who don’t start building practical experience now risk losing competitiveness. AI is no longer just a nice-to-have add-on – it is fundamentally changing the way we develop software. Those who ignore these new ways of working risk losing productivity and, in the medium term, competitiveness. Now is the time to sharpen specifications, rethink roles, and encourage experimental team setups.


Christian Weyer:

“Never trust a study you didn’t fake yourself” 😉.
Just kidding, of course.
But seriously: At Thinktecture, we’ve seen an unprecedented productivity boost across the team. Personally, I feel significantly more creative – which directly translates into being faster and producing better results.

The key? I don’t let AI tools disrupt my natural flow. Instead, I deliberately configure them to fit my individual thinking and working style. Tools like GitHub Copilot, Windsurf, Cursor, or Cline all offer great ways to customize the experience with your own guardrails.

Maybe many developers don’t yet fully leverage these configuration options – or don’t even know they exist. Used right, these tools amplify productivity instead of hindering it.


Veikko Krypczyk:

In my experience, artificial intelligence can be meaningfully applied throughout all phases of the software development process – from early ideation and UI design to architectural decisions and the implementation of complex algorithms. AI is by no means flawless, but it acts as a virtual work partner that can complete many tasks faster, more diversely, and sometimes even more creatively than would be possible alone.

The actual productivity gain strongly depends on two key factors: the quality of the prompts and the critical evaluation of the generated content. Those who can formulate clearly and have solid domain knowledge will greatly benefit from AI tools – whether it’s generating boilerplate code, writing test cases, supporting refactoring, or systematically exploring technical options.

Of course, AI outputs should never be accepted without reflection. It remains essential for developers to understand, question, and, if necessary, improve the generated suggestions. Domain expertise is not replaced by AI – quite the opposite: it becomes even more crucial to ensure the quality of the outcomes.

My conclusion: when used properly, AI enhances efficiency and broadens perspectives – both individually and in team processes. I find working with AI tools inspiring, more efficient, and often more focused, as they help offload routine work and spark creative thinking. I only experience a loss in productivity when AI is treated as an autopilot rather than as a co-pilot.


Links & Literature

[1] https://arxiv.org/abs/2507.09089

 

 

Top 10 FAQs About AI Coding Tools & Developer Productivity

 

1. Do AI coding tools like GitHub Copilot and Cursor really improve developer productivity?
Yes, when used correctly. Many developers see faster coding and fewer repetitive tasks with tools like GitHub Copilot, Cursor, and Claude. However, beginners may initially experience slower workflows while learning to use them effectively.

2. Why do some developers become less productive with AI development tools?
A lack of training and experience with AI-powered coding assistants can cause slower progress at first. Without understanding prompt writing, debugging AI output, or configuring tools properly, productivity can drop.

3. What is the learning curve for GitHub Copilot, Cursor, and similar AI coding assistants?
Most developers need time to master AI-assisted development. Success comes from learning prompt engineering, adapting workflows, and knowing when to trust AI suggestions versus manual coding.

4. Can AI coding assistants replace human software developers?
No. AI tools can speed up tasks like code completion, boilerplate generation, and prototyping, but human expertise is essential for architecture design, problem-solving, and ensuring high-quality code.

5. How can developers get the most out of AI coding tools?
Use AI tools for repetitive coding, quick prototypes, and brainstorming. Always review AI-generated code, write clear prompts, and combine AI with strong coding fundamentals for the best results.

6. What are common problems with AI-generated code?
Developers often face duplicated code, messy “spaghetti code,” shallow bug fixes, and the need for extra debugging. Writing unit tests and applying good coding practices remains essential.

7. What is ‘spec-driven development’ and how does it help AI-assisted coding?
Spec-driven development involves writing detailed software specifications before using AI tools. This approach helps ensure that AI-generated code matches the project’s goals and reduces wasted time on rework.

8. What are the best AI coding tools for developers in 2025?
Popular options include GitHub Copilot, Cursor, Claude 4 Sonnet, Roo Code, ChatGPT (GPT-4.1), Windsurf, and Cline. Many developers use a combination of these for different coding tasks.

9. How do AI coding assistants perform in greenfield vs. brownfield projects?
AI assistants tend to be more effective in greenfield (brand-new) projects, where they can help build from scratch. Brownfield (existing) projects often require more manual guidance and context-setting.

10. How should companies prepare before rolling out AI-powered coding tools?
Run pilot programs, give developers time to experiment, avoid locking into one tool too soon, and provide training on prompt engineering and AI best practices for software development.

The post Are AI Tools Hurting Developer Productivity? appeared first on ML Conference.

]]>
MCP vs A2A: Architecting AI Agent Communication for Enterprise https://mlconference.ai/blog/mcp-vs-a2a-ai-agent-communication-enterprise/ Mon, 21 Jul 2025 12:10:07 +0000 https://mlconference.ai/?p=108095 The AI landscape is shifting towards collaborative, specialized agents. This article provides an essential comparative analysis of emerging AI agent communication protocols: Anthropic's Model Context Protocol (MCP) and Google's Agent-to-Agent (A2A). For ML/AI developers and technical leaders, understanding these frameworks is crucial for building scalable, secure, and composable AI systems. We delve into the architecture, benefits, and challenges of each protocol, guiding you to make informed decisions for your next-gen enterprise AI infrastructure and AI tool development.

The post MCP vs A2A: Architecting AI Agent Communication for Enterprise appeared first on ML Conference.

]]>
The field of AI is undergoing a significant architectural shift. We are moving from standalone AI systems that provide isolated capabilities toward interconnected ecosystems of specialized agents that collaborate to solve complex problems. This evolution mirrors the historical development of human organizations, where specialization and communication allowed for more sophisticated collective capabilities.

As AI systems grow more capable and autonomous, the need for standardized communication mechanisms becomes increasingly critical. Without established protocols, organizations face challenges including:

  1. Technical Fragmentation: Teams developing separate integration methods for each agent pairing, leading to duplicated effort and inconsistent implementations.
  2. Security Vulnerabilities: Ad-hoc communication systems often lack robust authentication, authorization, and data protection mechanisms.
  3. Limited Composability: Without standardized interfaces, combining capabilities from different AI systems becomes prohibitively complex.
  4. Governance Challenges: Tracking information flow, maintaining audit trails, and ensuring accountability becomes difficult when agent communication occurs through diverse, non-standardized channels.

MLcon Community Newsletter

  • Expert Articles
  • Cheat Sheets
  • Whitepapers
  • Live Webinars
  • Magazines
Join 10,000+ members of the global MLcon community

[mc4wp-simple-turnstile]

AI Agent communication protocols aim to address these challenges by providing structured frameworks that define how agents advertise capabilities, request services, exchange information, and coordinate activities. These protocols serve as the foundational infrastructure upon which sophisticated multi-agent systems can be built.

In this evolving landscape, two significant protocols have emerged as potential industry standards: Model Context Protocol (MCP) developed by Anthropic and Agent-to-Agent (A2A) recently introduced by Google. Each brings a distinct perspective on how AI Agent communication should be structured, secured, and integrated into enterprise workflows.

Understanding the architectural foundations, benefits, limitations, and optimal use cases for each protocol is essential for organizations planning their AI infrastructure investments. This comparative analysis will help technical leaders make informed decisions about which protocol—or combination of protocols—best suits their specific requirements and use cases.

What is MCP?

Model Context Protocol (MCP) represents Anthropic’s approach to establishing a standardized framework for AI Agent communication and operation. At its philosophical core, MCP recognizes that as AI systems grow in complexity and capability, they require a consistent, structured way to interact with external tools, data sources, and services.

MCP emerged from the practical challenges faced by developers building sophisticated AI applications. Without standardization, each team was forced to develop custom integration methods for connecting their AI systems with external capabilities—resulting in duplicated effort, inconsistent implementations, and limited interoperability between systems.

The protocol addresses these challenges by providing a unified method for structuring the context in which AI models operate. It defines clear patterns for how information should be organized, how models should access external resources and tools, and how outputs should be formatted. This standardization allows for better interoperability between different AI systems, regardless of their underlying architecture or training methodology.

Rather than focusing on direct agent-to-agent communication, MCP emphasizes the importance of structured context—ensuring that AI systems have access to the information and capabilities they need in a consistent, well-organized format. This approach treats tools and data sources as extensions of the model’s capabilities, allowing for dynamic composition of functionality without requiring extensive pre-programming.

By providing this standardized interface for context management, MCP aims to reduce ecosystem fragmentation, enable more flexible AI deployments, and facilitate safer, more reliable AI systems that can work together coherently while maintaining alignment with human intentions and values.

MCP Architecture

MCP’s architecture is built around a hierarchical context structure that organizes information and capabilities into clearly defined components. This architecture follows several key design principles that prioritize clarity, security, and flexibility while maintaining clear separation between different types of contextual information.

Core Architectural Components:

  1. MCP Host: The “brain” of the system—an AI application using a Large Language Model (LLM) at its core that processes information and makes decisions based on available context. The host is the primary consumer of the capabilities and information provided through the MCP framework.
  2. MCP Client: Software components that maintain 1:1 connections with MCP servers. These clients serve as intermediaries between hosts and servers, facilitating standardized communication while abstracting away the complexities of server interactions.
  3. MCP Server: Lightweight programs that expose specific capabilities through the standardized protocol. Each server is responsible for a discrete set of functionalities, promoting separation of concerns and allowing for modular composition of capabilities.
  4. Local Data Sources: Files, databases, and services on the local machine that MCP servers can securely access. These provide the foundation for contextual information from the immediate environment.
  5. Remote Data Sources: External systems available over the internet (typically through APIs) that MCP servers can connect to, expanding the potential information sources available to AI systems.

Figure 1 MCP Architecture (source: Tahir[1])

Context Structure and Information Flow:

The MCP architecture implements a controlled information flow where context passes through defined pathways. When an MCP Host needs to access external information or capabilities, it connects to appropriate MCP Servers through MCP Clients. The servers then mediate access to various data sources, ensuring that information is properly formatted and permissions are appropriately handled.

This structured flow ensures that all processing occurs within well-defined boundaries, making it easier to track how information moves through the system and maintain security and accountability. The architecture explicitly distinguishes between:

  • Tools: Model-controlled actions that allow the AI to perform operations such as fetching data, writing to a database, or interacting with external systems.
  • Resources: Application-controlled data such as files, JSON objects, and attachments that can be incorporated into the AI’s context.
  • Prompts: User-controlled predefined templates (similar to slash commands in modern IDEs) that provide standardized ways to formulate certain types of requests.

Figure 2 is a sequence diagram showing the information flow between different components in a system that uses MCP to retrieve blog data, specifically SQL-related blog posts, for a user. This type of flow would be useful in a plugin-style AI integration where the AI needs to interact with external data sources via a protocol like MCP but requires explicit user permission and intelligent capability discovery.

Figure 2 MCP Workflow Example (source: Gökhan Ayrancıoğlu[2])

Protocol Implementation:

MCP is designed to be transport-agnostic, though the initial implementation focuses on HTTP/HTTPS as the primary transport layer. The protocol defines standardized message formats for tool registration, tool invocation, and result handling, ensuring consistent interaction patterns regardless of the specific tools or data sources being accessed.

Recent developments in the protocol have expanded support for remote MCP servers (over Server-Sent Events) and integration with authentication mechanisms like OAuth, making it suitable for enterprise deployments where security and distributed access are essential requirements.

This architecture aims to create a standardized environment for AI processing, where information sources are clearly delineated, tools are discoverable and consistently invocable, and outputs adhere to predictable formats enabling safer multi-agent interactions and clearer accountability.

MCP Benefits

MCP offers several substantial advantages that make it particularly valuable for organizations implementing enterprise-grade AI systems. These benefits directly address common challenges in AI development and deployment, providing tangible improvements in development efficiency, system flexibility, and organizational collaboration.

Ecosystem Standardization and Reduced Fragmentation:

One of MCP’s most significant benefits is the reduction in ecosystem fragmentation. Before standardized protocols, every team building AI applications had to develop custom integrations for connecting their systems with tools and data sources. This resulted in duplicated effort, inconsistent implementations, and limited interoperability.

MCP addresses this challenge by providing a standardized way to connect AI systems with external capabilities. This standardization significantly reduces development overhead and creates a more cohesive AI ecosystem where components can be easily shared and reused. Organizations can develop MCP servers once and leverage them across multiple AI applications, maximizing return on development investments.

Dynamic Composability of Capabilities:

MCP enables dynamic composability of AI systems. Agents and applications can discover and use new tools without pre-programming, allowing for more flexible and adaptable AI deployments. This composability means that organizations can incrementally enhance their AI capabilities by adding new MCP servers without needing to modify existing applications.

For example, a company might initially deploy an AI assistant with access to document search capabilities through an MCP server. Later, they could add financial analysis capabilities by deploying a new MCP server—and the assistant would be able to leverage these capabilities without requiring major modifications to its core implementation.

Enhanced Tool Integration and Context Management:

MCP provides a consistent framework for integrating external tools and capabilities into AI systems. This consistency makes it easier for developers to add new functionalities to their AI applications and for end-users to understand how to interact with those capabilities.

The protocol’s structured approach to context management ensures that models have access to the information they need in a well-organized format. This reduces the risk of context confusion and helps maintain consistent performance across different implementations. The clear separation between different types of contextual information (tools, resources, and prompts) also facilitates better governance and security practices.

Support for Enterprise Collaboration and Workflows:

The protocol aligns well with enterprise organizational structures, where different teams often maintain specialized services and capabilities. Teams can own specific services (such as vector databases, knowledge bases, or analytical tools) and expose them via MCP for other teams to use. This supports organizational separation of concerns while enabling cross-functional collaboration through standardized interfaces.

This alignment with enterprise workflows makes MCP particularly valuable for large organizations with diverse AI initiatives across multiple departments. It provides a common language for AI capabilities while respecting organizational boundaries and governance requirements.

Foundation for Self-Evolving Agent Systems:

MCP enables the creation of self-evolving agents that can grow more capable over time without requiring constant reprogramming. As new tools become available through the MCP registry, agents can discover and incorporate these capabilities dynamically—allowing for continuous improvement without manual intervention.

This foundation for evolving capabilities is especially valuable as organizations move toward more autonomous AI systems that need to adapt to changing requirements and opportunities.

These benefits collectively enable organizations to implement AI systems that are more interoperable, more easily extended, and better integrated into existing enterprise workflows and technology stacks.

MCP Challenges

Despite its numerous advantages, implementing MCP presents several significant challenges that organizations need to carefully consider. Understanding these limitations is essential for realistic planning and effective risk management when adopting the protocol.

Authentication and Security Framework Limitations:

One notable limitation of MCP in its current form is its relatively basic authentication mechanisms. While recent updates have improved OAuth integration, MCP lacks the comprehensive authentication frameworks that are essential for secure enterprise deployments across organizational boundaries.

This limitation becomes particularly significant when implementing MCP in environments where security is a critical concern, especially when AI systems need to access sensitive information or perform operations with potential security implications. Organizations implementing MCP in such environments will need to develop additional security layers to complement the protocol’s native capabilities.

Remote Server Management Complexity:

Although MCP has expanded to support remote MCP servers (over Server-Sent Events), managing these remote connections securely and reliably presents additional complexity. Organizations deploying MCP across distributed environments need to develop strategies for handling connection failures, latency issues, and security considerations.

This distributed architecture introduces potential points of failure that must be carefully managed, especially for mission-critical AI applications. Implementing robust monitoring, error handling, and recovery mechanisms becomes essential when deploying MCP at scale across distributed infrastructures.

Registry Development and Tool Discovery Maturity:

The planned MCP Registry for discovering and verifying MCP servers is still in development. Until this component is fully realized and mature, organizations face challenges in implementing dynamic tool discovery—one of the protocol’s key promised benefits.

Without a robust registry system, organizations must develop interim solutions for tool discovery and verification, potentially limiting the dynamic composition capabilities that make MCP valuable. This gap between the current implementation and the full vision for MCP requires pragmatic planning for organizations adopting the protocol today.

Connection Lifecycle Management:

MCP is still refining how it handles the distinction between stateful (long-lived) and stateless (short-lived) connections. This distinction is important for different types of AI applications, and the current[3] implementation may not fully address all use cases, particularly those requiring sophisticated state management across extended interaction sessions.

Organizations implementing MCP need to carefully consider their connection lifecycle requirements and may need to develop custom solutions for cases that fall outside the protocol’s current capabilities in this area.

Multi-Agent Coordination Limitations:

While MCP excels at connecting individual AI systems with tools and data, it provides less robust support for direct agent-to-agent communication in multi-agent systems where state is not necessarily shared. This limitation becomes apparent in complex agent ecosystems where multiple autonomous agents need to coordinate their activities directly.

For sophisticated multi-agent architectures, organizations may need to complement MCP with additional protocols or custom solutions to enable effective agent-to-agent communication, particularly when those agents operate across organizational boundaries or vendor environments.

Implementation Complexity and Learning Curve:

Adopting MCP requires investment in understanding and implementing the protocol’s specifications. For organizations with existing AI infrastructure, this may require significant refactoring of current systems to comply with MCP’s structural requirements.

This implementation complexity represents a real cost that must be factored into adoption planning. Organizations should expect to invest in developer training, refactoring existing code, and establishing new development practices aligned with the protocol’s requirements.

These challenges highlight the importance of careful planning when implementing MCP, particularly for organizations with complex security requirements or those building sophisticated multi-agent systems.

MCP Main Use Cases

MCP is particularly well-suited for several key application areas where its structured approach to context management delivers significant value. Understanding these optimal use cases helps organizations identify where MCP can provide the greatest return on implementation investment.

AI-Enhanced Development Environments:

MCP has gained significant traction in AI-enhanced coding environments and integrated development environments (IDEs). Tools like Cursor and Zed leverage MCP to provide developers with AI assistants that have rich access to contextual information, including code repositories, documentation, ticket systems, and development resources.

In these environments, MCP excels at:

  • Pulling in relevant code context from the current project
  • Accessing GitHub issues, documentation, and APIs
  • Enabling interaction with development tools and services
  • Maintaining appropriate context during extended coding sessions

The protocol’s standardized approach to context management makes it particularly effective for integrating AI capabilities into development workflows, allowing developers to work with AI assistance that truly understands their project context.

Enterprise Knowledge Management Systems:

MCP provides significant value in enterprise environments where AI needs to access, process, and reason over large volumes of organizational knowledge. The protocol’s clear structure for differentiating between various information sources helps maintain information integrity when AI systems need to reference multiple documents, databases, and knowledge bases simultaneously.

These knowledge management applications benefit from MCP’s ability to:

  • Access diverse document repositories with appropriate permissions
  • Query enterprise databases while maintaining security boundaries
  • Incorporate real-time information from organizational systems
  • Maintain clear provenance for information incorporated into analyses

This capability makes MCP ideal for implementing corporate knowledge assistants, document processing systems, and intelligent search applications that need to work across diverse information sources while maintaining appropriate security and governance.

Tool-Augmented Agents and Automated Workflows:

Organizations implementing AI Agents that need to leverage external tools benefit significantly from MCP’s standardized tool interface. Agents can autonomously invoke tools to search the web, query databases, perform calculations, or interact with enterprise systems through a consistent, well-defined interface.

This standardization makes it easier to:

  • Expand agent capabilities by adding new tools without changing the agent’s core implementation
  • Chain multiple tools together into sophisticated workflows
  • Maintain clear audit trails of tool invocations and results
  • Implement governance controls around tool access and usage

For example, a research assistant agent might use MCP to access scholarly databases, statistical analysis tools, and citation management systems—combining these capabilities dynamically based on specific research requests.

Domain-Specific AI Applications:

MCP provides an excellent foundation for building domain-specific AI applications that require access to specialized data sources or tools. In fields like finance, healthcare, or legal services, MCP allows developers to create AI systems that can interact with domain-specific resources through a standardized interface.

This standardization reduces the development effort required to build and maintain specialized applications by:

  • Providing a consistent pattern for integrating domain-specific tools
  • Enabling clear separation between the AI model and domain-specific resources
  • Facilitating compliance with domain-specific regulations through structured access controls
  • Allowing for modular updates to capabilities as domain requirements evolve

For instance, a healthcare AI assistant might use MCP to access medical terminology databases, electronic health record systems, and clinical decision support tools—all through a consistent interface that maintains appropriate clinical governance.

Self-Evolving Agent Systems:

The protocol enables the creation of self-evolving agents that can grow more capable over time without requiring constant reprogramming. These systems can:

  • Dynamically discover new tools via the registry
  • Combine MCP with computer vision for UI interactions
  • Chain multiple MCP servers for complex workflows (e.g., research → fact-check → report-writing)
  • Adapt to new information sources and capabilities as they become available

This capability is particularly valuable for organizations looking to build AI systems that can grow more sophisticated over time, adapting to changing requirements without requiring constant developer intervention.

These use cases highlight MCP’s strengths as a foundational layer for context-aware AI systems, particularly in environments where structured access to diverse information sources and tools is a key requirement.

[1] https://medium.com/@tahirbalarabe2/what-is-model-context-protocol-mcp-architecture-overview-c75f20ba4498

[2] https://gokhana.medium.com/what-is-model-context-protocol-mcp-how-does-mcp-work-97d72a11af8a

The post MCP vs A2A: Architecting AI Agent Communication for Enterprise appeared first on ML Conference.

]]>