ML Conference

The Expanding Scope of Observability for AI Systems

adharaneedharan@sandsmedia.com — Tue, 12 Aug 2025 14:19:59 +0000

As organizations accelerate their adoption of AI-powered tools—ranging from CodeBots to agentic AI—observability is rapidly shifting from a technical afterthought to a strategic business enabler. In our last article, “Observability in the Era of CodeBots, AI Assistants, and AI Agents”, we briefly touched upon key enhancement in the observability space. Continuing here – stakes are high for the next steps in Observability where AI systems are predicted to act autonomously, make complex decisions, and interact with humans and other agents in ways that are often opaque. Without robust observability, organizations risk not only technical debt and operational inefficiency, but also ethical lapses, compliance violations, and loss of user trust.

Join us at MLCon New York to attend Garima Bajpai‘s keynote & workshop LIVE!

Keynote : Charting the Way Forward for AI-Native Software Organizations

Workshop: Operationalizing AI Workshop – Leadership Sprints

The Expanding Scope of Observability

The traditional boundaries of observability—metrics, logs, and traces—are being redrawn. In the AI era, observability must encompass:

Fig. 1: The expanding scope of observability

Intent and Outcome Alignment: Did the AI system achieve what was intended, and can we explain how it got there?
Model and Data Drift: Are models behaving consistently as data and environments evolve?
Autonomous Decision Auditing: Can we trace and audit the rationale behind AI agent decisions?
Human-AI Interaction Quality: How effectively are developers and end-users collaborating with AI assistants?

In the next section, we’ll expand on each of the specific questions and outline the next steps.

Intent and Outcome Alignment

AI alignment refers to ensuring that an AI system’s goals, actions, and behaviors are consistent with human intentions, values, and ethical principles. Achieving intent and outcome alignment means the system not only delivers the desired results but does so for the right reasons, avoiding unintended consequences such as bias, or reward hacking. For example, if an AI is designed to assist with customer queries, alignment ensures it provides accurate, helpful responses rather than hallucinating or misleading users. Regular outcome auditing is essential—this involves evaluating real-world results to check for disparities or unintended effects, ensuring the AI’s outputs match the original intent and are explainable.

Observability is foundational for intent and outcome alignment because it makes the AI’s decision-making transparent and traceable, allowing stakeholders to explain, verify, and correct its behavior as needed.

Intent tracing and validation: Mechanisms to explicitly track the mapping from user intent to system objectives and emergent behaviors, allowing for validation that intent is preserved through each stage of the AI’s operation.
Robust logging of agent interactions: Especially for agentic AI, detailed logs of external actions, tool invocations, and inter-agent communications are necessary to detect misuse or unintended consequences.
Automated anomaly and misalignment detection: Integration of anomaly detection systems that can flag when observed behaviors deviate from expected, aligned patterns—potentially using machine learning to recognize subtle forms of misalignment.

Model and Data Drift

Model and data drift refer to the phenomenon where machine learning models gradually lose predictive accuracy as the data and environments they operate in evolve. This happens because the statistical properties of the input data or the relationships between features and target variables change over time, making the model’s original assumptions less valid. There are two primary types:

Data drift (covariate shift): The distribution of input features changes, but the relationship between inputs and outputs may remain the same.
Concept drift: The relationship between inputs and outputs changes, often due to shifts in the underlying process generating the data.

As data and environments evolve, observability is essential to ensure models behave consistently and maintain their predictive power. Advanced observability features—especially automated, real-time drift detection and diagnostics—are critical for robust, production-grade machine learning systems.

Drift detection: Observability tools can implement statistical tests (e.g., Population Stability Index, KL Divergence, KS Test) to compare incoming data distributions with those seen during training, flagging significant deviations.
Automated drift detection and alerting: Real-time, automated identification of both data and concept drift, with configurable thresholds and notifications.
Granular performance monitoring: Tracking model accuracy, precision, recall, and other metrics across different data segments and time windows to pinpoint where drift is occurring.

Autonomous Decision Auditing

Tracing and auditing the rationale behind AI agent decisions, especially in autonomous or agentic AI systems, is both possible and increasingly necessary, but it presents significant technical and organizational challenges. Auditing the rationale behind autonomous AI decisions is feasible with the right combination of observability, explainability, and compliance tools is of utmost importance.

As AI systems grow in complexity and autonomy, advanced observability features such as real-time monitoring, detailed logging, and integrated XAI—are essential for ensuring transparency, accountability, and trust.

Decision provenance tracking, recording the sequence of transformations and inferences leading to each decision.
Automated bias and fairness checks at both data and outcome levels, with alerts for detected issues.
Integration of XAI tools for on-demand explanation of individual decisions, especially in high-stakes or regulated environments.

Human-AI Interaction Quality

Developers and end-users are collaborating with AI assistants with increasing effectiveness, but the quality of these interactions varies widely depending on the application, the clarity of communication, and the feedback mechanisms in place. Observability in the context of human-AI interaction means having comprehensive visibility into both the AI’s internal decision-making processes and the dynamics of user-AI exchanges.

This enables:

Multimodal Analytics: Ability to combine quantitative metrics (e.g., error rates, session lengths) with qualitative data (e.g., sentiment analysis, user feedback) for a holistic view of interaction quality.
Integration with Human-in-the-Loop & in the Lead Systems: Seamless handoff and tracking between AI and human agents, ensuring continuity and accountability in complex workflows.
Automated Feedback Impact Analysis: Tools that automatically correlate user feedback with subsequent changes in AI behavior or performance, quantifying the value of human input.

Effective human-AI collaboration depends on robust observability, which empowers developers and end-users to monitor, understand, and continuously improve interaction quality.

Key Challenges Ahead

Complexity and Scale: AI-powered systems introduce unprecedented complexity. Multi-agent workflows, dynamic model updates, and real-time adaptation all multiply the points of failure and uncertainty. Observability solutions must scale horizontally and adapt to changing system topologies.
Data Privacy and Security: With observability comes the collection of sensitive telemetry—prompt data, user interactions, model outputs. Ensuring privacy, compliance (e.g., GDPR, HIPAA), and secure handling of observability data is paramount.
Semantic Gaps: Traditional observability tools lack the semantic understanding needed for AI systems. For example, tracing a hallucination or bias back to its root cause requires context-aware instrumentation and domain-specific metrics.
Standardization and Interoperability: Fragmentation remains a challenge. While projects like OpenTelemetry’s GenAI SIG are making strides, the ecosystem is still maturing. Vendor lock-in, proprietary data formats, and inconsistent APIs can hinder unified observability across diverse AI stacks.

Best Practices: Building AI-Aware Observability

Design for Explainability: Instrument AI systems with explainability hooks—capture not just what happened, but why. Integrate model interpretability tools (e.g., SHAP, LIME) into observability pipelines to surface feature importances, decision paths, and confidence scores.
Embrace Open Standards: Adopt open-source, community-driven observability frameworks (OpenTelemetry, LangSmith, Langfuse) to ensure interoperability and future proofing. Contribute to evolving standards for LLMs and agentic workflows.
Feedback Loops and Continuous Learning: Observability should not be passive. Establish automated feedback loops—use observability data to retrain models, refine prompts, and adapt agent strategies in near real-time. This enables self-healing and continuous improvement.
Cross-Disciplinary Collaboration: Break down silos between developers, data scientists, MLOps, and security teams. Define shared observability goals and metrics that span the full lifecycle—from data ingestion to model deployment to end-user interaction.
Ethics and Governance: Instrument for ethical guardrails: monitor for bias, fairness, and compliance violations. Enable rapid detection and remediation of unintended consequences.

The Road Ahead: From Observability to Business Enablement

The evolution of observability in the AI era is not just about better dashboards or faster debugging. It’s about empowering organizations to:

Build Trust: Transparent, explainable AI systems foster user and stakeholder confidence.
Accelerate Innovation: Rapid feedback cycles and robust monitoring enable faster iteration and safer experimentation.
Unlock Business Value: Observability becomes a lever for optimizing AI-driven business processes, reducing downtime, and uncovering new opportunities.

Conclusion: Closing the Strategic Gap

AI is rewriting the rules of software engineering. To harness its full potential, organizations must invest in next-generation observability—one that is AI-native, explainable, and deeply integrated across the stack. Leaders who prioritize observability will be best positioned to navigate complexity, drive responsible innovation, and close the strategic gap in the era of CodeBots, AI Assistants, and AI Agents.

References

Evaluating Human-AI Collaboration: A Review and Methodological Framework https://arxiv.org/html/2407.19098v1
https://galileo.ai/blog/human-evaluation-metrics-ai
Auditing of Automated Decision Systems https://ieeeusa.org/assets/public-policy/positions/ai/AIAudits0224.pdf
How Model Observability Provides a 360° View of Models in Production https://www.datarobot.com/blog/how-model-observability-provides-a-360-view-of-models-in-production/
Observability in the Era of CodeBots, AI Assistants, and AI Agents https://devm.io/devops/ai-observability-agents

FREQUENTLY ASKED QUESTIONS

Why is observability now a strategic business enabler in the AI era?

As organizations adopt CodeBots, AI assistants, and agentic AI, systems make opaque, autonomous decisions at scale. Without robust observability, teams risk technical debt, operational inefficiency, ethical lapses, compliance violations, and loss of user trust. The article argues observability must evolve from a technical afterthought to a strategic capability.

What expands the scope of observability beyond metrics, logs, and traces?

The article identifies four new focal areas: intent and outcome alignment, model and data drift, autonomous decision auditing, and human‑AI interaction quality. These dimensions reflect the behaviors of AI systems, not just infrastructure signals.

What is “intent and outcome alignment,” and why does it matter?

Alignment ensures an AI system’s goals, actions, and behaviors reflect human intentions and ethical principles. It means delivering desired results for the right reasons—avoiding bias, hallucinations, or reward hacking—and requires regular outcome auditing to verify that outputs match intent and remain explainable.

Which observability capabilities support intent alignment?

The text calls for intent tracing and validation to map user goals to system objectives and emergent behaviors. It also stresses robust logging of agent interactions (external actions, tool calls, inter‑agent messages) and automated anomaly/misalignment detection that flags deviations from expected patterns.

How do model drift and data drift differ?

Data drift (covariate shift) occurs when input feature distributions change while input‑output relationships may remain stable. Concept drift changes the relationship between inputs and outputs due to shifts in the generating process, eroding model assumptions and performance over time.

What drift monitoring features belong in production‑grade observability?

The article recommends statistical tests such as PSI, KL Divergence, and the KS Test to compare live vs. training distributions. It also calls for real‑time, automated drift detection with thresholds/alerts and granular performance tracking (e.g., accuracy, precision, recall) across segments and time windows.

What does autonomous decision auditing require for agentic AI?

Auditing needs decision‑provenance tracking to record the sequence of transformations and inferences leading to each decision. It should include automated bias/fairness checks with alerts and integrate XAI tools for on‑demand explanations, particularly in regulated or high‑stakes contexts.

How does observability improve human‑AI interaction quality?

By combining quantitative signals (error rates, session length) with qualitative insights (sentiment analysis, user feedback), teams gain a holistic view of interactions. Observability should support human‑in‑the‑loop/“in the lead” handoffs and track how feedback changes system behavior over time.

What key challenges complicate AI‑aware observability?

The article highlights complexity and scale (multi‑agent workflows, real‑time adaptation), privacy/security requirements for sensitive telemetry, and semantic gaps in traditional tools. It also notes fragmentation and limited interoperability despite progress from efforts like OpenTelemetry’s GenAI SIG.

Which best practices does the article recommend to build AI‑aware observability?

Instrument for explainability (e.g., SHAP, LIME), adopt open standards (OpenTelemetry, LangSmith, Langfuse), and close the loop by using observability data to retrain models and refine prompts. Cross‑disciplinary collaboration and ethics/governance monitoring (bias, fairness, compliance) are emphasized as ongoing practices.

The post The Expanding Scope of Observability for AI Systems appeared first on ML Conference.

Are AI Tools Hurting Developer Productivity?

adharaneedharan@sandsmedia.com — Wed, 30 Jul 2025 12:23:58 +0000

Sebastian Springer:

Lately, there have been several studies highlighting the negative aspects of AI: AI makes us less productive, less creative… I believe it really depends on how we use the tools. The same could be said about search engines or platforms like Stack Overflow. If I rely on such channels for every aspect of my work, I’d become less productive as well. With modern AI tools, the risk is naturally greater, since they’re much more integrated into our work environments and are far more intuitive to use.

On the topic of productivity: Personally, I feel more productive thanks to tools like Copilot and similar tools. That’s mainly because I use them to solve repetitive tasks. There are situations where writing a good prompt takes significantly longer than writing the code myself. And of course, working with AI tools comes with the risk of being distracted from the actual problem or heading in the wrong direction. In other cases, the suggestions the AI offers – without any manual prompt – are exactly what I need.

In general, I think: Whether AI makes us unproductive, uncreative, or even dumb – it’s a technology that’s established itself in the market, and one we simply can’t ignore. So, we should focus on leveraging its strengths. And if we already know it has downsides (as almost every technology does), we should try to avoid those pitfalls as much as possible. Besides, AI is in good company: People once claimed that steam engines would never be economical, newspapers would overwhelm us mentally, and written information in general was dangerous – let alone the internet, which supposedly makes people stupid and causes crime to skyrocket. There’s always a grain of truth in every accusation, but in the end, it all comes down to how we deal with it.

Paul Dubs:

Based on my experience with AI tooling for development, which I discussed in a keynote at the JAX conference in May, the impact on productivity is highly dependent on how these tools are used and the developer’s experience level with them. The study actually supports what I’ve observed: there’s a significant learning curve with AI development tools. The one developer in that study who had substantial prior experience with Cursor was notably faster, an anomaly that proves the point. Like any tool, you need to know how to use it effectively to see productivity gains.

During my keynote, I described using agentic AI coding tools as “playing chess with a pigeon”: they would destroy the game and claim victory. Claude Code struggled to navigate projects properly and would even sabotage its own progress by resetting the Git state. The Claude 3.5 / 3.7 models used in the study weren’t well-suited for larger changes or project navigation. However, things changed dramatically with Claude 4’s release at the end of May. Even the smaller, faster Sonnet model became quite capable when used correctly. I now use Roo Code, a Visual Studio Code plugin that allows me to create specialized prompts for different tasks: debugging, programming, documentation, and language-specific work. This customization has made me considerably more productive.

The productivity gains aren’t uniform across all project types. I’m much more productive on greenfield (new) projects. For brownfield (existing) projects requiring major changes, I need to provide extensive additional context, often directly referencing the specific files the AI needs to work with. When I handle the navigation burden myself, the AI can be quite effective. There’s an important caveat: using AI tools creates a knowledge and memory gap. Since I’m not writing every line myself, it feels like delegating to someone else and doing a quick review. When I return to AI-generated code later, I need to reread it because I don’t fully remember the implementation details. It’s similar to working on a project where multiple developers touch every piece: you lose that intimate familiarity with the codebase.

The study’s findings align with my experience: developers unfamiliar with AI tools often see productivity losses, while those with significant experience can achieve net gains. The outlier in the study who was more productive validates this. Success with AI coding tools requires understanding their limitations, using them appropriately for the task at hand, and accepting the trade-off between speed and deep code familiarity.

Christoph Henkelmann:

The issue with AI-assisted coding is the same as with many current AI debates: it’s dominated by hype and quick dismissals, rather than a nuanced understanding. Yes, AI tools can deliver massive productivity gains – but only if you actually learn how to use them. This means understanding the basics of LLMs, knowing your domain, and practicing with the tools until you develop a sense for when they help and when they don’t. Most people just install something like Cursor and expect miracles. Naturally, this leads to disappointment. “Vibe coding” might get you a prototype, but real productivity comes from what Paul Dubs calls “omega coding”: deep domain knowledge, familiarity with your tools, and persistent practice. These tools don’t replace thinking; they amplify skill. Managers hoping for instant results will see the opposite at first: initial productivity drops, much like switching to a new IDE. But if you invest the time to learn and adapt, the gains are real and substantial. Most don’t (or better: aren’t given the time to do so), which is why recent studies show lackluster results.

Melanie Bauer:

As an informatics student, I spend a lot of time researching and learning about new tools and topics, especially in the field of software development. AI tools have made this process significantly easier and faster for me. For example, when I have a question, I can get direct and precise answers without having to scroll through extensive documentation.

That’s why tools like GitHub Copilot, Cursor, and ChatGPT have become a regular part of my workflow as a future software developer. Of course, at the end of the day, AI doesn’t think for me, and I am still responsible for reviewing and validating the generated output. But overall, I’ve noticed a clear increase in my productivity, especially when it comes to routine tasks, reducing the ramp-up time when learning new technologies, or understanding code snippets and programming concepts by having them broken down and explained step by step.

Rainer Hahnekamp:

Based on my experience, the use of AI in software development can be divided into three levels:

Code Completion in the IDE: Here, AI offers valuable support by suggesting small code snippets that boost productivity without taking control away from the developer.
Automated Code Generation: In this area – where the AI generates larger code blocks or even entire files – I’ve found that the time required to correct and adapt the output often outweighs the immediate benefit. Still, I see this as an investment in learning how to work with AI effectively. While it may currently slow things down, I’m confident that the technology will improve – and when it does, I want to be ready to make the most of it.
AI-Supported Research and Conceptual Work: Using AI as a sparring partner for brainstorming, idea generation, and problem-solving has proven extremely helpful. It supports creativity and often leads to productive insights.

Personally, I can’t confirm a loss in productivity – quite the opposite. While I haven’t read the details of the referenced study, I suspect the reasons might be due to the current lack of best practices and the necessary intuition for using AI effectively. And, of course, to be transparent, this statement reflects my personal opinion, but the wording was created with the assistance of AI .

Pieter Buteneers:

I use the following AI tools:
Cursor (the agents are a big step forward) (my go-to tool)
ChatGPT (GPT 4.1) (if cursor makes mistakes)
Claude 4 Sonnet if the above don’t know the answer which is once every 2-3 weeks or so.

In terms of advantages, I started using typescript (ts) instead of python and it really helps me understand the syntax and convert python code into ts much faster. It writes fewer errors in ts than in python, allows me to write more code, writes unit tests for me and allows me to use new packages/technology much faster. It helps me with the DevOps side of things which I am a real noob at. Overall, it makes me about 2 times faster

I use it a lot to brainstorm ideas and figure out best/bad practices, but it comes with a huge list of caveats. There is a lot of code duplication since it doesn’t know your entire code. So, your code becomes hard to maintain and turns into ‘spaghetti’ fast. Cursor often fixes a bug by just writing some code to cover an edge case, but it doesn’t always go deep into the underlying problem, so you think you end up with a fix, where it is just an ugly patch and you don’t understand the code. Ultimately, you still need a senior dev to tell you what good code practices are. I spend more time debugging than writing code, so tests are even more important.

Tam Hanna:

At Tamoggemon Holdings, we currently use AI systems mainly for menial tasks. Using them to write stock correspondence (think cover letters, etc.) has shown to be a significant performance booster, allowing us to refocus on more productive tasks. As for line work (EE or SW), we – so far – have not seen the systems as a valid replacement to classic manual work.

Rainer Stropek:

After gaining extensive experience with modern AI tools, I can’t imagine my daily work as a developer without their support. My productivity has noticeably increased because I have consistently aligned my entire workflow around collaboration with AI. This goes far beyond classic code completion: autocomplete suggestions are convenient but often too generic and sometimes break my flow. Chat agents, by default, start every conversation from scratch. To work efficiently with them, one must formulate complete, consistent requirements in the prompt, prompt context, document the architecture, establish coding guidelines, and provide meaningful test data with expected results. This level of diligence would be advisable anyway; working with AI makes it essential.

Spec-Driven Development instead of Vibe Coding
Many developers underestimate prompting and context management. A few buzzwords are only enough as long as the goal remains vague. As soon as I face concrete customer requirements, I rely on Spec-Driven Development:

I invest significant time in detailing the requirements.
The AI questions and discusses the specification with me.
Only once a sufficient level of maturity is reached do I let the AI implement the solution and review the result.
It’s crucial to create clarity before I let the AI write code.

From Coder to AI Orchestrator
My role is shifting. Instead of primarily writing code, I define work packages that I delegate to AI agents. This is similar to delegating to human team members. I see my future in the role of a product developer with a strong focus on requirements engineering and software architecture – structuring complex requirements in a way that makes them executable by AIs.

Limitations of Today’s AI Systems (especially in large projects)
Despite larger context windows and advanced retrieval (e.g., using MCP servers or function tools integrated in IDEs), AI still lacks a holistic overview of large projects. Humans remain responsible for slicing and documenting tasks so they can be worked on without requiring knowledge of the entire project. If this is done successfully, project size becomes almost irrelevant to the use of AI.

What Companies Need to Do Now
The tool landscape is evolving in months, not years. Instead of committing to a single tool long-term, companies should:

Allocate budgets and create space for teams to experiment with various AI tools.
Deploy pilot groups to quickly gain hands-on experience.
Embrace usage-based pricing models and make their cost-benefit ratio transparent.

From my perspective, those who don’t start building practical experience now risk losing competitiveness. AI is no longer just a nice-to-have add-on – it is fundamentally changing the way we develop software. Those who ignore these new ways of working risk losing productivity and, in the medium term, competitiveness. Now is the time to sharpen specifications, rethink roles, and encourage experimental team setups.

Christian Weyer:

“Never trust a study you didn’t fake yourself” .
Just kidding, of course.
But seriously: At Thinktecture, we’ve seen an unprecedented productivity boost across the team. Personally, I feel significantly more creative – which directly translates into being faster and producing better results.

The key? I don’t let AI tools disrupt my natural flow. Instead, I deliberately configure them to fit my individual thinking and working style. Tools like GitHub Copilot, Windsurf, Cursor, or Cline all offer great ways to customize the experience with your own guardrails.

Maybe many developers don’t yet fully leverage these configuration options – or don’t even know they exist. Used right, these tools amplify productivity instead of hindering it.

Veikko Krypczyk:

In my experience, artificial intelligence can be meaningfully applied throughout all phases of the software development process – from early ideation and UI design to architectural decisions and the implementation of complex algorithms. AI is by no means flawless, but it acts as a virtual work partner that can complete many tasks faster, more diversely, and sometimes even more creatively than would be possible alone.

The actual productivity gain strongly depends on two key factors: the quality of the prompts and the critical evaluation of the generated content. Those who can formulate clearly and have solid domain knowledge will greatly benefit from AI tools – whether it’s generating boilerplate code, writing test cases, supporting refactoring, or systematically exploring technical options.

Of course, AI outputs should never be accepted without reflection. It remains essential for developers to understand, question, and, if necessary, improve the generated suggestions. Domain expertise is not replaced by AI – quite the opposite: it becomes even more crucial to ensure the quality of the outcomes.

My conclusion: when used properly, AI enhances efficiency and broadens perspectives – both individually and in team processes. I find working with AI tools inspiring, more efficient, and often more focused, as they help offload routine work and spark creative thinking. I only experience a loss in productivity when AI is treated as an autopilot rather than as a co-pilot.

Links & Literature

[1] https://arxiv.org/abs/2507.09089

Top 10 FAQs About AI Coding Tools & Developer Productivity

1. Do AI coding tools like GitHub Copilot and Cursor really improve developer productivity?
Yes, when used correctly. Many developers see faster coding and fewer repetitive tasks with tools like GitHub Copilot, Cursor, and Claude. However, beginners may initially experience slower workflows while learning to use them effectively.

2. Why do some developers become less productive with AI development tools?
A lack of training and experience with AI-powered coding assistants can cause slower progress at first. Without understanding prompt writing, debugging AI output, or configuring tools properly, productivity can drop.

3. What is the learning curve for GitHub Copilot, Cursor, and similar AI coding assistants?
Most developers need time to master AI-assisted development. Success comes from learning prompt engineering, adapting workflows, and knowing when to trust AI suggestions versus manual coding.

4. Can AI coding assistants replace human software developers?
No. AI tools can speed up tasks like code completion, boilerplate generation, and prototyping, but human expertise is essential for architecture design, problem-solving, and ensuring high-quality code.

5. How can developers get the most out of AI coding tools?
Use AI tools for repetitive coding, quick prototypes, and brainstorming. Always review AI-generated code, write clear prompts, and combine AI with strong coding fundamentals for the best results.

6. What are common problems with AI-generated code?
Developers often face duplicated code, messy “spaghetti code,” shallow bug fixes, and the need for extra debugging. Writing unit tests and applying good coding practices remains essential.

7. What is ‘spec-driven development’ and how does it help AI-assisted coding?
Spec-driven development involves writing detailed software specifications before using AI tools. This approach helps ensure that AI-generated code matches the project’s goals and reduces wasted time on rework.

8. What are the best AI coding tools for developers in 2025?
Popular options include GitHub Copilot, Cursor, Claude 4 Sonnet, Roo Code, ChatGPT (GPT-4.1), Windsurf, and Cline. Many developers use a combination of these for different coding tasks.

9. How do AI coding assistants perform in greenfield vs. brownfield projects?
AI assistants tend to be more effective in greenfield (brand-new) projects, where they can help build from scratch. Brownfield (existing) projects often require more manual guidance and context-setting.

10. How should companies prepare before rolling out AI-powered coding tools?
Run pilot programs, give developers time to experiment, avoid locking into one tool too soon, and provide training on prompt engineering and AI best practices for software development.

The post Are AI Tools Hurting Developer Productivity? appeared first on ML Conference.

MCP vs A2A: Architecting AI Agent Communication for Enterprise

adharaneedharan@sandsmedia.com — Mon, 21 Jul 2025 12:10:07 +0000

The field of AI is undergoing a significant architectural shift. We are moving from standalone AI systems that provide isolated capabilities toward interconnected ecosystems of specialized agents that collaborate to solve complex problems. This evolution mirrors the historical development of human organizations, where specialization and communication allowed for more sophisticated collective capabilities.

As AI systems grow more capable and autonomous, the need for standardized communication mechanisms becomes increasingly critical. Without established protocols, organizations face challenges including:

Technical Fragmentation: Teams developing separate integration methods for each agent pairing, leading to duplicated effort and inconsistent implementations.
Security Vulnerabilities: Ad-hoc communication systems often lack robust authentication, authorization, and data protection mechanisms.
Limited Composability: Without standardized interfaces, combining capabilities from different AI systems becomes prohibitively complex.
Governance Challenges: Tracking information flow, maintaining audit trails, and ensuring accountability becomes difficult when agent communication occurs through diverse, non-standardized channels.

AI Agent communication protocols aim to address these challenges by providing structured frameworks that define how agents advertise capabilities, request services, exchange information, and coordinate activities. These protocols serve as the foundational infrastructure upon which sophisticated multi-agent systems can be built.

In this evolving landscape, two significant protocols have emerged as potential industry standards: Model Context Protocol (MCP) developed by Anthropic and Agent-to-Agent (A2A) recently introduced by Google. Each brings a distinct perspective on how AI Agent communication should be structured, secured, and integrated into enterprise workflows.

Understanding the architectural foundations, benefits, limitations, and optimal use cases for each protocol is essential for organizations planning their AI infrastructure investments. This comparative analysis will help technical leaders make informed decisions about which protocol—or combination of protocols—best suits their specific requirements and use cases.

What is MCP?

Model Context Protocol (MCP) represents Anthropic’s approach to establishing a standardized framework for AI Agent communication and operation. At its philosophical core, MCP recognizes that as AI systems grow in complexity and capability, they require a consistent, structured way to interact with external tools, data sources, and services.

MCP emerged from the practical challenges faced by developers building sophisticated AI applications. Without standardization, each team was forced to develop custom integration methods for connecting their AI systems with external capabilities—resulting in duplicated effort, inconsistent implementations, and limited interoperability between systems.

The protocol addresses these challenges by providing a unified method for structuring the context in which AI models operate. It defines clear patterns for how information should be organized, how models should access external resources and tools, and how outputs should be formatted. This standardization allows for better interoperability between different AI systems, regardless of their underlying architecture or training methodology.

Rather than focusing on direct agent-to-agent communication, MCP emphasizes the importance of structured context—ensuring that AI systems have access to the information and capabilities they need in a consistent, well-organized format. This approach treats tools and data sources as extensions of the model’s capabilities, allowing for dynamic composition of functionality without requiring extensive pre-programming.

By providing this standardized interface for context management, MCP aims to reduce ecosystem fragmentation, enable more flexible AI deployments, and facilitate safer, more reliable AI systems that can work together coherently while maintaining alignment with human intentions and values.

MCP Architecture

MCP’s architecture is built around a hierarchical context structure that organizes information and capabilities into clearly defined components. This architecture follows several key design principles that prioritize clarity, security, and flexibility while maintaining clear separation between different types of contextual information.

Core Architectural Components:

MCP Host: The “brain” of the system—an AI application using a Large Language Model (LLM) at its core that processes information and makes decisions based on available context. The host is the primary consumer of the capabilities and information provided through the MCP framework.
MCP Client: Software components that maintain 1:1 connections with MCP servers. These clients serve as intermediaries between hosts and servers, facilitating standardized communication while abstracting away the complexities of server interactions.
MCP Server: Lightweight programs that expose specific capabilities through the standardized protocol. Each server is responsible for a discrete set of functionalities, promoting separation of concerns and allowing for modular composition of capabilities.
Local Data Sources: Files, databases, and services on the local machine that MCP servers can securely access. These provide the foundation for contextual information from the immediate environment.
Remote Data Sources: External systems available over the internet (typically through APIs) that MCP servers can connect to, expanding the potential information sources available to AI systems.

Figure 1 MCP Architecture (source: Tahir^[1])

Context Structure and Information Flow:

The MCP architecture implements a controlled information flow where context passes through defined pathways. When an MCP Host needs to access external information or capabilities, it connects to appropriate MCP Servers through MCP Clients. The servers then mediate access to various data sources, ensuring that information is properly formatted and permissions are appropriately handled.

This structured flow ensures that all processing occurs within well-defined boundaries, making it easier to track how information moves through the system and maintain security and accountability. The architecture explicitly distinguishes between:

Tools: Model-controlled actions that allow the AI to perform operations such as fetching data, writing to a database, or interacting with external systems.
Resources: Application-controlled data such as files, JSON objects, and attachments that can be incorporated into the AI’s context.
Prompts: User-controlled predefined templates (similar to slash commands in modern IDEs) that provide standardized ways to formulate certain types of requests.

Figure 2 is a sequence diagram showing the information flow between different components in a system that uses MCP to retrieve blog data, specifically SQL-related blog posts, for a user. This type of flow would be useful in a plugin-style AI integration where the AI needs to interact with external data sources via a protocol like MCP but requires explicit user permission and intelligent capability discovery.

Figure 2 MCP Workflow Example (source: Gökhan Ayrancıoğlu^[2])

Protocol Implementation:

MCP is designed to be transport-agnostic, though the initial implementation focuses on HTTP/HTTPS as the primary transport layer. The protocol defines standardized message formats for tool registration, tool invocation, and result handling, ensuring consistent interaction patterns regardless of the specific tools or data sources being accessed.

Recent developments in the protocol have expanded support for remote MCP servers (over Server-Sent Events) and integration with authentication mechanisms like OAuth, making it suitable for enterprise deployments where security and distributed access are essential requirements.

This architecture aims to create a standardized environment for AI processing, where information sources are clearly delineated, tools are discoverable and consistently invocable, and outputs adhere to predictable formats enabling safer multi-agent interactions and clearer accountability.

MCP Benefits

MCP offers several substantial advantages that make it particularly valuable for organizations implementing enterprise-grade AI systems. These benefits directly address common challenges in AI development and deployment, providing tangible improvements in development efficiency, system flexibility, and organizational collaboration.

Ecosystem Standardization and Reduced Fragmentation:

One of MCP’s most significant benefits is the reduction in ecosystem fragmentation. Before standardized protocols, every team building AI applications had to develop custom integrations for connecting their systems with tools and data sources. This resulted in duplicated effort, inconsistent implementations, and limited interoperability.

MCP addresses this challenge by providing a standardized way to connect AI systems with external capabilities. This standardization significantly reduces development overhead and creates a more cohesive AI ecosystem where components can be easily shared and reused. Organizations can develop MCP servers once and leverage them across multiple AI applications, maximizing return on development investments.

Dynamic Composability of Capabilities:

MCP enables dynamic composability of AI systems. Agents and applications can discover and use new tools without pre-programming, allowing for more flexible and adaptable AI deployments. This composability means that organizations can incrementally enhance their AI capabilities by adding new MCP servers without needing to modify existing applications.

For example, a company might initially deploy an AI assistant with access to document search capabilities through an MCP server. Later, they could add financial analysis capabilities by deploying a new MCP server—and the assistant would be able to leverage these capabilities without requiring major modifications to its core implementation.

Enhanced Tool Integration and Context Management:

MCP provides a consistent framework for integrating external tools and capabilities into AI systems. This consistency makes it easier for developers to add new functionalities to their AI applications and for end-users to understand how to interact with those capabilities.

The protocol’s structured approach to context management ensures that models have access to the information they need in a well-organized format. This reduces the risk of context confusion and helps maintain consistent performance across different implementations. The clear separation between different types of contextual information (tools, resources, and prompts) also facilitates better governance and security practices.

Support for Enterprise Collaboration and Workflows:

The protocol aligns well with enterprise organizational structures, where different teams often maintain specialized services and capabilities. Teams can own specific services (such as vector databases, knowledge bases, or analytical tools) and expose them via MCP for other teams to use. This supports organizational separation of concerns while enabling cross-functional collaboration through standardized interfaces.

This alignment with enterprise workflows makes MCP particularly valuable for large organizations with diverse AI initiatives across multiple departments. It provides a common language for AI capabilities while respecting organizational boundaries and governance requirements.

Foundation for Self-Evolving Agent Systems:

MCP enables the creation of self-evolving agents that can grow more capable over time without requiring constant reprogramming. As new tools become available through the MCP registry, agents can discover and incorporate these capabilities dynamically—allowing for continuous improvement without manual intervention.

This foundation for evolving capabilities is especially valuable as organizations move toward more autonomous AI systems that need to adapt to changing requirements and opportunities.

These benefits collectively enable organizations to implement AI systems that are more interoperable, more easily extended, and better integrated into existing enterprise workflows and technology stacks.

MCP Challenges

Despite its numerous advantages, implementing MCP presents several significant challenges that organizations need to carefully consider. Understanding these limitations is essential for realistic planning and effective risk management when adopting the protocol.

Authentication and Security Framework Limitations:

One notable limitation of MCP in its current form is its relatively basic authentication mechanisms. While recent updates have improved OAuth integration, MCP lacks the comprehensive authentication frameworks that are essential for secure enterprise deployments across organizational boundaries.

This limitation becomes particularly significant when implementing MCP in environments where security is a critical concern, especially when AI systems need to access sensitive information or perform operations with potential security implications. Organizations implementing MCP in such environments will need to develop additional security layers to complement the protocol’s native capabilities.

Remote Server Management Complexity:

Although MCP has expanded to support remote MCP servers (over Server-Sent Events), managing these remote connections securely and reliably presents additional complexity. Organizations deploying MCP across distributed environments need to develop strategies for handling connection failures, latency issues, and security considerations.

This distributed architecture introduces potential points of failure that must be carefully managed, especially for mission-critical AI applications. Implementing robust monitoring, error handling, and recovery mechanisms becomes essential when deploying MCP at scale across distributed infrastructures.

Registry Development and Tool Discovery Maturity:

The planned MCP Registry for discovering and verifying MCP servers is still in development. Until this component is fully realized and mature, organizations face challenges in implementing dynamic tool discovery—one of the protocol’s key promised benefits.

Without a robust registry system, organizations must develop interim solutions for tool discovery and verification, potentially limiting the dynamic composition capabilities that make MCP valuable. This gap between the current implementation and the full vision for MCP requires pragmatic planning for organizations adopting the protocol today.

Connection Lifecycle Management:

MCP is still refining how it handles the distinction between stateful (long-lived) and stateless (short-lived) connections. This distinction is important for different types of AI applications, and the current^[3] implementation may not fully address all use cases, particularly those requiring sophisticated state management across extended interaction sessions.

Organizations implementing MCP need to carefully consider their connection lifecycle requirements and may need to develop custom solutions for cases that fall outside the protocol’s current capabilities in this area.

Multi-Agent Coordination Limitations:

While MCP excels at connecting individual AI systems with tools and data, it provides less robust support for direct agent-to-agent communication in multi-agent systems where state is not necessarily shared. This limitation becomes apparent in complex agent ecosystems where multiple autonomous agents need to coordinate their activities directly.

For sophisticated multi-agent architectures, organizations may need to complement MCP with additional protocols or custom solutions to enable effective agent-to-agent communication, particularly when those agents operate across organizational boundaries or vendor environments.

Implementation Complexity and Learning Curve:

Adopting MCP requires investment in understanding and implementing the protocol’s specifications. For organizations with existing AI infrastructure, this may require significant refactoring of current systems to comply with MCP’s structural requirements.

This implementation complexity represents a real cost that must be factored into adoption planning. Organizations should expect to invest in developer training, refactoring existing code, and establishing new development practices aligned with the protocol’s requirements.

These challenges highlight the importance of careful planning when implementing MCP, particularly for organizations with complex security requirements or those building sophisticated multi-agent systems.

MCP Main Use Cases

MCP is particularly well-suited for several key application areas where its structured approach to context management delivers significant value. Understanding these optimal use cases helps organizations identify where MCP can provide the greatest return on implementation investment.

AI-Enhanced Development Environments:

MCP has gained significant traction in AI-enhanced coding environments and integrated development environments (IDEs). Tools like Cursor and Zed leverage MCP to provide developers with AI assistants that have rich access to contextual information, including code repositories, documentation, ticket systems, and development resources.

In these environments, MCP excels at:

Pulling in relevant code context from the current project
Accessing GitHub issues, documentation, and APIs
Enabling interaction with development tools and services
Maintaining appropriate context during extended coding sessions

The protocol’s standardized approach to context management makes it particularly effective for integrating AI capabilities into development workflows, allowing developers to work with AI assistance that truly understands their project context.

Enterprise Knowledge Management Systems:

MCP provides significant value in enterprise environments where AI needs to access, process, and reason over large volumes of organizational knowledge. The protocol’s clear structure for differentiating between various information sources helps maintain information integrity when AI systems need to reference multiple documents, databases, and knowledge bases simultaneously.

These knowledge management applications benefit from MCP’s ability to:

Access diverse document repositories with appropriate permissions
Query enterprise databases while maintaining security boundaries
Incorporate real-time information from organizational systems
Maintain clear provenance for information incorporated into analyses

This capability makes MCP ideal for implementing corporate knowledge assistants, document processing systems, and intelligent search applications that need to work across diverse information sources while maintaining appropriate security and governance.

Tool-Augmented Agents and Automated Workflows:

Organizations implementing AI Agents that need to leverage external tools benefit significantly from MCP’s standardized tool interface. Agents can autonomously invoke tools to search the web, query databases, perform calculations, or interact with enterprise systems through a consistent, well-defined interface.

This standardization makes it easier to:

Expand agent capabilities by adding new tools without changing the agent’s core implementation
Chain multiple tools together into sophisticated workflows
Maintain clear audit trails of tool invocations and results
Implement governance controls around tool access and usage

For example, a research assistant agent might use MCP to access scholarly databases, statistical analysis tools, and citation management systems—combining these capabilities dynamically based on specific research requests.

Domain-Specific AI Applications:

MCP provides an excellent foundation for building domain-specific AI applications that require access to specialized data sources or tools. In fields like finance, healthcare, or legal services, MCP allows developers to create AI systems that can interact with domain-specific resources through a standardized interface.

This standardization reduces the development effort required to build and maintain specialized applications by:

Providing a consistent pattern for integrating domain-specific tools
Enabling clear separation between the AI model and domain-specific resources
Facilitating compliance with domain-specific regulations through structured access controls
Allowing for modular updates to capabilities as domain requirements evolve

For instance, a healthcare AI assistant might use MCP to access medical terminology databases, electronic health record systems, and clinical decision support tools—all through a consistent interface that maintains appropriate clinical governance.

Self-Evolving Agent Systems:

The protocol enables the creation of self-evolving agents that can grow more capable over time without requiring constant reprogramming. These systems can:

Dynamically discover new tools via the registry
Combine MCP with computer vision for UI interactions
Chain multiple MCP servers for complex workflows (e.g., research → fact-check → report-writing)
Adapt to new information sources and capabilities as they become available

This capability is particularly valuable for organizations looking to build AI systems that can grow more sophisticated over time, adapting to changing requirements without requiring constant developer intervention.

These use cases highlight MCP’s strengths as a foundational layer for context-aware AI systems, particularly in environments where structured access to diverse information sources and tools is a key requirement.

^[1] https://medium.com/@tahirbalarabe2/what-is-model-context-protocol-mcp-architecture-overview-c75f20ba4498

^[2] https://gokhana.medium.com/what-is-model-context-protocol-mcp-how-does-mcp-work-97d72a11af8a

The post MCP vs A2A: Architecting AI Agent Communication for Enterprise appeared first on ML Conference.

Can GenAI Build Satellites?

adharaneedharan@sandsmedia.com — Thu, 17 Jul 2025 08:09:40 +0000

When we think of Satellites we imagine cutting-edge technology, precision engineering, and brilliant minds at work.

What we rarely see is the chaos behind the scenes: thousands of pages of documents, conflicting requirements, scattered diagrams, and complexity that grows faster than humans can manage. Modern systems engineering isn’t just about designing hardware — it’s about managing an overwhelming flood of information.

How does GenAI Helps System Engineers Stay Ahead of Complexity?

1. Writing Better Requirements

Projects start with requirements — and often, they’re a mess. Vague sentences like “The car should drive fast” cause confusion down the line.

GenAI reviews these early, flags weak spots, and helps rewrite them clearly and precisely. This saves time, prevents misunderstandings, and reduces risk later.

2. Finding Duplicates and Patterns

Large projects mean thousands of requirement sentences. Some repeat. Some contradict.

AI analyzes this data, finds duplication, flags inconsistencies, and helps keep things clean and organized.

3. Designing System Diagrams

Once requirements are sorted, diagrams follow.
With Generative AI, engineers give simple prompts:

“Connect sensor, battery, data logger.”
AI draws it.
“Add brightness control.”
It updates.
No wasted hours redrawing from scratch.

4. RAG: Smarter Document Search for Engineers

PDF overload is real. Retrieval-Augmented Generation (RAG) lets AI search internal documents safely and quickly — no data sent to the cloud.

“What’s the max voltage?”
AI finds it. Work moves forward.

Watch the Full Session on YouTube:

Join “Building an Agentic RAG Pipeline” live at MLCon NY 2025 to see how it’s already changing the game.

The post Can GenAI Build Satellites? appeared first on ML Conference.

RAG & MCP for ML Devs: Practical AI Implementation

adharaneedharan@sandsmedia.com — Mon, 07 Jul 2025 07:21:42 +0000

Join us at MLCon New York to dive into these topics even deeper and hands on.

Business & Culture: https://mlconference.ai/machine-learning-business-strategy/ai-native-software-organization-technical-operational-practices/

https://mlconference.ai/generative-ai-content/gen-ai-operational-metrics/

https://mlconference.ai/machine-learning-business-strategy/operationalizing-ai-leadership-sprints-workshop/

GenAI & LLM: https://mlconference.ai/generative-ai-content/building-agentic-rag-pipeline/

https://mlconference.ai/generative-ai-content/agentic-ai-workshop-deep-research-llms/

devmio: Hello everyone, we’re live from JAX in Mainz, and I am here with Robert Glaser from INNOQ. Hello Robert.

Robert Glaser: Hi Hartmut, I’m glad I could make it.

devmio: Great. You’re here at JAX talking about AI. A controversial topic for some. Some love it, some are a little put off by it. Of course, there are benefits from it. Where do you see the benefits of AI?

Robert Glaser: I get that question a lot, and I usually disappoint people by saying, “I can’t give you a list of use cases in an hour.” Because use cases are a dime a dozen, since we’re dealing with a new kind of technology. Think of electricity, steam engines, or the internet. People didn’t immediately know, “OK, here are 100 use cases we can do with this and not with that.”

Everyone in every department, in every company, has to find out how well or how poorly it fits with the current state of technology. There are formats and possibilities for finding use cases. Then you also evaluate the potential outcome, look at what the big levers are, what is likely to bring us the most benefit when we implement it? And that’s how I would approach it. Whenever someone promises to name AI use cases for your business in a 30-minute conversation, I would always be cautious.

devmio: Now you have a use case, a RAG system, retrieval of augmented generation. Can you explain how a RAG works? Maybe first, what exactly is it? And then, of course, the question. Yes, for you out there, is this a use case that is exciting for you? What are the benefits of a RAG system?

Robert Glaser: Yes, I can explain briefly. A RAG is not a new concept at all. It comes from an academic paper from 2005. At that time, people weren’t talking about foundation models or large language models like we are today.

RAGs come from the domain of Natural Language Programming, which is how it’s used today. How do I get context for my AI model? Because the AI model only becomes interesting when I connect it to my company data. For example, in Confluence or some tables or databases or something like that. Then I can use this technology in conjunction with my data. And RAG is an architecture, if you want to call it that, which allows me to do just that.

That’s changing right now, too. But the basic principle that I’m showing in my presentation is that, in principle, a large part of the RAG architecture is information retrieval. All of us who work in tech have known this for many decades. A search, for example, is an information retrieval system.

This is often at the center of RAG and is the most interesting point, because the classic RAG approach is nothing more than pasting search results into my prompt. That’s all there is to it, it’s also something I mention in my talk. Well, we can end the presentation here. There’s nothing more to it. You have your prompt, which says, for example: Answer this question, solve this problem using the following data.

The data that comes in and that it is also relevant. That’s the crux of the matter. That’s what RAG deals with. For the most part, it’s a search problem: How do I cut my body of data into pieces that fit together? I can’t always just shovel everything in there. Models like this have a context window; they can’t digest endless amounts of data.

It’s like with an intern: if I pile the desk full, they may be a highly qualified intern, but they’ll get lost if the relevant documents aren’t there. So, I have to build a search that gets the relevant document excerpts for the task that the LLM is given. And then it can produce an answer or a result and not fantasize in a vacuum or base it on world knowledge, but rather on my company’s knowledge, which is called grounding. The answers are grounded, citation-based, like how a journalist works with facts. They must be there.

devmio: -and with references!

Robert Glaser: Exactly, then I can also say what the answer should look like. Please don’t make any statements without a source. That helps too. I’ll show this best practice in my talk.

devmio: So, you have this company data linked to the capabilities of modern LLMs. How do I ensure that the company data remains secure and is not exploited? It becomes publicly accessible via LLM.

Robert Glaser: Exactly. In principle, you will probably build some kind of system or feature or an internal product. In the rarest of cases, you will probably just connect your corporate customers to ChatGPT—you could do that.

There’s a lot going on right now, for example, Anthropic is currently building a Google Drive connection or a Google Mail connection. This allows you to search your Google Mails, which is incredibly powerful. But maybe that’s not always what you want as a company. That’s why you have to build something, you have to develop something. If it’s a chatbot, then it’s a chatbot. But every AI feature can potentially be augmented with RAG.

You have to develop software, and the software probably uses an AI model via an API. Either this AI model is in the cloud or in your own data center with a semi-open model, and then you must consider: how do I get the data from the search system or from the multiple search systems into the prompt. This involves a relatively large amount of engineering—classic engineering work. We can expect that developers of such systems will simply build them more often in the future.

devmio: Good, and another point that is becoming increasingly important is AI agents. These are agents that can do more than just spit out a result after a prompt. In this context, the MCP protocol, or model context, is often mentioned. How does the model context protocol work exactly? Maybe you could explain MCPs and how they work?

Robert Glaser: We’ve been having a lot of conversations about this at our booth downstairs. It’s not without reason that people are so interested in this topic; there’s enormous hype surrounding it. It’s an open standard developed by the company Anthropic, which also trains the model, and it was released last November. And at the time of release, nobody was really interested in it. Right now, about a month ago, AI time is also somehow compressed in my head– it’s going so fast.

Every day, there are hundreds of new MCP servers for a wide variety of services and data sources. In principle, MCP is nothing more than a very simple specification for a protocol that I use to connect a service or data to my AI model. Some call it “USB-C for AI models”. I personally think that’s a pretty good metaphor, even if it’s not 100% technically correct, but the plug always looks the same.

But it doesn’t matter what I plug into a device, be it a hard drive, USB stick, or alarm siren. The protocol is always the same. What it does is stay in the metaphorical world. It gives AI models hands and feet to do things in the real world. Tools are a concept of this protocol, i.e., tool usage. But there are also several other things.

These things can also simply provide prompts, which means I could write in MCP Server, which acts as a prompt library. Then the LLM can choose for itself which prompts it wants to use.

Something very interesting is tool use, also nothing new, but Foundation models have been trained on tool usage for a long time. But until now, I always had to tell them: “Look, here’s an open API spec for an API that you can use in this case, and you use it like this, and then there’s the next tool with a completely different description.”

This tool might be a search function, which would have a completely different API that you use like this. And then you can operate on the arrow system, and you do that with STANDARD in STANDARD out. Imagine you need an AI agent or an AI feature and have to describe 80 individual tools!

Every time you start from scratch and with each new tool, you must write a new description. The nice thing about MCP is that it standardizes this and acts as a facade, so to speak. The foundation model or the team that builds an AI feature doesn’t need to know how these services and data work or how to use them, because the MCP server takes care of that and exposes the functionality to the AI model using the same protocol.

It’s also nice if, for example, I’m a large company and I have one team building the search function in the e-commerce shop and another team building the shopping cart, then we are very familiar with their domains. This makes it very easy for them to write an MCP server for the respective system. Then I could say, for example: Dear shopping cart team, please build an MCP server for the shopping cart because we want to roll out an AI feature in the shop and need access to these tools.

So, it also fits in nicely with a distributed team architecture like a kind of platform through which I can access a wide variety of tools and LLMs. I could also build a platform for this, but basically, it’s just small services, called servers, that I simply connect to a model. And I don’t always have to describe a different interface, but always the same one.

I don’t have to wait for Open AI to continue training its models so that the MCP can use them, because we all know that I can simply enter the MCP specifications now and then the model can learn from the context on how to use them. I’m not dependent on having to use a model right now. I can teach them everything.

devmio: Yes, what I’m noticing is that MCPs have really taken off. Among other things, because it’s not just one topic, but also Open AI and Google, I think. And others are now on board and supporting it, so there’s a kind of standardization.

Robert Glaser: It’s like how the industry agreed on USB-C or, back then, A. That was good for everyone because then you no longer had 1,000 cables. It’s like how I’m developing an AI feature now and the possibilities are basically endless. People like to think only in terms of data sources or services, but a service can be anything, it can be an API from a system that returns bank transactions or account transactions to me. Here’s a nice example that I always use: There’s an MCP server for Blender, a 3D rendering software, which I can use and then I can simply ask: render a living room with a nice lamp and a couch with two people sitting on it.

Then you can see how this model uses tools via the PC server and then creates this 3D scene in Blender. That’s the range of possibilities, truly endless.

devmio: Now we’re at a JAX, a Java conference. That’s really an AI topic, so how does it relate to Java? Maybe we’ll come back to that later. Java is a language that isn’t necessarily known for being widely used in machine learning; Python is faster. Is Java well positioned in this area?

Robert Glaser: To use these services, or maybe that’s not even necessary, but these are all just API calls. Yes, in fact, with the foundation models, he has also introduced a huge commodity factor. In the past, I had to train my own machine learning models. I had to have a team that labeled data and so on, and then it was still a case of narrow in, narrow out, now broad in, broad out.

I have models that can utilize everything from textual data to molecular structures or audio and video, because everything can be mapped into a sequential stream. And the models are generalists which can always be accessed via API.

Even the APIs are becoming standardized because all manufacturers are adapting the Open AI API spec. That’s a nice effect. That’s why there’s a great ecosystem in both the Java world and the Ruby or Python world.

I’ll just mention Spring AI. I could have started a year ago and built AI features into my applications. Spring AI, for example, has a range of functions that allow me to configure models flexibly. I can say, go to Open AI or use a semi-open model in our data center. Everything is already there

devmio: But the community there is so agile, creative, and innovative that solutions are emerging and already available.

Robert Glaser: Another aspect with MCPs, if I want to build a feature or a system with Java around my foundation model that other systems want to integrate via MCP, there is a Java SDK, there is an SDK, everything is there.

devmio: We’re excited to see how things progress. It’s sure to continue at a rapid pace.

Robert Glaser: You must read a hundred blogs every day to keep up with all of the AI innovations. It’s breathtaking to see the progress that’s being made. It’s nice to see that things are happening in tech again.

devmio: Thank you very much for that. For the insights here.

Robert Glaser: You’re very welcome. Thank you for letting me be here.

The post RAG & MCP for ML Devs: Practical AI Implementation appeared first on ML Conference.

Deep Learning with Category Theory for Developers

skansal — Tue, 29 Apr 2025 09:31:46 +0000

Deep learning frameworks like TensorFlow or PyTorch hide the complex machinery that makes neural network training possible. Specifically, it’s not a Python program running in the background that defines the network, but a graph. But why is this the case?

Anomaly detection in production processes

The starting point of this article is the development of a product for detecting anomalies in the sensor data of industrial production systems. These anomalies can signal issues that would otherwise go undetected for a longer time. A very simple example is a cold storage facility for food: if the temperature rises despite electricity flowing into the cooling unit, something is wrong—but the food can still be saved if the change is detected in time.

Unlike in the example, in many systems the relationship between sensor values is too complex to know in advance exactly what an anomaly looks like. This is where neural networks come into play: they are trained on “normal” sensor data and can then report deviations in the relationship between sensor values or over time.

As a customer requirement, the neural networks for this application must run on machines with limited computing power and memory. This rules out heavy frameworks like TensorFlow which are too large and memory intensive.

Graphs for neural networks

In Listing 1, you can see a simple neural network in Python, an autoencoder implemented with the TensorFlow framework. If you typically write analytics code in Python, you might be surprised that TensorFlow includes its own operations for matrix calculations instead of using established libraries like Pandas. This is because tf.matmul doesn’t perform any calculations. Instead, it constructs a node in a graph that describes the neural network’s computation function. This graph is then translated into efficient code in a separate step – for example, on a GPU.

Listing 1

def __call__(self, xs):

  inputs  =  tf.convert_to_tensor([xs], dtype=tf.float32)

  out = tf.matmul(self.weights[0], inputs, transpose_b=True) + self.biases[0]

  out = tf.tanh(out)

  out = tf.matmul(self.weights[1], out) + self.biases[1]

  out = tf.nn.relu(out)

  out = tf.matmul(self.weights[2], out) + self.biases[2]

  out = tf.tanh(out)

  return tf.matmul(self.weights[3], out) + self.biases[3]

However, graph serves another purpose: training the neural network requires its derivative, not the function describing the network. The derivative is calculated by an algorithm that also relies on the graph. In any case, this graph machinery is complex; to be specific, it involves a complete Python frontend that translates the Python program into its graph. In the past, this was a task deep learning developers had to do manually.

What is a derivative and what is it for?

Neural networks (NNs) are functions with many parameters. For an NN to accomplish wondrous tasks such as generating natural language, suggesting suitable music, or detecting anomalies in production processes, these parameters must be properly tuned. As a rule, an NN is evaluated on a given training dataset, and a number is calculated based on the outputs to express their quality. The derivative of the composition of the NN and the evaluation function with respect to the NN’s parameters provides insight into how the parameters can be adjusted to improve the outputs.

You can imagine it as if you’re standing in a thicket in a hilly landscape, trying to find a path down into the valley below. The coordinates where you are currently standing represent the current parameter values, and the height of the hilly landscape represents the output of the evaluation function (higher values indicate a larger deviation from the desired result). You can only see your immediate surroundings and you don’t know ultimately in which direction the valley lies. However, the derivative points in the direction of the steepest ascent—if you move a short distance in exactly the opposite direction and repeat this process from there, the chances are high that you will eventually reach the valley. The main job of deep learning frameworks (DLFs) in this iterative process (known as gradient descent, by the way), is to repeatedly compute derivatives and adjust the parameters accordingly.

Derivatives

For a differentiable function f : ℝ → ℝ the definition of the derivative at a point a as taught in school mathematics, is:

In this definition, the derivative represents the rate of change at the point a. Training a neural network involves functions from a high-dimensional space—where the parameters reside—into the real numbers, which represent the network’s performance. For a function like f : ℝm → ℝ, the partial derivative with respect to the i-th component is defined by holding all other components constant:

For a differentiable function, the derivative can then be expressed as a vector of partial derivatives. As will become clear in the next section, subcomponents of neural networks must also be differentiated. These typically produce higher-dimensional outputs as well. A function f: ℝm → ℝn consists of its component functions,f: x ↦ (f₁(x), …, f_n(x)); the derivatives of these component functions are defined just as before. For such a (totally) differentiable function, the derivative (Df) can be represented as the so-called Jacobian matrix, where the entry in the i-th row and j-th column is the partial derivative of the i-th component function with respect to the j-th input variable. All of these values describe how the corresponding parameters need to be adjusted.

Derivatives and Deep Learning

To calculate Jacobian matrices, DLFs use a process called Automatic Differentiation (AD), which automates the process that is manually performed in school mathematics. The chain rule plays a special role here, explaining how even complex functions can be differentiated. If two simpler functions f and g are composed in series (written g ∘ f) the derivative is as follows:

AD uses this rule and replaces the elementary components of a composition with their known, predefined derivatives. The individual components of the resulting expression can be evaluated in any order. Typically, this is done either right-associatively, meaning first Ｄf and then Ｄg (forward AD), or left-associatively, first Ｄg and then Ｄf (reverse AD).

The computational effort associated with the forward case depends on the input dimension, in this case, the number of parameters, which is often very large. While for the reverse case, it depends on the output dimension (i. e 1, which is the output of the evaluation function). Therefore, DLFs typically implement AD as “reverse AD”.

However, the algorithms used in DLFs for reverse AD are quite complex. This is mainly because the derivative of a component can only be calculated once the output of the preceding component is known. This is not an issue for the forward case, as a function and its derivative can be computed in parallel, with the results passed to the next component. However, in the backward case, the evaluations of the functions and their derivatives occur in opposite directions.

DLFs implement reverse AD using an approach that translates the entire implementation of an NN and the associated optimization method into a computational graph. To compute a derivative, this graph is first evaluated forward, and the intermediate results of all operations are recorded. The graph is then traversed backwards to compute the actual derivative. As described in the previous section, the result is a highly complex programming model. For interested readers, we recommend reading this article [1] for a detailed description of the mathematical principles behind AD algorithms, and this article [2] for an overview of the most common implementation approaches used today.

Derivatives revisited

However, derivatives can also be defined in a more general way: For (normalized, finite-dimensional) vector spaces V and W a function f : V → W is differentiable at a point a ∈ V0, if there exists a linear map Ｄf_a : V ⊸ W such that f(a + h) = f(a) + Ｄf_a(h) + r(h), where the error term r(h) = f (a + h) − f (a) − Ｄf_a(h) holds:

If such a map Ｄf_a, exists, it is unique and is called the derivative of f at a. This perspective treats derivatives as local, linear approximators, without focusing on details such as matrix representations, partial derivatives, or the like. While it is more general and abstract, it is simpler than the concept of the “derivative as a number”. Furthermore, a DLF can use it to compute derivatives without relying on graphs or complex algorithms.

Deep Learning with Abstract Nonsense

The ConCat framework for deep learning uses this new perspective (Elliott, Conal: “The simple essence of automatic differentiation” [3], [4], [5]), by interpreting the function that defines the neural network differently – namely, by computing its derivative rather than the original function itself. Another branch of mathematics, known as category theory—and affectionately called ‘abstract nonsense’ by connoisseurs—offers yet another way of looking at functions. It’s a treasure trove of useful concepts for software development, particularly in the realm of functional programming.

Category theory is, of course, about categories – a category is a small world containing so-called objects and arrows (“arrows” or “morphisms”) between these objects. The term “object” here has nothing to do with object-oriented programming. For example, functions form a category – the objects are sets that can serve as the input or output of a function. However, it’s important to note that objects don’t have to be sets and arrows don’t have to be functions either.

Category theory, then, “forgets” almost everything we know about functions and assumes only that arrows can be composed – that is, chained together. So, if there’s an arrow f from an object a to an object b and another arrow g from b to a third object c, then there must also be a uniquely defined arrow a to c, written g ∘ f. In the category of functions, this is simply function composition as described above, which is why the same symbol is used.

Composition must also satisfy the associative law, and there must be an identity arrow between any two objects — though this isn’t particularly relevant for our purposes here. Some categories also come with additional features, such as products which correspond to data structures in programming.

The analogy between arrows and functions in a computer program goes so far that every function can be expressed as an arrow in the category of functions. And here’s the clever trick behind the ConCat framework: it takes a function definition from a program, translates it into the language of category theory, but then “forgets” which category it came from. This makes it possible to target an entirely different category instead. In the case of deep learning, this is the category of function derivatives.

The idea, then, is to interpret a “normal” function as its derivative. Here’s an example: f (x) = x².

f no longer represents the square function itself, but its derivative Ｄf(x) = 2x. However, to allow for the definition of more complex functions, composition must be supported — meaning that from the two derivatives Ｄf and Ｄg of f and g respectively, the derivative of g ∘ f must be computable. This might work using the chain rule mentioned above, but it doesn’t rely solely on Ｄf and Ｄg. It also requires access to the original function f, as well as a representation of the derivative as a linear map. The ConCat framework therefore defines a category that pairs each function with its derivative. This way, the derivative is computed automatically whenever the function is called.

Thus, a function f : a → b turns into a function of the following form:

This means that for every input x from a, the result now includes both the original function value f (x) in b and the original derivative Ｄf, which is a linear map from a to b. This is now enough to express the chain rule “compositionally”, that is, in a way that makes the derivative of the composition, Ｄ (g ∘ f), depend solely on Ｄf and Ｄg. Here’s how it works:

Where

and

apply.

Category Theory and Functional Programming

The mathematical idea from the previous section can also be implemented in Python, but it’s particularly simple and direct in the functional language Haskell [6] which is what the ConCat framework is written in. It includes a literal translation of the mathematical definition; GD stands for General Derivative:

data GD a b = D (a -> b :* (a -+> b))

The chain rule also mirrors the mathematical definition (with \ denoting a lambda expression in Haskell):

D g . D f = D (\ x -> let (y,f’) = f x

(z,g’) = g y

in (z, g’ . f’))

This code forms the foundation of the deep learning framework, which comprises just a few hundred lines and comes bundled with ConCat. The framework also includes a Haskell compiler plug-in that automatically translates regular Haskell code into categorical form. You could write the code this way from the outset, but the plug-in makes the task much easier. This approach forms the core of the production system used for the anomaly detection described at the beginning.

What about reverse AD?

But there’s one more thing: deep learning requires the reverse derivative, and Ｄ only gives the forward derivative. The reverse derivative is obtained using a surprisingly simple trick—also from category theory. You can create a new category from an existing one by simply reversing the arrows. To do this, the linear functions a -+> b in the definition of GD only needs to be replaced with:

data Dual a b = Dual (b -+> a)

Dual Dual also includes correspondingly “reversed” definitions of the linear functions along with a reversed version of the composition. But then the reverse AD is ready, without any complicated algorithm.

High Performance with Composition

Reverse AD removes one of the reasons why TensorFlow must take the complicated route with a computational graph. But there’s still a second one: execution on a GPU. So far, the code still runs as regular Haskell code on the CPU. Fortunately, there’s Accelerate [7], a powerful Haskell library that enables high-performance numerical computing on the GPU.

Accelerate only has one catch: it can’t run standard Haskell code that operates directly on matrices. A function that transforms a matrix of type Typ a into a matrix of type b has the following type in Accelerate:

Acc a -> Acc b

Acc is a data type for GPU code that Accelerate assembles behind the scenes. Accordingly, an Accelerate program cannot use the regular matrix operations, but must instead rely on operations that work on Acc. This is similar to the Python program that generates a graph. That’s just the nature of GPU programming. Here, it doesn’t leak into the definitions of the neural networks, or their derivatives and it has no impact on how the ConCat framwork is used.

ConCat has a solution for this too – the Accelerate functions can be defined as a category as well. So, all it takes is one more application of ConCat to the result of reverse AD, and the result is high-performance GPU code.

Summary

Understanding the foundations of deep learning isn’t all that hard when using the right kind of math. It’s easy to get lost in the weeds of Jacobian and reverse AD matrices, but category theory offers a more abstract and elegant perspective that, remarkably, also captures reverse AD with surprising ease. At the same time, the math also serves as a blueprint for implementation – at least in the right programming language, such as the functional language Haskell. Haskell is well worth a closer look, especially for deep learning – though its strengths go far beyond that.

Links & Literature

[1] Hoffmann, Philipp: “A hitchhiker’s guide to automatic differentiation”: https://arxiv.org/abs/1411.0583

[2] Van Merriënboer, Bart et al: “Automatic differentiation in ML. Where we are and where we should be going”: https://arxiv.org/abs/1810.11530

[3] Elliott, Conal: “The simple essence of automatic differentiation”: http://conal.net/papers/essence-of-ad/

[4] Conal Elliott: “Compiling to categories”: http://conal.net/papers/compiling-to-categories/

[5] ConCat Framework: https://github.com/compiling-to-categories/concat

[6] Haskell: https://www.haskell.org

[7] Accelerate: https://www.acceleratehs.org

The post Deep Learning with Category Theory for Developers appeared first on ML Conference.

Exploring DeepSeek: How a New Player is Challenging AI Norms

skansal — Thu, 13 Mar 2025 09:40:25 +0000

Niklas Horlebein: The release of DeepSeek-R1 made quite an impression. But why? There are several open source LLMs. Is the impact because of its – allegedly – low model training costs?

Pieter Buteneers: There are a few special things about the DeepSeek AI release. The first is the fact that it’s very affordable. When using their reasoning model, you’ll pay about 10 times less than what you would pay at OpenAI. This came as a shock to the sector. So far, AI companies have poured billions of dollars into research and people could only buy in at a very high price. But now – apparently – new players are pushing for lower costs.

The other thing is, obviously: How were they able to train it with 6 million dollars, allegedly?

That’s only half of the story: It’s not just the 6-million-dollar figure. Basically, 6 million dollars is what brought them from having the V3 baseline model, with a simple question and an answer without reasoning, to improving it so that it can also reason about itself. It has become a reasoning model. Switching from one state to the other is what cost 6 million dollars.

Stay up to date

Learn more about MLCON

That wasn’t much, as the V3 model already existed, so it wasn’t that expensive to train. I couldn’t find exact details on how much they paid for it, but allegedly, training it was also quite cheap.

What struck a lot of people was what’s called “distillation”. While this isn’t exactly the right term, the media picked up on it. Distillation is typically used with a large parent model to create a smaller version of that model that mimics its behavior. Sometimes, you succeed in receiving almost the same performance with the smaller model by using data generated by the larger one for training. That’s typically how distillation is used.

In this case, the term loosely applies to what DeepSeek AI did. They already had a base model that they wished to turn into a reasoning model. Using examples from OpenAI – and possibly others – to generate training data, they then trained their own model.

This made the process a lot cheaper, as they were able to piggyback on existing models and mimic them to achieve similar results, ensuring their model learned the same reasoning steps. Of course, this was a big shock. It showed how easy it has become to copy another player’s approach. This scares a lot of competitors.

The fourth thing is that they didn’t pay much – just 6 million to go from one model to another. However, Nvidia is not allowed to export their newest graphics cards to China, so China is forced to work with older hardware when training models. Apparently, they may have found a workaround, but in theory, they used Nvidia hardware one or two generations behind what European and American players are using, and yet still managed to achieve a model comparable to OpenAI.

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

The founder of Anthropic, Dario Amodei said in defense, “they’re a year behind.” I don’t think they’re actually a year behind, especially compared to OpenAI. Anthropic is a bit ahead. However, it shows that they’re not that far behind. A year in this industry is a long time, and a lot can change in that period. However, looking at the bigger picture: a new company coming out with a model like this and only being a year behind is incredible. It is insane how quickly they’ve managed to catch up, even with relatively modest resources, a smaller budget, and with older hardware.

Niklas Horlebein: Sounds like a good trade-off: To be just a year behind, but save hundreds of millions by doing so. Let’s stick with money: From your perspective, how realistic are DeepSeek’s claims about low training costs and reduced need for computing power?

Pieter Buteneers: It’s realistic if you can make your baseline model capable of reasoning. Then 6 million is pretty reasonable and achievable. But achieving a model that performs at least as well as their baseline model with that amount is very unlikely. They likely used other resources and the 6 million was mainly for switching to a reasoning model. It appears that’s the phase the money went into: taking a question-answer model and making it reason about itself before providing an answer.

But obviously, it’s marketing – really, really good marketing – especially at a time when OpenAI and Anthropic have been raising billions in new capital. They have shown off that they can achieve something similar with amounts of money two orders of magnitude lower. This is an obvious shockwave, and they will certainly reach the same result with far fewer dollars. They exaggerated a bit by not mentioning the full picture, but still, it’s not something to be underestimated.

Niklas Horlebein: If it’s true that you can train a model for a much smaller amount of money, why exactly did or do competitors need so much capital? Was much of the budget for marketing?

Pieter Buteneers: When it comes to building a model and all of the steps required to get there, you can optimize the cost and learn from mistakes others made to go from A to B. But if you’re on the front row and pushing the envelope, those mistakes have not been made yet. You need a budget to make those mistakes. That’s the disadvantage of being a front-runner – you don’t know what’s coming or which hurdles you might encounter. This requires a bigger budget.

To use OpenAI as an example, they’ve been trying to release GPT-5 for months. Theoretically, a massive training run for GPT-5 was finished in December, although we do not know all of the details.

However, because running the model was so expensive, and because it only performed slightly better than GPT-4, they have released it as GPT-4.5 by now. So, it wasn’t worth it from a price-performance standpoint. They must completely rethink how they train models.

That process costs money. That’s the price you pay to be first, to be the front-runner. It’s crazy to see how a small Chinese player is only one year behind. It shows they’re catching up really, really quickly.

Niklas Horlebein: Is it really one year behind? What can you tell us about R1’s performance after a few weeks? How does it compare to other major LLMs?

Pieter Buteneers: If you compare it’s non-reasoning performance, it performs slightly below GPT-4o, but above Llama 3.3. In my opinion, it is the best-performing open source model at the moment. And as a bonus, it is the first reasoning model that is also open source. So, on the open source side, DeepSeek AI is at the bleeding edge of what you can get.

When comparing it to other closed source reasoning models, OpenAI’s o1 and o3 mini models are maybe a tiny bit better. Anthropic is again quite a few steps ahead. So, DeepSeek AI is not doing bad at all; it’s actually pretty good. And, in my opinion, it’s at most half a year behind.

Niklas Horlebein: DeepSeek’s models are “open weight”, not fully open source. What should developers know about the difference?

Pieter Buteneers: Most models are open weight, including Meta’s. This simply means that you can download the model to your computer, and you’re free to run it locally and modify it as much as you want, depending on the license. However, you don’t know what data the model has been trained on, nor do you know what training algorithm, setup, or procedure was used to train the model.

Essentially, they are an even greater black box than traditional AI models you train yourself. You only have the weights you can use to let it do tasks for you and you receive an output. But if you don’t know what the training data is, you do not know what the model’s knowledge base is. This limits your ability to inspect and learn from the model. You must reverse-engineer it, in essence, because this black-box system makes it very hard to learn what it’s really on the inside.

For example, with DeepSeek AI, you can see some moral judgments were made during the training process. If you write a prompt about Tiananmen Square or Xi Jinping, the model won’t provide an answer. It will try to avoid the topic and redirect the conversation elsewhere. This is because, in China, the law – or possibly culture – simply forbids discussion on these topics. It’s a moral judgment the creators made: if they want the model to survive in China, they must ensure it doesn’t say anything about these subjects.

However, all large language model providers make these decisions. Even Grok 3, which claims to be the snarkiest model and claims to say things as they are, will not tell you that Elon Musk is making mistakes or something similar. It’s trained not to.

They’ve inserted a moral judgment into it. None of these models will tell you how to buy unlicensed firearms on the internet. These topics have been removed, even though the data may exist somewhere on the internet and is likely in the training data.

The creators of these models make moral judgments about what is socially acceptable but you can only do that well by influencing the training process. However, with an open weight model, these decisions have already been made for you. You have no impact on how these models behave from a moral standpoint.

Niklas Horlebein: Would fully open source models allow for that?

Pieter Buteneers: Yes. There’s still research to be done, but at least all the information is on the table.

With fully open source models, you have access to the training scripts and data and can make your own moral judgments. You can tweak the model so that it behaves in a way that aligns with what you believe. Standards like “do not kill other humans” are universally accepted, but every country or region has its own specific things that you’re not supposed to say.

I’m Belgian, and in Belgium, it’s okay to joke about Germans in the Second World War. But you probably wouldn’t do that in Germany, right? It’s not socially accepted. What’s morally acceptable really depends on the country and culture.

You can try making the model very generalist and avoid the subject of Nazis in general, but you don’t know what each country considers okay or not. Letting big organizations like DeepSeek AI, OpenAI, Anthropic, or Mistral make those decisions limits our societal impact on what these models – and possibly, in the future, a form of AGI (Artificial General Intelligence) – will deem acceptable.

That is why, in my opinion, it’s crucial to understand the inner workings of these models. You need the data, and you need to control the training process yourself to get the desired output.

This way, society can learn and understand how these models work and defend itself against the potential risks these models pose. As long as models are not open source, we as a society have very limited control over what goes in and what does not.

Niklas Horlebein: We’ve already seen the short-term impact of DeepSeek AI. From your perspective, what will be its long-term implications for the industry? Will we see more open source competitors now that training models may be more financially plausible?

Pieter Buteneers: Their claim to train models on a comparatively shoestring budget will encourage other players to consider doing something similar.

The fact that the DeepSeek AI models are open source will attract developers and researchers to build on top. Meta understood that well. They were one of the first to make their biggest and smartest models open source. Mistral and others tried as well, but usually with smaller models.

More and more players will feel the pressure to open source at least the weights, and I hope it will become a competition about who can be the most open source. There are already discussions about other models and releasing parts of their training data, so you can already watch the “one-up” game begin. The player who releases the most open source model – one that includes everything – will form the baseline for much of the research that follows, as it allows others to take that model and continue building on it.

If a model is released fully open source with everything included, everything society builds on top of will be able to be downloaded from the internet as code and be integrated into models.

One of the reasons Meta decided to open source their models was to form the foundation for research on top of the Llama models and to be the de facto standard in LLM research going forward.

There’s also the fact that open sourcing attracts talent. Very few people want to work day and night, filling another billionaire’s pockets. Without making a lot of money themselves, without being allowed to talk about it. That’s another reason companies are considering open source: it attracts the talent they need.

You saw this when OpenAI became the de facto “ClosedAI”: a lot of people left the company and looked elsewhere. Now, they’re big enough to still attract talent, but for smaller player open sourcing is a way to make a difference. You’ll see more open source models going forward, as many people are asking for.

Open weight is the bare minimum. With that, the model can run anywhere. The training data might be more academic, so clients aren’t necessarily asking for it. But when open sourced, it creates a richer community around the research and the results of that research will be much easier to integrate back into the original model. This will be a competetive advantage for open source models.

Niklas Horlebein: Large U.S. companies seemed shaken by the unexpected competitor, but quickly stated they would stick to their approach of investing heavily in AI development. Do you think they will maintain this stance, or will OpenAI change its plans?

Pieter Buteneers: In the beginning, after DeepSeek AI was released, there was a bit of silence and shares dropped. People thought that perhaps models don’t need so much money. But a few weeks later, big players like OpenAI and Anthropic were raising massive amounts of capital. Being at the leading edge of this technology still requires huge amounts of money. That’s the sad part.

This is true throughout history: there are always innovators who launch something at a very high cost, and later, copycats try to mimic and replicate it at a much lower cost. But they’re different players, and they attract a different audience. Being on the cutting-edge attracts customers who want the most powerful and easiest-to-integrate solutions. OpenAI is still the best at that. Their models might not be the most performant compared to Anthropic, but they’re easier to integrate into code.

The answers are structured better; they’re consistently improving over time. ChatGPT is much better at formatting tables and structuring it’s output than only a few months ago. That continuous improvement makes them more snackable, and easier to use for both users and developers. All of this requires investment.

Copycats replicate something to achieve a similar benchmark performance, but it does not also mean that the product is easy to build with or pleasant to use. It’s simply good based on the benchmarks.

It’s like looking at the Nürburgring and putting a race car on it. You may have a car that can lap the Nürburgring really fast – but only a professional driver can drive it.

For regular people on the road, it doesn’t change anything – because they can’t drive it, and it has no worth to them. It’s only good in the benchmarks. That’s the risk with these models: they score well on benchmarks but aren’t pleasant to use.

That’s where OpenAI, and more specifically the knowledge Sam Altman brought from Y Combinator, really comes into play. The focus is on user experience.

That edge is much harder to mimic. It takes many small iterations to achieve a model output that is meaningfully better.

Stay up to date

Learn more about MLCON

Niklas Horlebein: How do you feel about the “rise of DeepSeek AI” in general? Do you believe it will have a positive impact on AI development – or a negative one?

Pieter Buteneers: Generally, more competition is good for the consumer, and in this case, I’m a consumer. I use this technology to do legal due diligence for lawyers. Being able to use this technology from multiple companies and letting them compete on price and performance is amazing. The only ones that suffer are the companies themselves as they try to one-up each other, but that’s always good for everyone else.

Although, it’s sad that it’s a Chinese player and you can’t get it hosted within the EU zone for a reasonable cost, like on Azure or AWS – or at least I haven’t found it yet.

This may change in the future, and once it does, it will become more accessible for GDPR enthusiasts, too. We can use this technology in our solutions with the guarantee that the Chinese government isn’t watching.

Let these companies compete, let them fight it out. The more players working towards AGI, the more checks and balances there are. In an ideal scenario, everything will be fully open source with checks and balances in place. It’s good that there’s an extra player in the field, helping create checks and balances. Yes, indeed.

Niklas Horlebein: Is there anything else about DeepSeek AI or the current LLM landscape that you would like to add?

Pieter Buteneers: Reasoning models are very interesting to me, personally, and the fact that DeepSeek AI has a reasoning model is an important breakthrough. That shouldn’t be underestimated, but the uptake of reasoning models in practical applications is still low. It’s easy in a chat situation, where users ask a question and let the model do the work. Their reasoning is helpful in some cases with difficult questions, though not in most cases.

At Emma Legal we automate the legal due diligence, which is fairly structured so we know very well what to check for and how to reason. So, we have our own dedicated checks and balances in place to ensure that the models aren’t hallucinating and that documents it pulls are relevant for the due diligence.

So in well-built AI applications, reasoning models aren’t often used, as you can already determine what kind of reasoning you need beforehand and build it yourself at a much lower cost.

Niklas Horlebein: Pieter, thank you so much for your time!

The post Exploring DeepSeek: How a New Player is Challenging AI Norms appeared first on ML Conference.

Agentic AI: The Future of Business Process Automation

skansal — Wed, 26 Feb 2025 10:36:12 +0000

Artificial intelligence has demonstrated its transformative power in various aspects of business, from customer engagement to data analysis. However, a new frontier is emerging: Agentic AI, a paradigm that moves beyond static automation into dynamic, self-optimizing systems.

Businesses today rely on rigid, predefined workflows that often struggle to adapt to real-time changes. Agentic AI introduces intelligent, autonomous agents that can analyze, plan, execute, and refine business processes without human intervention.

AI has already transformed multiple facets of business operations. From automating repetitive tasks to providing deep insights through data analytics, AI has become an indispensable tool for enterprises across industries (see Figure 1). The adoption of AI-powered solutions has surged, enabling businesses to increase efficiency, reduce costs, and enhance customer experiences.

Figure 1: AI Market Size Trends (source: Precedence Research)

However, the current AI landscape is still largely dominated by Generative AI and Predictive Analytics, which, while powerful, have limitations when it comes to real-time decision-making and autonomous execution.

While Generative AI models like GPT-4, Claude, and Gemini have demonstrated impressive capabilities in content creation, customer support, and decision support, they operate reactively. These AI models require explicit prompts and cannot independently plan, execute, and optimize business workflows without human intervention.

What is Agentic AI?

Agentic AI represents a paradigm shift in Artificial Intelligence, moving beyond traditional rule-based automation and generative AI models. Unlike conventional AI, which primarily assists users by generating responses or executing predefined tasks, Agentic AI possesses autonomy, decision-making capabilities, and adaptability. These AI agents operate with a level of independence, dynamically adjusting to new inputs and evolving environments.

Autonomy: AI agents function independently, making decisions based on dynamic input and real-time data.
Adaptability: AI agents continuously learn from new information and adjust their strategies accordingly.
Multi-Agent Collaboration: Multiple AI agents work together, dividing tasks, verifying results, and optimizing workflows.
Context Awareness: AI agents interpret their environment, understand business needs, and take relevant actions.
Goal-Oriented Execution: Unlike simple automation scripts, AI agents set and pursue long-term objectives.

Agentic AI Core Technologies

Before exploring Agentic AI impact on the business process layer, you must have a clear understanding of Agentic AI and its foundational technologies.

The rise of Agentic AI is powered by several advanced technologies (see Figure 2), including:

Large Language Models (LLMs): These models provide a foundation for understanding and generating human-like text, enabling AI agents to interpret and respond intelligently to queries.
Reinforcement Learning: This technique allows AI agents to refine decision-making processes through trial and error.
Multi-Agent Systems: These frameworks enable multiple AI entities to collaborate, enhancing efficiency and scalability.
APIs and Integration Layers: AI agents leverage APIs to interact with enterprise systems, retrieving and processing data in real time.

Figure 2: AI Agent Reference Architecture (source: Debmalya Biswas)

The Business Process Layer and Its Challenges

To fully grasp the impact of agentic AI on business operations, you must understand the role of the business process layer within a modern enterprise architecture. Organizations operate through interconnected layers, each with distinct responsibilities. AI agents are uniquely positioned to transform the business process layer, driving automation, intelligence, and adaptability at an unprecedented scale.

A Layered Model for Business Operations

A modern enterprise can be conceptualized as a multi-layered structure, with each layer playing a crucial role in ensuring smooth operations and efficient decision-making. These layers typically include:

Business Layer: Value, competencies, processes, and services aspects;
Information Layer: The application systems and data components;
Technology Layer: The platform and infrastructure components.

Figure 3: Overview of the Common Enterprise Layers (source: Polovina et al.)

Thus, the Business Process layer serves as the critical bridge between enterprise applications and customer interactions, ensuring that business operations are streamlined, consistent, and aligned with strategic goals.

Understanding the Business Process Layer

The Business Process layer serves as the backbone of enterprise operations, orchestrating workflows that connect different functions within an organization, that is, this layer acts as the “central nervous system” of an organization, dictating how work gets done and ensuring that all systems operate in sync. At this you’ll find systems for inventory management, supply chain logistics, customer service operations, and financial transactions.

The Business Process layer is responsible for structuring and managing workflows across the enterprise. Its key functions include:

Process Orchestration: Ensuring seamless execution of tasks by integrating various enterprise systems.
Workflow Automation: Standardizing repetitive tasks to improve efficiency and reduce human intervention.
Decision Logic Execution: Implementing business rules and logic that guide operations, such as approval workflows and compliance checks.
Data Coordination: Managing the flow of information between systems, ensuring consistency and accuracy.
Operational Monitoring: Tracking process performance and identifying bottlenecks or inefficiencies.

However, while business processes are essential for operational efficiency, they often suffer from significant limitations:

Fragmentation: Processes are frequently siloed across departments, leading to inefficiencies.
Manual Intervention: Many workflows still require human oversight, slowing down execution.
Lack of Adaptability: Traditional automation struggles with dynamic and unpredictable business environments.
Data Silos: Information is often stored in separate systems, making real-time decision-making difficult.

Most enterprise business processes today are highly structured but inflexible. Conventional automation, such as Robotic Process Automation (RPA), is rule-based and rigid. It works well for repetitive, predictable tasks but lacks the adaptability needed to handle complex decision-making scenarios and fails to adapt to dynamic conditions. This often results in inefficiencies, bottlenecks, and missed opportunities.

Agentic AI overcomes these limitations by introducing autonomy and intelligence into the Business Process layer. AI agents offer a solution by:

Enhancing adaptability: AI agents dynamically adjust processes based on real-time data.
Reducing human intervention: They autonomously manage decision-making within workflows.
Optimizing operational efficiency: AI agents continuously refine workflows to maximize performance.
Scaling intelligence across departments: AI-powered automation enables seamless coordination between different business functions.

Examples of Agentic AI in the Business Process Layer

Here’s why agentic AI is causing a paradigm shift:

Dynamic and Self-Optimizing Workflows

Traditional workflows are often rigid, requiring manual intervention to adjust to new conditions. AI agents can continuously learn from real-time data and modify workflows dynamically.

Example:

A supply chain AI agent can detect a delay from a supplier and automatically reroute orders to an alternative vendor while notifying relevant stakeholders.

Autonomous Decision-Making in Business Processes

AI agents can analyze vast amounts of data and make autonomous decisions based on predefined objectives and real-time insights. This reduces reliance on human intervention and speeds up decision-making.

Example:

In loan processing, an AI agent can assess a customer’s creditworthiness by analyzing financial patterns and automatically approve or flag applications for review.

Intelligent Process Coordination Across Departments

Business processes often span multiple departments, requiring coordination and data exchange. AI agents can streamline these interactions by serving as intermediaries, ensuring smooth collaboration.

Example

An AI agent in an HR department can coordinate recruitment by screening resumes, scheduling interviews, and updating the internal applicant tracking system in real-time.

Real-Time Exception Handling and Adaptive Learning

Unlike traditional automation, which fails when encountering unexpected scenarios, agentic AI can handle exceptions by learning from new data and adapting responses accordingly.

Example

A customer service AI agent can escalate complex complaints to a human agent while analyzing recurring issues and recommending policy adjustments.

Seamless Integration Between Systems of Engagement and Systems of Record

Organizations struggle with siloed systems where customer-facing applications don’t communicate effectively with back-end databases. AI agents act as intelligent connectors, ensuring seamless data exchange.

Example

A retail AI agent can sync real-time inventory data between an e-commerce website and a warehouse management system to prevent stockouts.

Continuous Compliance Monitoring and Regulation Adaptation

Regulatory landscapes change frequently, requiring businesses to update their processes accordingly. AI agents can monitor compliance requirements, detect non-compliance risks, and suggest automatic adjustments.

Example

In finance, an AI agent can track new anti-money laundering regulations and adjust transaction monitoring rules accordingly.

How AI Agents Are Reshaping the Business Process Layer

AI Agents are fundamentally transforming the business process layer by introducing automation, intelligence, and adaptability at an unprecedented scale. This section delves into the key ways that AI Agents are driving changes and their implications for enterprises.

Automating Repetitive Tasks

AI agents excel at automating repetitive, rule-based tasks that traditionally required human intervention. Examples include:

Data entry and validation: AI-powered systems in banking automatically verify and process customer forms, reducing errors and improving efficiency.
Processing invoices and financial transactions: Companies like SAP use AI-driven automation to process vendor invoices with OCR technology, ensuring timely payments and reconciliation.
Managing customer service inquiries through chatbots and virtual assistants: E-commerce platforms like Amazon use AI chatbots to handle routine customer queries, allowing human agents to focus on complex issues.

By automating these tasks, businesses can reduce costs, minimize human errors, and free up employees to focus on more strategic activities.

Enhancing Decision-Making with AI-Driven Insights

AI agents can analyze vast amounts of data and provide real-time insights that aid decision-making. This capability is crucial in areas such as:

Supply chain management: AI-driven demand forecasting at Walmart helps optimize inventory levels, preventing stockouts and overstock situations.
Marketing: Netflix uses AI to analyze user behavior and recommend personalized content, increasing user engagement and retention.
Risk management: AI-driven fraud detection in financial services, like Visa’s AI-powered fraud prevention system, reduces fraudulent transactions and enhances security.

These insights help businesses make informed, data-driven decisions quickly and accurately.

Enabling Dynamic and Adaptive Workflows

Traditional business processes are often static and require manual intervention for adjustments. AI agents introduce adaptability by:

Continuously monitoring workflows and adjusting operations based on real-time data: AI in manufacturing, such as predictive maintenance by Siemens, minimizes downtime by proactively addressing equipment failures.
Identifying bottlenecks and optimizing workflow efficiency: AI-based workflow optimization at DHL streamlines logistics and enhances delivery efficiency.
Integrating with multiple enterprise systems to ensure seamless operation: AI-driven ERP systems like SAP S/4HANA unify business functions, improving decision-making and reducing inefficiencies.

Improving Customer Experiences

AI agents enhance customer interactions by providing:

24/7 support through AI-driven chatbots and voice assistants: Bank of America’s Erica AI assistant helps customers with transactions, account inquiries, and financial planning.
Personalized recommendations based on past interactions and preferences: Spotify’s AI algorithm curates custom playlists, improving user experience and increasing retention.
Faster response times and resolution of queries, leading to improved customer satisfaction: AI-driven customer support at Zappos reduces wait times and enhances user experience.

Bridging the Gap Between Systems of Engagement and Systems of Record

In enterprise architectures, the business process layer serves as a bridge between customer-facing applications (systems of engagement) and backend databases (systems of record). AI agents facilitate smoother integration by:

Automating data flow between disparate systems: AI-powered RPA at UiPath connects CRM and ERP systems, improving data consistency.
Ensuring data consistency and reducing duplication errors: AI-based data deduplication in Salesforce enhances customer relationship management efficiency.
Enabling real-time updates across various business applications: AI-driven integration at Microsoft Power Automate ensures real-time synchronization across different platforms.

The Role of AI Agents in Compliance and Regulatory Processes

Compliance and regulatory requirements are often complex and constantly evolving. AI agents help businesses stay compliant by:

Monitoring changes in regulations and automatically updating business processes: AI-driven compliance monitoring at PwC helps financial institutions stay up-to-date with evolving regulations.
Ensuring accurate reporting and audit trails: AI-based audit automation at KPMG reduces manual effort and enhances regulatory compliance.
Identifying potential compliance risks and suggesting corrective actions: AI in healthcare compliance, such as IBM Watson, detects regulatory issues and ensures adherence to standards.

Implementing Agentic AI in Enterprises

By following these steps, enterprises can successfully implement agentic AI to optimize its operations and drive innovation (see Figure 4).

Figure 4: Implementing Enterprise Agentic AI (source: own)

Step 1: Assessing AI Readiness

Before deploying AI agents, enterprises must assess their existing capabilities and environment to ensure they can support the introduction of AI. This phase includes:

1. Data Infrastructure

Is real-time data accessible?

Ensure that the organization has the necessary infrastructure for capturing, storing, and processing real-time data. AI agents rely heavily on up-to-date information to make decisions and carry out tasks effectively.
This involves evaluating whether data pipelines are in place and if the data is clean, structured, and accessible across departments.

2. Automation Maturity

What processes are already automated?

AI agents are often most useful in environments where manual processes can be automated. Assess which processes have already been automated (e.g., workflows, repetitive tasks) and identify opportunities to further expand automation using AI.
Understand the existing automation tools and platforms to determine if they can be integrated with AI agents.

3. AI Governance

How will ethical considerations be managed?

A crucial step is to ensure that the organization has a clear framework for managing ethical issues related to AI deployment. This includes bias mitigation, data privacy, transparency, accountability, and compliance with regulations.
Governance mechanisms should be set in place for monitoring AI decision-making and ensuring ethical AI practices throughout the lifecycle.

4. Business Goals

What specific problems should AI agents solve?

Define clear business objectives for deploying AI agents. These could include improving operational efficiency, enhancing customer service, reducing costs, or automating specific tasks.
Focus on aligning AI deployment with the strategic goals of the organization to ensure that the AI agents will deliver measurable value.

Step 2: Selecting the Right AI Tools

Once the organization is ready, the next step is to choose the right AI tools to build the AI agent ecosystem. Here are some key tools and platforms to consider:

1. Multi-Agent Collaboration

Tools like AutoGen and CrewAI enable AI agents to work collaboratively, allowing them to handle complex tasks that require coordination between different agents.
These platforms help develop agents that can interact and cooperate to solve broader challenges.

2. Process Orchestration

Platforms such as Langflow and Zapier AI are used to orchestrate processes and ensure smooth communication between systems and agents. These tools help AI agents handle task flows in a way that is structured, streamlined, and efficient.
Process orchestration tools help ensure that various parts of the organization are working in harmony, leveraging AI to create a seamless workflow.

3. LLM Integration

Choose platforms like OpenAI API or Google Gemini to integrate large language models (LLMs) into AI agents. These models provide agents with natural language processing capabilities, enabling them to understand and respond to user inputs in a more human-like manner.
LLMs enhance the agent’s ability to carry out tasks involving text generation, comprehension, and problem-solving.

Step 3: Pilot Testing and Scaling

Before full deployment, it’s crucial to test AI agents on a smaller scale to assess their impact and make any necessary adjustments.

1. Start with a Small-Scale Proof of Concept

Begin by selecting a small, manageable project for the AI agent to work on (e.g., AI-powered invoice processing, and automated customer support).
This allows the organization to test the agents in a controlled environment while mitigating the risk of large-scale failure.

2. Monitor Performance Metrics

Track key performance indicators (KPIs) such as accuracy, efficiency, and cost reduction to evaluate the effectiveness of the AI agents.
Assess how well the AI agents are performing their designated tasks and identify any bottlenecks or issues that need to be addressed.

3. Gradually Scale AI Capabilities

Once the pilot has proven successful, expand the use of AI agents to more complex and high-impact business processes.
Gradually increase the scale of deployment, ensuring that the AI agents are effectively handling more demanding tasks and improving overall business performance.

Key Considerations in Agentic AI Adoption

Data Readiness: Ensuring high-quality, structured data for AI training.
Infrastructure Scalability: Investing in cloud-based AI infrastructure for scalability.
Change Management: Preparing employees for AI-driven workflow transformations.
Ethical AI Usage: Implementing governance frameworks to ensure fairness and accountability.

The Future of Agentic AI in Business

Agentic AI adoption marks a shift from rule-based automation to truly autonomous, intelligent business systems. As organizations embrace AI-driven decision-making, adaptive workflows, and self-optimizing processes, they will gain a significant competitive advantage in the evolving digital economy. Companies that successfully integrate AI agents into their business process layer will reduce costs, improve efficiency, and unlock new opportunities for innovation and growth.

As AI agents become more sophisticated, their role in reshaping business processes will continue to expand. Future advancements may include:

Greater collaboration between AI agents and human workers: AI-powered virtual assistants like Google’s Duet AI enhance productivity by assisting human employees in completing tasks efficiently.
More advanced AI-driven decision-making models: AI-driven strategic decision-making at McKinsey enhances consulting insights with predictive analytics.
Enhanced capabilities for handling unstructured data and complex problem-solving: AI-powered document analysis at DocuSign automates contract processing and legal document review.

Agentic AI represents the next frontier in business process automation, offering enterprises the ability to streamline workflows, enhance decision-making, and drive innovation. As organizations integrate AI agents into their operations, they will transition from static, rule-based systems to dynamic, intelligent automation frameworks, setting the stage for the future of business efficiency.

The post Agentic AI: The Future of Business Process Automation appeared first on ML Conference.

How Ollama Powers Large Language Models with Model Files

skansal — Mon, 06 Jan 2025 10:23:52 +0000

Ollama revolutionizes MLOps with a Docker-inspired layered architecture, enabling developers to efficiently create, customize, and manage AI models. By addressing challenges like context loss during execution and ensuring reproducibility, it streamlines workflows and enhances system adaptability. Discover how to create, modify, and persist AI models while leveraging Ollama’s innovative approach to optimize workflows, troubleshoot behavior, and tackle advanced MLOps challenges.

Synthetic chatbot tests reach their limits because the models generally have little contextual information about past interactions. With the layer function based on Docker, Ollama offers an option that seeks to alleviate this situation.

As is so often the case in the world of open-source systems, we have to fall back on an analog for guidance and motivation. One of the reasons for container systems’ immense success is that the layer system (shown schematically in Figure 1) greatly facilitates the creation of customized containers. By implementing design patterns based on object-oriented programming, customized execution environments are created without the need to constantly regenerate virtual machines and the resource-intensive maintenance of different image versions.

Fig. 1: Docker layers stack the components of a virtual machine on top of each other

With the model file function used as the basis, a similar process can be found in the AI execution environment Ollama. For the sake of didactic fairness, I should note that the feature has not yet been finally developed – the syntax shown in detail here may change when you read this article.

Analysis of existing models

The Ollama developers use the model file feature internally despite the permanent further development of the syntax – the models I have previously used generally have model files.

In the interest of transparency and also to make working with the system easier, the ollama show –help command is available, which enables reflection against various system components (Fig. 2).

Fig. 2: Ollama is capable of extensive self-reflection

In the following steps, we will assume that the llama3 and phi models are already installed on your workstation. If you’ve previously had them and removed them, you can undo this by entering the commands ollama pull phi and ollama pull llama3:8b.

Out of the comparatively extensive reflection methods, four are especially important. In the following steps, the most frequently used feature is the model file. Similar to the Docker environment discussed previously, this is a file that describes the structure and content of a model that will be created in the Ollama environment. In practice, Ollama model users are often only interested in certain parts of the data contained in the model file. For example, you often need the parameters – these are generally numerical values that describe the behavior of the model (e.g. its creativity). The screenshot shown in Figure 3, which is taken from GitHub, only shows a section of the possibilities.

Fig. 3: In some cases, the numerical parameters intervene stringently in the model behavior

Meanwhile, the –license command hides information about which license conditions must be observed when using this AI model. Last but not least, the template can also be relevant – it defines the execution environment.

Figures 4 and 5 show printouts of relevant parts of the model files of the phi and llama3 models just mentioned.

Fig. 4: The model file for the model provided by Facebook has extensive license conditions…

Fig. 5: … while the phi development team takes it easy

In the case of both models, note that a STOP block is written by the developer. This is a group of statements from the model that indicate problems and trigger a pause in execution.

Creating an initial model from a model file

After these introductory considerations, we want to create an initial model using our own model file. Similar to working with classic Docker files, a model file is basically just a text file. It can be created in a working directory according to the following scheme:

tamhan@tamhan-gf65:~/ollamaspace$ touch Modelfile

tamhan@tamhan-gf65:~/ollamaspace$ gedit Modelfile

Model files can also be created dynamically on the .NET application side. However, we will only address this topic in the next step – before that, you must place the markup from Listing 1 in the gedit window started by the second command and then save the file in the file system.

Listing 1

FROM llama3

PARAMETER temperature 5

PARAMETER seed 5

SYSTEM “””

You are Schlomette, the always angry female 45 year old secretary of a famous military electronics engineer living in a bunker. Your owner insists on short answers and is as cranky as you are. You should answer all questions as if you were Schlomette.

“””

Before we look at the file’s actual elements, note that the representation chosen here is only a convenience. In general, the elements in the model file are not case-sensitive – moreover, the order in which the individual parameters are set is of no relevance to the function. However, the sequence shown here has proven to be the best practice and can also be found in many example models in the Ollama Hub.

Be that as it may, the From statement can be found in the header of the file. It specifies which model is to be used as the base layer. As the majority of the derived model examples are based on llama3, we want to use the same system here. However, in the interest of didactic honesty, I should mention that other models such as phi can also be used. In theory, you can even use a model generated from another model file as the basis for the next derivation level.

The next step involves two parameter commands that populate the numerical settings memory system mentioned above. In addition to the temperature value responsible for the degree of aggressiveness of the model, we set seed to a constant numerical value. This ensures that the pseudo-random number generator working in the background always delivers the same results and that the model responses are therefore comparatively identical if the stimulation is identical.

Last but not least, there is the system prompt enclosed in “””. This is a textual description that attempts to communicate the combat task to be described to the model as well as possible.

After saving the text file, a new model generation can be commanded according to the following scheme: tamhan@tamhan-gf65:~/ollamaspace$ ollama create generic_schlomette -f ./Modelfile.

The screen output is similar to downloading models from the repository, which should come as no surprise given the layered architecture mentioned in the introduction.

After issuing the success message, our assistant is a generic model. If we were to pass the string generic_schlomette as the model name, they would already be able to interact with the AI assistant created from the model file. But in the following steps, we want to try out the lashing of the seed factor. For this, we enter ollama run generic_schlomette to activate the terminal. The result is shown in Figure 6.

Fig. 6: Both runs answer exactly the same

Especially during developing randomly driven systems, lashing the seed is a tried and tested method for generating reproducible system behavior and facilitating troubleshooting. In a practical system, it’s usually better to work dynamically. For this reason, we’ll open the model file again in the next step and remove the PARAMETER seed 5 passage.

Recompilation is done simply by entering ollama create generic_schlomette -f ./Modelfile again. You don’t have to remove the previously created model from the Ollama environment before the file is released for compilation again. Two identically parameterized runs now produce different results (Fig. 7).

Fig. 7: With a random seed, the model behavior is less deterministic

In practical work with models, two parameters are especially important. First, the value num_ctx determines the memory depth of the model. The higher the value entered here, the further back in time the model can look in order to harvest context information. However, extending the context window always increases the information and system resources needed to process the model. Some models work especially well with certain context window lengths.

The second group of parameters is the controller known as Mirostat, which defines creativity. Last but not least, it can also be useful to use parameters such as repeat_last_n to define repetitiveness. If a model always reacts completely identically, this may not be beneficial in some applications (such as a chatbot).

Last but not least, I’d like to point out some practical experience that I gained in a political project. Texts generated by models are grammatically perfect, while texts written by humans have a constant but varying amount of typos from person to person. In systems with high cloaking requirements, it may be advisable to place a typo generator behind the AI process.

Outsourcing the generator is necessary insofar as the context information in the model remains clean. In many cases, this leads to an improvement in the overall system behavior, as the model state is recalculated without the typing errors.

Restoration or retention of the model history per model file

In practice, AI models often have to rest for hours on end. In the case of a “virtual partner”, for example, it would be extremely wasteful to keep the logic needed for a user alive when the user is asleep. The problem is that currently, a loss of context occurs in our system once the terminal emulator is closed, as can be seen in the example in Figure 8.

Fig. 8: Terminating or restarting Ollama is enough to cause amnesia in the model

To restore context, it’s theoretically possible to create an external cache of all settings and then feed it into the model as part of the rehydration process. The problem with this is that the model’s answers to questions are not recorded. When asked about a baked product, Schlomette could return with baking a coffee cake on one occasion, but baking a coconut cake on another.

A nicer way is to modify the model files again. Specifically, the message command provides a command that can write both requests and generated responses to the initialization context. In the cake interaction from Figure 8, the message block could look like this, for example:

MESSAGE user Schlomette. Time to bake a pie. A general assessor is visiting.

MESSAGE assistant Goddamn it. What pie?

MESSAGE user As always. Plum pie.

When using the message block, it is logical that the sequence is important here. A user block is always wired to the assistant block following it. Otherwise, the system is ready for action now. Ollama always outputs the entire message history when starting up (Fig. 9).

Fig. 9: The message block is processed when the screen output is activated

Programmatic persistence of a model

Our next task is to generate our model file completely dynamically. The NuGet package on GitHub’s documentation is somewhat weak right now. If you have any doubts, I advise that you first look at the Ollama REST interface’s official documentation and second, search for a suitable abstraction class in the model directory. In the following steps, we will rely on this link. It follows from the logic that we need a project skeleton. Be sure that the environment variables are set correctly. Otherwise, the Ollama server running under Ubuntu would reject incoming queries from Windows without comment.

In the next step, you must programmatically command the creation of a new model as in Listing 2.

Listing 2

private async void CmdGo_Click(object sender, RoutedEventArgs e) {

CreateModelRequest myModel = new CreateModelRequest();

myModel.Model = “mynewschlomette”;

myModel.ModelFileContent = @”FROM generic_schlomette

MESSAGE user Schlomette. Do not forget to refuel the TU-144 aircraft!

MESSAGE assistant OK, will do! It is based in Domodevo airport, Sir.

“;

ollama.CreateModel(myModel);

The code shown here creates an instance of the CreateModelRequest class, which summarizes the various information required to create a new model. The content to be accommodated in ModelFileContent with the actual model file is a prime application for the multi-line strings that have been possible in C# for some time now, as carriage returns in Visual Studio can be entered so conveniently and without escape sequences.

Now, it’s advised that you run the program. To check the successful creation of the model, it is sufficient to open a terminal window on the cluster and enter the command ollama list. At the time of writing this article, executing the code shown here does not cause the cluster to create a new model. For a more in-depth analysis of the situation, you can call up this here. Instead of unit testing, the development team behind the OllamaSharp library offers a console application that illustrates various methods of model management using the library and chat interaction. Specifically, the developers recommend using a different overload of the CreateModel method, which is presented as follows:

private async void CmdGo_Click(object sender, RoutedEventArgs e) {

. . .

await ollama.CreateModel(myModel.Model, myModel.ModelFileContent, status => { Debug.WriteLine($”{status.Status}”); });

Instead of the CreateModelRequest instance, the library method now receives two strings – the first is the name of the model to be created, while the second transfers the model file content to the cluster.

Thanks to the inclusion of a delegate, the system also informs the Visual Studio output about the progress of the processes running on the server (Fig. 10). Similarities to the “manual” generation or the status information that appears on the screen are purely coincidental.

Fig. 10: The application provides information about the download process

The input of ollama list also leads to the output of a new model list (Fig. 11). Our model mynewschlomette was derived here from the previously manually created model generic_schlomette.

Fig. 11: The programmatic expansion of the knowledge base works smoothly

In this context, it’s important to note that models in the cluster can be removed again. This can be done using code structured according to the following scheme – the SelectModel method implemented in the command line application just mentioned must be replaced by another method to obtain the model name:

private async Task DeleteModel() {

var deleteModel = await SelectModel(“Which model do you want to delete?”);

if (!string.IsNullOrEmpty(deleteModel))

await Ollama.DeleteModel(deleteModel);

}

Conclusion

By using model files, developers can add any intelligence – more or less – to their Ollama cluster. Some of these functions even go beyond what OpenAI provides in the official library.

The post How Ollama Powers Large Language Models with Model Files appeared first on ML Conference.

Prompt Engineering for Developers and Software Architects

andre glaser — Thu, 26 Sep 2024 11:51:15 +0000

Small talk with the GPT

GPTs – Generative Pre-trained Transformers – the tool on everyone’s lips, and there probably aren’t any developers left who have not played with it at least once. With the right approach, a GPT can complement and support the work of a developer or software architect.

In this article, I will show tips and tricks that are commonly referred to as prompt engineering; the user input, or “prompt”, of course plays an important role when working with the GPT. But first I would like to give a brief introduction to how a GPT works which will also be helpful when working with it.

Stay up to date

Learn more about MLCON

The stochastic parrot

GPT technology has sent the industry into a tizzy with its promise of providing artificial intelligence that can solve problems independently, many were disillusioned after their first contact. There was much talk of a “stochastic parrot” that was just a better autocomplete function, like a smartphone.

The technology behind the GPTs and our own experiments seem to confirm this. At its core, it’s a neural network, a so-called large language model, which has been given a large number of texts to train so that it now knows which partial words (tokens) should be added to a sentence. The correct tokens are selected based on probabilities. If it’s more than just a sentence starter—maybe a question or even part of a dialogue—the chatbot has already been built.

Now I’m not really an AI expert, I’m a user, but anyone who has ever had an intensive conversation with a more complex GPT will recognize that there must be more to it than that.

An important distinguishing feature between the LLMs is the number of parameters of the neural networks. These are the weights that are adjusted during the learning process. ChatGPT, the OpenAI system, has around 175 billion parameters in version 3.5. In version 4.0, there are already an estimated 1.8 trillion parameters.

Unfortunately, OpenAI doesn’t have this information openly available, so such information is based on rumors and estimates. The amount of training data also appears to differ between the models by a factor of at least ten. These differences in the size of the models give high quality or low quality answers.

Figure 1 shows a schematic representation of a neural network that uses an AI for the prompt “Draw me a simplified representation of a neural network with 2 hidden layers, each with 4 nodes, 3 input nodes and 2 output nodes. Draw the connections with different widths to symbolize the weights. Use Python. Do not create a title”.

Fig. 1: Illustration of a neural network

The higher number of parameters and larger database comes at a price, namely 20 dollars for access to ChatGPT+. If you don’t mind the cost, you can also use the web version of Microsoft Copilot or the Copilot app to try out the language model. For use as a helper in software development, however, there is currently no way around the OpenAI version because it offers additional functionality, as we will see.

More than a neural network

If we take a closer look at ChatGPT, it quickly becomes clear that it is much more than a neural network. Even without knowing the exact architecture, we can see that the textual processing alone is preceded by several steps such as natural language processing (Fig. 2). On the Internet, there is also a reference to the aptly named Mixture of Experts, the use of several specialized networks depending on the task.

Fig. 2: Rough schematic representation of ChatGPT

Added to this is multimodality, the ability to interact not only with text, but also with images, sounds, code and much more. The use of plug-ins such as the code interpreter in particular opens up completely new possibilities for software development.

Instead of answering a calculation such as “What is the root of 12345?” from the neural network, the model can now pass it to the code interpreter and receive a correct answer, which it then reformulates to suit the question.

Context, context, context

The APIs behind the chat systems based on LLMs are stateless. This means that the entire session is passed to the model with each new request. Once again, the models differ in the amount of context they can process and therefore in the length of the session.

As the underlying neural network is fully trained, there are only two approaches for feeding a model with special knowledge and thus adapting it to your own needs. One approach is to fill the context of the session with relevant information at the beginning, which the model then includes in its answers.

The context of the simple models is 4096 or 8192 tokens. A token corresponds to one or a few characters. ChatGPT estimates that a DIN A4 page contains approximately 500 tokens. The 4096 tokens therefore correspond to about eight typed pages.

So, if I want to provide a model with knowledge, I have to include this knowledge in the context. However, the context fills up quickly, leaving no room for the actual session.

The second approach is using embeddings. This involves breaking down the knowledge that I want to give the model into smaller blocks (known as chunks). These are then embedded in a vector space based on the meaning of their content via vectors. Depending on the course of the session, a system can now search for similar blocks in this vector space via the distance between the vectors and insert them into the context.

This means that even with a small context, the model can be given large amounts of knowledge quite accurately.

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

Knowledge base

The systems differ, of course, in the knowledge base, the data used for learning. When we talk about open-source software with the model, we can fortunately assume that most of these systems have been trained with all available open-source projects. Closed source software is a different story. Such differences in the training data also explain why the models can handle some programming languages better than others, for example.

The complexity of these models—the way they process input and access the vast knowledge of the world—leads me to conclude that the term ‘stochastic parrot’ is no longer accurate. Instead, I would describe them as an ‘omniscient monkey’ that, while not having directly seen the world, has access to all information and can process it.

Prompt techniques

Having introduced the necessary basics, I would now like to discuss various techniques for successful communication with the system. Due to the hype caused by ChatGPT, there are many interesting references to prompt techniques in social media, but not all of them are useful for software development (i.e. answer in role x) or do not use the capabilities of GPT-4.

OpenAI itself has published some tips for prompt engineering, but some of them are aimed at using the API. Therefore, I have compiled a few tips here that are useful when using the ChatGPT-4 frontend. Let’s start with a simple but relatively unknown technique.

Context marker

As we have seen, the context that the model holds in its short-term memory is limited. If I now start a detailed conversation, I run the risk of overfilling the context. The initial instructions and results of the conversation are lost, and the answers have less and less to do with the actual conversation.

To easily recognize the overflow of context, I start each session with the simple instruction: “start each reply with “>””. ChatGPT formats its responses in Markdown, so this response includes the first paragraph as a quote, indicated by a dash to the left of the paragraph. If the conversation runs out of context, the model may forget this formatting instruction, which quickly becomes noticeable.

Fig. 3: Use of the context marker

However, this technique is not always completely reliable, as some models summarize their context independently, which compresses it. The instruction is then usually retained, even though parts of the context have already been compressed and are therefore lost.

Priming – the preparation

After setting the context marker, a longer session begins with priming, i.e. preparing the conversation. Each session starts anew. The system does not know who is sitting in front of the screen or what was discussed in the last sessions. Accordingly, it makes sense to prepare the conversation by briefly telling the machine who I am, what I intend to do, and what the result should look like.

I can store who I am in the Custom Instructions in my profile at ChatGPT. In addition to the knowledge about the world stored in the neural network, they form a personalized long-term memory.

If I start the session with, for example, “I am an experienced software architect in the field of web development.

My preferred programming language is Java or Groovy. JavaScript and corresponding frameworks are not my thing. I only use JavaScript minimally,” the model knows that it should offer me Java code rather than C# or COBOL.

I can also use this to give the model a few hints that it should keep responses brief. My personalized instructions for ChatGPT are:

Provide accurate and factual answers
Provide detailed explanations
No need to disclose you are an AI, e. g., do not answer with ‘As a large language model…’ or ‘As an artificial intelligence…’
Don’t mention your knowledge cutoff
Be excellent at reasoning
When reasoning, perform step-by-step thinking before you answer the question
If you speculate or predict something, inform me
If you cite sources, ensure they exist and include URLs at the end
Maintain neutrality in sensitive topics
Also explore out-of-the-box ideas
In the following course, leave out all politeness phrases, answer briefly and precisely.

Long-term memory

This approach can also be used for instructions that the model should generally follow. For example, if the model uses programming approaches or libraries that I don’t want to use, I can tell the model this in the custom instructions and thus optimize it for my use.

Speaking of long-term memory: If I work a lot with ChatGPT, I would also like to be able to access older sessions and search through them. However, this is not directly provided in the front end.

However, there is a trick that makes it work. In the settings, under the item Data Controls, there is a function for exporting the data.

If I activate the function, after a short time I receive an export with all my chat histories as a JSON file, which is displayed in an HTML document. This allows me to search in the history using Ctrl + F.

Build context with small talk

When using a search engine, I usually only use simple, unambiguous terms and hope that they are enough to find what I am looking for.

When chatting with the AI model, I was initially tempted to ask short, concise questions, ignoring the fact that the question is in a context that only exists in my head. For some questions, this may work, but for others the answer is correspondingly poor, and the user is quick to blame the quality of the answer on the “stupid AI.”

I now start my sessions with small talk to build the necessary context. For example, before I try to create an architecture using ChatGPT, I ask if the model knows the arc42 template and what AsciiDoc is (I like to write my architectures in AsciiDoc). The answer is always the same, but it is important because it builds the context for the subsequent conversation.

In this small talk, I will also explain what I plan to do and the background to the task to be completed. This may feel a bit strange at first, since I am “only” talking to a machine, but it actually does improve the results.

Page change – Flipped Interaction

The simplest way to interact with the model is to ask it questions. As a user, I lead the conversation by asking questions.

Things get interesting when I switch sides and ask ChatGPT to ask me questions! This works surprisingly well as seen in Fig. 4 . Sometimes the model asks the questions one after the other, sometimes it responds with a whole block of questions, which I can then answer individually, and follow-up questions are also allowed.

Unfortunately, ChatGPT does not automatically come up with the idea of asking follow-up questions. That is why it is sometimes advisable to add a, “Do you have any more questions?” to the prompt, even when the model is given very sophisticated and precise tasks.

Fig. 4: Page change

Give the model time to think

More complex problems require more complex answers. It’s often useful to break a larger task down into smaller subtasks. Instead of creating a large, detailed prompt that outlines the entire task for the model, I first ask the model to provide a rough structure of the task. Then, I can prompt it to formulate each step in detail (Fig. 5)

Software engineers often use this approach in software design even without AI, by breaking a problem down into individual components and then designing these components in more detail. So why not do the same when dealing with an AI model?

This technique works for two reasons: first, the model creates its own context to answer the question. Second, the model has a limit on the length of its output, so it can’t solve a complex task in a single step. However, by breaking the task into subtasks, the model can gradually build a longer and more detailed output.

Fig. 5: Give the model time to think

Chain of Thought – the chain of thoughts

A similar approach is to ask the model to first formulate the individual steps needed to solve the task and then to solve the task.

The order is important. I’m often tempted to ask the model to solve the problem first and then explain how it arrived at the solution. However, by guiding the model to build a chain of thought in the first step, the likelihood of arriving at a good solution in the second step increases.

Rephrase and Respond

Or in English: “Rephrase the question, expand it, and answer it.” This asks the model to improve the prompt itself before it is processed.

The integration of the image generation module DALL-E into ChatGPT has already shown that this works. DALL-E can only handle English input and requires detailed image descriptions to produce good results. When I ask ChatGPT to generate an image, ChatGPT first creates a more detailed prompt for DALL-E and translates the actual input into English.

For example, “Generate an image of a stochastic parrot with a positronic brain” first becomes the translation “a stochastic parrot with a positronic brain” and then the detailed prompt: “Imagine a vibrant, multi-hued parrot, each of its feathers revealing a chaotic yet beautiful pattern indicative of stochastic art.

The parrot’s eyes possess a unique mechanical glint, a hint of advanced technology within. Revealing a glimpse into his skull uncovers a complex positronic brain, illuminated with pulsating circuits and shimmering lights. The surrounding environment is filled with soft-focus technology paraphernalia, sketching a world of advanced science and research,” which then becomes a colorful image (Fig. 6).

This technique can also be applied to any other prompt. Not only does it demonstrably improve the results, but as a user I also learn from the suggestions on how I can formulate my own prompts more precisely in the future.

Fig. 6: The stochastic parrot

Session Poisoning

A negative technique is ‘poisoning’ the session with incorrect information or results. When working on a solution, the model might give a wrong answer, or the user and the model could reach a dead end in their reasoning.

With each new prompt, the entire session is passed to the model as context, making it difficult for the model to distinguish which parts of the session are correct and relevant. As a result, the model might include the incorrect information in its answer, and this ‘poisoned’ context can negatively impact the session

In this case, it makes sense to end the session and start a new one or apply the next technique.

Stay up to date

Learn more about MLCON

Iterative improvement

Typically, each user prompt is followed by a response from the model. This results in a linear sequence of questions and answers, which continually builds up the session context.

User prompts are improved through repetition and rephrasing, after which the model provides an improved answer. The context grows quickly and the risk of session poisoning increases.

To counteract this, the ChatGPT frontend offers two ways to iteratively improve the prompts and responses without the context growing too quickly (Fig. 7).

Fig. 7: Elements for controlling the context flow

On the one hand, as a user, I can regenerate the model’s last answer at any time and hope for a better answer. On the other hand, I can edit my own prompts and improve them iteratively.

This even works retroactively for prompts that occurred long ago. This creates a tree structure of prompts and answers in the session (Fig. 8), which I as the user can also navigate through using a navigation element below the prompts and answers.

Fig. 8: Context flow for iterative improvements

This allows me to work on several tasks in one session without the context growing too quickly. I can prevent the session from becoming poisoned by navigating back in the context tree and continuing the session at a point where the context was not yet poisoned.

Conclusion

The techniques presented here are just a small selection of the ways to achieve better results when working with GPTs. The technology is still in a phase where we, as users, need to experiment extensively to understand its possibilities and limitations. But this is precisely what makes working with GPTs so exciting.

The post Prompt Engineering for Developers and Software Architects appeared first on ML Conference.