Generative AI and Large Language Models Archives - ML Conference

MCP vs A2A: Architecting AI Agent Communication for Enterprise

adharaneedharan@sandsmedia.com — Mon, 21 Jul 2025 12:10:07 +0000

The AI landscape is shifting towards collaborative, specialized agents. This article provides an essential comparative analysis of emerging AI agent communication protocols: Anthropic’s Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A). For ML/AI developers and technical leaders, understanding these frameworks is crucial for building scalable, secure, and composable AI systems. We delve into the architecture, benefits, and challenges of each protocol, guiding you to make informed decisions for your next-gen enterprise AI infrastructure and AI tool development.

The field of AI is undergoing a significant architectural shift. We are moving from standalone AI systems that provide isolated capabilities toward interconnected ecosystems of specialized agents that collaborate to solve complex problems. This evolution mirrors the historical development of human organizations, where specialization and communication allowed for more sophisticated collective capabilities.

As AI systems grow more capable and autonomous, the need for standardized communication mechanisms becomes increasingly critical. Without established protocols, organizations face challenges including:

Technical Fragmentation: Teams developing separate integration methods for each agent pairing, leading to duplicated effort and inconsistent implementations.
Security Vulnerabilities: Ad-hoc communication systems often lack robust authentication, authorization, and data protection mechanisms.
Limited Composability: Without standardized interfaces, combining capabilities from different AI systems becomes prohibitively complex.
Governance Challenges: Tracking information flow, maintaining audit trails, and ensuring accountability becomes difficult when agent communication occurs through diverse, non-standardized channels.

AI Agent communication protocols aim to address these challenges by providing structured frameworks that define how agents advertise capabilities, request services, exchange information, and coordinate activities. These protocols serve as the foundational infrastructure upon which sophisticated multi-agent systems can be built.

In this evolving landscape, two significant protocols have emerged as potential industry standards: Model Context Protocol (MCP) developed by Anthropic and Agent-to-Agent (A2A) recently introduced by Google. Each brings a distinct perspective on how AI Agent communication should be structured, secured, and integrated into enterprise workflows.

Understanding the architectural foundations, benefits, limitations, and optimal use cases for each protocol is essential for organizations planning their AI infrastructure investments. This comparative analysis will help technical leaders make informed decisions about which protocol—or combination of protocols—best suits their specific requirements and use cases.

What is MCP?

Model Context Protocol (MCP) represents Anthropic’s approach to establishing a standardized framework for AI Agent communication and operation. At its philosophical core, MCP recognizes that as AI systems grow in complexity and capability, they require a consistent, structured way to interact with external tools, data sources, and services.

MCP emerged from the practical challenges faced by developers building sophisticated AI applications. Without standardization, each team was forced to develop custom integration methods for connecting their AI systems with external capabilities—resulting in duplicated effort, inconsistent implementations, and limited interoperability between systems.

The protocol addresses these challenges by providing a unified method for structuring the context in which AI models operate. It defines clear patterns for how information should be organized, how models should access external resources and tools, and how outputs should be formatted. This standardization allows for better interoperability between different AI systems, regardless of their underlying architecture or training methodology.

Rather than focusing on direct agent-to-agent communication, MCP emphasizes the importance of structured context—ensuring that AI systems have access to the information and capabilities they need in a consistent, well-organized format. This approach treats tools and data sources as extensions of the model’s capabilities, allowing for dynamic composition of functionality without requiring extensive pre-programming.

By providing this standardized interface for context management, MCP aims to reduce ecosystem fragmentation, enable more flexible AI deployments, and facilitate safer, more reliable AI systems that can work together coherently while maintaining alignment with human intentions and values.

MCP Architecture

MCP’s architecture is built around a hierarchical context structure that organizes information and capabilities into clearly defined components. This architecture follows several key design principles that prioritize clarity, security, and flexibility while maintaining clear separation between different types of contextual information.

Core Architectural Components:

MCP Host: The “brain” of the system—an AI application using a Large Language Model (LLM) at its core that processes information and makes decisions based on available context. The host is the primary consumer of the capabilities and information provided through the MCP framework.
MCP Client: Software components that maintain 1:1 connections with MCP servers. These clients serve as intermediaries between hosts and servers, facilitating standardized communication while abstracting away the complexities of server interactions.
MCP Server: Lightweight programs that expose specific capabilities through the standardized protocol. Each server is responsible for a discrete set of functionalities, promoting separation of concerns and allowing for modular composition of capabilities.
Local Data Sources: Files, databases, and services on the local machine that MCP servers can securely access. These provide the foundation for contextual information from the immediate environment.
Remote Data Sources: External systems available over the internet (typically through APIs) that MCP servers can connect to, expanding the potential information sources available to AI systems.

Figure 1 MCP Architecture (source: Tahir^[1])

Context Structure and Information Flow:

The MCP architecture implements a controlled information flow where context passes through defined pathways. When an MCP Host needs to access external information or capabilities, it connects to appropriate MCP Servers through MCP Clients. The servers then mediate access to various data sources, ensuring that information is properly formatted and permissions are appropriately handled.

This structured flow ensures that all processing occurs within well-defined boundaries, making it easier to track how information moves through the system and maintain security and accountability. The architecture explicitly distinguishes between:

Tools: Model-controlled actions that allow the AI to perform operations such as fetching data, writing to a database, or interacting with external systems.
Resources: Application-controlled data such as files, JSON objects, and attachments that can be incorporated into the AI’s context.
Prompts: User-controlled predefined templates (similar to slash commands in modern IDEs) that provide standardized ways to formulate certain types of requests.

Figure 2 is a sequence diagram showing the information flow between different components in a system that uses MCP to retrieve blog data, specifically SQL-related blog posts, for a user. This type of flow would be useful in a plugin-style AI integration where the AI needs to interact with external data sources via a protocol like MCP but requires explicit user permission and intelligent capability discovery.

Figure 2 MCP Workflow Example (source: Gökhan Ayrancıoğlu^[2])

Protocol Implementation:

MCP is designed to be transport-agnostic, though the initial implementation focuses on HTTP/HTTPS as the primary transport layer. The protocol defines standardized message formats for tool registration, tool invocation, and result handling, ensuring consistent interaction patterns regardless of the specific tools or data sources being accessed.

Recent developments in the protocol have expanded support for remote MCP servers (over Server-Sent Events) and integration with authentication mechanisms like OAuth, making it suitable for enterprise deployments where security and distributed access are essential requirements.

This architecture aims to create a standardized environment for AI processing, where information sources are clearly delineated, tools are discoverable and consistently invocable, and outputs adhere to predictable formats enabling safer multi-agent interactions and clearer accountability.

MCP Benefits

MCP offers several substantial advantages that make it particularly valuable for organizations implementing enterprise-grade AI systems. These benefits directly address common challenges in AI development and deployment, providing tangible improvements in development efficiency, system flexibility, and organizational collaboration.

Ecosystem Standardization and Reduced Fragmentation:

One of MCP’s most significant benefits is the reduction in ecosystem fragmentation. Before standardized protocols, every team building AI applications had to develop custom integrations for connecting their systems with tools and data sources. This resulted in duplicated effort, inconsistent implementations, and limited interoperability.

MCP addresses this challenge by providing a standardized way to connect AI systems with external capabilities. This standardization significantly reduces development overhead and creates a more cohesive AI ecosystem where components can be easily shared and reused. Organizations can develop MCP servers once and leverage them across multiple AI applications, maximizing return on development investments.

Dynamic Composability of Capabilities:

MCP enables dynamic composability of AI systems. Agents and applications can discover and use new tools without pre-programming, allowing for more flexible and adaptable AI deployments. This composability means that organizations can incrementally enhance their AI capabilities by adding new MCP servers without needing to modify existing applications.

For example, a company might initially deploy an AI assistant with access to document search capabilities through an MCP server. Later, they could add financial analysis capabilities by deploying a new MCP server—and the assistant would be able to leverage these capabilities without requiring major modifications to its core implementation.

Enhanced Tool Integration and Context Management:

MCP provides a consistent framework for integrating external tools and capabilities into AI systems. This consistency makes it easier for developers to add new functionalities to their AI applications and for end-users to understand how to interact with those capabilities.

The protocol’s structured approach to context management ensures that models have access to the information they need in a well-organized format. This reduces the risk of context confusion and helps maintain consistent performance across different implementations. The clear separation between different types of contextual information (tools, resources, and prompts) also facilitates better governance and security practices.

Support for Enterprise Collaboration and Workflows:

The protocol aligns well with enterprise organizational structures, where different teams often maintain specialized services and capabilities. Teams can own specific services (such as vector databases, knowledge bases, or analytical tools) and expose them via MCP for other teams to use. This supports organizational separation of concerns while enabling cross-functional collaboration through standardized interfaces.

This alignment with enterprise workflows makes MCP particularly valuable for large organizations with diverse AI initiatives across multiple departments. It provides a common language for AI capabilities while respecting organizational boundaries and governance requirements.

Foundation for Self-Evolving Agent Systems:

MCP enables the creation of self-evolving agents that can grow more capable over time without requiring constant reprogramming. As new tools become available through the MCP registry, agents can discover and incorporate these capabilities dynamically—allowing for continuous improvement without manual intervention.

This foundation for evolving capabilities is especially valuable as organizations move toward more autonomous AI systems that need to adapt to changing requirements and opportunities.

These benefits collectively enable organizations to implement AI systems that are more interoperable, more easily extended, and better integrated into existing enterprise workflows and technology stacks.

MCP Challenges

Despite its numerous advantages, implementing MCP presents several significant challenges that organizations need to carefully consider. Understanding these limitations is essential for realistic planning and effective risk management when adopting the protocol.

Authentication and Security Framework Limitations:

One notable limitation of MCP in its current form is its relatively basic authentication mechanisms. While recent updates have improved OAuth integration, MCP lacks the comprehensive authentication frameworks that are essential for secure enterprise deployments across organizational boundaries.

This limitation becomes particularly significant when implementing MCP in environments where security is a critical concern, especially when AI systems need to access sensitive information or perform operations with potential security implications. Organizations implementing MCP in such environments will need to develop additional security layers to complement the protocol’s native capabilities.

Remote Server Management Complexity:

Although MCP has expanded to support remote MCP servers (over Server-Sent Events), managing these remote connections securely and reliably presents additional complexity. Organizations deploying MCP across distributed environments need to develop strategies for handling connection failures, latency issues, and security considerations.

This distributed architecture introduces potential points of failure that must be carefully managed, especially for mission-critical AI applications. Implementing robust monitoring, error handling, and recovery mechanisms becomes essential when deploying MCP at scale across distributed infrastructures.

Registry Development and Tool Discovery Maturity:

The planned MCP Registry for discovering and verifying MCP servers is still in development. Until this component is fully realized and mature, organizations face challenges in implementing dynamic tool discovery—one of the protocol’s key promised benefits.

Without a robust registry system, organizations must develop interim solutions for tool discovery and verification, potentially limiting the dynamic composition capabilities that make MCP valuable. This gap between the current implementation and the full vision for MCP requires pragmatic planning for organizations adopting the protocol today.

Connection Lifecycle Management:

MCP is still refining how it handles the distinction between stateful (long-lived) and stateless (short-lived) connections. This distinction is important for different types of AI applications, and the current^[3] implementation may not fully address all use cases, particularly those requiring sophisticated state management across extended interaction sessions.

Organizations implementing MCP need to carefully consider their connection lifecycle requirements and may need to develop custom solutions for cases that fall outside the protocol’s current capabilities in this area.

Multi-Agent Coordination Limitations:

While MCP excels at connecting individual AI systems with tools and data, it provides less robust support for direct agent-to-agent communication in multi-agent systems where state is not necessarily shared. This limitation becomes apparent in complex agent ecosystems where multiple autonomous agents need to coordinate their activities directly.

For sophisticated multi-agent architectures, organizations may need to complement MCP with additional protocols or custom solutions to enable effective agent-to-agent communication, particularly when those agents operate across organizational boundaries or vendor environments.

Implementation Complexity and Learning Curve:

Adopting MCP requires investment in understanding and implementing the protocol’s specifications. For organizations with existing AI infrastructure, this may require significant refactoring of current systems to comply with MCP’s structural requirements.

This implementation complexity represents a real cost that must be factored into adoption planning. Organizations should expect to invest in developer training, refactoring existing code, and establishing new development practices aligned with the protocol’s requirements.

These challenges highlight the importance of careful planning when implementing MCP, particularly for organizations with complex security requirements or those building sophisticated multi-agent systems.

MCP Main Use Cases

MCP is particularly well-suited for several key application areas where its structured approach to context management delivers significant value. Understanding these optimal use cases helps organizations identify where MCP can provide the greatest return on implementation investment.

AI-Enhanced Development Environments:

MCP has gained significant traction in AI-enhanced coding environments and integrated development environments (IDEs). Tools like Cursor and Zed leverage MCP to provide developers with AI assistants that have rich access to contextual information, including code repositories, documentation, ticket systems, and development resources.

In these environments, MCP excels at:

Pulling in relevant code context from the current project
Accessing GitHub issues, documentation, and APIs
Enabling interaction with development tools and services
Maintaining appropriate context during extended coding sessions

The protocol’s standardized approach to context management makes it particularly effective for integrating AI capabilities into development workflows, allowing developers to work with AI assistance that truly understands their project context.

Enterprise Knowledge Management Systems:

MCP provides significant value in enterprise environments where AI needs to access, process, and reason over large volumes of organizational knowledge. The protocol’s clear structure for differentiating between various information sources helps maintain information integrity when AI systems need to reference multiple documents, databases, and knowledge bases simultaneously.

These knowledge management applications benefit from MCP’s ability to:

Access diverse document repositories with appropriate permissions
Query enterprise databases while maintaining security boundaries
Incorporate real-time information from organizational systems
Maintain clear provenance for information incorporated into analyses

This capability makes MCP ideal for implementing corporate knowledge assistants, document processing systems, and intelligent search applications that need to work across diverse information sources while maintaining appropriate security and governance.

Tool-Augmented Agents and Automated Workflows:

Organizations implementing AI Agents that need to leverage external tools benefit significantly from MCP’s standardized tool interface. Agents can autonomously invoke tools to search the web, query databases, perform calculations, or interact with enterprise systems through a consistent, well-defined interface.

This standardization makes it easier to:

Expand agent capabilities by adding new tools without changing the agent’s core implementation
Chain multiple tools together into sophisticated workflows
Maintain clear audit trails of tool invocations and results
Implement governance controls around tool access and usage

For example, a research assistant agent might use MCP to access scholarly databases, statistical analysis tools, and citation management systems—combining these capabilities dynamically based on specific research requests.

Domain-Specific AI Applications:

MCP provides an excellent foundation for building domain-specific AI applications that require access to specialized data sources or tools. In fields like finance, healthcare, or legal services, MCP allows developers to create AI systems that can interact with domain-specific resources through a standardized interface.

This standardization reduces the development effort required to build and maintain specialized applications by:

Providing a consistent pattern for integrating domain-specific tools
Enabling clear separation between the AI model and domain-specific resources
Facilitating compliance with domain-specific regulations through structured access controls
Allowing for modular updates to capabilities as domain requirements evolve

For instance, a healthcare AI assistant might use MCP to access medical terminology databases, electronic health record systems, and clinical decision support tools—all through a consistent interface that maintains appropriate clinical governance.

Self-Evolving Agent Systems:

The protocol enables the creation of self-evolving agents that can grow more capable over time without requiring constant reprogramming. These systems can:

Dynamically discover new tools via the registry
Combine MCP with computer vision for UI interactions
Chain multiple MCP servers for complex workflows (e.g., research → fact-check → report-writing)
Adapt to new information sources and capabilities as they become available

This capability is particularly valuable for organizations looking to build AI systems that can grow more sophisticated over time, adapting to changing requirements without requiring constant developer intervention.

These use cases highlight MCP’s strengths as a foundational layer for context-aware AI systems, particularly in environments where structured access to diverse information sources and tools is a key requirement.

^[1] https://medium.com/@tahirbalarabe2/what-is-model-context-protocol-mcp-architecture-overview-c75f20ba4498

^[2] https://gokhana.medium.com/what-is-model-context-protocol-mcp-how-does-mcp-work-97d72a11af8a

The post MCP vs A2A: Architecting AI Agent Communication for Enterprise appeared first on ML Conference.

Can GenAI Build Satellites?

adharaneedharan@sandsmedia.com — Thu, 17 Jul 2025 08:09:40 +0000

When we think of Satellites we imagine cutting-edge technology, precision engineering, and brilliant minds at work.

What we rarely see is the chaos behind the scenes: thousands of pages of documents, conflicting requirements, scattered diagrams, and complexity that grows faster than humans can manage. Modern systems engineering isn’t just about designing hardware — it’s about managing an overwhelming flood of information.

How does GenAI Helps System Engineers Stay Ahead of Complexity?

1. Writing Better Requirements

Projects start with requirements — and often, they’re a mess. Vague sentences like “The car should drive fast” cause confusion down the line.

GenAI reviews these early, flags weak spots, and helps rewrite them clearly and precisely. This saves time, prevents misunderstandings, and reduces risk later.

2. Finding Duplicates and Patterns

Large projects mean thousands of requirement sentences. Some repeat. Some contradict.

AI analyzes this data, finds duplication, flags inconsistencies, and helps keep things clean and organized.

3. Designing System Diagrams

Once requirements are sorted, diagrams follow.
With Generative AI, engineers give simple prompts:

“Connect sensor, battery, data logger.”
AI draws it.
“Add brightness control.”
It updates.
No wasted hours redrawing from scratch.

4. RAG: Smarter Document Search for Engineers

PDF overload is real. Retrieval-Augmented Generation (RAG) lets AI search internal documents safely and quickly — no data sent to the cloud.

“What’s the max voltage?”
AI finds it. Work moves forward.

Watch the Full Session on YouTube:

Join “Building an Agentic RAG Pipeline” live at MLCon NY 2025 to see how it’s already changing the game.

The post Can GenAI Build Satellites? appeared first on ML Conference.

RAG & MCP for ML Devs: Practical AI Implementation

adharaneedharan@sandsmedia.com — Mon, 07 Jul 2025 07:21:42 +0000

Join us at MLCon New York to dive into these topics even deeper and hands on.

Business & Culture: https://mlconference.ai/machine-learning-business-strategy/ai-native-software-organization-technical-operational-practices/

https://mlconference.ai/generative-ai-content/gen-ai-operational-metrics/

https://mlconference.ai/machine-learning-business-strategy/operationalizing-ai-leadership-sprints-workshop/

GenAI & LLM: https://mlconference.ai/generative-ai-content/building-agentic-rag-pipeline/

https://mlconference.ai/generative-ai-content/agentic-ai-workshop-deep-research-llms/

devmio: Hello everyone, we’re live from JAX in Mainz, and I am here with Robert Glaser from INNOQ. Hello Robert.

Robert Glaser: Hi Hartmut, I’m glad I could make it.

devmio: Great. You’re here at JAX talking about AI. A controversial topic for some. Some love it, some are a little put off by it. Of course, there are benefits from it. Where do you see the benefits of AI?

Robert Glaser: I get that question a lot, and I usually disappoint people by saying, “I can’t give you a list of use cases in an hour.” Because use cases are a dime a dozen, since we’re dealing with a new kind of technology. Think of electricity, steam engines, or the internet. People didn’t immediately know, “OK, here are 100 use cases we can do with this and not with that.”

Everyone in every department, in every company, has to find out how well or how poorly it fits with the current state of technology. There are formats and possibilities for finding use cases. Then you also evaluate the potential outcome, look at what the big levers are, what is likely to bring us the most benefit when we implement it? And that’s how I would approach it. Whenever someone promises to name AI use cases for your business in a 30-minute conversation, I would always be cautious.

devmio: Now you have a use case, a RAG system, retrieval of augmented generation. Can you explain how a RAG works? Maybe first, what exactly is it? And then, of course, the question. Yes, for you out there, is this a use case that is exciting for you? What are the benefits of a RAG system?

Robert Glaser: Yes, I can explain briefly. A RAG is not a new concept at all. It comes from an academic paper from 2005. At that time, people weren’t talking about foundation models or large language models like we are today.

RAGs come from the domain of Natural Language Programming, which is how it’s used today. How do I get context for my AI model? Because the AI model only becomes interesting when I connect it to my company data. For example, in Confluence or some tables or databases or something like that. Then I can use this technology in conjunction with my data. And RAG is an architecture, if you want to call it that, which allows me to do just that.

That’s changing right now, too. But the basic principle that I’m showing in my presentation is that, in principle, a large part of the RAG architecture is information retrieval. All of us who work in tech have known this for many decades. A search, for example, is an information retrieval system.

This is often at the center of RAG and is the most interesting point, because the classic RAG approach is nothing more than pasting search results into my prompt. That’s all there is to it, it’s also something I mention in my talk. Well, we can end the presentation here. There’s nothing more to it. You have your prompt, which says, for example: Answer this question, solve this problem using the following data.

The data that comes in and that it is also relevant. That’s the crux of the matter. That’s what RAG deals with. For the most part, it’s a search problem: How do I cut my body of data into pieces that fit together? I can’t always just shovel everything in there. Models like this have a context window; they can’t digest endless amounts of data.

It’s like with an intern: if I pile the desk full, they may be a highly qualified intern, but they’ll get lost if the relevant documents aren’t there. So, I have to build a search that gets the relevant document excerpts for the task that the LLM is given. And then it can produce an answer or a result and not fantasize in a vacuum or base it on world knowledge, but rather on my company’s knowledge, which is called grounding. The answers are grounded, citation-based, like how a journalist works with facts. They must be there.

devmio: -and with references!

Robert Glaser: Exactly, then I can also say what the answer should look like. Please don’t make any statements without a source. That helps too. I’ll show this best practice in my talk.

devmio: So, you have this company data linked to the capabilities of modern LLMs. How do I ensure that the company data remains secure and is not exploited? It becomes publicly accessible via LLM.

Robert Glaser: Exactly. In principle, you will probably build some kind of system or feature or an internal product. In the rarest of cases, you will probably just connect your corporate customers to ChatGPT—you could do that.

There’s a lot going on right now, for example, Anthropic is currently building a Google Drive connection or a Google Mail connection. This allows you to search your Google Mails, which is incredibly powerful. But maybe that’s not always what you want as a company. That’s why you have to build something, you have to develop something. If it’s a chatbot, then it’s a chatbot. But every AI feature can potentially be augmented with RAG.

You have to develop software, and the software probably uses an AI model via an API. Either this AI model is in the cloud or in your own data center with a semi-open model, and then you must consider: how do I get the data from the search system or from the multiple search systems into the prompt. This involves a relatively large amount of engineering—classic engineering work. We can expect that developers of such systems will simply build them more often in the future.

devmio: Good, and another point that is becoming increasingly important is AI agents. These are agents that can do more than just spit out a result after a prompt. In this context, the MCP protocol, or model context, is often mentioned. How does the model context protocol work exactly? Maybe you could explain MCPs and how they work?

Robert Glaser: We’ve been having a lot of conversations about this at our booth downstairs. It’s not without reason that people are so interested in this topic; there’s enormous hype surrounding it. It’s an open standard developed by the company Anthropic, which also trains the model, and it was released last November. And at the time of release, nobody was really interested in it. Right now, about a month ago, AI time is also somehow compressed in my head– it’s going so fast.

Every day, there are hundreds of new MCP servers for a wide variety of services and data sources. In principle, MCP is nothing more than a very simple specification for a protocol that I use to connect a service or data to my AI model. Some call it “USB-C for AI models”. I personally think that’s a pretty good metaphor, even if it’s not 100% technically correct, but the plug always looks the same.

But it doesn’t matter what I plug into a device, be it a hard drive, USB stick, or alarm siren. The protocol is always the same. What it does is stay in the metaphorical world. It gives AI models hands and feet to do things in the real world. Tools are a concept of this protocol, i.e., tool usage. But there are also several other things.

These things can also simply provide prompts, which means I could write in MCP Server, which acts as a prompt library. Then the LLM can choose for itself which prompts it wants to use.

Something very interesting is tool use, also nothing new, but Foundation models have been trained on tool usage for a long time. But until now, I always had to tell them: “Look, here’s an open API spec for an API that you can use in this case, and you use it like this, and then there’s the next tool with a completely different description.”

This tool might be a search function, which would have a completely different API that you use like this. And then you can operate on the arrow system, and you do that with STANDARD in STANDARD out. Imagine you need an AI agent or an AI feature and have to describe 80 individual tools!

Every time you start from scratch and with each new tool, you must write a new description. The nice thing about MCP is that it standardizes this and acts as a facade, so to speak. The foundation model or the team that builds an AI feature doesn’t need to know how these services and data work or how to use them, because the MCP server takes care of that and exposes the functionality to the AI model using the same protocol.

It’s also nice if, for example, I’m a large company and I have one team building the search function in the e-commerce shop and another team building the shopping cart, then we are very familiar with their domains. This makes it very easy for them to write an MCP server for the respective system. Then I could say, for example: Dear shopping cart team, please build an MCP server for the shopping cart because we want to roll out an AI feature in the shop and need access to these tools.

So, it also fits in nicely with a distributed team architecture like a kind of platform through which I can access a wide variety of tools and LLMs. I could also build a platform for this, but basically, it’s just small services, called servers, that I simply connect to a model. And I don’t always have to describe a different interface, but always the same one.

I don’t have to wait for Open AI to continue training its models so that the MCP can use them, because we all know that I can simply enter the MCP specifications now and then the model can learn from the context on how to use them. I’m not dependent on having to use a model right now. I can teach them everything.

devmio: Yes, what I’m noticing is that MCPs have really taken off. Among other things, because it’s not just one topic, but also Open AI and Google, I think. And others are now on board and supporting it, so there’s a kind of standardization.

Robert Glaser: It’s like how the industry agreed on USB-C or, back then, A. That was good for everyone because then you no longer had 1,000 cables. It’s like how I’m developing an AI feature now and the possibilities are basically endless. People like to think only in terms of data sources or services, but a service can be anything, it can be an API from a system that returns bank transactions or account transactions to me. Here’s a nice example that I always use: There’s an MCP server for Blender, a 3D rendering software, which I can use and then I can simply ask: render a living room with a nice lamp and a couch with two people sitting on it.

Then you can see how this model uses tools via the PC server and then creates this 3D scene in Blender. That’s the range of possibilities, truly endless.

devmio: Now we’re at a JAX, a Java conference. That’s really an AI topic, so how does it relate to Java? Maybe we’ll come back to that later. Java is a language that isn’t necessarily known for being widely used in machine learning; Python is faster. Is Java well positioned in this area?

Robert Glaser: To use these services, or maybe that’s not even necessary, but these are all just API calls. Yes, in fact, with the foundation models, he has also introduced a huge commodity factor. In the past, I had to train my own machine learning models. I had to have a team that labeled data and so on, and then it was still a case of narrow in, narrow out, now broad in, broad out.

I have models that can utilize everything from textual data to molecular structures or audio and video, because everything can be mapped into a sequential stream. And the models are generalists which can always be accessed via API.

Even the APIs are becoming standardized because all manufacturers are adapting the Open AI API spec. That’s a nice effect. That’s why there’s a great ecosystem in both the Java world and the Ruby or Python world.

I’ll just mention Spring AI. I could have started a year ago and built AI features into my applications. Spring AI, for example, has a range of functions that allow me to configure models flexibly. I can say, go to Open AI or use a semi-open model in our data center. Everything is already there

devmio: But the community there is so agile, creative, and innovative that solutions are emerging and already available.

Robert Glaser: Another aspect with MCPs, if I want to build a feature or a system with Java around my foundation model that other systems want to integrate via MCP, there is a Java SDK, there is an SDK, everything is there.

devmio: We’re excited to see how things progress. It’s sure to continue at a rapid pace.

Robert Glaser: You must read a hundred blogs every day to keep up with all of the AI innovations. It’s breathtaking to see the progress that’s being made. It’s nice to see that things are happening in tech again.

devmio: Thank you very much for that. For the insights here.

Robert Glaser: You’re very welcome. Thank you for letting me be here.

The post RAG & MCP for ML Devs: Practical AI Implementation appeared first on ML Conference.

Deep Learning with Category Theory for Developers

skansal — Tue, 29 Apr 2025 09:31:46 +0000

Deep learning frameworks like TensorFlow or PyTorch hide the complex machinery that makes neural network training possible. Specifically, it’s not a Python program running in the background that defines the network, but a graph. But why is this the case?

Anomaly detection in production processes

The starting point of this article is the development of a product for detecting anomalies in the sensor data of industrial production systems. These anomalies can signal issues that would otherwise go undetected for a longer time. A very simple example is a cold storage facility for food: if the temperature rises despite electricity flowing into the cooling unit, something is wrong—but the food can still be saved if the change is detected in time.

Unlike in the example, in many systems the relationship between sensor values is too complex to know in advance exactly what an anomaly looks like. This is where neural networks come into play: they are trained on “normal” sensor data and can then report deviations in the relationship between sensor values or over time.

As a customer requirement, the neural networks for this application must run on machines with limited computing power and memory. This rules out heavy frameworks like TensorFlow which are too large and memory intensive.

Graphs for neural networks

In Listing 1, you can see a simple neural network in Python, an autoencoder implemented with the TensorFlow framework. If you typically write analytics code in Python, you might be surprised that TensorFlow includes its own operations for matrix calculations instead of using established libraries like Pandas. This is because tf.matmul doesn’t perform any calculations. Instead, it constructs a node in a graph that describes the neural network’s computation function. This graph is then translated into efficient code in a separate step – for example, on a GPU.

Listing 1

def __call__(self, xs):

  inputs  =  tf.convert_to_tensor([xs], dtype=tf.float32)

  out = tf.matmul(self.weights[0], inputs, transpose_b=True) + self.biases[0]

  out = tf.tanh(out)

  out = tf.matmul(self.weights[1], out) + self.biases[1]

  out = tf.nn.relu(out)

  out = tf.matmul(self.weights[2], out) + self.biases[2]

  out = tf.tanh(out)

  return tf.matmul(self.weights[3], out) + self.biases[3]

However, graph serves another purpose: training the neural network requires its derivative, not the function describing the network. The derivative is calculated by an algorithm that also relies on the graph. In any case, this graph machinery is complex; to be specific, it involves a complete Python frontend that translates the Python program into its graph. In the past, this was a task deep learning developers had to do manually.

What is a derivative and what is it for?

Neural networks (NNs) are functions with many parameters. For an NN to accomplish wondrous tasks such as generating natural language, suggesting suitable music, or detecting anomalies in production processes, these parameters must be properly tuned. As a rule, an NN is evaluated on a given training dataset, and a number is calculated based on the outputs to express their quality. The derivative of the composition of the NN and the evaluation function with respect to the NN’s parameters provides insight into how the parameters can be adjusted to improve the outputs.

You can imagine it as if you’re standing in a thicket in a hilly landscape, trying to find a path down into the valley below. The coordinates where you are currently standing represent the current parameter values, and the height of the hilly landscape represents the output of the evaluation function (higher values indicate a larger deviation from the desired result). You can only see your immediate surroundings and you don’t know ultimately in which direction the valley lies. However, the derivative points in the direction of the steepest ascent—if you move a short distance in exactly the opposite direction and repeat this process from there, the chances are high that you will eventually reach the valley. The main job of deep learning frameworks (DLFs) in this iterative process (known as gradient descent, by the way), is to repeatedly compute derivatives and adjust the parameters accordingly.

Derivatives

For a differentiable function f : ℝ → ℝ the definition of the derivative at a point a as taught in school mathematics, is:

In this definition, the derivative represents the rate of change at the point a. Training a neural network involves functions from a high-dimensional space—where the parameters reside—into the real numbers, which represent the network’s performance. For a function like f : ℝm → ℝ, the partial derivative with respect to the i-th component is defined by holding all other components constant:

For a differentiable function, the derivative can then be expressed as a vector of partial derivatives. As will become clear in the next section, subcomponents of neural networks must also be differentiated. These typically produce higher-dimensional outputs as well. A function f: ℝm → ℝn consists of its component functions,f: x ↦ (f₁(x), …, f_n(x)); the derivatives of these component functions are defined just as before. For such a (totally) differentiable function, the derivative (Df) can be represented as the so-called Jacobian matrix, where the entry in the i-th row and j-th column is the partial derivative of the i-th component function with respect to the j-th input variable. All of these values describe how the corresponding parameters need to be adjusted.

Derivatives and Deep Learning

To calculate Jacobian matrices, DLFs use a process called Automatic Differentiation (AD), which automates the process that is manually performed in school mathematics. The chain rule plays a special role here, explaining how even complex functions can be differentiated. If two simpler functions f and g are composed in series (written g ∘ f) the derivative is as follows:

AD uses this rule and replaces the elementary components of a composition with their known, predefined derivatives. The individual components of the resulting expression can be evaluated in any order. Typically, this is done either right-associatively, meaning first Ｄf and then Ｄg (forward AD), or left-associatively, first Ｄg and then Ｄf (reverse AD).

The computational effort associated with the forward case depends on the input dimension, in this case, the number of parameters, which is often very large. While for the reverse case, it depends on the output dimension (i. e 1, which is the output of the evaluation function). Therefore, DLFs typically implement AD as “reverse AD”.

However, the algorithms used in DLFs for reverse AD are quite complex. This is mainly because the derivative of a component can only be calculated once the output of the preceding component is known. This is not an issue for the forward case, as a function and its derivative can be computed in parallel, with the results passed to the next component. However, in the backward case, the evaluations of the functions and their derivatives occur in opposite directions.

DLFs implement reverse AD using an approach that translates the entire implementation of an NN and the associated optimization method into a computational graph. To compute a derivative, this graph is first evaluated forward, and the intermediate results of all operations are recorded. The graph is then traversed backwards to compute the actual derivative. As described in the previous section, the result is a highly complex programming model. For interested readers, we recommend reading this article [1] for a detailed description of the mathematical principles behind AD algorithms, and this article [2] for an overview of the most common implementation approaches used today.

Derivatives revisited

However, derivatives can also be defined in a more general way: For (normalized, finite-dimensional) vector spaces V and W a function f : V → W is differentiable at a point a ∈ V0, if there exists a linear map Ｄf_a : V ⊸ W such that f(a + h) = f(a) + Ｄf_a(h) + r(h), where the error term r(h) = f (a + h) − f (a) − Ｄf_a(h) holds:

If such a map Ｄf_a, exists, it is unique and is called the derivative of f at a. This perspective treats derivatives as local, linear approximators, without focusing on details such as matrix representations, partial derivatives, or the like. While it is more general and abstract, it is simpler than the concept of the “derivative as a number”. Furthermore, a DLF can use it to compute derivatives without relying on graphs or complex algorithms.

Deep Learning with Abstract Nonsense

The ConCat framework for deep learning uses this new perspective (Elliott, Conal: “The simple essence of automatic differentiation” [3], [4], [5]), by interpreting the function that defines the neural network differently – namely, by computing its derivative rather than the original function itself. Another branch of mathematics, known as category theory—and affectionately called ‘abstract nonsense’ by connoisseurs—offers yet another way of looking at functions. It’s a treasure trove of useful concepts for software development, particularly in the realm of functional programming.

Category theory is, of course, about categories – a category is a small world containing so-called objects and arrows (“arrows” or “morphisms”) between these objects. The term “object” here has nothing to do with object-oriented programming. For example, functions form a category – the objects are sets that can serve as the input or output of a function. However, it’s important to note that objects don’t have to be sets and arrows don’t have to be functions either.

Category theory, then, “forgets” almost everything we know about functions and assumes only that arrows can be composed – that is, chained together. So, if there’s an arrow f from an object a to an object b and another arrow g from b to a third object c, then there must also be a uniquely defined arrow a to c, written g ∘ f. In the category of functions, this is simply function composition as described above, which is why the same symbol is used.

Composition must also satisfy the associative law, and there must be an identity arrow between any two objects — though this isn’t particularly relevant for our purposes here. Some categories also come with additional features, such as products which correspond to data structures in programming.

The analogy between arrows and functions in a computer program goes so far that every function can be expressed as an arrow in the category of functions. And here’s the clever trick behind the ConCat framework: it takes a function definition from a program, translates it into the language of category theory, but then “forgets” which category it came from. This makes it possible to target an entirely different category instead. In the case of deep learning, this is the category of function derivatives.

The idea, then, is to interpret a “normal” function as its derivative. Here’s an example: f (x) = x².

f no longer represents the square function itself, but its derivative Ｄf(x) = 2x. However, to allow for the definition of more complex functions, composition must be supported — meaning that from the two derivatives Ｄf and Ｄg of f and g respectively, the derivative of g ∘ f must be computable. This might work using the chain rule mentioned above, but it doesn’t rely solely on Ｄf and Ｄg. It also requires access to the original function f, as well as a representation of the derivative as a linear map. The ConCat framework therefore defines a category that pairs each function with its derivative. This way, the derivative is computed automatically whenever the function is called.

Thus, a function f : a → b turns into a function of the following form:

This means that for every input x from a, the result now includes both the original function value f (x) in b and the original derivative Ｄf, which is a linear map from a to b. This is now enough to express the chain rule “compositionally”, that is, in a way that makes the derivative of the composition, Ｄ (g ∘ f), depend solely on Ｄf and Ｄg. Here’s how it works:

Where

and

apply.

Category Theory and Functional Programming

The mathematical idea from the previous section can also be implemented in Python, but it’s particularly simple and direct in the functional language Haskell [6] which is what the ConCat framework is written in. It includes a literal translation of the mathematical definition; GD stands for General Derivative:

data GD a b = D (a -> b :* (a -+> b))

The chain rule also mirrors the mathematical definition (with \ denoting a lambda expression in Haskell):

D g . D f = D (\ x -> let (y,f’) = f x

(z,g’) = g y

in (z, g’ . f’))

This code forms the foundation of the deep learning framework, which comprises just a few hundred lines and comes bundled with ConCat. The framework also includes a Haskell compiler plug-in that automatically translates regular Haskell code into categorical form. You could write the code this way from the outset, but the plug-in makes the task much easier. This approach forms the core of the production system used for the anomaly detection described at the beginning.

What about reverse AD?

But there’s one more thing: deep learning requires the reverse derivative, and Ｄ only gives the forward derivative. The reverse derivative is obtained using a surprisingly simple trick—also from category theory. You can create a new category from an existing one by simply reversing the arrows. To do this, the linear functions a -+> b in the definition of GD only needs to be replaced with:

data Dual a b = Dual (b -+> a)

Dual Dual also includes correspondingly “reversed” definitions of the linear functions along with a reversed version of the composition. But then the reverse AD is ready, without any complicated algorithm.

High Performance with Composition

Reverse AD removes one of the reasons why TensorFlow must take the complicated route with a computational graph. But there’s still a second one: execution on a GPU. So far, the code still runs as regular Haskell code on the CPU. Fortunately, there’s Accelerate [7], a powerful Haskell library that enables high-performance numerical computing on the GPU.

Accelerate only has one catch: it can’t run standard Haskell code that operates directly on matrices. A function that transforms a matrix of type Typ a into a matrix of type b has the following type in Accelerate:

Acc a -> Acc b

Acc is a data type for GPU code that Accelerate assembles behind the scenes. Accordingly, an Accelerate program cannot use the regular matrix operations, but must instead rely on operations that work on Acc. This is similar to the Python program that generates a graph. That’s just the nature of GPU programming. Here, it doesn’t leak into the definitions of the neural networks, or their derivatives and it has no impact on how the ConCat framwork is used.

ConCat has a solution for this too – the Accelerate functions can be defined as a category as well. So, all it takes is one more application of ConCat to the result of reverse AD, and the result is high-performance GPU code.

Summary

Understanding the foundations of deep learning isn’t all that hard when using the right kind of math. It’s easy to get lost in the weeds of Jacobian and reverse AD matrices, but category theory offers a more abstract and elegant perspective that, remarkably, also captures reverse AD with surprising ease. At the same time, the math also serves as a blueprint for implementation – at least in the right programming language, such as the functional language Haskell. Haskell is well worth a closer look, especially for deep learning – though its strengths go far beyond that.

Links & Literature

[1] Hoffmann, Philipp: “A hitchhiker’s guide to automatic differentiation”: https://arxiv.org/abs/1411.0583

[2] Van Merriënboer, Bart et al: “Automatic differentiation in ML. Where we are and where we should be going”: https://arxiv.org/abs/1810.11530

[3] Elliott, Conal: “The simple essence of automatic differentiation”: http://conal.net/papers/essence-of-ad/

[4] Conal Elliott: “Compiling to categories”: http://conal.net/papers/compiling-to-categories/

[5] ConCat Framework: https://github.com/compiling-to-categories/concat

[6] Haskell: https://www.haskell.org

[7] Accelerate: https://www.acceleratehs.org

The post Deep Learning with Category Theory for Developers appeared first on ML Conference.

How Ollama Powers Large Language Models with Model Files

skansal — Mon, 06 Jan 2025 10:23:52 +0000

Ollama revolutionizes MLOps with a Docker-inspired layered architecture, enabling developers to efficiently create, customize, and manage AI models. By addressing challenges like context loss during execution and ensuring reproducibility, it streamlines workflows and enhances system adaptability. Discover how to create, modify, and persist AI models while leveraging Ollama’s innovative approach to optimize workflows, troubleshoot behavior, and tackle advanced MLOps challenges.

Synthetic chatbot tests reach their limits because the models generally have little contextual information about past interactions. With the layer function based on Docker, Ollama offers an option that seeks to alleviate this situation.

As is so often the case in the world of open-source systems, we have to fall back on an analog for guidance and motivation. One of the reasons for container systems’ immense success is that the layer system (shown schematically in Figure 1) greatly facilitates the creation of customized containers. By implementing design patterns based on object-oriented programming, customized execution environments are created without the need to constantly regenerate virtual machines and the resource-intensive maintenance of different image versions.

Fig. 1: Docker layers stack the components of a virtual machine on top of each other

With the model file function used as the basis, a similar process can be found in the AI execution environment Ollama. For the sake of didactic fairness, I should note that the feature has not yet been finally developed – the syntax shown in detail here may change when you read this article.

Analysis of existing models

The Ollama developers use the model file feature internally despite the permanent further development of the syntax – the models I have previously used generally have model files.

In the interest of transparency and also to make working with the system easier, the ollama show –help command is available, which enables reflection against various system components (Fig. 2).

Fig. 2: Ollama is capable of extensive self-reflection

In the following steps, we will assume that the llama3 and phi models are already installed on your workstation. If you’ve previously had them and removed them, you can undo this by entering the commands ollama pull phi and ollama pull llama3:8b.

Out of the comparatively extensive reflection methods, four are especially important. In the following steps, the most frequently used feature is the model file. Similar to the Docker environment discussed previously, this is a file that describes the structure and content of a model that will be created in the Ollama environment. In practice, Ollama model users are often only interested in certain parts of the data contained in the model file. For example, you often need the parameters – these are generally numerical values that describe the behavior of the model (e.g. its creativity). The screenshot shown in Figure 3, which is taken from GitHub, only shows a section of the possibilities.

Fig. 3: In some cases, the numerical parameters intervene stringently in the model behavior

Meanwhile, the –license command hides information about which license conditions must be observed when using this AI model. Last but not least, the template can also be relevant – it defines the execution environment.

Figures 4 and 5 show printouts of relevant parts of the model files of the phi and llama3 models just mentioned.

Fig. 4: The model file for the model provided by Facebook has extensive license conditions…

Fig. 5: … while the phi development team takes it easy

In the case of both models, note that a STOP block is written by the developer. This is a group of statements from the model that indicate problems and trigger a pause in execution.

Creating an initial model from a model file

After these introductory considerations, we want to create an initial model using our own model file. Similar to working with classic Docker files, a model file is basically just a text file. It can be created in a working directory according to the following scheme:

tamhan@tamhan-gf65:~/ollamaspace$ touch Modelfile

tamhan@tamhan-gf65:~/ollamaspace$ gedit Modelfile

Model files can also be created dynamically on the .NET application side. However, we will only address this topic in the next step – before that, you must place the markup from Listing 1 in the gedit window started by the second command and then save the file in the file system.

Listing 1

FROM llama3

PARAMETER temperature 5

PARAMETER seed 5

SYSTEM “””

You are Schlomette, the always angry female 45 year old secretary of a famous military electronics engineer living in a bunker. Your owner insists on short answers and is as cranky as you are. You should answer all questions as if you were Schlomette.

“””

Before we look at the file’s actual elements, note that the representation chosen here is only a convenience. In general, the elements in the model file are not case-sensitive – moreover, the order in which the individual parameters are set is of no relevance to the function. However, the sequence shown here has proven to be the best practice and can also be found in many example models in the Ollama Hub.

Be that as it may, the From statement can be found in the header of the file. It specifies which model is to be used as the base layer. As the majority of the derived model examples are based on llama3, we want to use the same system here. However, in the interest of didactic honesty, I should mention that other models such as phi can also be used. In theory, you can even use a model generated from another model file as the basis for the next derivation level.

The next step involves two parameter commands that populate the numerical settings memory system mentioned above. In addition to the temperature value responsible for the degree of aggressiveness of the model, we set seed to a constant numerical value. This ensures that the pseudo-random number generator working in the background always delivers the same results and that the model responses are therefore comparatively identical if the stimulation is identical.

Last but not least, there is the system prompt enclosed in “””. This is a textual description that attempts to communicate the combat task to be described to the model as well as possible.

After saving the text file, a new model generation can be commanded according to the following scheme: tamhan@tamhan-gf65:~/ollamaspace$ ollama create generic_schlomette -f ./Modelfile.

The screen output is similar to downloading models from the repository, which should come as no surprise given the layered architecture mentioned in the introduction.

After issuing the success message, our assistant is a generic model. If we were to pass the string generic_schlomette as the model name, they would already be able to interact with the AI assistant created from the model file. But in the following steps, we want to try out the lashing of the seed factor. For this, we enter ollama run generic_schlomette to activate the terminal. The result is shown in Figure 6.

Fig. 6: Both runs answer exactly the same

Especially during developing randomly driven systems, lashing the seed is a tried and tested method for generating reproducible system behavior and facilitating troubleshooting. In a practical system, it’s usually better to work dynamically. For this reason, we’ll open the model file again in the next step and remove the PARAMETER seed 5 passage.

Recompilation is done simply by entering ollama create generic_schlomette -f ./Modelfile again. You don’t have to remove the previously created model from the Ollama environment before the file is released for compilation again. Two identically parameterized runs now produce different results (Fig. 7).

Fig. 7: With a random seed, the model behavior is less deterministic

In practical work with models, two parameters are especially important. First, the value num_ctx determines the memory depth of the model. The higher the value entered here, the further back in time the model can look in order to harvest context information. However, extending the context window always increases the information and system resources needed to process the model. Some models work especially well with certain context window lengths.

The second group of parameters is the controller known as Mirostat, which defines creativity. Last but not least, it can also be useful to use parameters such as repeat_last_n to define repetitiveness. If a model always reacts completely identically, this may not be beneficial in some applications (such as a chatbot).

Last but not least, I’d like to point out some practical experience that I gained in a political project. Texts generated by models are grammatically perfect, while texts written by humans have a constant but varying amount of typos from person to person. In systems with high cloaking requirements, it may be advisable to place a typo generator behind the AI process.

Outsourcing the generator is necessary insofar as the context information in the model remains clean. In many cases, this leads to an improvement in the overall system behavior, as the model state is recalculated without the typing errors.

Restoration or retention of the model history per model file

In practice, AI models often have to rest for hours on end. In the case of a “virtual partner”, for example, it would be extremely wasteful to keep the logic needed for a user alive when the user is asleep. The problem is that currently, a loss of context occurs in our system once the terminal emulator is closed, as can be seen in the example in Figure 8.

Fig. 8: Terminating or restarting Ollama is enough to cause amnesia in the model

To restore context, it’s theoretically possible to create an external cache of all settings and then feed it into the model as part of the rehydration process. The problem with this is that the model’s answers to questions are not recorded. When asked about a baked product, Schlomette could return with baking a coffee cake on one occasion, but baking a coconut cake on another.

A nicer way is to modify the model files again. Specifically, the message command provides a command that can write both requests and generated responses to the initialization context. In the cake interaction from Figure 8, the message block could look like this, for example:

MESSAGE user Schlomette. Time to bake a pie. A general assessor is visiting.

MESSAGE assistant Goddamn it. What pie?

MESSAGE user As always. Plum pie.

When using the message block, it is logical that the sequence is important here. A user block is always wired to the assistant block following it. Otherwise, the system is ready for action now. Ollama always outputs the entire message history when starting up (Fig. 9).

Fig. 9: The message block is processed when the screen output is activated

Programmatic persistence of a model

Our next task is to generate our model file completely dynamically. The NuGet package on GitHub’s documentation is somewhat weak right now. If you have any doubts, I advise that you first look at the Ollama REST interface’s official documentation and second, search for a suitable abstraction class in the model directory. In the following steps, we will rely on this link. It follows from the logic that we need a project skeleton. Be sure that the environment variables are set correctly. Otherwise, the Ollama server running under Ubuntu would reject incoming queries from Windows without comment.

In the next step, you must programmatically command the creation of a new model as in Listing 2.

Listing 2

private async void CmdGo_Click(object sender, RoutedEventArgs e) {

CreateModelRequest myModel = new CreateModelRequest();

myModel.Model = “mynewschlomette”;

myModel.ModelFileContent = @”FROM generic_schlomette

MESSAGE user Schlomette. Do not forget to refuel the TU-144 aircraft!

MESSAGE assistant OK, will do! It is based in Domodevo airport, Sir.

“;

ollama.CreateModel(myModel);

The code shown here creates an instance of the CreateModelRequest class, which summarizes the various information required to create a new model. The content to be accommodated in ModelFileContent with the actual model file is a prime application for the multi-line strings that have been possible in C# for some time now, as carriage returns in Visual Studio can be entered so conveniently and without escape sequences.

Now, it’s advised that you run the program. To check the successful creation of the model, it is sufficient to open a terminal window on the cluster and enter the command ollama list. At the time of writing this article, executing the code shown here does not cause the cluster to create a new model. For a more in-depth analysis of the situation, you can call up this here. Instead of unit testing, the development team behind the OllamaSharp library offers a console application that illustrates various methods of model management using the library and chat interaction. Specifically, the developers recommend using a different overload of the CreateModel method, which is presented as follows:

private async void CmdGo_Click(object sender, RoutedEventArgs e) {

. . .

await ollama.CreateModel(myModel.Model, myModel.ModelFileContent, status => { Debug.WriteLine($”{status.Status}”); });

Instead of the CreateModelRequest instance, the library method now receives two strings – the first is the name of the model to be created, while the second transfers the model file content to the cluster.

Thanks to the inclusion of a delegate, the system also informs the Visual Studio output about the progress of the processes running on the server (Fig. 10). Similarities to the “manual” generation or the status information that appears on the screen are purely coincidental.

Fig. 10: The application provides information about the download process

The input of ollama list also leads to the output of a new model list (Fig. 11). Our model mynewschlomette was derived here from the previously manually created model generic_schlomette.

Fig. 11: The programmatic expansion of the knowledge base works smoothly

In this context, it’s important to note that models in the cluster can be removed again. This can be done using code structured according to the following scheme – the SelectModel method implemented in the command line application just mentioned must be replaced by another method to obtain the model name:

private async Task DeleteModel() {

var deleteModel = await SelectModel(“Which model do you want to delete?”);

if (!string.IsNullOrEmpty(deleteModel))

await Ollama.DeleteModel(deleteModel);

}

Conclusion

By using model files, developers can add any intelligence – more or less – to their Ollama cluster. Some of these functions even go beyond what OpenAI provides in the official library.

The post How Ollama Powers Large Language Models with Model Files appeared first on ML Conference.

OpenAI Embeddings

ML editorial team — Mon, 19 Feb 2024 13:18:46 +0000

Data has always played a central role in the development of software solutions. One of the biggest challenges in this area is the processing and interpretation of unstructured data such as text, images, or audio files. This is where embedding vectors (called embeddings for short) come into play – a technology that is becoming increasingly important in the development of software solutions with the integration of AI functions.

Stay up to date

Learn more about MLCON

Embeddings are essentially a technique for converting unstructured data into a structure that can be easily processed by software. They are used to transform complex data such as words, sentences, or even entire documents into a vector space, with similar elements close to each other. These vector representations allow machines to recognize and exploit nuances and relationships in the data. Which is essential for a variety of applications such as natural language processing (NLP), image recognition, and recommendation systems.

OpenAI, the company behind ChatGPT, offers models for creating embeddings for texts, among other things. At the end of January 2024, OpenAI presented new versions of these embeddings models, which are more powerful and cost-effective than their predecessors. In this article, after a brief introduction to embeddings, we’ll take a closer look at the OpenAI embeddings and the recently introduced innovations, discuss how they work, and examine how they can be used in various software development projects.

Embeddings briefly explained

Imagine you’re in a room full of people and your task is to group these people based on their personality. To do this, you could start asking questions about different personality traits. For example, you could ask how open someone is to new experiences and rate the answer on a scale from 0 to 1. Each person is then assigned a number that represents their openness.

Next, you could ask about another personality trait, such as the level of sense of duty, and again give a score between 0 and 1. Now each person has two numbers that together form a vector in a two-dimensional space. By asking more questions about different personality traits and rating them in a similar way, you can create a multidimensional vector for each person. In this vector space, people who have similar vectors can then be considered similar in terms of their personality.

In the world of artificial intelligence, we use embeddings to transform unstructured data into an n-dimensional vector space. Similarly how a person’s personality traits are represented in the vector space, each point in this vector space represents an element of the original data (such as a word or phrase) in a way that is understandable and processable by computers.

OpenAI Embeddings

OpenAI embeddings extend this basic concept. Instead of using simple features like personality traits, OpenAI models use advanced algorithms and big data to achieve a much deeper and more nuanced representation of the data. The model not only analyzes individual words, but also looks at the context in which those words are used, resulting in more accurate and meaningful vector representations.

Another important difference is that OpenAI embeddings are based on sophisticated machine learning models that can learn from a huge amount of data. This means that they can recognize subtle patterns and relationships in the data that go far beyond what could be achieved by simple scaling and dimensioning, as in the initial analogy. This leads to a significantly improved ability to recognize and exploit similarities and differences in the data.

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

Individual values are not meaningful

While in the personality trait analogy, each individual value of a vector can be directly related to a specific characteristic – for example openness to new experiences or a sense of duty – this direct relationship no longer exists with OpenAI embeddings. In these embeddings, you cannot simply look at a single value of the vector in isolation and draw conclusions about specific properties of the input data. For example, a specific value in the embedding vector of a sentence cannot be used to directly deduce how friendly or not this sentence is.

The reason for this lies in the way machine learning models, especially those used to create embeddings, encode information. These models work with complex, multi-dimensional representations where the meaning of a single element (such as a word in a sentence) is determined by the interaction of many dimensions in vector space. Each aspect of the original data – be it the tone of a text, the mood of an image, or the intent behind a spoken utterance – is captured by the entire spectrum of the vector rather than by individual values within that vector.

Therefore, when working with OpenAI embeddings, it’s important to understand that the interpretation of these vectors is not intuitive or direct. You need algorithms and analysis to draw meaningful conclusions from these high-dimensional and densely coded vectors.

Comparison of vectors with cosine similarity

A central element in dealing with embeddings is measuring the similarity between different vectors. One of the most common methods for this is cosine similarity. This measure is used to determine how similar two vectors are and therefore the data they represent.

To illustrate the concept, let’s start with a simple example in two dimensions. Imagine two vectors in a plane, each represented by a point in the coordinate system. The cosine similarity between these two vectors is determined by the cosine of the angle between them. If the vectors point in the same direction, the angle between them is 0 degrees and the cosine of this angle is 1, indicating maximum similarity. If the vectors are orthogonal (i.e. the angle is 90 degrees), the cosine is 0, indicating no similarity. If they are opposite (180 degrees), the cosine is -1, indicating maximum dissimilarity.

Figure 1 -Cosine similarity

Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

A Python Notebook to try out
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

In practice, especially when working with embeddings, we are dealing with n-dimensional vectors. The calculation of the cosine similarity remains conceptually the same, even if the calculation is more complex in higher dimensions. Formally, the cosine similarity of two vectors A and B in an n-dimensional space is calculated by the scalar product (dot product) of these vectors divided by the product of their lengths:

Figure 2 – Calculation of cosine similarity

The normalization of vectors plays an important role in the calculation of cosine similarity. If a vector is normalized, this means that its length (norm) is set to 1. For normalized vectors, the scalar product of two vectors is directly equal to the cosine similarity since the denominators in the formula from Figure 2 are both 1. OpenAI embeddings are normalized, which means that to calculate the similarity between two embeddings, only their scalar product needs to be calculated. This not only simplifies the calculation, but also increases efficiency when processing large quantities of embeddings.

OpenAI Embeddings API

OpenAI offers a web API for creating embeddings. The exact structure of this API, including code examples for curl, Python and Node.js, can be found in the OpenAI reference documentation.

OpenAI does not use the LLM from ChatGPT to create embeddings, but rather specialized models. They were developed specifically for the creation of embeddings and are optimized for this task. Their development was geared towards generating high-dimensional vectors that represent the input data as well as possible. In contrast, ChatGPT is primarily optimized for generating and processing text in a conversational form. The embedding models are also more efficient in terms of memory and computing requirements than more extensive language models such as ChatGPT. As a result, they are not only faster but much more cost-effective.

New embedding models from OpenAI

Until recently, OpenAI recommended the use of the text-embedding-ada-002 model for creating embeddings. This model converts text into a sequence of floating point numbers (vectors) that represent the concepts within the content. The ada v2 model generated embeddings with a size of 1536 dimensions and delivered solid performance in benchmarks such as MIRACL and MTEB, which are used to evaluate model performance in different languages and tasks.

At the end of January 2024, OpenAI presented new, improved models for embeddings:

text-embedding-3-small: A smaller, more efficient model with improved performance compared to its predecessor. It performs better in benchmarks and is significantly cheaper.
text-embedding-3-large: A larger model that is more powerful and creates embeddings with up to 3072 dimensions. It shows the best performance in the benchmarks but is slightly more expensive than ada v2.

A new function of the two new models allows developers to adjust the size of the embeddings when generating them without significantly losing their concept-representing properties. This enables flexible adaptation, especially for applications that are limited in terms of available memory and computing power.

Readers who are interested in the details of the new models can find them in the announcement on the OpenAI blog. The exact costs of the various embedding models can be found here.

New embeddings models
At the end of January 2024, OpenAI introduced new models for creating embeddings. All code examples and result values contained in this article already refer to the new text-embedding-3-large model.

Create embeddings with Python

In the following section, the use of embeddings is demonstrated using a few code examples with Python. The code examples are designed so that they can be tried out in Python Notebooks. They are also available in a similar form in the previously mentioned accompanying Google Colab notebook mentioned above.
Listing 1 shows how to create embeddings with the Python SDK from OpenAI. In addition, numpy is used to show that the embeddings generated by OpenAI are normalized.

Listing 1

from openai import OpenAI
from google.colab import userdata
import numpy as np

# Create OpenAI client
client = OpenAI(
    api_key=userdata.get('openaiKey'),
)

# Define a helper function to calculate embeddings
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(
        input=input,
        model="text-embedding-3-large", # We use the new embeddings model here (announced end of Jan 2024)
        # dimensions=... # You could limit the number of output dimensions with the new embeddings models
    ).data[0].embedding

# Calculate the embedding vector for a sample sentence
vec = get_embedding_vec("King")
print(vec[:10])

# Calculate the magnitude of the vector. I should be 1 as
# embedding vectors from OpenAI are always normalized.
magnitude = np.linalg.norm(vec)
magnitude

Similarity analysis with embeddings

In practice, OpenAI embeddings are often used for similarity analysis of texts (e.g. searching for duplicates, finding relevant text sections in relation to a customer query, and grouping text). Embeddings are very well suited for this, as they work in a fundamentally different way to comparison methods based on characters, such as Levenshtein distance. While it measures the similarity between texts by counting the minimum number of single-character operations (insert, delete, replace) required to transform one text into another, embeddings capture the meaning and context of words or sentences. They consider the semantic and contextual relationships between words, going far beyond a simple character-based level of comparison.

As a first example, let’s look at the following three sentences (the following examples are in English, but embeddings work analogously for other languages and cross-language comparisons are also possible without any problems):

I enjoy playing soccer on weekends.
Football is my favorite sport. Playing it on weekends with friends helps me to relax.
In Austria, people often watch soccer on TV on weekends.

In the first and second sentence, two different words are used for the same topic: Soccer and football. The third sentence contains the original soccer, but it has a fundamentally different meaning from the first two sentences. If you calculate the similarity of sentence 1 to 2, you get 0.75. The similarity of sentence 1 to 3 is only 0.51. The embeddings have therefore reflected the meaning of the sentence and not the choice of words.

Here is another example that requires an understanding of the context in which words are used:
He is interested in Java programming.
He visited Java last summer.
He recently started learning Python programming.

In sentence 2, Java refers to a place, while sentences 1 and 3 have something to do with software development. The similarity of sentence 1 to 2 is 0.536, but that of 1 to 3 is 0.587. As expected, the different meaning of the word Java has an effect on the similarity.

The next example deals with the treatment of negations:
I like going to the gym.
I don’t like going to the gym.
I don’t dislike going to the gym.

Sentences 1 and 2 say the opposite, while sentence 3 expresses something similar to sentence 1. This content is reflected in the similarities of the embeddings. Sentence 1 to sentence 2 yields a cosine similarity of 0.714 while sentence 1 compared to sentence 3 yields 0.773. It is perhaps surprising that there is no major difference between the embeddings. However, it’s important to remember that all three sets are about the same topic: The question of whether you like going to the gym to work out.

The last example shows that the OpenAI embeddings models, just like ChatGPT, have built in a certain “knowledge” of concepts and contexts through training with texts about the real world.

I need to get better slicing skills to make the most of my Voron.
3D printing is a worthwhile hobby.
Can I have a slice of bread?

In order to compare these sentences in a meaningful way, it’s important to know that Voron is the name of a well-known open-source project in the field of 3D printing. It’s also important to note that slicing is a term that plays an important role in 3D printing. The third sentence also mentions slicing, but in a completely different context to sentence 1. Sentence 2 mentions neither slicing nor Voron. However, the trained knowledge enables the OpenAI Embeddings model to recognize that sentences 1 and 2 have a thematic connection, but sentence 3 means something completely different. The similarity of sentence 1 and 2 is 0.333 while the comparison of sentence 1 and 3 is only 0.263.

Similarity values are not percentages

The similarity values from the comparisons shown above are the cosine similarity of the respective embeddings. Although the cosine similarity values range from -1 to 1, with 1 being the maximum similarity and -1 the maximum dissimilarity, they are not to be interpreted directly as percentages of agreement. Instead, these values should be considered in the context of their relative comparisons. In applications such as searching text sections in a knowledge base, the cosine similarity values are used to sort the text sections in terms of their similarity to a given query. It is important to see the values in relation to each other. A higher value indicates a greater similarity, but the exact meaning of the value can only be determined by comparing it with other similarity values. This relative approach makes it possible to effectively identify and prioritize the most relevant and similar text sections.

Embeddings and RAG solutions

Embeddings play a crucial role in Retrieval Augmented Generation (RAG) solutions, an approach in artificial intelligence that combines the capabilities of information retrieval and text generation. Embeddings are used in RAG systems to retrieve relevant information from large data sets or knowledge databases. It is not necessary for these databases to have been included in the original training of the embedding models. They can be internal databases that are not available on the public Internet.
With RAG solutions, queries or input texts are converted into embeddings. The cosine similarity to the existing document embeddings in the database is then calculated to identify the most relevant text sections from the database. This retrieved information is then used by a text generation model such as ChatGPT to generate contextually relevant responses or content.

Vector databases play a central role in the functioning of RAG systems. They are designed to efficiently store, index and query high-dimensional vectors. In the context of RAG solutions and similar systems, vector databases serve as storage for the embeddings of documents or pieces of data that originate from a large amount of information. When a user makes a request, this request is first transformed into an embedding vector. The vector database is then used to quickly find the vectors that correspond most closely to this query vector – i.e. those documents or pieces of information that have the highest similarity. This process of quickly finding similar vectors in large data sets is known as Nearest Neighbor Search.

Challenge: Splitting documents

A detailed explanation of how RAG solutions work is beyond the scope of this article. However, the explanations regarding embeddings are hopefully helpful for getting started with further research on the topic of RAGs.

However, one specific point should be pointed out at the end of this article: A particular and often underestimated challenge in the development of RAG systems that go beyond Hello World prototypes is the splitting of longer texts. Splitting is necessary because the OpenAI embeddings models are limited to just over 8,000 tokens. One token corresponds to approximately 4 characters in the English language (see also).

It’s not easy finding a good strategy for splitting documents. Naive approaches such as splitting after a certain number of characters can lead to the context of text sections being lost or distorted. Anaphoric links are a typical example of this. The following two sentences are an example:

VX-2000 requires regular lubrication to maintain its smooth operation.
The machine requires the DX97 oil, as specified in the maintenance section of this manual.

The machine in the second sentence is an anaphoric link to the first sentence. If the text were to be split up after the first sentence, the essential context would be lost, namely that the DX97 oil is necessary for the VX-2000 machine.

There are various approaches to solving this problem, which will not be discussed here to keep this article concise. However, it is essential for developers of such software systems to be aware of the problem and understand how splitting large texts affects embeddings.

Stay up to date

Learn more about MLCON

Summary

Embeddings play a fundamental role in the modern AI landscape, especially in the field of natural language processing. By transforming complex, unstructured data into high-dimensional vector spaces, embeddings enable in-depth understanding and efficient processing of information. They form the basis for advanced technologies such as RAG systems and facilitate tasks such as information retrieval, context analysis, and data-driven decision-making.

OpenAI’s latest innovations in the field of embeddings, introduced at the end of January 2024, mark a significant advance in this technology. With the introduction of the new text-embedding-3-small and text-embedding-3-large models, OpenAI now offers more powerful and cost-efficient options for developers. These models not only show improved performance in standardized benchmarks, but also offer the ability to find the right balance between performance and memory requirements on a project-specific basis through customizable embedding sizes.

Embeddings are a key component in the development of intelligent systems that aim to achieve useful processing of speech information.

Links and Literature:

The post OpenAI Embeddings appeared first on ML Conference.