Top Tools, APIs, and Frameworks for Efficient ML Development https://mlconference.ai/blog/tools-apis-frameworks/ The Conference for Machine Learning Innovation Tue, 29 Apr 2025 14:29:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 Exploring DeepSeek: How a New Player is Challenging AI Norms https://mlconference.ai/blog/deepseek-ai-challenging-llm-market/ Thu, 13 Mar 2025 09:40:25 +0000 https://mlconference.ai/?p=107624 In this interview, Dr. Pieter Buteneers shares his thoughts on the emerging AI company DeepSeek. We explore the technology behind it, discuss its potential impact on AI development, and touch upon the opportunities and challenges of this innovation.

The post Exploring DeepSeek: How a New Player is Challenging AI Norms appeared first on ML Conference.

]]>
Niklas Horlebein: The release of DeepSeek-R1 made quite an impression. But why? There are several open source LLMs. Is the impact because of its – allegedly – low model training costs?

Pieter Buteneers: There are a few special things about the DeepSeek AI release. The first is the fact that it’s very affordable. When using their reasoning model, you’ll pay about 10 times less than what you would pay at OpenAI. This came as a shock to the sector. So far, AI companies have poured billions of dollars into research and people could only buy in at a very high price. But now – apparently – new players are pushing for lower costs.

The other thing is, obviously: How were they able to train it with 6 million dollars, allegedly?

That’s only half of the story: It’s not just the 6-million-dollar figure. Basically, 6 million dollars is what brought them from having the V3 baseline model, with a simple question and an answer without reasoning, to improving it so that it can also reason about itself. It has become a reasoning model. Switching from one state to the other is what cost 6 million dollars.

Stay up to date

Learn more about MLCON

 

That wasn’t much, as the V3 model already existed, so it wasn’t that expensive to train. I couldn’t find exact details on how much they paid for it, but allegedly, training it was also quite cheap.

What struck a lot of people was what’s called “distillation”. While this isn’t exactly the right term, the media picked up on it. Distillation is typically used with a large parent model to create a smaller version of that model that mimics its behavior. Sometimes, you succeed in receiving almost the same performance with the smaller model by using data generated by the larger one for training. That’s typically how distillation is used.

In this case, the term loosely applies to what DeepSeek AI did. They already had a base model that they wished to turn into a reasoning model. Using examples from OpenAI – and possibly others – to generate training data, they then trained their own model.

This made the process a lot cheaper, as they were able to piggyback on existing models and mimic them to achieve similar results, ensuring their model learned the same reasoning steps. Of course, this was a big shock. It showed how easy it has become to copy another player’s approach. This scares a lot of competitors.

The fourth thing is that they didn’t pay much – just 6 million to go from one model to another. However, Nvidia is not allowed to export their newest graphics cards to China, so China is forced to work with older hardware when training models. Apparently, they may have found a workaround, but in theory, they used Nvidia hardware one or two generations behind what European and American players are using, and yet still managed to achieve a model comparable to OpenAI.

Explore Generative AI Innovation

Generative AI and LLMs

The founder of Anthropic, Dario Amodei said in defense, “they’re a year behind.” I don’t think they’re actually a year behind, especially compared to OpenAI. Anthropic is a bit ahead. However, it shows that they’re not that far behind. A year in this industry is a long time, and a lot can change in that period. However, looking at the bigger picture: a new company coming out with a model like this and only being a year behind is incredible. It is insane how quickly they’ve managed to catch up, even with relatively modest resources, a smaller budget, and with older hardware.

Niklas Horlebein: Sounds like a good trade-off: To be just a year behind, but save hundreds of millions by doing so. Let’s stick with money: From your perspective, how realistic are DeepSeek’s claims about low training costs and reduced need for computing power?

Pieter Buteneers: It’s realistic if you can make your baseline model capable of reasoning. Then  6 million is pretty reasonable and achievable. But achieving a model that performs at least as well as their baseline model with that amount is very unlikely. They likely used other resources and the 6 million was mainly for switching to a reasoning model. It appears that’s the phase the money went into: taking a question-answer model and making it reason about itself before providing an answer.

But obviously, it’s marketing – really, really good marketing – especially at a time when OpenAI and Anthropic have been raising billions in new capital. They have shown off that they can achieve something similar with amounts of money two orders of magnitude lower. This is an obvious shockwave, and they will certainly reach the same result with far fewer dollars. They exaggerated a bit by not mentioning the full picture, but still, it’s not something to be underestimated.

Niklas Horlebein: If it’s true that you can train a model for a much smaller amount of money, why exactly did or do competitors need so much capital? Was much of the budget for marketing?

Pieter Buteneers: When it comes to building a model and all of the steps required to get there, you can optimize the cost and learn from mistakes others made to go from A to B. But if you’re on the front row and pushing the envelope, those mistakes have not been made yet. You need a budget to make those mistakes. That’s the disadvantage of being a front-runner – you don’t know what’s coming or which hurdles you might encounter. This requires a bigger budget.

To use OpenAI as an example, they’ve been trying to release GPT-5 for months. Theoretically, a massive training run for GPT-5 was finished in December, although we do not know all of the details.

However, because running the model was so expensive, and because it only performed slightly better than GPT-4, they have released it as GPT-4.5 by now. So, it wasn’t worth it from a price-performance standpoint. They must completely rethink how they train models.

That process costs money. That’s the price you pay to be first, to be the front-runner. It’s crazy to see how a small Chinese player is only one year behind. It shows they’re catching up really, really quickly.

Niklas Horlebein: Is it really one year behind? What can you tell us about R1’s performance after a few weeks? How does it compare to other major LLMs?

Pieter Buteneers: If you compare it’s non-reasoning performance, it performs slightly below GPT-4o, but above Llama 3.3. In my opinion, it is the best-performing open source model at the moment. And as a bonus, it is the first reasoning model that is also open source. So, on the open source side, DeepSeek AI is at the bleeding edge of what you can get.

When comparing it to other closed source reasoning models, OpenAI’s o1 and o3 mini models are maybe a tiny bit better. Anthropic is again quite a few steps ahead. So, DeepSeek AI is not doing bad at all; it’s actually pretty good. And, in my opinion, it’s at most half a year behind.

Niklas Horlebein: DeepSeek’s models are “open weight”, not fully open source. What should developers know about the difference?

Pieter Buteneers: Most models are open weight, including Meta’s. This simply means that you can download the model to your computer, and you’re free to run it locally and modify it as much as you want, depending on the license. However, you don’t know what data the model has been trained on, nor do you know what training algorithm, setup, or procedure was used to train the model.

Essentially, they are an even greater black box than traditional AI models you train yourself. You only have the weights you can use to let it do tasks for you and you receive an output. But if you don’t know what the training data is, you do not know what the model’s knowledge base is. This limits your ability to inspect and learn from the model. You must reverse-engineer it, in essence, because this black-box system makes it very hard to learn what it’s really on the inside.

For example, with DeepSeek AI, you can see some moral judgments were made during the training process. If you write a prompt about Tiananmen Square or Xi Jinping, the model won’t provide an answer. It will try to avoid the topic and redirect the conversation elsewhere. This is because, in China, the law – or possibly culture – simply forbids discussion on these topics. It’s a moral judgment the creators made: if they want the model to survive in China, they must ensure it doesn’t say anything about these subjects.

However, all large language model providers make these decisions. Even Grok 3, which claims to be the snarkiest model and claims to say things as they are, will not tell you that Elon Musk is making mistakes or something similar. It’s trained not to.

They’ve inserted a moral judgment into it. None of these models will tell you how to buy unlicensed firearms on the internet. These topics have been removed, even though the data may exist somewhere on the internet and is likely in the training data.

The creators of these models make moral judgments about what is socially acceptable but you can only do that well by influencing the training process. However, with an open weight model, these decisions have already been made for you. You have no impact on how these models behave from a moral standpoint.

Niklas Horlebein: Would fully open source models allow for that? 

Pieter Buteneers: Yes. There’s still research to be done, but at least all the information is on the table.

With fully open source models, you have access to the training scripts and data and can make your own moral judgments. You can tweak the model so that it behaves in a way that aligns with what you believe. Standards like “do not kill other humans” are universally accepted, but every country or region has its own specific things that you’re not supposed to say.

I’m Belgian, and in Belgium, it’s okay to joke about Germans in the Second World War. But you probably wouldn’t do that in Germany, right? It’s not socially accepted. What’s morally acceptable really depends on the country and culture.

You can try making the model very generalist and avoid the subject of Nazis in general, but you don’t know what each country considers okay or not. Letting big organizations like DeepSeek AI, OpenAI, Anthropic, or Mistral make those decisions limits our societal impact on what these models – and possibly, in the future, a form of AGI (Artificial General Intelligence) – will deem acceptable.

That is why, in my opinion, it’s crucial to understand the inner workings of these models. You need the data, and you need to control the training process yourself to get the desired output.

This way, society can learn and understand how these models work and defend itself against the potential risks these models pose. As long as models are not open source, we as a society have very limited control over what goes in and what does not.

Niklas Horlebein: We’ve already seen the short-term impact of DeepSeek AI. From your perspective, what will be its long-term implications for the industry? Will we see more open source competitors now that training models may be more financially plausible?

Pieter Buteneers: Their claim to train models on a comparatively shoestring budget will encourage other players to consider doing something similar. 

The fact that the DeepSeek AI models are open source will attract developers and researchers to build on top. Meta understood that well. They were one of the first to make their biggest and smartest models open source. Mistral and others tried as well, but usually with smaller models. 

More and more players will feel the pressure to open source at least the weights, and I hope it will become a competition about who can be the most open source. There are already discussions about other models and releasing parts of their training data, so you can already watch the “one-up” game begin. The player who releases the most open source model – one that includes everything – will form the baseline for much of the research that follows, as it allows others to take that model and continue building on it.

If a model is released fully open source with everything included, everything society builds on top of will be able to be downloaded from the internet as code and be integrated into models.

One of the reasons Meta decided to open source their models was to form the foundation for research on top of the Llama models and to be the de facto standard in LLM research going forward.

There’s also the fact that open sourcing attracts talent. Very few people want to work day and night, filling another billionaire’s pockets. Without making a lot of money themselves, without being allowed to talk about it. That’s another reason companies are considering open source: it attracts the talent they need.

You saw this when OpenAI became the de facto “ClosedAI”: a lot of people left the company and looked elsewhere. Now, they’re big enough to still attract talent, but for smaller player open sourcing is a way to make a difference. You’ll see more open source models going forward, as many people are asking for.

Open weight is the bare minimum. With that, the model can run anywhere. The training data might be more academic, so clients aren’t necessarily asking for it. But when open sourced, it creates a richer community around the research and the results of that research will be much easier to integrate back into the original model. This will be a competetive advantage for open source models.

Niklas Horlebein: Large U.S. companies seemed shaken by the unexpected competitor, but quickly stated they would stick to their approach of investing heavily in AI development. Do you think they will maintain this stance, or will OpenAI change its plans?

Pieter Buteneers: In the beginning, after DeepSeek AI was released, there was a bit of silence and shares dropped. People thought that perhaps models don’t need so much money. But a few weeks later, big players like OpenAI and Anthropic were raising massive amounts of capital. Being at the leading edge of this technology still requires huge amounts of money. That’s the sad part.

This is true throughout history: there are always innovators who launch something at a very high cost, and later, copycats try to mimic and replicate it at a much lower cost. But they’re different players, and they attract a different audience. Being on the cutting-edge attracts customers who want the most powerful and easiest-to-integrate solutions. OpenAI is still the best at that. Their models might not be the most performant compared to Anthropic, but they’re easier to integrate into code.

The answers are structured better; they’re consistently improving over time. ChatGPT is much better at formatting tables and structuring it’s output than only a few months ago. That continuous improvement makes them more snackable, and easier to use for both users and developers. All of this requires investment.

Copycats replicate something to achieve a similar benchmark performance, but it does not also mean that the product is easy to build with or pleasant to use. It’s simply good based on the benchmarks.

It’s like looking at the Nürburgring and putting a race car on it. You may have a car that can lap the Nürburgring really fast – but only a professional driver can drive it.

For regular people on the road, it doesn’t change anything – because they can’t drive it, and it has no worth to them. It’s only good in the benchmarks. That’s the risk with these models: they score well on benchmarks but aren’t pleasant to use.

That’s where OpenAI, and more specifically the knowledge Sam Altman brought from Y Combinator, really comes into play. The focus is on user experience.

That edge is much harder to mimic. It takes many small iterations to achieve a model output that is meaningfully better.

Stay up to date

Learn more about MLCON

 

Niklas Horlebein: How do you feel about the “rise of DeepSeek AI” in general? Do you believe it will have a positive impact on AI development – or a negative one?

Pieter Buteneers: Generally, more competition is good for the consumer, and in this case, I’m a consumer. I use this technology to do legal due diligence for lawyers. Being able to use this technology from multiple companies and letting them compete on price and performance is amazing. The only ones that suffer are the companies themselves as they try to one-up each other, but that’s always good for everyone else.

Although, it’s sad that it’s a Chinese player and you can’t get it hosted within the EU zone for a reasonable cost, like on Azure or AWS – or at least I haven’t found it yet.

This may change in the future, and once it does, it will become more accessible for GDPR enthusiasts, too. We can use this technology in our solutions with the guarantee that the Chinese government isn’t watching.

Let these companies compete, let them fight it out. The more players working towards AGI, the more checks and balances there are. In an ideal scenario, everything will be fully open source with checks and balances in place. It’s good that there’s an extra player in the field, helping create checks and balances. Yes, indeed.

Niklas Horlebein: Is there anything else about DeepSeek AI or the current LLM landscape that you would like to add?

Pieter Buteneers: Reasoning models are very interesting to me, personally, and the fact that DeepSeek AI has a reasoning model is an important breakthrough. That shouldn’t be underestimated, but the uptake of reasoning models in practical applications is still low. It’s easy in a chat situation, where users ask a question and let the model do the work. Their reasoning is helpful in some cases with difficult questions, though not in most cases. 

At Emma Legal we automate the legal due diligence, which is fairly structured  so we know very well what to check for and how to reason. So, we have our own dedicated checks and balances in place to ensure that the models aren’t hallucinating and that documents it pulls are relevant for the due diligence.

So in well-built AI applications, reasoning models aren’t often used, as you can already determine what kind of reasoning you need beforehand and build it yourself at a much lower cost.

Niklas Horlebein: Pieter, thank you so much for your time!

The post Exploring DeepSeek: How a New Player is Challenging AI Norms appeared first on ML Conference.

]]>
How Ollama Powers Large Language Models with Model Files https://mlconference.ai/blog/ollama-large-language-models/ Mon, 06 Jan 2025 10:23:52 +0000 https://mlconference.ai/?p=107211 The rise of large language models (LLMs) has transformed the landscape of artificial intelligence, enabling advanced text generation, and even the ability to write code. At the heart of these systems are foundation models built on transformer models and fine-tuned with vast amounts of data. Ollama offers a unique approach to managing these AI models with its model file feature, which simplifies the training process and the customization of generative AI systems.

Whether it's configuring training data, optimizing neural networks, or enhancing conversational AI, Ollama provides tools to improve how large language models work. From managing virtual machines to ensuring a fault-tolerant infrastructure, Ollama is revolutionizing text generation and generating content for diverse AI applications.

The post How Ollama Powers Large Language Models with Model Files appeared first on ML Conference.

]]>
Ollama revolutionizes MLOps with a Docker-inspired layered architecture, enabling developers to efficiently create, customize, and manage AI models. By addressing challenges like context loss during execution and ensuring reproducibility, it streamlines workflows and enhances system adaptability. Discover how to create, modify, and persist AI models while leveraging Ollama’s innovative approach to optimize workflows, troubleshoot behavior, and tackle advanced MLOps challenges.

Synthetic chatbot tests reach their limits because the models generally have little contextual information about past interactions. With the layer function based on Docker, Ollama offers an option that seeks to alleviate this situation.

As is so often the case in the world of open-source systems, we have to fall back on an analog for guidance and motivation. One of the reasons for container systems’ immense success is that the layer system (shown schematically in Figure 1) greatly facilitates the creation of customized containers. By implementing design patterns based on object-oriented programming, customized execution environments are created without the need to constantly regenerate virtual machines and the resource-intensive maintenance of different image versions.

Docker layers stack the components of a virtual machine on top of each other

Fig. 1: Docker layers stack the components of a virtual machine on top of each other

With the model file function used as the basis, a similar process can be found in the AI execution environment Ollama. For the sake of didactic fairness, I should note that the feature has not yet been finally developed – the syntax shown in detail here may change when you read this article.

Analysis of existing models

The Ollama developers use the model file feature internally despite the permanent further development of the syntax – the models I have previously used generally have model files.

In the interest of transparency and also to make working with the system easier, the ollama show –help command is available, which enables reflection against various system components (Fig. 2).

Ollama is capable of extensive self-reflection

Fig. 2: Ollama is capable of extensive self-reflection

In the following steps, we will assume that the llama3 and phi models are already installed on your workstation. If you’ve previously had them and removed them, you can undo this by entering the commands ollama pull phi and ollama pull llama3:8b.

Out of the comparatively extensive reflection methods, four are especially important. In the following steps, the most frequently used feature is the model file. Similar to the Docker environment discussed previously, this is a file that describes the structure and content of a model that will be created in the Ollama environment. In practice, Ollama model users are often only interested in certain parts of the data contained in the model file. For example, you often need the parameters – these are generally numerical values that describe the behavior of the model (e.g. its creativity). The screenshot shown in Figure 3, which is taken from GitHub, only shows a section of the possibilities.

In some cases, the numerical parameters intervene stringently in the model behavior

Fig. 3: In some cases, the numerical parameters intervene stringently in the model behavior

Meanwhile, the –license command hides information about which license conditions must be observed when using this AI model. Last but not least, the template can also be relevant – it defines the execution environment.

Figures 4 and 5 show printouts of relevant parts of the model files of the phi and llama3 models just mentioned.

The model file for the model provided by Facebook has extensive license conditions.

Fig. 4: The model file for the model provided by Facebook has extensive license conditions…

While the phi development team takes it easy.

Fig. 5: … while the phi development team takes it easy

In the case of both models, note that a STOP block is written by the developer. This is a group of statements from the model that indicate problems and trigger a pause in execution.

Creating an initial model from a model file

After these introductory considerations, we want to create an initial model using our own model file. Similar to working with classic Docker files, a model file is basically just a text file. It can be created in a working directory according to the following scheme:

tamhan@tamhan-gf65:~/ollamaspace$ touch Modelfile

tamhan@tamhan-gf65:~/ollamaspace$ gedit Modelfile

Model files can also be created dynamically on the .NET application side. However, we will only address this topic in the next step – before that, you must place the markup from Listing 1 in the gedit window started by the second command and then save the file in the file system.

Listing 1

FROM llama3

PARAMETER temperature 5

PARAMETER seed 5

SYSTEM “””

You are Schlomette, the always angry female 45 year old secretary of a famous military electronics engineer living in a bunker. Your owner insists on short answers and is as cranky as you are. You should answer all questions as if you were Schlomette.

“””

Before we look at the file’s actual elements, note that the representation chosen here is only a convenience. In general, the elements in the model file are not case-sensitive – moreover, the order in which the individual parameters are set is of no relevance to the function. However, the sequence shown here has proven to be the best practice and can also be found in many example models in the Ollama Hub.

Be that as it may, the From statement can be found in the header of the file. It specifies which model is to be used as the base layer. As the majority of the derived model examples are based on llama3, we want to use the same system here. However, in the interest of didactic honesty, I should mention that other models such as phi can also be used. In theory, you can even use a model generated from another model file as the basis for the next derivation level.

The next step involves two parameter commands that populate the numerical settings memory system mentioned above. In addition to the temperature value responsible for the degree of aggressiveness of the model, we set seed to a constant numerical value. This ensures that the pseudo-random number generator working in the background always delivers the same results and that the model responses are therefore comparatively identical if the stimulation is identical.

Last but not least, there is the system prompt enclosed in “””. This is a textual description that attempts to communicate the combat task to be described to the model as well as possible.

After saving the text file, a new model generation can be commanded according to the following scheme: tamhan@tamhan-gf65:~/ollamaspace$ ollama create generic_schlomette -f ./Modelfile.

The screen output is similar to downloading models from the repository, which should come as no surprise given the layered architecture mentioned in the introduction.

After issuing the success message, our assistant is a generic model. If we were to pass the string generic_schlomette as the model name, they would already be able to interact with the AI assistant created from the model file. But in the following steps, we want to try out the lashing of the seed factor. For this, we enter ollama run generic_schlomette to activate the terminal. The result is shown in Figure 6.

Both runs answer exactly the same

Fig. 6: Both runs answer exactly the same

Especially during developing randomly driven systems, lashing the seed is a tried and tested method for generating reproducible system behavior and facilitating troubleshooting. In a practical system, it’s usually better to work dynamically. For this reason, we’ll open the model file again in the next step and remove the PARAMETER seed 5 passage.

Recompilation is done simply by entering ollama create generic_schlomette -f ./Modelfile again. You don’t have to remove the previously created model from the Ollama environment before the file is released for compilation again. Two identically parameterized runs now produce different results (Fig. 7).

With a random seed, the model behavior is less deterministic

Fig. 7: With a random seed, the model behavior is less deterministic

In practical work with models, two parameters are especially important. First, the value num_ctx determines the memory depth of the model. The higher the value entered here, the further back in time the model can look in order to harvest context information. However, extending the context window always increases the information and system resources needed to process the model. Some models work especially well with certain context window lengths.

The second group of parameters is the controller known as Mirostat, which defines creativity. Last but not least, it can also be useful to use parameters such as repeat_last_n to define repetitiveness. If a model always reacts completely identically, this may not be beneficial in some applications (such as a chatbot).

Last but not least, I’d like to point out some practical experience that I gained in a political project. Texts generated by models are grammatically perfect, while texts written by humans have a constant but varying amount of typos from person to person. In systems with high cloaking requirements, it may be advisable to place a typo generator behind the AI process.

Outsourcing the generator is necessary insofar as the context information in the model remains clean. In many cases, this leads to an improvement in the overall system behavior, as the model state is recalculated without the typing errors.

Restoration or retention of the model history per model file

In practice, AI models often have to rest for hours on end. In the case of a “virtual partner”, for example, it would be extremely wasteful to keep the logic needed for a user alive when the user is asleep. The problem is that currently, a loss of context occurs in our system once the terminal emulator is closed, as can be seen in the example in Figure 8.

Terminating or restarting Ollama is enough to cause amnesia in the model.

Fig. 8: Terminating or restarting Ollama is enough to cause amnesia in the model

To restore context, it’s theoretically possible to create an external cache of all settings and then feed it into the model as part of the rehydration process. The problem with this is that the model’s answers to questions are not recorded. When asked about a baked product, Schlomette could return with baking a coffee cake on one occasion, but baking a coconut cake on another.

A nicer way is to modify the model files again. Specifically, the message command provides a command that can write both requests and generated responses to the initialization context. In the cake interaction from Figure 8, the message block could look like this, for example:

MESSAGE user Schlomette. Time to bake a pie. A general assessor is visiting.

MESSAGE assistant Goddamn it. What pie?

MESSAGE user As always. Plum pie.

When using the message block, it is logical that the sequence is important here. A user block is always wired to the assistant block following it. Otherwise, the system is ready for action now. Ollama always outputs the entire message history when starting up (Fig. 9).

The message block is processed when the screen output is activated.

Fig. 9: The message block is processed when the screen output is activated

Programmatic persistence of a model

Our next task is to generate our model file completely dynamically. The NuGet package on GitHub’s documentation is somewhat weak right now. If you have any doubts, I advise that you first look at the Ollama REST interface’s official documentation and second, search for a suitable abstraction class in the model directory. In the following steps, we will rely on this link. It follows from the logic that we need a project skeleton. Be sure that the environment variables are set correctly. Otherwise, the Ollama server running under Ubuntu would reject incoming queries from Windows without comment.

In the next step, you must programmatically command the creation of a new model as in Listing 2.

Listing 2

private async void CmdGo_Click(object sender, RoutedEventArgs e) {

  CreateModelRequest myModel = new CreateModelRequest();

  myModel.Model = “mynewschlomette”;

  myModel.ModelFileContent = @”FROM generic_schlomette

    MESSAGE user Schlomette. Do not forget to refuel the TU-144 aircraft!

    MESSAGE assistant OK, will do! It is based in Domodevo airport, Sir.

  “;

  ollama.CreateModel(myModel);

The code shown here creates an instance of the CreateModelRequest class, which summarizes the various information required to create a new model. The content to be accommodated in ModelFileContent with the actual model file is a prime application for the multi-line strings that have been possible in C# for some time now, as carriage returns in Visual Studio can be entered so conveniently and without escape sequences.

Now, it’s advised that you run the program. To check the successful creation of the model, it is sufficient to open a terminal window on the cluster and enter the command ollama list. At the time of writing this article, executing the code shown here does not cause the cluster to create a new model. For a more in-depth analysis of the situation, you can call up this here. Instead of unit testing, the development team behind the OllamaSharp library offers a console application that illustrates various methods of model management using the library and chat interaction. Specifically, the developers recommend using a different overload of the CreateModel method, which is presented as follows:

private async void CmdGo_Click(object sender, RoutedEventArgs e) {

  . . .

  await ollama.CreateModel(myModel.Model, myModel.ModelFileContent, status => { Debug.WriteLine($”{status.Status}”); });

Instead of the CreateModelRequest instance, the library method now receives two strings – the first is the name of the model to be created, while the second transfers the model file content to the cluster.

Thanks to the inclusion of a delegate, the system also informs the Visual Studio output about the progress of the processes running on the server (Fig. 10). Similarities to the “manual” generation or the status information that appears on the screen are purely coincidental.

The application provides information about the download process.

Fig. 10: The application provides information about the download process

The input of ollama list also leads to the output of a new model list (Fig. 11). Our model mynewschlomette was derived here from the previously manually created model generic_schlomette.

The programmatic expansion of the knowledge base works smoothly.

Fig. 11: The programmatic expansion of the knowledge base works smoothly

In this context, it’s important to note that models in the cluster can be removed again. This can be done using code structured according to the following scheme – the SelectModel method implemented in the command line application just mentioned must be replaced by another method to obtain the model name:

private async Task DeleteModel() {

  var deleteModel = await SelectModel(“Which model do you want to delete?”);

  if (!string.IsNullOrEmpty(deleteModel))

    await Ollama.DeleteModel(deleteModel);

}

Conclusion

By using model files, developers can add any intelligence – more or less – to their Ollama cluster. Some of these functions even go beyond what OpenAI provides in the official library.

The post How Ollama Powers Large Language Models with Model Files appeared first on ML Conference.

]]>
Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows https://mlconference.ai/blog/data-lakehouse-databricks-ml-performance/ Mon, 18 Mar 2024 10:31:51 +0000 https://mlconference.ai/?p=87350 In today’s rapidly evolving data landscape, leveraging a Data Lakehouse architecture is becoming a key strategy for enhancing machine learning workflows. Databricks, a leader in unified data analytics, provides a robust platform that integrates seamlessly with the data lakehouse model to enable data engineers, data scientists, and Machine learning (ml) developers to collaborate more effectively. In this article, we explore how Databricks empowers organizations to streamline data processing, accelerate model development, and unlock the full potential of artificial intelligence (AI) by providing a centralized data repository. This solution not only improves scalability and efficiency but also facilitates end-to-end machine learning pipelines from data ingestion to model deployment.

The post Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows appeared first on ML Conference.

]]>
Demystify the power of DataBricks Lakehouse! This comprehensive guide dives into setting up, running, and optimizing machine learning experiments on this industry-leading platform. Whether you’re a seasoned data scientist or just getting started, this hands-on approach will equip you with the skills to unlock the full potential of DataBricks.

DataBricks is known as the Data Lake House. This is a combination of a data warehouse and data lake. This article will take a closer look at what this means in practice and how you can start your first experiments with DataBricks.{.preface}

You should know that the DataBricks platform is a spin-off of the Apache Spark project. As with many open source projects, the idea behind it was to combine open source technology with quality of life improvements.

DataBricks in particular obviously focuses on ease of use and a flat learning curve. Developers should resist the temptation to use an inexpensive, turnkey product instead of a  technically innovative system, especially for projects with a short lifespan.

Stay up to date

Learn more about MLCON

 

Commissioning DataBricks

DataBricks is currently used exclusively in resources or implementations from cloud providers. At the time of writing this, the company at least supports the “Big Three”. Interestingly, in the [FAQ] seen in **Figure 1**, they explicitly admit that they don’t currently provide the option of locally hosting the DataBricks system.

Fig. 1: If you want to host DataBricks locally, you’re out of luck.{.caption}

Interestingly, DataBricks has a close relationship with all three cloud providers. In many cases, you don’t have to pay separate AWS or cloud costs when purchasing a commercial DataBricks product. Instead, payment is made directly to DataBricks and the provider settles the costs.

For newcomers, there is the DataBricks Community Edition, a light version provided in collaboration with Amazon AWS. It’s completely free to use, but only allows 15 GB of data volume and is limited in terms of some convenience functions, scheduling (and the REST API). But this function should be enough for our first attempts.

So let’s call up the [DataBricks Community Edition log-in page] in the browser of our choice. After clicking on the sign-up link, DataBricks takes you to the fully-fledged log-in portal, where you can register for a free 14-day trial of the platform’s full version. In order to use the Community Edition, you must first fully complete the registration process.

In the second step, be sure not to choose a cloud provider in the window shown in **Figure 2**. Instead, click the Get started with Community Edition link at the bottom to continue the registration process for the Community Edition.

Databricks cloud provider selection screen with options for AWS, Microsoft Azure, and Google Cloud Platform, along with a button to continue and a link to Community Edition.

Fig. 2: Care is needed when activating the Community Edition.{.caption}

In the next step, you need to solve a captcha to identify yourself as a human user. The confirmation message seen in **Figure 3** is divided between the commercial and community edition. Don’t get anxious about the reference to the free trial phase.

Databricks email verification screen prompting users to check their email to start their trial, with links to an administration guide and a quickstart guide for deploying the first workspace.

Fig. 3: Community Edition users also see this message.{.caption}

Entering a valid e-mail address is especially important. DataBricks will send a confirmation email. Clicking the link in the email lets you set a password. Then you’ll find yourself in the product’s start interface, [which can be activated later here](https://community.cloud.databricks.com/).

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Working through the Quickstart notebook

In many respects, commercial companies are interested in flattening the learning curve for potential customers. This can be seen in DataBrick’s guide. The Quickstart tutorial section is prominently placed on the homepage, offering the Start Tutorial link.

Click it to command the web interface to change mode. Your efforts will be rewarded with a user interface similar to several Python notebook systems.

The visual similarities are no coincidence. DataBricks relies on the IPython engine in the background and is more or less compatible with standalone product versions.

Creating the cluster is especially important here. Let me explain. The developer creates the intelligence needed to complete the machine learning task in the notebooks.

But the actual execution of this intelligence requires computing power that normally far exceeds the available computing resources behind Schlomo Normaldevveloper’s browser window. Interestingly, DataBricks’ clusters are available in two versions. The all-purpose class is a classic cloud VM that (manually started and/or scheduled) is also available to a user rotation for collaboratively finishing battle tasks.

System number two is the job cluster. This is a dedicated cluster created for a batch task. It is automatically terminated after a successful or failed job processing. It’s important to note that the administrator isn’t able to keep a job cluster alive after the batch process finishes.

Be that as it may, in the next step, we place our mouse pointer on the far left to expand the menu. DataBricks offers two different operating modes by default.

We want to choose Data Science and Engineering. In the next step, open the Compute menu. Here, we can manage the computing power sources in our account.

Activate the All-Purpose-Compute tab and click the Create Compute option to make a new cluster element. You can freely choose a name. I opted for SUSTest1.

It’s important that several Runtime versions are available. In the following, we opt for the 7.3 LTS option (Scala 2.12, Spark 3.0.1).

As free Community Edition users, we don’t have the option of choosing different cluster hardware sizes. Our system only ever has 15 GB of memory and deactivates after two hours of inactivity.

So, all you need to do to start the configuration process is click the Create Cluster button. Then, click the compute element again to switch to the overview table. This lists all of your account’s compute resources side-by-side.

Generating the compute resources will take some time. To the far left of the table, as seen in **Figure 4**, there is a rotating circle symbol to show that our cluster is in progress.

Databricks compute configuration screen showing options for all-purpose compute and job compute, with a button to create new compute resources and a list of existing resources labeled 'SUSTest1'.

Fig. 4: If the circle is rotating, the cluster isn’t ready for combat yet.{.caption}

The start process can take up to five minutes. Once the work is done, a green tick symbol will appear, as seen in **Figure 5**. As a free version user, you cannot assume that your cluster is running ad perpetuum. If you notice strange behavior in the DataBricks, it makes sense to check the cluster status.

Screenshot of Databricks' 'Compute' tab showcasing an active all-purpose compute resource named 'SUSTest1'. This compute resource is used in a scalable machine learning (ML) pipeline within a data lakehouse architecture. The platform streamlines data processing and analytics workflows, supporting collaboration and efficient compute management.

Fig. 5: The green tick mean it’s ready for action.{.caption}

Once our work is done, we can return to the notebook. The Connect option is available in the top right-hand corner. Click it and select the cluster to establish a connection. Then click 

the Run All icon next to it to instruct all commands in the notebook to execute. In the following, the system will execute commands in individual cells in real-time, as seen in **Figure 5**. Be sure to scroll down and view the results.

Screenshot showing a Databricks notebook executing PySpark commands for a machine learning (ML) workflow within a data lakehouse architecture. The code reads a CSV file, saves it using Delta format, creates a Delta table, and runs a SQL query on the 'diamonds' dataset. This demonstrates scalable data processing and streamlined pipelines for analytics and collaboration.

Fig. 6: The environment provides real-time information about operations performed

Focus on the cell.{.caption}

Due to the architectural decision to build DataBricks as a whole on IPython notebooks, we  must deliver the commands to be executed in the form of notebooks. Interestingly, the notebook as a whole can be kept in one programming language, while individual command cells can offer other languages. A foreign-language command element is created by clicking the respective language bubble, as shown in **Figure 7**.

Screenshot of a Databricks notebook displaying a PySpark command to read a CSV file from a dataset, process it with Delta format, and overwrite it into a Delta table. A dropdown menu shows options to change the notebook cell language, including Markdown, Python, SQL, Scala, and R. This is part of a machine learning (ML) workflow in a scalable data lakehouse architecture.

Fig. 7: DataBricks allows the use of insular languages.{.caption}

Using the menu option File | Export | HTML, the DataBricks notebook can also be exported as an HTML file after its commands are successfully processed. The majority of the mark-up is lost, but the resulting file presents the results in a way that’s easier for management to understand and digest.

Alternatively, you can click the blue Publish button to generate a globally valid link that lets any user view the fully-fledged notebook. By default, these links stay valid for six months. Please note that publishing a new version invalidates all existing links.

Commercial version owners can also run their notebooks regularly like a cron job with the scheduling option. The user interface in **Figure 8** is used for this. Other job scheduling system users will feel right at home. However, be aware that this function requires a job cluster, which isn’t included and cannot be created in the free Community Edition at the time of writing this.

DataBricks in scheduling mode

Fig. 8: DataBricks in scheduling mode.{.caption}

 

Last but not least, you can also stop the cluster using the menu at the top right. This is only a courtesy to the company for the Community Edition. But, it’s highly recommended for commercial use since it reduces overall costs.

Different data tables for optimizing performance

One of NoSQL databases’ basic characteristics is that in many cases, they soften the ACID criteria. The lower consistency quality is usually offset by a greatly reduced database administration effort. Sometimes, this results in impressive performance increases compared to a classic relational database. When working with DataBricks, we deal with a group of different table types that differ in terms of performance and data storage type.

The most important difference concerns external tables and managed tables. A managed table lives entirely in the DataBricks cluster. The development team understands this to mean that the database server handles management of the actual information and the provision of metadata and access features.

There’s also the unmanaged or external table. This table represents a kind of “wrapper” around an external data source. Using this design pattern is recommended if you frequently use sample databases or information already available elsewhere in the system in an accessible form.

Since our sample from DataBricks is based on a diamond information set, using external tables is recommended. Redundant duplication of resources will only waste memory space in our cluster, without bringing any significant benefits here.

However, a careful look at the instructions created in the example notebook shows two different procedures. The first table is created with the following snippet:

 

```

DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
```

Besides the call to DROP TABLE, which is always needed to initialize the cluster, creating the new table uses standard SQL commands, more or less.We use _Using csv_ to tell the Runtime we want to use the CSV engine.

If you scroll further down in the example, you’ll see that the table is created again, but in a two-stage process. In the first step, there’s now a Python island in the notebook that interacts with the diamond sample information in the URL /databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv according to the following:

```
%python
diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
diamonds.write.format("delta").mode("overwrite").save("/delta/diamonds")
```

The DataBricks development team provides aspiring data science experimenters with a dozen or so widely used sample datasets. These can be accessed directly from the DataBricks runtime using friendly URLs. Additional information about available data sources [can be found here](https://docs.databricks.com/dbfs/databricks-datasets.html).

In the second step, there’s a snippet of SQL code that delivers Using Delta instead of the previously used Using CSV. This instructs the DataBricks backend to animate the existing element with the Delta database engine.

```
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING DELTA LOCATION '/delta/diamonds/'
```

Delta is an open source database engine based on Apache Parquet. Its . Normally, it’s always preferable to use the Apache Spark table because it delivers better results in terms of both ACID criteria and performance, especially when large amounts of data need to be processed.

DataBricks is more – Focus on machine learning

Until now, we operated the DataBricks runtime in engineering mode. It’s optimized for the needs of ordinary data scientists who want to perform various types of analyses. But the user interface has a special mode specifically for machine learning (**Fig. 9** shows the mode switcher) that focuses on relevant functions.

This option lets you change the personality of the DataBricks interface.

Fig. 9: This option lets you change the personality of the DataBricks interface.{.caption}

In principle, the workflow in **Figure 10** is always used. Anyone implementing this workflow in an in-house application will always work with the Auto-ML working environment sooner or later. In theory, this is available from version 9.1 at the end of Runtime, but it’s only really feature-complete when at least version 10.4 LTS ML is available on the cluster. But since this is one of the USPs of the DataBricks platform, we can assume that the product is under constant further development.

It’s advised that you check if the cluster in question is running the product’s latest version. For data engineering, DataBricks also offers a dedicated tutorial in the Guide: Training section from the home screen. This makes it easier to get started. Click the Start guide option again to load the notebook for this tutorial as “to be edited”.

ML functions in DataBricks workflow.

Fig. 10: If you want to use the ML functions in DataBricks, you should familiarize yourself with this workflow.{.caption}

Due to higher demands on the aforementioned required Data Bricks Runtime, you should switch to the Compute section and delete the previously created cluster. Then, click the Create Compute option again and delete the previously created cluster. Click the Create Compute option again and make sure to click the ML heading in the DataBricks Runtime Version field (see **Fig. 11**) in the first step.

ML-capable variants of the DataBricks runtime appear in a separate section in the backend.

Fig. 11: ML-capable variants of the DataBricks runtime appear in a separate section in the backend.{.caption}

Just for fun, we’ll use the latest version 12.0 ML and name the cluster “SUSTestML”. It takes some time after clicking the Create Cluster button, since the cloud resources aren’t immediately provided.

During cluster generation, we can return to the notebook to get an overview of the elements. In the first step, we see the inclusion of the following libraries, abbreviated here. They are familiar to every Python developer:

```

import mlflow
import numpy as np
import pandas as pd
import sklearn.datasets
. . .
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
. . .
```

In many respects, DataBricks is based on what ML developers are familiar with from working with standard Python scripts. Some libraries naturally have optimizations to make them run more efficiently on the DataBricks hardware. In general, however, a locally functioning Python script will continue to work without any problems after being moved to the DataBricks cluster. For the actual monitoring of the learning process, Data Bricks relies on MLFlow, which is available here [6].

For this reason, the rest of the notebook is standard ML code, although it’s elegantly integrated into the user interface. For example, there is a flyout in which the application provides information about various parameters that were created during the parameterization of the model:

```
with mlflow.start_run(run_name='gradient_boost') as run:
  model = sklearn.ensemble.GradientBoostingClassifier(random_state=0)
  model.fit(X_train, y_train)
  . . .
```

It’s also interesting to note that the results of the individual optimization runs are not only displayed in the user interface. The Python code that lives in the notebook can also access them programmatically. In this way, it can perform a kind of reflection to find the most suitable parameters and/or model architectures.

In the case of the example notebook provided by DataBricks, this is illustrated in the following snippet, which applies an SQL query to the results available in the mlflow.search_runs field:

“`

best_run = mlflow.search_runs(
  order_by=['metrics.test_auc DESC', 'start_time DESC'],
  max_results=10,
).iloc[0]
print('Best Run')
print('AUC: {}'.format(best_run["metrics.test_auc"]))
print('Num Estimators: {}'.format(best_run["params.n_estimators"]))
```

 AutoML, for the second time

The duality of control via the user interface and programmatic control also continues in the case of the AutoML library mentioned above. The user interface shown in Figure 12, which allows graphical parameterization of ML runs, is probably the most common marketing argument.

AutoML allows the graphical configuration of modeling

Fig. 12: AutoML allows the graphical configuration of modeling{.caption}

On the other hand, there is also a programmatic API that illustrates DataBricks in the form of a group of notebooks. Here we want to use the example notebook provided here [7], which we load into a browser window in the first step. Then click on the Import Notebook button at the top right and copy the URL to the clipboard.

Next, open the menu of your DataBricks instance and select the Workspace File Users option. Next to your email address, there is a downward pointing arrow, which allows you to open a context menu. Select the import option there and then enter the URL to load the sample notebook into your DataBricks instance.

The actual body of the model couldn’t be any easier. In the first step, we mainly load test data, but we also create a schema element that informs the engine about the type or data type of the model information to be processed:

```

from pyspark.sql.types import DoubleType, StringType, StructType, StructField

schema = StructType([
  StructField("age", DoubleType(), False),
  . . .
  StructField("income", StringType(), False)
])
input_df = 
spark.read.format("csv").schema(schema).load("/databricks-datasets/adult/adult.data")
```
The actual classification run then also takes place with a single line:
```

from databricks import automl
summary = automl.classify(train_df, target_col="income", timeout_minutes=30)
```

 

If you want to carry out interference later, you can do this with both Pandas and Spark.

Stay up to date

Learn more about MLCON

 

The multitool for ML professionals

Although there are hundreds of pages yet to be written about DataBricks, we’ll end our experiments with this brief overview. DataBricks is a tool that is completely focused on data scientists and machine learning experts and is not really suitable for beginners due to the very steep learning curve. Much like the infamous Squirrel Busters, DataBricks is a product that will find you when you need it.

The post Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows appeared first on ML Conference.

]]>
OpenAI Embeddings https://mlconference.ai/blog/openai-embeddings-technology-2024/ Mon, 19 Feb 2024 13:18:46 +0000 https://mlconference.ai/?p=87274 Embedding vectors (or embeddings) play a central role in the challenges of processing and interpretation of unstructured data such as text, images, or audio files. Embeddings take unstructured data and convert it to structured, no matter how complex, so they can be easily processed by software. OpenAI offers such embeddings, and this article will go over how they work and how they can be used.

The post OpenAI Embeddings appeared first on ML Conference.

]]>
Data has always played a central role in the development of software solutions. One of the biggest challenges in this area is the processing and interpretation of unstructured data such as text, images, or audio files. This is where embedding vectors (called embeddings for short) come into play – a technology that is becoming increasingly important in the development of software solutions with the integration of AI functions.

Stay up to date

Learn more about MLCON

 

 

Embeddings are essentially a technique for converting unstructured data into a structure that can be easily processed by software. They are used to transform complex data such as words, sentences, or even entire documents into a vector space, with similar elements close to each other. These vector representations allow machines to recognize and exploit nuances and relationships in the data. Which is essential for a variety of applications such as natural language processing (NLP), image recognition, and recommendation systems.

OpenAI, the company behind ChatGPT, offers models for creating embeddings for texts, among other things. At the end of January 2024, OpenAI presented new versions of these embeddings models, which are more powerful and cost-effective than their predecessors. In this article, after a brief introduction to embeddings, we’ll take a closer look at the OpenAI embeddings and the recently introduced innovations, discuss how they work, and examine how they can be used in various software development projects.

Embeddings briefly explained

Imagine you’re in a room full of people and your task is to group these people based on their personality. To do this, you could start asking questions about different personality traits. For example, you could ask how open someone is to new experiences and rate the answer on a scale from 0 to 1. Each person is then assigned a number that represents their openness.

Next, you could ask about another personality trait, such as the level of sense of duty, and again give a score between 0 and 1. Now each person has two numbers that together form a vector in a two-dimensional space. By asking more questions about different personality traits and rating them in a similar way, you can create a multidimensional vector for each person. In this vector space, people who have similar vectors can then be considered similar in terms of their personality.

In the world of artificial intelligence, we use embeddings to transform unstructured data into an n-dimensional vector space. Similarly how a person’s personality traits are represented in the vector space, each point in this vector space represents an element of the original data (such as a word or phrase) in a way that is understandable and processable by computers.

OpenAI Embeddings

OpenAI embeddings extend this basic concept. Instead of using simple features like personality traits, OpenAI models use advanced algorithms and big data to achieve a much deeper and more nuanced representation of the data. The model not only analyzes individual words, but also looks at the context in which those words are used, resulting in more accurate and meaningful vector representations.

Another important difference is that OpenAI embeddings are based on sophisticated machine learning models that can learn from a huge amount of data. This means that they can recognize subtle patterns and relationships in the data that go far beyond what could be achieved by simple scaling and dimensioning, as in the initial analogy. This leads to a significantly improved ability to recognize and exploit similarities and differences in the data.

 

Explore Generative AI Innovation

Generative AI and LLMs

Individual values are not meaningful

While in the personality trait analogy, each individual value of a vector can be directly related to a specific characteristic – for example openness to new experiences or a sense of duty – this direct relationship no longer exists with OpenAI embeddings. In these embeddings, you cannot simply look at a single value of the vector in isolation and draw conclusions about specific properties of the input data. For example, a specific value in the embedding vector of a sentence cannot be used to directly deduce how friendly or not this sentence is.

The reason for this lies in the way machine learning models, especially those used to create embeddings, encode information. These models work with complex, multi-dimensional representations where the meaning of a single element (such as a word in a sentence) is determined by the interaction of many dimensions in vector space. Each aspect of the original data – be it the tone of a text, the mood of an image, or the intent behind a spoken utterance – is captured by the entire spectrum of the vector rather than by individual values within that vector.

Therefore, when working with OpenAI embeddings, it’s important to understand that the interpretation of these vectors is not intuitive or direct. You need algorithms and analysis to draw meaningful conclusions from these high-dimensional and densely coded vectors.

Comparison of vectors with cosine similarity

A central element in dealing with embeddings is measuring the similarity between different vectors. One of the most common methods for this is cosine similarity. This measure is used to determine how similar two vectors are and therefore the data they represent.

To illustrate the concept, let’s start with a simple example in two dimensions. Imagine two vectors in a plane, each represented by a point in the coordinate system. The cosine similarity between these two vectors is determined by the cosine of the angle between them. If the vectors point in the same direction, the angle between them is 0 degrees and the cosine of this angle is 1, indicating maximum similarity. If the vectors are orthogonal (i.e. the angle is 90 degrees), the cosine is 0, indicating no similarity. If they are opposite (180 degrees), the cosine is -1, indicating maximum dissimilarity.

Figure 1 -Cosine similarity

Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

 

A Python Notebook to try out
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

In practice, especially when working with embeddings, we are dealing with n-dimensional vectors. The calculation of the cosine similarity remains conceptually the same, even if the calculation is more complex in higher dimensions. Formally, the cosine similarity of two vectors A and B in an n-dimensional space is calculated by the scalar product (dot product) of these vectors divided by the product of their lengths:

Figure 2 – Calculation of cosine similarity

The normalization of vectors plays an important role in the calculation of cosine similarity. If a vector is normalized, this means that its length (norm) is set to 1. For normalized vectors, the scalar product of two vectors is directly equal to the cosine similarity since the denominators in the formula from Figure 2 are both 1. OpenAI embeddings are normalized, which means that to calculate the similarity between two embeddings, only their scalar product needs to be calculated. This not only simplifies the calculation, but also increases efficiency when processing large quantities of embeddings.

OpenAI Embeddings API

OpenAI offers a web API for creating embeddings. The exact structure of this API, including code examples for curl, Python and Node.js, can be found in the OpenAI reference documentation.

OpenAI does not use the LLM from ChatGPT to create embeddings, but rather specialized models. They were developed specifically for the creation of embeddings and are optimized for this task. Their development was geared towards generating high-dimensional vectors that represent the input data as well as possible. In contrast, ChatGPT is primarily optimized for generating and processing text in a conversational form. The embedding models are also more efficient in terms of memory and computing requirements than more extensive language models such as ChatGPT. As a result, they are not only faster but much more cost-effective.

New embedding models from OpenAI

Until recently, OpenAI recommended the use of the text-embedding-ada-002 model for creating embeddings. This model converts text into a sequence of floating point numbers (vectors) that represent the concepts within the content. The ada v2 model generated embeddings with a size of 1536 dimensions and delivered solid performance in benchmarks such as MIRACL and MTEB, which are used to evaluate model performance in different languages and tasks.

At the end of January 2024, OpenAI presented new, improved models for embeddings:

text-embedding-3-small: A smaller, more efficient model with improved performance compared to its predecessor. It performs better in benchmarks and is significantly cheaper.
text-embedding-3-large: A larger model that is more powerful and creates embeddings with up to 3072 dimensions. It shows the best performance in the benchmarks but is slightly more expensive than ada v2.

A new function of the two new models allows developers to adjust the size of the embeddings when generating them without significantly losing their concept-representing properties. This enables flexible adaptation, especially for applications that are limited in terms of available memory and computing power.

Readers who are interested in the details of the new models can find them in the announcement on the OpenAI blog. The exact costs of the various embedding models can be found here.

New embeddings models
At the end of January 2024, OpenAI introduced new models for creating embeddings. All code examples and result values contained in this article already refer to the new text-embedding-3-large model.

Create embeddings with Python

In the following section, the use of embeddings is demonstrated using a few code examples with Python. The code examples are designed so that they can be tried out in Python Notebooks. They are also available in a similar form in the previously mentioned accompanying Google Colab notebook mentioned above.
Listing 1 shows how to create embeddings with the Python SDK from OpenAI. In addition, numpy is used to show that the embeddings generated by OpenAI are normalized.

Listing 1

from openai import OpenAI
from google.colab import userdata
import numpy as np

# Create OpenAI client
client = OpenAI(
    api_key=userdata.get('openaiKey'),
)

# Define a helper function to calculate embeddings
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(
        input=input,
        model="text-embedding-3-large", # We use the new embeddings model here (announced end of Jan 2024)
        # dimensions=... # You could limit the number of output dimensions with the new embeddings models
    ).data[0].embedding

# Calculate the embedding vector for a sample sentence
vec = get_embedding_vec("King")
print(vec[:10])

# Calculate the magnitude of the vector. I should be 1 as
# embedding vectors from OpenAI are always normalized.
magnitude = np.linalg.norm(vec)
magnitude

Similarity analysis with embeddings

In practice, OpenAI embeddings are often used for similarity analysis of texts (e.g. searching for duplicates, finding relevant text sections in relation to a customer query, and grouping text). Embeddings are very well suited for this, as they work in a fundamentally different way to comparison methods based on characters, such as Levenshtein distance. While it measures the similarity between texts by counting the minimum number of single-character operations (insert, delete, replace) required to transform one text into another, embeddings capture the meaning and context of words or sentences. They consider the semantic and contextual relationships between words, going far beyond a simple character-based level of comparison.

As a first example, let’s look at the following three sentences (the following examples are in English, but embeddings work analogously for other languages and cross-language comparisons are also possible without any problems):

I enjoy playing soccer on weekends.
Football is my favorite sport. Playing it on weekends with friends helps me to relax.
In Austria, people often watch soccer on TV on weekends.

In the first and second sentence, two different words are used for the same topic: Soccer and football. The third sentence contains the original soccer, but it has a fundamentally different meaning from the first two sentences. If you calculate the similarity of sentence 1 to 2, you get 0.75. The similarity of sentence 1 to 3 is only 0.51. The embeddings have therefore reflected the meaning of the sentence and not the choice of words.

Here is another example that requires an understanding of the context in which words are used:
He is interested in Java programming.
He visited Java last summer.
He recently started learning Python programming.

In sentence 2, Java refers to a place, while sentences 1 and 3 have something to do with software development. The similarity of sentence 1 to 2 is 0.536, but that of 1 to 3 is 0.587. As expected, the different meaning of the word Java has an effect on the similarity.

The next example deals with the treatment of negations:
I like going to the gym.
I don’t like going to the gym.
I don’t dislike going to the gym.

Sentences 1 and 2 say the opposite, while sentence 3 expresses something similar to sentence 1. This content is reflected in the similarities of the embeddings. Sentence 1 to sentence 2 yields a cosine similarity of 0.714 while sentence 1 compared to sentence 3 yields 0.773. It is perhaps surprising that there is no major difference between the embeddings. However, it’s important to remember that all three sets are about the same topic: The question of whether you like going to the gym to work out.

The last example shows that the OpenAI embeddings models, just like ChatGPT, have built in a certain “knowledge” of concepts and contexts through training with texts about the real world.

I need to get better slicing skills to make the most of my Voron.
3D printing is a worthwhile hobby.
Can I have a slice of bread?

In order to compare these sentences in a meaningful way, it’s important to know that Voron is the name of a well-known open-source project in the field of 3D printing. It’s also important to note that slicing is a term that plays an important role in 3D printing. The third sentence also mentions slicing, but in a completely different context to sentence 1. Sentence 2 mentions neither slicing nor Voron. However, the trained knowledge enables the OpenAI Embeddings model to recognize that sentences 1 and 2 have a thematic connection, but sentence 3 means something completely different. The similarity of sentence 1 and 2 is 0.333 while the comparison of sentence 1 and 3 is only 0.263.

Similarity values are not percentages

The similarity values from the comparisons shown above are the cosine similarity of the respective embeddings. Although the cosine similarity values range from -1 to 1, with 1 being the maximum similarity and -1 the maximum dissimilarity, they are not to be interpreted directly as percentages of agreement. Instead, these values should be considered in the context of their relative comparisons. In applications such as searching text sections in a knowledge base, the cosine similarity values are used to sort the text sections in terms of their similarity to a given query. It is important to see the values in relation to each other. A higher value indicates a greater similarity, but the exact meaning of the value can only be determined by comparing it with other similarity values. This relative approach makes it possible to effectively identify and prioritize the most relevant and similar text sections.

Embeddings and RAG solutions

Embeddings play a crucial role in Retrieval Augmented Generation (RAG) solutions, an approach in artificial intelligence that combines the capabilities of information retrieval and text generation. Embeddings are used in RAG systems to retrieve relevant information from large data sets or knowledge databases. It is not necessary for these databases to have been included in the original training of the embedding models. They can be internal databases that are not available on the public Internet.
With RAG solutions, queries or input texts are converted into embeddings. The cosine similarity to the existing document embeddings in the database is then calculated to identify the most relevant text sections from the database. This retrieved information is then used by a text generation model such as ChatGPT to generate contextually relevant responses or content.

Vector databases play a central role in the functioning of RAG systems. They are designed to efficiently store, index and query high-dimensional vectors. In the context of RAG solutions and similar systems, vector databases serve as storage for the embeddings of documents or pieces of data that originate from a large amount of information. When a user makes a request, this request is first transformed into an embedding vector. The vector database is then used to quickly find the vectors that correspond most closely to this query vector – i.e. those documents or pieces of information that have the highest similarity. This process of quickly finding similar vectors in large data sets is known as Nearest Neighbor Search.

Challenge: Splitting documents

A detailed explanation of how RAG solutions work is beyond the scope of this article. However, the explanations regarding embeddings are hopefully helpful for getting started with further research on the topic of RAGs.

However, one specific point should be pointed out at the end of this article: A particular and often underestimated challenge in the development of RAG systems that go beyond Hello World prototypes is the splitting of longer texts. Splitting is necessary because the OpenAI embeddings models are limited to just over 8,000 tokens. One token corresponds to approximately 4 characters in the English language (see also).

It’s not easy finding a good strategy for splitting documents. Naive approaches such as splitting after a certain number of characters can lead to the context of text sections being lost or distorted. Anaphoric links are a typical example of this. The following two sentences are an example:

VX-2000 requires regular lubrication to maintain its smooth operation.
The machine requires the DX97 oil, as specified in the maintenance section of this manual.

The machine in the second sentence is an anaphoric link to the first sentence. If the text were to be split up after the first sentence, the essential context would be lost, namely that the DX97 oil is necessary for the VX-2000 machine.

There are various approaches to solving this problem, which will not be discussed here to keep this article concise. However, it is essential for developers of such software systems to be aware of the problem and understand how splitting large texts affects embeddings.

Stay up to date

Learn more about MLCON

 

 

Summary

Embeddings play a fundamental role in the modern AI landscape, especially in the field of natural language processing. By transforming complex, unstructured data into high-dimensional vector spaces, embeddings enable in-depth understanding and efficient processing of information. They form the basis for advanced technologies such as RAG systems and facilitate tasks such as information retrieval, context analysis, and data-driven decision-making.

OpenAI’s latest innovations in the field of embeddings, introduced at the end of January 2024, mark a significant advance in this technology. With the introduction of the new text-embedding-3-small and text-embedding-3-large models, OpenAI now offers more powerful and cost-efficient options for developers. These models not only show improved performance in standardized benchmarks, but also offer the ability to find the right balance between performance and memory requirements on a project-specific basis through customizable embedding sizes.

Embeddings are a key component in the development of intelligent systems that aim to achieve useful processing of speech information.

Links and Literature:

  1. https://colab.research.google.com/gist/rstropek/f3d4521ed9831ae5305a10df84a42ecc/embeddings.ipynb
  2. https://platform.openai.com/docs/api-reference/embeddings/create
  3. https://openai.com/blog/new-embedding-models-and-api-updates
  4. https://openai.com/pricing
  5. https://platform.openai.com/tokenizer

The post OpenAI Embeddings appeared first on ML Conference.

]]>
Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone https://mlconference.ai/blog/building-chatbot-openai-api-php-pinecone/ Thu, 04 Jan 2024 08:50:31 +0000 https://mlconference.ai/?p=87014 We leveraged OpenAI's API and PHP to develop a proof-of-concept chatbot that seamlessly integrates with Pinecone, a vector database, to enhance our homepage's search functionality and empower our customers to find answers more effectively. In this article, we’ll explain our steps so far to accomplish this.

The post Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone appeared first on ML Conference.

]]>
[lwptoc]

The team at Three.ie, recognized that customers were having difficulty finding answers to basic questions on our website. To improve the user experience, we decided to utilize AI to create a more efficient and user-friendly experience with a chatbot. Building the chatbot posed several challenges, such as effectively managing the expanding context of each chat session and maintaining high-quality data. This article details our journey from concept to implementation and how we overcome these challenges. Anyone interested in AI, data management, and customer experience improvements should find valuable insights in this article. 

While the chatbot project is still in progress, this article outlines the steps taken and key takeaways from the journey thus far. Stay tuned for subsequent installments and the project’s resolution.

Stay up to date

Learn more about MLCON

 

Identifying the Problem

Hi there, I’m a Senior PHP Developer at Three.ie, a company in the Telecom Industry. Today, I’d like to address the problem of our customers’ challenge with locating answers to basic questions on our website. Information like understanding bill details, how to top up, and more relevant information is available but isn’t easy to find, because it’s tucked away within our forums.

![community-page.png](community-page.png) {.caption}

Community Page {.caption}

The AI Solution

The rise of AI chatbots and the impressive capabilities of GPT-3 presented us with an opportunity to tackle this issue head-on. The idea was simple, why not leverage AI to create a more user-friendly way for customers to find the information they need? Our tool of choice for this task was OpenAI’s API, which we planned to integrate into a chat interface.

To make this chatbot truly useful, it needed access to the right data and that’s where Pinecone came in. Using this vector database, we were able to generate embeddings from the OpenAI API, creating an efficient search system for our chatbot.

This laid groundwork for our proof of concept: a simple yet effective solution for a problem faced by many businesses. Let’s dive deeper into how we brought this concept to life.

![chat-poc.png](chat-poc.png) {.figure}

First POC {.caption}

Challenges and AI’s Role

With our proof of concept in place, the next step was to ensure the chatbot was interacting with the right data and providing the most accurate search results possible. While Pinecone served as an excellent solution for storing data and enabling efficient search during the early stages. In the long term, we realized it might not be the most cost-effective choice for a full-fledged product. 

While Pinecone is an excellent solution easy to integrate and straightforward to use. The free tier only allows you to have a single pod with a single project. We would need to create small indexes but separated into multiple products. The  starting plan costs around $70/month/pod. Aiming to keep the project within budget was a priority, and we knew that continuing with Pinecone would soon become difficult, since we wanted to split our data.

The initial data used in the chatbot was extracted directly from our website and stored in separate files. This setup allowed us to create embeddings and feed them to our chatbot. To streamline this process, we developed a ‘data import’ script. The script works by taking a file, adding it to the database, creating an embedding using the content, and finally it stores the embedding in Pinecone, using the database ID as a reference.

Unfortunately, we faced a hurdle with the structure and quality of our data. Some of the extracted data was not well-structured, which led to issues with the chatbot’s responses. To address this challenge, we once again turned to AI, this time to enhance our data quality. Employing the GPT-3.5 model, we optimized the content of each file before generating the vector. By doing so, we were able to harness the power of AI not only for answering customer queries but also for improving the quality of our data.

As the process grew more complex, the need for more efficient automation became evident. To reduce the time taken by the data import script, we incorporated queues and utilized parallel processing. This allowed us to manage the increasingly complex data import process more effectively and keep the system efficient.

![data-ingress-flow.png](data-ingress-flow.png) {.figure}

Data Ingress Flow {.caption}

Data Integration

With our data stored and the API ready to handle chats, the next step was to bring everything together. The initial plan was to use Pinecone to retrieve the top three results matching the customer’s query. For instance, if a user inquired, “How can I top up by text message?”, we would generate an embedding for this question and then use Pinecone to fetch the three most relevant records. These matches were determined based on cosine similarity, ensuring the retrieved information was highly pertinent to the user’s query.

Cosine similarity is a key part of our search algorithm. Think of it like this: imagine each question and answer is a point in space. Cosine similarity measures how close these points are to each other. For example, if a user asks, “How do I top up my account?”, and we have a database entry that says, “Top up your account by going to Settings”, these two are closely related and would have a high cosine similarity score, close to 1. On the other hand, if the database entry says something about “changing profile picture”, the score would be low, closer to 0, indicating they’re not related.

This way, we can quickly find the best matches to a customer’s query, making the chatbot’s answers more relevant and useful.

For those who understand a bit of math, this is how cosine similarity works. You represent each sentence as a vector in multi-dimensional space. The cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes. Mathematically, it looks like this:

![cosine-formula.png](cosine-formula.png) {.figure}

Cosine Similarity  {.caption}

This formula gives us a value between -1 and 1. A value close to 1 means the sentences are very similar, and a value close to -1 means they are dissimilar. Zero means they are not related.

![simplified-workflow.png](simplified-workflow.png) {.figure}

Simplified Workflow {.caption}

Next, we used these top three records as a context in the OpenAI chat API. We merged everything together: the chat history, Three’s base prompt instructions, the current question, and the top three contexts.

![vector-comparison-logic.png](vector-comparison-logic.png) {.figure}

Vector Comparison Logic {.caption}

Initially, this approach was fantastic and provided accurate and informative answers.  However, there was a looming issue, as we were using OpenAI’s first 4k model, and the entire context was sent for every request. Furthermore, the context was treated as “history” for the following message, meaning that each new message added the boilerplate text plus three more contexts. As you can imagine, this led to rapid growth of the context.

To manage this complexity, we decided to keep track of the context. We started storing each message from the user (along with the chatbot’s responses) and the selected contexts. As a result, each chat session now had two separate artifacts: messages and contexts. This ensured that if a user’s next message related to the same context, it wouldn’t be duplicated and we could keep track of what had been used before.

Progress so Far

To put it simply, our system starts with a manual input of questions and answers (Q&A)  which is then enhanced by our AI.  To ensure efficient data handling we use queues to store data quickly. In the chat, when a user asks a question, we add a “context group” that includes all the data we got from Pinecone. To maintain system organization and efficiency, older messages are removed from longer chats.

 

 

 

![chat-workflow.png](chat-workflow.png) {.figure}

 

Chat Workflow {.caption}

![chat-workflow.png](chat-workflow.png) {.figure}

Chat Workflow {.caption}

Automating Data Collection

Acknowledging the manual input as a bottleneck, we set out to streamline the process through automation. I started by trying out scrappers using different languages like PHP and Python. However, to be honest, none of them were good enough and we faced issues with both speed and accuracy. While this component of the system is still in its formative stages, we’re committed to overcoming this challenge. We are currently evaluating the possibility of utilizing an external service to manage this task, aiming to streamline and simplify the overall process.

While working towards data automation, I dedicated my efforts to improving our existing system. I developed a backend admin page, replacing the manual data input process with a streamlined interface. This admin panel provides additional control over the chatbot, enabling adjustments to parameters like the ‘temperature’ setting and initial prompt, further optimizing the customer experience.  So, although we have challenges ahead, we’re making improvements every step of the way.

 

#

RETHINK YOUR APPROACHES

Business & Strategy

A Week of Intense Progress

The week was a whirlwind of AI-fueled excitement, and we eagerly jumped in. After sending an email to my department, the feedback came flooding in. Our team was truly collaborative: a skilled designer supplied Figma templates and a copywriter crafted the app’s text. We even had volunteers who stress-tested our tool with unconventional prompts. It felt like everything was coming together quickly.

However, this initial enthusiasm came to a screeching halt due to security concerns becoming the new focus. A recent data breach at OpenAI, unrelated to our project, shifted our priorities. Though frustrating, it necessitated a comprehensive security check of all projects, causing a temporary halt to our progress.

The breach occurred during a specific nine-hour window on March 20, between 1 a.m. and 10 a.m. Pacific Time. OpenAI confirmed that around 1.2% of active ChatGPT Plus subscribers had their data compromised during this period. They were using the Redis client library (redis-py), which allowed them to maintain a pool of connections between their Python server and Redis. This meant they didn’t need to query the main database for every request, but it became a point of vulnerability.

In the end, it’s good to put security at the forefront and not treat it as an afterthought, especially in the wake of a data breach. While the delay is frustrating, we all agree that making sure our project is secure is worth the wait. Now, our primary focus is to meet all security guidelines before progressing further.

The Move to Microsoft Azure

In just one week, the board made a big decision to move from OpenAI and Pinecone to Microsoft’s Azure.  At first glance, it looks like a smart choice as Azure is known for solid security but the plug-and-play aspect can be difficult.

What stood out in Azure was having our own dedicated GPT-3.5 Turbo model. Unlike OpenAI, where the general GPT-3.5 model is shared, Azure gives you a model exclusive to your company. You can train it, fine-tune it, all in a very secure environment, a big plus for us.

The hard part? Setting up the data storage was not an easy feat. Everything in Azure is different from what we were used to. So, we are now investing time to understand these new services, a learning curve we’re currently climbing.

Azure Cognitive Search

In our move to Microsoft Azure, security was a key focus. We looked into using Azure Cognitive Search for our data management. Azure offers advanced security features like end-to-end encryption and multi-factor authentication. This aligns well with our company’s heightened focus on safeguarding customer data.

The idea was simple: you upload your data into Azure, create an index, and then you can search it just like a database. You define what’s called “fields” for indexing and then Azure Cognitive Search organizes it for quick searching. But the truth is, setting it up wasn’t easy because creating the indexes was more complex than we thought. So, we didn’t end up using it in our project. It’s a powerful tool, but difficult to implement. This was the idea:

![azure-structure.png](azure-structure.png) {.figure}

Azure Structure {.caption}

The Long Road of Discovery

So, what did we really learn from this whole experience? First, improving the customer journey isn’t a walk in the park; it’s a full-on challenge. AI brings a lot of potential to the table, but it’s not a magic fix. We’re still deep in the process of getting this application ready for the public, and it’s still a work in progress.

One of the most crucial points for me has been the importance of clear objectives. Knowing exactly what you aim to achieve can steer the project in the right direction from the start. Don’t wait around — get a proof of concept (POC) out as fast as you can. Test the raw idea before diving into complexities.

Also, don’t try to solve issues that haven’t cropped up yet, this is something we learned the hard way. Transitioning to Azure seemed like a move towards a more robust infrastructure. But it ended up complicating things and setting us back significantly. The added layers of complexity postponed our timeline for future releases. Sometimes, ‘better’ solutions can end up being obstacles if they divert you from your main goal.

 

Stay up to date

Learn more about MLCON

 

In summary, this project has been a rollercoaster of both challenges and valuable lessons learned. We’re optimistic about the future, but caution has become our new mantra. We’ve come to understand that a straightforward approach is often the most effective, and introducing unnecessary complexities can lead to unforeseen problems. With these lessons in hand, we are in the process of recalibrating our strategies and setting our sights on the next development phase.

Although we have encountered setbacks, particularly in the area of security, these experiences have better equipped us for the journey ahead. The ultimate goal remains unchanged: to provide an exceptional experience for our customers. We are fully committed to achieving this goal, one carefully considered step at a time.

Stay tuned for further updates as we continue to make progress. This project is far from complete, and we are excited to share the next chapters of our story with you.

The post Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone appeared first on ML Conference.

]]>
Take Control of ML Projects https://mlconference.ai/blog/take-control-of-ml-projects/ Mon, 11 Jul 2022 10:33:17 +0000 https://mlconference.ai/?p=84602 The decision to move Elasticsearch to proprietary licensing awakened a sleeping giant. The open source community rapidly flexed its muscle to ensure a true open source option for fast and scalable search and analytics—which many users depend on for ML projects—would continue to be available. The result is OpenSearch, a community-driven hard fork of Elasticsearch 7.10.2, built with Apache Lucene and available under the fully open source Apache 2.0 license.

The post Take Control of ML Projects appeared first on ML Conference.

]]>
OpenSearch brings users the same enterprise-grade core features and advanced add-ons as its predecessor. Key benefits include a horizontally-scalable distributed architecture ready to handle thousands of nodes and petabytes of data, high availability, extremely fast and powerful text search, and analytics with faceting and aggregations. OpenSearch also features a rich ecosystem with language specific clients such as Python, Node, Java and more. OpenSearch also supports data shippers such as Logstash, beats, and Fluentd.

Migrating from Elasticsearch to OpenSearch enables you to continue to utilize the same powerful capabilities that your organization is already accustomed to, while safeguarding your technology against the potential for future proprietary lock-in and limitations that come with a solution that is under a proprietary license that is no longer truly open source. At the same time, making the leap to OpenSearch ensures that organizations are positioned to take advantage of all new features introduced by the open source community as the technology evolves moving forward.

To successfully migrate from Elasticsearch to OpenSearch, on cloud or on-prem systems or through a managed platform, follow these eight steps:

Make sure you’re running Elasticsearch 7.10.2, and upgrade if necessary

Enterprises should be running Elasticsearch 7.10.2 for maximum compatibility before migrating to OpenSearch. Upgrade to client libraries compatible with Elasticsearch 7.10.2, and be sure to use OpenSearch versions of libraries when available (all of which also work with Elasticsearch clusters). If the existing cluster is on a newer version than 7.12.0 then downgrade to 7.10.2 via reindex. Also be on alert for potential breaking changes or the need to re-index (between v5.6 and v6.8) that can occur when upgrading between Elasticsearch versions.

Stay up to date

Learn more about MLCON

 

Build a migration testing environment

Create an Elasticsearch test cluster that emulates your production environment as closely as possible. Run the same Elasticsearch version client libraries, and any other data shippers such as Logstash or Fluentd. Benchmark the test environment’s search and indexing performance based on realistic data. Next, create a test OpenSearch cluster with an equivalent number and types of nodes, for a fair and simple comparison.

Check your tool and client libraries for OpenSearch compatibility

It’s crucial to verify the interoperability of all tools and libraries prior to upgrading. For example, recent builds of tools like Logstash and others include version checks that make them incompatible with OpenSearch. While the community rapidly develops open source versions of popular tools and clients for use with OpenSearch – and many are already available and production-ready – it still pays to implement a deliberate compatibility strategy. 

The tenets of such a strategy: first, use clients and tools provided by OpenSearch whenever possible. Where OpenSearch-specific options aren’t available, use tool or client versions compatible with Elasticsearch OSS 7.10.2. As a last alternative, use the OpenSearch compatibility setting to override version issues, using either the opensearch.yaml or cluster-wide settings like this:

In the opensearch.yaml (Restarting the OpenSearch cluster is necessary to change the opensearch.yaml):

<style="text-align: left;">compatibility.override_main_response_version: true

In the cluster settings:

PUT _cluster/settings
{
  "persistent": {
    "compatibility": {
      "override_main_response_version": true
    }
  }
}

Compatibility verification example: Filebeat

Here we’ll check that the Filebeat module we have running on an Apache HTTP server is compatible with our OpenSearch cluster. First, we’ll point the Filebeat configuration to the OpenSearch endpoint:

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:

  # Array of hosts to connect to.

  hosts: ["search.cxxxxxxxxxxx.cnodes.io:9200"]

  # Protocol - either `http` (default) or `https`.

  protocol: "https"

  # Authentication credentials - either API key or username/password.

  #api_key: "id:api_key"

  username: "icopensearch"

  password: "***************************"

 

Make sure that the OpenSearch cluster can receive logs. Unfortunately (in this example), the non-OSS version of Filebeat cannot connect to OpenSearch, but does sends logs to the cluster thanks to the compatibility version check:

2021-11-30T23:28:32.514Z        ERROR  
[publisher_pipeline_output]     pipeline/output.go:154  Failed to 
connect to 
backoff(elasticsearch(https://search.xxxxxxxxxxxxxxxxxxxxxx.cnodes
.io:9200)): Connection marked as failed because the onConnect
callback failed: could not connect to a compatible version of
Elasticsearch: 400 Bad Request:
{"error":{"root_cause":[{"type":"invalid_index_name_exception","re
ason":"Invalid index name [_license], must not start with
'_'.","index":"_license","index_uuid":"_na_"}],"type":"invalid_ind
ex_name_exception","reason":"Invalid index name [_license], must
not start with
 '_'.","index":"_license","index_uuid":"_na_"},"status":400}


Replacing non-OSS Filebeat with the open source version 7.10.2 solves these compatibility issues. You can verify that you are receiving the filebeat logs by checking for agent.type: filebeat

In addition to ensuring that all tools and clients function with OpenSearch, monitor tools and clients to see that performance in the OpenSearch cluster is similar to Elasticsearch.

Back up your data

Before going ahead with the bulk of the migration, be sure to back up all important data. While the migration to OpenSearch shouldn’t cause data loss, it never hurts to play it safe. Backups are especially crucial when performing rolling, restart, or other in-place upgrades. With Elasticsearch snapshots, you can backup your data to a filesystem repository or to cloud repositories such as S3, GCS, Microsoft Azure. 

Migrate data

Migrating from Elasticsearch to OpenSearch can be done in a few different ways which vary in ease of migration, required downtime, level of compatibility, etc.

Migrating with reindex provides the highest level of compatibility and we will be focusing on reindex migration for the rest of the document.

Migrate data via reindex

To begin, identify all indices you’ll migrate to OpenSearch (don’t migrate system indices). Then copy all needed index mappings, settings, and templates, and apply them to your OpenSearch cluster. Make as few changes as possible in the interest of a seamless migration.

For example, the following code takes the index sample_http_responses and copies and applies settings and mappings to OpenSearch:

PUT sample_http_responses

{

  "mappings": {

    "properties": {

      "@timestamp": {"type": "date"},

      "http_1xx": {"type": "long"},

      "http_2xx": {"type": "long"},

      "http_3xx": {"type": "long"},

      "http_4xx": {"type": "long"},

      "http_5xx": {"type": "long"},

      "status_code": {"type": "long"}

    }

  },

  "settings": {

    "index": {

      "number_of_shards": "3",

      "number_of_replicas": "1"

    }

  }

}

 

Prior to reindexing, you’ll ideally want to stop any new indexing to the source index. Not possible for your use case? Then perform incremental reindexing to handle newer documents, if you have a timestamp or incremental id available to facilitate that strategy.

You’ll also need to whitelist your remote cluster endpoint in OpenSearch’s settings before beginning a reindex. Edit the opensearch.yaml file, and add the whitelist config for the remote IP. It’s also possible to list <ip-addr>:<port> configurations you’d like to white list. 

reindex.remote.whitelist: "xxx.xxx.xxx.xxx:9200"

Next, submit the reindex request, specifying remote endpoint details such as ssl parameters and remote credentials. The following submits the reindexing operation as an async request, a useful technique since moving a lot of data can lengthen completion time. 

To avoid overloading the remote cluster, it’s also possible to throttle the number of requests per second.

 

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

 

POST _reindex?wait_for_completion=false&requests_per_second=-1
{

  "source": {

span style="font-weight: 400;">    "remote": {

      "host": "http://xxx.xxx.xxx.xxx:9200",

      "socket_timeout": "2m",

      "connect_timeout": "60s"

    },

    "index": "sample_http_responses"

  },

  "dest": {

    "index": "sample_http_responses"

  }

}

 

The _task/<task-id> end point lets you check the reindex operation’s progress.

 

{
  "completed" : true,

  "task" : {

    "node" : "o-qCCzE-RZOK1_nDS3ItmA",

    "id" : 1858,

    "type" : "transport",

    "action" : "indices:data/write/reindex",

    "status" : {

      "total" : 1000,

      "updated" : 0,

      "created" : 1000,

      "deleted" : 0,

      "batches" : 1,

      "version_conflicts" : 0,

      "noops" : 0,

      "retries" : {

        "bulk" : 0,

        "search" : 0

      },

      "throttled_millis" : 0,

      "requests_per_second" : -1.0,

      "throttled_until_millis" : 0

    },

    "description" : """reindex from [host=xxx.xxx.xxx.xxx port=9200 query={

  "match_all" : {

    "boost" : 1.0
 }

Migrate dashboards and visualizations

Exporting dashboards and visualizations from Kibana as saved objects, and reimporting them to the OpenSearch dashboard offers the simplest approach for migrating these items. Maintaining the same index names across reindexed data will allow dashboards to work seamlessly post-migration.

First, migrate index patterns by selecting them under Kibana > Stack Management > Saved Objects, and exporting them as ndjson objects.

 

Next, import the index patterns from the OpenSearch Dashboard under Stack Management > Saved Objects.

 

Now migrate dashboards and visualizations using the same process, and check that they function correctly.

 

Validate the functionality and performance of your new OpenSearch cluster

With the migration to OpenSearch complete, check that everything is working as it should. Verify the functionality of search queries and aggregations your applications depend upon, and check performance versus the previous Elasticsearch cluster. 

 

Do it all again in production

With your migration strategy now successfully proven out in test environments, it’s time to repeat the steps above, and to migrate your production Elasticsearch environment to realize the benefits of fully open source OpenSearch. 

The post Take Control of ML Projects appeared first on ML Conference.

]]>
Python Developers live in Visual Studio Code https://mlconference.ai/blog/python-developers-live-in-visual-studio-code/ Thu, 10 Feb 2022 14:47:32 +0000 https://mlconference.ai/?p=83282 With over 18 million monthly users, VS Code has become one of the most popular and fastest-growing text editors in the world. To learn more about why over 3.7 million of them find VS Code to be the perfect habitat for Python development and data science work, keep on reading!.

The post Python Developers live in Visual Studio Code appeared first on ML Conference.

]]>
The Python Programming language was created during the late 1980’s by Guido Van Rossum. By 2003, it consistently ranked among the most popular programming languages in the world. According to the PYPL (PopularitY of Programming Language index) [1], which is generated by analyzing the frequency of coding tutorial searches on Google, Python is now the most popular language in the world. This is no surprise after having grown 15.4% in the last 5 years. [Fig. 1]

Fig. 1: Python popularity growth

With hundreds of packages including Pandas, NumPy, Matplotlib and Scikit Learn that provide functionality for tabular data manipulation, numerical computing, data visualizations and machine learning algorithms for predictive data analysis respectively, the Python language has become the go-to for data science work. Powerful frameworks for building apps such as Flask and Django that are lightning-fast, scalable, and flexible make it one of the most compelling options for web development. Python’s growth and coverage in multiple coding use-cases has continued to skyrocket and has no indications of slowing down any time soon.

VS Code – The Perfect Habitat for Python Developers

At Microsoft’s Developer Division, our mission is to enable every developer to achieve more. This year, to continue supporting the quickly growing Python community, we increased our sponsorship of the Python Software Foundation [2] to the top new visionary level. Goals of the PSF include providing grants and resources for further development and adoption of Python as well as expanding Python outreach by funding the Python Ambassador Program.

On top of supporting the Python community at large, we aim to support Python users right here at home in VS Code! With over 18 million monthly users, VS Code has become one of the most popular and fastest-growing text editors in the world. To learn more about why over 3.7 million of them find VS Code to be the perfect habitat for Python development and data science work, keep on reading!

Stay up to date

Learn more about MLCON

 

How to Follow Along

First things first, you will need to install VS Code. Once you have VS Code installed, you can search for and install extensions through the VS Code Extensions Marketplace. The family of extensions you’ll need for the ultimate Python coding experience include the Python, Pylance, Jupyter, and Azure Machine Learning extensions. [Fig. 2]

Fig. 2: VS Code Extension Marketplace

Python Extension

The Python extension builds on top of VS Code’s already powerful code editor. By providing additional support for environment handling, debugging, testing, linting and formatting, the Python extension capabilities are here to supercharge your Python development work. Our recent extension startup changes have also made great strides in performance improvements so you can get coding sooner.

RETHINK YOUR APPROACHES

Business & Strategy

 

Environment Handling

Get started easily with any of your favorite environments such as pyenv, pipenv, Conda, and Poetry. The extension will automatically detect Python interpreters that are installed in standard locations and the environment you choose will power the IntelliSense, auto-completions, linting, formatting, and any other language-related feature other than debugging.

Debugging

Print statements to check states of variables are a thing of the past. Easily debug different types of Python applications (e.g., multi-threaded, web, and remote applications) by setting breakpoints, inspecting data, and utilizing the debug console as you step through your code. On top of that, there is no starting and stopping with the Python debugger. If you make changes to your code after the debugger execution has already hit a breakpoint, you no longer need to restart the debugger as auto-reload exists for Python scripts, Django and Flask! [Fig. 3]

Fig. 3: Debug Python file in terminal

Linting

Linting analyzes your code for potential errors, highlighting areas where problems should be corrected so you don’t have to manually parse the code yourself. The currently supported linters include Pylint, pycodestyle, Flake8, mypy, pydocstyle, prospector, and pylama. Simply make sure the linter of your choice is installed in the active Python interpreter. [Fig. 4]

Fig. 4: Linter support

Pylance

Your default Python editing experience has been upgraded with the bunding of our Pylance extension, Visual Studio Code’s most robust and performant Python language server. Its rich editing features include completions, auto-imports, function signature help, docstrings, contextual highlighting, and more!

With auto-imports you can say goodbye to your workflow being interrupted to import necessary modules. As you are constructing your code, Pylance will provide smart import suggestions and insert them at the top of your file for you. The function signature help provide information on parameters as well as return types so that you no longer have to hunt down external documentation and leave the context of your code editor.

You can even refactor your code at the speed of light by tapping into Pylance’s extraction features. You can highlight lines of code and pick either “Extract Method” [Fig 5] or “Extract Variable” [Fig. 6] to have Pylance do the heavy lifting to turn them into new variables or functions.

Fig. 5: Pylance’s Extract Method

Fig. 6: Pylance’s Extract Variable

Don’t forget about contextual document highlighting! Double-clicking on variables will present other instances of the variable to you such that none can slip by you. [Fig. 7]

Fig. 7: Pylance’s Contextual Document Highlighting

Jupyter Notebooks

A Jupyter Notebook is an interactive programming and computational document that supports mixing executable code, equations, visualizations, and narrative text. Jupyter Notebooks [Fig. 8] can contain markdown and code cells, where code cells have two major components: input and output. You can write code in the input area of a cell, and after running the cell the result will show up in the output area just below. [Fig. 9]

Fig. 8: VS Code Notebook

Fig. 9: Histogram in VS Code Notebooks

Jupyter Notebooks have quickly become the de facto tool for data science. The ability to run chunks of code at a time and out of order makes them very exploratory in nature which is incredibly conducive to data exploration. The ability to see outputs and visualizations in a hassle-free manner paired with narrative text, makes Jupyter Notebooks the perfect location to tell a story with data. Outside of data science though, they are also a great tool for teaching or learning new languages, general code experimentation, and building quick prototypes.

This past year, our very own implementation of Jupyter Notebooks got a major overhaul by being fully integrated with Visual Studio Code. On top of a new modern design, you can now benefit from faster load times, innate source control and diffing capabilities, full notebook debugging, customizable theming, and more!

Variable Explorer and Data Viewer

Our additional features such as our Variable Explorer [Fig. 10] and Data Viewer will help you keep track of the state of your variables and take a deeper look at the tabular data you might be working with. To access the Variable Explorer, simply click on the Variables Icon in your notebook toolbar.[Fig. 11] To access the Data Viewer, click on the icon to the left of the tabular variable you would like to inspect.

Fig. 10: Variable Explorer in Notebook Toolbar

Fig. 11: Variable Explorer

The Data Viewer [Fig. 12] provides a spreadsheet-like view of your data and the filtering capabilities allow you to make quick checks on your data. It facilitates and speeds up identifying data quality issues and the next steps that must be taken in order to properly clean the data.

Fig. 12: Data Viewer

Debugging

VS Code allows you to debug your notebook in multiple ways. For a “debugging-lite” experience, you can opt for Run by Line. As you’ve likely already guessed based on the name, Run by Line [Fig. 13] allows you to run through your cell, one line at a time. When Run by Line is enabled, the Variable Explorer will open alongside with it so you can keep track of the state of your variables as you iterate and quickly resolve small code issues.

Fig. 13: Run by Line

With our most recent revamp, you can now take advantage of the same full debugging experience [Fig. 14] enabled by the Python extension in notebooks. It’s important to note that if you’d like to use these debugging features in VS Code today, you’ll need at least version 6.0 or higher of ipykernel [3] in the environment you select to run your notebooks.

Fig. 14: Debugging in Jupyter Notebooks

Custom Notebook Diffing

Under the hood, Jupyter Notebooks are JSON files. The segments in a JSON file are rendered as cells that are comprised of three components: input, output, and metadata. Comparing changes made in a notebook using lined-based diffing is difficult and hard to parse. The rich diffing editor for notebooks allows you to easily see changes for each component of a cell.

You can even customize what types of changes you want displayed within your diffing view. In the top right, select the overflow menu item in the toolbar to customize what cell components you want included, but don’t worry about input changes as those will always be shown. [Fig. 15]

Fig. 15: Customized Diffing View

Interactive Window

If you like the idea of notebooks but are used to working with scripts, we have the feature just for you! The Interactive Window is a hybrid between a notebook and a script. When working in a Python file, you can create cell-like code segments by using the ‘#%%’ delimiter. Running these faux cells in the Interactive Window [Fig.16] allows you to break down your longer Python script into smaller and more comprehensible chunks and see their results to the right as opposed to inline like a notebook would. You can also run code directly in the Interactive Window itself, that way you can use it as a scratch pad where you can try out slightly tweaked code before inserting it into your more finalized Python script.

Fig. 16: Interactive Window

TensorBoard and PyTorch Profiler

If you are using PyTorch or TensorFlow you can look forward to our TensorBoard dashboard integration helping you visualize datasets, train models, spot check model predictions, view model architecture, analyze model’s loss and accuracy over time, as well as profile your code to understand where it is slowest. In addition to TensorBoard integration, we’ve also embedded the PyTorch Profiler in VS Code such that you can monitor your PyTorch models all in one convenient location. In addition, VS Code is exclusively the only tool today that allows you to jump directly to your source code file from the PyTorch Profiler!

If working in notebooks, our Variable Explorer allows you to inspect PyTorch and TensorFlow data types and our Data Viewer allows you to slice data so you can get a robust understanding of any 3D or higher dimensional data. As a reminder, you can access the Data Viewer through the Variable Explorer or during a Python debugging session. When a debugging session has started, you can right click on the Tensor you would like to do a deeper dive on and select “View Value in Data Viewer”. [Fig. 17]

Fig. 17: Data viewer

Azure Machine Learning

While many data science and machine learning tasks can be completed successfully on your local machine, sometimes you just need a bit more power! If you’re interested in scaling your training and inferencing workloads, the Azure Machine Learning extension [4] has you covered.

You can search for the Azure Machine Learning extension in the Visual Studio Code marketplace, sign into your Azure Account, [5]  and create a machine learning workspace [6] to organize and manage your resources. Through the Azure Machine Learning extension, you can create a compute resource and seamlessly connect to it without requiring SSH or additional network configuration. When connected to the compute you can continue using your favorite VS Code capabilities (notebooks, debugger, terminal), import your local project, and scale your model training while leveraging GPU resources.

The Azure Machine Learning extension [Fig. 18] also provides enhanced language support (completions based on your Azure resources) and generated templates that you can use to author and check-in reproducible, shareable configuration files. Once created and deployed, the results of your work (e.g., creating an environment, training a model) can be viewed from directly within Visual Studio Code through the extension tree view; you no longer have to context-switch between the editor and the browser to manage your machine learning resources.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Fig. 18: Azure Machine Learning

Additional Notable Mentions

While the next two extensions are not exclusive to Python itself, they are incredibly powerful aids in your development experience.

Remote – SSH

The Remote – SSH extension lets you use any remote machine with a SSH server as your development environment. It effectively runs VS Code on the remote machine, so you have access to any extensions and files on that same remote machine. With this extension you can develop on the same operating system you deploy to or use larger, faster, or more specialized hardware than your local machine as well as switch between remote development environments without altering anything on your local machine.

Live Share

The Live Share extension allows you to collaboratively edit and debug with others in real time, regardless of what programming languages are being used. You can forget the archaic method of sending files back and forth between you and your coworkers. Live Share allows you to instantly (and securely) share your current project, and then as needed, share debugging sessions, terminal instances, localhost web apps, voice calls, and more! Guests invited to your sessions will have your editor context mirrored on their machine so you can start collaborating productively immediately without needing to clone any repos or install any SDKs.

As the guest joining a Live Share session however, you will still have all your personal editor preferences (e.g. keybindings, theme) honored and your own cursor so you can seamlessly jump into a session and work together or independently in the same file.

Welcome Home Pythonistas

Regardless of what you are trying to achieve with Python, VS Code is the place for you! We hope you come try out the Python experience in VS Code and that it becomes your new home for Python development and data science work! Let us know what you think!

For a much more detailed and in-depth walkthrough of the mentioned extensions and features, please visit our Visual Studio Code Documentation.

How to Contact Us

Follow our Twitter handles @code for any Visual Studio Code product updates and @pythonvscode for Python and Jupyter product announcements! For any feature suggestions or issues don’t hesitate to file issues on our VS Code, VS Code Python, VS Code Jupyter, or VS Code Pylance GitHub Repositories! As always, we encourage and welcome the community to participate and contribute to our open-source tools!

Links & Literature

[1] https://pypl.github.io/PYPL.html

[2] https://www.python.org/psf/ 

[3] https://pypi.org/project/ipykernel/ 

[4] https://docs.microsoft.com/en-us/Azure/machine-learning/how-to-setup-vs-code 

[5] https://docs.microsoft.com/en-us/dotnet/azure/create-azure-account

[6] https://docs.microsoft.com/en-us/Azure/machine-learning/how-to-setup-vs-code 

The post Python Developers live in Visual Studio Code appeared first on ML Conference.

]]>
What is Data Annotation and how is it used in Machine Learning? https://mlconference.ai/blog/data-annotation-ml/ Tue, 12 Oct 2021 12:24:15 +0000 https://mlconference.ai/?p=82363 What is data annotation? And how is data annotation applied in ML? In this article, we are delving deep to answer these key questions. Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

]]>
Modern businesses are operating in highly competitive markets, and finding new business opportunities is even harder. Customer experiences are constantly changing, finding the right talent to work on common business goals is also an enormous challenge, yet businesses want to perform the way the market demands. So what are these companies doing to create a sustainable competitive advantage? This is where Artificial Intelligence (AI) solutions come in and are prioritized. With AI, it is easier to automate business processes and smoothen decision-making. But, what exactly defines a successful Machine Learning (ML) project? The answer is simple, the quality of training datasets that work with your ML algorithms.

Having that in mind, what amounts to a high-quality training dataset? Data annotation. What is data annotation? And how is data annotation applied in ML?

In this article, we are delving deep to answer these key questions, and is particularly helpful if:

  • You are seeking to understand what data annotation is in ML and why it is so important.
  • You are a data scientist curious to know the various data annotation types out there and their unique applications.
  • You want to produce high-quality datasets for your ML model’s top performance, and have no idea where to find professional data annotation services.
  • You have huge chunks of unlabeled data, have no time to gather, organize, and label them, and in dire need of a data labeler to do the job for you, ultimately meet your training and deploying goals for your models.

What is Data Annotation?

In ML, data annotation refers to the process of labeling data in a manner that machines can recognize either through computer vision or natural language processing (NLP). In other words, data labeling teaches the ML model to interpret its environment, make decisions and take action in the process.

Data scientists use massive amounts of datasets when building an ML model, carefully customizing them according to the model training needs. Thus, machines are able to recognize data annotated in different, understandable formats such as images, texts, and videos.

This explains why AI and ML companies are after such annotated data to feed into their ML algorithm, training them to learn and recognize recurring patterns, eventually using the same to make precise estimations and predictions.

The data annotation types

Data annotation comes in different types, each serving different and unique use cases. Although data annotation is broad and wide, there are common annotation types in popular machine learning projects which we are looking at in this section to give you the gist in this field:

Semantic Annotation

Semantic annotation entails annotation of different concepts within text, such as names, objects, or people. Data annotators use semantic annotation in their ML projects to train chatbots and improve search relevance.

Image and Video Annotation

Let’s say this, image annotation enables machines to interpret content in pictures. Data experts use various forms of image annotation, including bounding boxes displayed on images, to pixels assigned a meaning individually, a process called semantic segmentation. This type of annotation is commonly used in image recognition models for various tasks like facial recognition and recognizing and blocking sensitive content.

Video annotation, on the other hand, uses bounding boxes, or polygons on video content. The process is simple, developers use video annotation tools to place these bounding boxes, or stick together video frames to track the movement of annotated objects. Either way deemed fit by the developer, this type of data becomes handy when developing computer vision models for localization of object tracking tasks.

Text categorization

Text categorization, also called text classification or text tagging is where a set of predefined categories are assigned to documents. A document can contain tagged paragraphs or sentences by topic using this type of annotation, thus making it easier for users to search for information within a document, an application, or a website.

Why is Data Annotation so Important in ML

Whether you think of search engines’ ability to improve on the quality of results, developing facial recognition software, or how self-driving automobiles are created, all these are made real through data annotation. Living examples include how Google manages to give results based on the user’s geographical location or sex, how Samsung and Apple have improved the security of their smartphones using facial unlocking software, how Tesla brought into the market semi-autonomous self-driving cars, and so on.

Annotated data is valuable in ML in giving accurate predictions and estimations in our living environments. As aforesaid, machines are able to recognize recurring patterns, make decisions, and take action as a result. In other words, machines are shown understandable patterns and told what to look for – in image, video, text, or audio. There is no limit to what similar patterns a trained ML algorithm cannot find in any new datasets fed into it.

Data Labeling in ML

In ML, a data label, also called a tag, is an element that identifies raw data (images, videos, or text), and adds one or more informative labels to put into context what an ML model can learn from. For example, a tag can indicate what words were said in an audio file, or what objects are contained in a photo.

Data labeling helps ML models learn from numerous examples given. For example, the model will spot a bird or a person easily in an image without labels if it has seen adequate examples of images with a car, bird, or a person in them.

Conclusion

Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before. The growth of the AI and ML industry as a whole depends solely on the continued creation of nuanced datasets needed to create some of ML’s complex problems.

There is no better “fuel” for training ML algorithms than annotated data in images, videos, or texts – and that is when we arrive at some of the autonomous ML models we can possibly and proudly have.

Now you understand why data annotation is essential in ML, its various and common types, and where to find data annotators to do the job for you. You are in a position to make informed choices for your enterprise and level up your operations.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

]]>
Neuroph and DL4J https://mlconference.ai/blog/neuroph-and-dl4j/ Tue, 14 Sep 2021 11:31:34 +0000 https://mlconference.ai/?p=82227 In this article, we would like to show how neural networks, specifically the multilayer perceptron of two Java frameworks, can be used to detect blood cells in images.

The post Neuroph and DL4J appeared first on ML Conference.

]]>
Microscopic blood counts include an analysis of the six types of white blood cells. These include: Neutrophils, Eosinophils, Basophilic Granulocytes, Monocytes, and Lymphocytes. Based on the number, maturity, and distribution of these white blood cells, you can obtain valuable information about possible diseases. However, here we will not focus on the handling of the blood smears, but on the recognition of the blood cells.

For the tests described, the Bresser Trino microscope with a MikrOkular was used and connected to a computer (HP Z600). The program presented in this article was used for image analysis. The software is based on neural networks using the Java frameworks Neuroph and Deep Learning for Java (DL4J). Staining of smears for the microscope were made with Löffler solution.

 

Training data

For neural network training, the images of the blood cells were centered, converted to grayscale format, and normalized. After preparation, the images looked as shown in Figure 1.

Fig. 1: The JPG images have a size of 100 x 100 pixels and show (from left to right) lymphocyte (ly), basophil (bg), eosinophil (eog), monocyte (mo), rod-nucleated (young) neutrophil (sng), segment-nucleated (mature) neutrophil (seg); the cell types were used for neural network training.

 

A dataset of 663 images with 6 labels – ly, bg, eog, mo, sng, seg – was compiled for training. For Neuroph, the imageLabels shown in Listing 1 were set.

List<String> imageLabels = new ArrayList();
  imageLabels.add("ly");
  imageLabels.add("bg");
  imageLabels.add("eog");
  imageLabels.add("mo");
  imageLabels.add("sng");
  imageLabels.add("seg");

After that, the directory for the input data looks like Figure 2.

Fig. 2: The directory for the input data

 

For DL4J the directory for the input data (your data location) is composed differently (Fig. 3).

Fig. 3: Directory for the input data for DL4J.

 

Most of the images in the dataset came from our own photographs. However, there were also images from open and free internet sources. In addition, the dataset contained the images multiple times as they were also rotated 90, 180, and 270 degrees respectively and stored.

 

Stay up to date

Learn more about MLCON

 

Neuroph MLP Network

The main dependencies for the Neuroph project in pom.xml are shown in Listing 2.

<dependency>
  <groupId>org.neuroph</groupId>
  <artifactId>neuroph-core</artifactId>
  <version>2.96</version>
</dependency>   
<dependency>
  <groupId>org.neuroph</groupId>
  <artifactId>neuroph-imgrec</artifactId>
  <version>2.96</version>
</dependency>
<dependency>
  <groupId>log4j</groupId>
  <artifactId>log4j</artifactId>
  <version>1.2.17</version>
</dependency>

A multilayer perceptron was set with the parameters shown in Listing 3.

private static final double LEARNINGRATE = 0.05;
private static final double MAXERROR = 0.05;
private static final int HIDDENLAYERS = 13;
 
//Open Network
Map<String, FractionRgbData> map;
  try {
    map = ImageRecognitionHelper.getFractionRgbDataForDirectory(new File(imageDir), new Dimension(10, 10));
    dataSet = ImageRecognitionHelper.createRGBTrainingSet(image-Labes, map);
    // create neural network
    List<Integer> hiddenLayers = new ArrayList<>();
    hiddenLayers.add(HIDDENLAYERS);/
    NeuralNetwork nnet = ImageRecognitionHelper.createNewNeuralNet-work("leukos", new Dimension(10, 10), ColorMode.COLOR_RGB, imageLabels, hiddenLayers, TransferFunctionType.SIGMOID);
    // set learning rule parameters
    BackPropagation mb = (BackPropagation) nnet.getLearningRule();
    mb.setLearningRate(LEARNINGRATE);
    mb.setMaxError(MAXERROR);
    nnet.save("leukos.net");
  } catch (IOException ex) {
    Logger.getLogger(Neuroph.class.getName()).log(Level.SEVERE, null, ex);
  }

 

Example

The implementation of a test can look like the one shown in Listing 4.

HashMap<String, Double> output;
String fileName = "leukos112.seg";
NeuralNetwork nnetTest = NeuralNetwork.createFromFile("leukos.net");
// get the image recognition plugin from neural network
ImageRecognitionPlugin imageRecognition = (ImageRecognitionPlugin) nnetTest.getPlugin(ImageRecognitionPlugin.class);
output = imageRecognition.recognizeImage(new File(fileName);

 

Client

A simple SWING interface was developed for graphical cell recognition. An example of the recognition of a lymphocyte is shown in Figure 4.

Fig. 4: The program recognizes a lymphocyte and highlights it

 

DL4J MLP network

The main dependencies for the DL4J project in pom.xml are shown in Listing 5.

<dependency>
  <groupId>org.deeplearning4j</groupId>
  <artifactId>deeplearning4j-core</artifactId>
  <version>1.0.0-beta4</version>
</dependency>
<dependency>
  <groupId>org.nd4j</groupId>
  <artifactId>nd4j-native-platform</artifactId>
  <version>1.0.0-beta4</version>
</dependency>

Again, a multilayer perceptron was used with the parameters shown in Listing 6.

protected static int height = 100;
protected static int width = 100;
protected static int channels = 1;
protected static int batchSize = 20;
 
protected static long seed = 42;
protected static Random rng = new Random(seed);
protected static int epochs = 100;
protected static boolean save = true;
//DataSet
  String dataLocalPath = "your data location";
  ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator();
  File mainPath = new File(dataLocalPath);
  FileSplit fileSplit = new FileSplit(mainPath, NativeImageLoader.ALLOWED_FORMATS, rng);
  int numExamples = toIntExact(fileSplit.length());
  numLabels = Objects.requireNonNull(fileSplit.getRootDir().listFiles(File::isDirectory)).length;
  int maxPathsPerLabel = 18;
  BalancedPathFilter pathFilter = new BalancedPathFilter(rng, labelMaker, numExamples, numLabels, maxPathsPerLabel);
  //training – Share test
  double splitTrainTest = 0.8;
  InputSplit[] inputSplit = fileSplit.sample(pathFilter, splitTrainTest, 1 - splitTrainTest);
  InputSplit trainData = inputSplit[0];
  InputSplit testData = inputSplit[1];
 
//Open Network
MultiLayerNetwork network = lenetModel();
network.init();
ImageRecordReader trainRR = new ImageRecordReader(height, width, channels, labelMaker);
trainRR.initialize(trainData, null);
DataSetIterator trainIter = new RecordReaderDataSetIterator(trainRR, batchSize, 1, numLabels);
  scaler.fit(trainIter);
  trainIter.setPreProcessor(scaler);
  network.fit(trainIter, epochs);

 

LeNet Model

This model is a kind of forward neural network for image processing (Listing 8).

private MultiLayerNetwork lenetModel() {
  /*
    * Revisde Lenet Model approach developed by ramgo2 achieves slightly above random
    * Reference: https://gist.github.com/ramgo2/833f12e92359a2da9e5c2fb6333351c5
  */
  MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .l2(0.005)
    .activation(Activation.RELU)
    .weightInit(WeightInit.XAVIER)
    .updater(new AdaDelta())
    .list()
    .layer(0, convInit("cnn1", channels, 50, new int[]{5, 5}, new int[]{1, 1}, new int[]{0, 0}, 0))
    .layer(1, maxPool("maxpool1", new int[]{2, 2}))
    .layer(2, conv5x5("cnn2", 100, new int[]{5, 5}, new int[]{1, 1}, 0))
    .layer(3, maxPool("maxool2", new int[]{2, 2}))
    .layer(4, new DenseLayer.Builder().nOut(500).build())
    .layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
      .nOut(numLabels)
      .activation(Activation.SOFTMAX)
      .build())
    .setInputType(InputType.convolutional(height, width, channels))
    .build();
 
  return new MultiLayerNetwork(conf);
 
}

 

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Example

A test of the Lenet model might look like the one shown in Listing 9.

trainIter.reset();
DataSet testDataSet = trainIter.next();
List<String> allClassLabels = trainRR.getLabels();
int labelIndex;
int[] predictedClasses;
String expectedResult;
String modelPrediction;
int n = allClassLabels.size();
System.out.println("n = " + n);
for (int i = 0; i < n; i = i + 1) {
  labelIndex = testDataSet.getLabels().argMax(1).getInt(i);
  System.out.println("labelIndex=" + labelIndex);
  INDArray ia = testDataSet.getFeatures();
  predictedClasses = network.predict(ia);
  expectedResult = allClassLabels.get(labelIndex);
  modelPrediction = allClassLabels.get(predictedClasses[i]);
  System.out.println("For a single example that is labeled " + expectedResult + " the model predicted " + modelPrediction);
}

 

Results

After a few test runs, the results shown in Table 1 are obtained.

LeukosNeurophDL4J

Lymphocytes (ly) 87 85
Basophils (bg) 96 63
Eosinophils (eog) 93 54
Monocytes (mo) 86 60
Rod nuclear neutrophils (sng) 71 46
Segment nucleated neutrophils (seg) 92 81
Table 1: Results of leukocyte counting (N-success/N-samples in %).

 

As can be seen, the results using Neuroph are slightly better than those using DL4J, but it is important to note that the results are dependent on the quality of the input data and the network’s topology. We plan to investigate this issue further in the near future.

However, with these results, we have already been able to show that image recognition can be used for medical purposes with not one, but two sound and potentially complementary Java frameworks.

 

Acknowledgments

At this point, we would like to thank Mr. A. Klinger (Management Devoteam GmbH Germany) and Ms. M. Steinhauer (Bioinformatician) for their support.

The post Neuroph and DL4J appeared first on ML Conference.

]]>
Top 5 reasons to attend ML Conference https://mlconference.ai/blog/top-5-reasons-to-attend-ml-conference/ Tue, 20 Jul 2021 11:33:51 +0000 https://mlconference.ai/?p=82083 So you’ve decided to attend ML Conference but you don’t know how to break it to your boss that it is a win-win situation? Don’t worry, we’ve got you covered. Follow 4 simple steps and use these 5 arguments to show why your organization needs to invest in ML Conference!

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

]]>
1. Let your boss know why you want to go to ML Conference

Tell him the there are over 25 expert speakers and industry experts addressing actual trends and best practices.

Tell your boss to take a look at the conference tracks to have a better idea of what this conference is all about.

 

2. Tell him what’s in it for him

You have the chance to gain key knowledge and skills for this new era of Machine Learning. Turn your ideas into best practices during the workshops and meet people who can help you with that. You’ll learn what it means to build up a ML-first mindset with numerous real-world examples and you can put them into practice in your company. At ML Conference you will develop a deep understanding of your data, as well as of the latest tools and technologies.

 

3. Show him that you’ve done your homework: Book your ticket now and save money.

If you book your ticket now, your boss will save money on the early bird ticket. Plus, you will have an additional 10% discount for a group of 3 people or more.

 

4. Assure your boss that you will network with top industry experts

In addition to the valuable knowledge you will get from top-notch industry experts, you’ll also have the chance to connect and network with the people who are at the top of their career. ML Conference offers an expo reception and a networking event.

 

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

]]>