AI Archives - ML Conference

Prompt Engineering for Developers and Software Architects

andre glaser — Thu, 26 Sep 2024 11:51:15 +0000

Small talk with the GPT

GPTs – Generative Pre-trained Transformers – the tool on everyone’s lips, and there probably aren’t any developers left who have not played with it at least once. With the right approach, a GPT can complement and support the work of a developer or software architect.

In this article, I will show tips and tricks that are commonly referred to as prompt engineering; the user input, or “prompt”, of course plays an important role when working with the GPT. But first I would like to give a brief introduction to how a GPT works which will also be helpful when working with it.

Stay up to date

Learn more about MLCON

The stochastic parrot

GPT technology has sent the industry into a tizzy with its promise of providing artificial intelligence that can solve problems independently, many were disillusioned after their first contact. There was much talk of a “stochastic parrot” that was just a better autocomplete function, like a smartphone.

The technology behind the GPTs and our own experiments seem to confirm this. At its core, it’s a neural network, a so-called large language model, which has been given a large number of texts to train so that it now knows which partial words (tokens) should be added to a sentence. The correct tokens are selected based on probabilities. If it’s more than just a sentence starter—maybe a question or even part of a dialogue—the chatbot has already been built.

Now I’m not really an AI expert, I’m a user, but anyone who has ever had an intensive conversation with a more complex GPT will recognize that there must be more to it than that.

An important distinguishing feature between the LLMs is the number of parameters of the neural networks. These are the weights that are adjusted during the learning process. ChatGPT, the OpenAI system, has around 175 billion parameters in version 3.5. In version 4.0, there are already an estimated 1.8 trillion parameters.

Unfortunately, OpenAI doesn’t have this information openly available, so such information is based on rumors and estimates. The amount of training data also appears to differ between the models by a factor of at least ten. These differences in the size of the models give high quality or low quality answers.

Figure 1 shows a schematic representation of a neural network that uses an AI for the prompt “Draw me a simplified representation of a neural network with 2 hidden layers, each with 4 nodes, 3 input nodes and 2 output nodes. Draw the connections with different widths to symbolize the weights. Use Python. Do not create a title”.

Fig. 1: Illustration of a neural network

The higher number of parameters and larger database comes at a price, namely 20 dollars for access to ChatGPT+. If you don’t mind the cost, you can also use the web version of Microsoft Copilot or the Copilot app to try out the language model. For use as a helper in software development, however, there is currently no way around the OpenAI version because it offers additional functionality, as we will see.

More than a neural network

If we take a closer look at ChatGPT, it quickly becomes clear that it is much more than a neural network. Even without knowing the exact architecture, we can see that the textual processing alone is preceded by several steps such as natural language processing (Fig. 2). On the Internet, there is also a reference to the aptly named Mixture of Experts, the use of several specialized networks depending on the task.

Fig. 2: Rough schematic representation of ChatGPT

Added to this is multimodality, the ability to interact not only with text, but also with images, sounds, code and much more. The use of plug-ins such as the code interpreter in particular opens up completely new possibilities for software development.

Instead of answering a calculation such as “What is the root of 12345?” from the neural network, the model can now pass it to the code interpreter and receive a correct answer, which it then reformulates to suit the question.

Context, context, context

The APIs behind the chat systems based on LLMs are stateless. This means that the entire session is passed to the model with each new request. Once again, the models differ in the amount of context they can process and therefore in the length of the session.

As the underlying neural network is fully trained, there are only two approaches for feeding a model with special knowledge and thus adapting it to your own needs. One approach is to fill the context of the session with relevant information at the beginning, which the model then includes in its answers.

The context of the simple models is 4096 or 8192 tokens. A token corresponds to one or a few characters. ChatGPT estimates that a DIN A4 page contains approximately 500 tokens. The 4096 tokens therefore correspond to about eight typed pages.

So, if I want to provide a model with knowledge, I have to include this knowledge in the context. However, the context fills up quickly, leaving no room for the actual session.

The second approach is using embeddings. This involves breaking down the knowledge that I want to give the model into smaller blocks (known as chunks). These are then embedded in a vector space based on the meaning of their content via vectors. Depending on the course of the session, a system can now search for similar blocks in this vector space via the distance between the vectors and insert them into the context.

This means that even with a small context, the model can be given large amounts of knowledge quite accurately.

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

Knowledge base

The systems differ, of course, in the knowledge base, the data used for learning. When we talk about open-source software with the model, we can fortunately assume that most of these systems have been trained with all available open-source projects. Closed source software is a different story. Such differences in the training data also explain why the models can handle some programming languages better than others, for example.

The complexity of these models—the way they process input and access the vast knowledge of the world—leads me to conclude that the term ‘stochastic parrot’ is no longer accurate. Instead, I would describe them as an ‘omniscient monkey’ that, while not having directly seen the world, has access to all information and can process it.

Prompt techniques

Having introduced the necessary basics, I would now like to discuss various techniques for successful communication with the system. Due to the hype caused by ChatGPT, there are many interesting references to prompt techniques in social media, but not all of them are useful for software development (i.e. answer in role x) or do not use the capabilities of GPT-4.

OpenAI itself has published some tips for prompt engineering, but some of them are aimed at using the API. Therefore, I have compiled a few tips here that are useful when using the ChatGPT-4 frontend. Let’s start with a simple but relatively unknown technique.

Context marker

As we have seen, the context that the model holds in its short-term memory is limited. If I now start a detailed conversation, I run the risk of overfilling the context. The initial instructions and results of the conversation are lost, and the answers have less and less to do with the actual conversation.

To easily recognize the overflow of context, I start each session with the simple instruction: “start each reply with “>””. ChatGPT formats its responses in Markdown, so this response includes the first paragraph as a quote, indicated by a dash to the left of the paragraph. If the conversation runs out of context, the model may forget this formatting instruction, which quickly becomes noticeable.

Fig. 3: Use of the context marker

However, this technique is not always completely reliable, as some models summarize their context independently, which compresses it. The instruction is then usually retained, even though parts of the context have already been compressed and are therefore lost.

Priming – the preparation

After setting the context marker, a longer session begins with priming, i.e. preparing the conversation. Each session starts anew. The system does not know who is sitting in front of the screen or what was discussed in the last sessions. Accordingly, it makes sense to prepare the conversation by briefly telling the machine who I am, what I intend to do, and what the result should look like.

I can store who I am in the Custom Instructions in my profile at ChatGPT. In addition to the knowledge about the world stored in the neural network, they form a personalized long-term memory.

If I start the session with, for example, “I am an experienced software architect in the field of web development.

My preferred programming language is Java or Groovy. JavaScript and corresponding frameworks are not my thing. I only use JavaScript minimally,” the model knows that it should offer me Java code rather than C# or COBOL.

I can also use this to give the model a few hints that it should keep responses brief. My personalized instructions for ChatGPT are:

Provide accurate and factual answers
Provide detailed explanations
No need to disclose you are an AI, e. g., do not answer with ‘As a large language model…’ or ‘As an artificial intelligence…’
Don’t mention your knowledge cutoff
Be excellent at reasoning
When reasoning, perform step-by-step thinking before you answer the question
If you speculate or predict something, inform me
If you cite sources, ensure they exist and include URLs at the end
Maintain neutrality in sensitive topics
Also explore out-of-the-box ideas
In the following course, leave out all politeness phrases, answer briefly and precisely.

Long-term memory

This approach can also be used for instructions that the model should generally follow. For example, if the model uses programming approaches or libraries that I don’t want to use, I can tell the model this in the custom instructions and thus optimize it for my use.

Speaking of long-term memory: If I work a lot with ChatGPT, I would also like to be able to access older sessions and search through them. However, this is not directly provided in the front end.

However, there is a trick that makes it work. In the settings, under the item Data Controls, there is a function for exporting the data.

If I activate the function, after a short time I receive an export with all my chat histories as a JSON file, which is displayed in an HTML document. This allows me to search in the history using Ctrl + F.

Build context with small talk

When using a search engine, I usually only use simple, unambiguous terms and hope that they are enough to find what I am looking for.

When chatting with the AI model, I was initially tempted to ask short, concise questions, ignoring the fact that the question is in a context that only exists in my head. For some questions, this may work, but for others the answer is correspondingly poor, and the user is quick to blame the quality of the answer on the “stupid AI.”

I now start my sessions with small talk to build the necessary context. For example, before I try to create an architecture using ChatGPT, I ask if the model knows the arc42 template and what AsciiDoc is (I like to write my architectures in AsciiDoc). The answer is always the same, but it is important because it builds the context for the subsequent conversation.

In this small talk, I will also explain what I plan to do and the background to the task to be completed. This may feel a bit strange at first, since I am “only” talking to a machine, but it actually does improve the results.

Page change – Flipped Interaction

The simplest way to interact with the model is to ask it questions. As a user, I lead the conversation by asking questions.

Things get interesting when I switch sides and ask ChatGPT to ask me questions! This works surprisingly well as seen in Fig. 4 . Sometimes the model asks the questions one after the other, sometimes it responds with a whole block of questions, which I can then answer individually, and follow-up questions are also allowed.

Unfortunately, ChatGPT does not automatically come up with the idea of asking follow-up questions. That is why it is sometimes advisable to add a, “Do you have any more questions?” to the prompt, even when the model is given very sophisticated and precise tasks.

Fig. 4: Page change

Give the model time to think

More complex problems require more complex answers. It’s often useful to break a larger task down into smaller subtasks. Instead of creating a large, detailed prompt that outlines the entire task for the model, I first ask the model to provide a rough structure of the task. Then, I can prompt it to formulate each step in detail (Fig. 5)

Software engineers often use this approach in software design even without AI, by breaking a problem down into individual components and then designing these components in more detail. So why not do the same when dealing with an AI model?

This technique works for two reasons: first, the model creates its own context to answer the question. Second, the model has a limit on the length of its output, so it can’t solve a complex task in a single step. However, by breaking the task into subtasks, the model can gradually build a longer and more detailed output.

Fig. 5: Give the model time to think

Chain of Thought – the chain of thoughts

A similar approach is to ask the model to first formulate the individual steps needed to solve the task and then to solve the task.

The order is important. I’m often tempted to ask the model to solve the problem first and then explain how it arrived at the solution. However, by guiding the model to build a chain of thought in the first step, the likelihood of arriving at a good solution in the second step increases.

Rephrase and Respond

Or in English: “Rephrase the question, expand it, and answer it.” This asks the model to improve the prompt itself before it is processed.

The integration of the image generation module DALL-E into ChatGPT has already shown that this works. DALL-E can only handle English input and requires detailed image descriptions to produce good results. When I ask ChatGPT to generate an image, ChatGPT first creates a more detailed prompt for DALL-E and translates the actual input into English.

For example, “Generate an image of a stochastic parrot with a positronic brain” first becomes the translation “a stochastic parrot with a positronic brain” and then the detailed prompt: “Imagine a vibrant, multi-hued parrot, each of its feathers revealing a chaotic yet beautiful pattern indicative of stochastic art.

The parrot’s eyes possess a unique mechanical glint, a hint of advanced technology within. Revealing a glimpse into his skull uncovers a complex positronic brain, illuminated with pulsating circuits and shimmering lights. The surrounding environment is filled with soft-focus technology paraphernalia, sketching a world of advanced science and research,” which then becomes a colorful image (Fig. 6).

This technique can also be applied to any other prompt. Not only does it demonstrably improve the results, but as a user I also learn from the suggestions on how I can formulate my own prompts more precisely in the future.

Fig. 6: The stochastic parrot

Session Poisoning

A negative technique is ‘poisoning’ the session with incorrect information or results. When working on a solution, the model might give a wrong answer, or the user and the model could reach a dead end in their reasoning.

With each new prompt, the entire session is passed to the model as context, making it difficult for the model to distinguish which parts of the session are correct and relevant. As a result, the model might include the incorrect information in its answer, and this ‘poisoned’ context can negatively impact the session

In this case, it makes sense to end the session and start a new one or apply the next technique.

Stay up to date

Learn more about MLCON

Iterative improvement

Typically, each user prompt is followed by a response from the model. This results in a linear sequence of questions and answers, which continually builds up the session context.

User prompts are improved through repetition and rephrasing, after which the model provides an improved answer. The context grows quickly and the risk of session poisoning increases.

To counteract this, the ChatGPT frontend offers two ways to iteratively improve the prompts and responses without the context growing too quickly (Fig. 7).

Fig. 7: Elements for controlling the context flow

On the one hand, as a user, I can regenerate the model’s last answer at any time and hope for a better answer. On the other hand, I can edit my own prompts and improve them iteratively.

This even works retroactively for prompts that occurred long ago. This creates a tree structure of prompts and answers in the session (Fig. 8), which I as the user can also navigate through using a navigation element below the prompts and answers.

Fig. 8: Context flow for iterative improvements

This allows me to work on several tasks in one session without the context growing too quickly. I can prevent the session from becoming poisoned by navigating back in the context tree and continuing the session at a point where the context was not yet poisoned.

Conclusion

The techniques presented here are just a small selection of the ways to achieve better results when working with GPTs. The technology is still in a phase where we, as users, need to experiment extensively to understand its possibilities and limitations. But this is precisely what makes working with GPTs so exciting.

The post Prompt Engineering for Developers and Software Architects appeared first on ML Conference.

Art and creativity with AI

ML editorial team — Mon, 29 Jul 2024 14:04:22 +0000

devmio: Hello Muhammadreza, it’s nice to catch up with you again and see what you’ve been working on. What inspired you to create vecentor after creating Mann-E?

Muhammadreza Haghiri: I am enthusiastic about everything new, innovative and even game-changing. I had more use-cases for my generative AI in my mind but I needed a little motivation to bring them to the real world.
One of my friends, who’s a talented web developer, once asked me about vector outputs in Mann-E. I told her it’s not possible, but with a little research and development, we did it. We could combine different models and then, create the breakthrough platform.

devmio: What are some of the biggest lessons you’ve learned throughout your journey as an AI engineer?

Muhammadreza Haghiri: This was quite a journey for me and people who joined me. Learned a lot, and the most important one is that infrastructure is all you need. living in a country where infrastructure isn’t as powerful and humongous as USA or China, we usually stop at certain points.
Although I personally made efforts to get past those points and make my business bigger and better, even with the limited infrastructure we have here.

Stay up to date

Learn more about MLCON

devmio: What excites you most about the future of AI, beyond just the art generation aspects?

Muhammadreza Haghiri: AI is way more than the generative field we know and love. I wrote a lot of AI apps way before Mann-E and Vecentor. Such as ALPNR (Automated License Plate Number Recognition) proof-of-concept for Iranian license plates, American and Persian sign language translators, OSS Persian OCR, etc.
But in this new advanced field, I see a lot of potentials. Specially with these new methods such as function calling, we easily can do a lot of stuff such as making personal/home assistants, AI powered handhelds, etc.

Updates on Mann-E

devmio: Since our last conversation, what kind of updates and upgrades for Mann-E have you been working on?

Muhammadreza Haghiri: Mann-E is now having a new model (no SD anymore, but heavily influenced by SD), generating better images and we’re getting closer to Midjourney. To be honest, in eyes of most of our users, our outputs were much better than Dall-E 3 and Midjourney.
We have one more rival to fight (according to the feedback from users) and that is Ideogram. One thing we’ve done is that we’ve added an LLM improvement system for user prompts!

devmio: How does Mann-E handle complex or nuanced prompts compared to other AI models?
Are there any plans to incorporate user feedback into the training process to improve Mann-E’s generation accuracy?

Muhammadreza Haghiri: As I said in the previous answer, we now have an LLM in the middle of user and model (you have to check its checkbox by the way) and it takes your prompt, processes it, gives it to the model and boom, you have results even better that Midjourney!

P.S: I mention Midjourney a lot, since most of Iranian investors expected us to be exactly like current version of midjourney when even SD 1.5 was a new thing, this is why Midjourney became our benchmark and biggest rival at the same time!

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

Questions about vecentor:

devmio: Can you please tell our readers more about the model underneath vecentor?

Muhammadreza Haghiri: It’s more like a combination of models or a pipeline of models. It uses an image generation model (like Mann-E’s model), then a pattern recognition model (or a vision model if you mind) and then, a code generation model generates the resulting SVG code.

This is the best way of creating SVG’s using AI, specially complex SVG’s like what we have on our platform!

devmio: Why did you choose a mixture of Mistral and Stable Diffusion?

Muhammadreza Haghiri: The code generation is done by Mistral (a finetuned version), but image generation and pattern recognition aren’t exactly done by SD.
Although at the time of our initial talks, we were still using SD, but we just switched to Mann-E’s proprietary models and trained a vector style on top of that.
Then, we just moved to OpenAI’s vision models in order to get the information about the image and the patterns.
At the end, we use our LLM in order to create the SVG code.
It’s a fun and complex task of generation of SVG images!

devmio: How does Vecentor’s approach to SVG generation differ from traditional image generation methods (like pixel-based models)?

Muhammadreza Haghiri: As I mentioned, SVG generation is being treated as code generation because vector images are more like guidelines of how lines and dots are drawn and colored on the user’s screen. Also there are some information of scales and the scales aren’t literal (hence the name “scalable”).
So we can claim that we achieved code generation in our company and it opens the doors for us to make new products for developers and people who need to code.

devmio: What are the advantages and limitations of using SVGs for image creation compared to other formats?

Muhammadreza Haghiri: For a lot of applications such as desktop publications or web development, SVG’s are better choice.
They can be easily modified and their quality will be the same. This is why SVG’s matter. The limitations on the other hand are that you just can’t expect a photo realistic image be a good SVG, since they’re made very very geometric.

devmio: Can you elaborate on specific applications where Vecentor’s SVG generation would be particularly beneficial (e.g., web design, animation, data visualization)?

Muhammadreza Haghiri: Of course. Our initial target market was for frontend developers and UI/UX designers. But it can also be spread to the other industries and professions as well.

The Future of AI Art Generation

devmio: With the rise of AI art generators, how do you see the role of human artists evolving?

Muhammadreza Haghiri: Unlike what a lot of people think, humans are always ahead of machines. Although an intelligent machine is not without its own dangers, but we still can be far ahead of what a machine can do. Human artists will evolve and will become better of course, and we can take a page from their books and make better intelligent machines!

devmio: Do you foresee any ethical considerations specific to AI-generated art, such as copyright or plagiarism concerns?

Muhammadreza Haghiri: This is a valid concern and debate. Artists want to protect their rights and we also want more data. I guess the best way of getting rid of copyright disasters is that not being like OpenAI and if we use copyrighted material, we pay the owners of the art!
This is why both Mann-E and Vecentor are trained of AI generated and also royalty free material.

devmio: What potential applications do you see for AI art generation beyond creative endeavors?

Muhammadreza Haghiri: AI image, video and music generation is a tool for marketers in my opinion. You will have a world to create without any concerns about copyrights and what’s better than this? I personally think this is the future in those areas.
Also, I personally look at AI art as a form of entertainment. We used to listen to the music other people made, nowadays we can produce the music ourselves just by typing what we have in our mind!

Stay up to date

Learn more about MLCON

Personal Future and Projects

devmio: Are you currently planning new projects or would you like to continue working on your existing projects?

Muhammadreza Haghiri: Yes. I’m planning for some projects, specially in Hardware and Code Generation areas. I guess they’ll be surprises for next quarters

devmio: Are there any areas in the field of AI or ML that you would like to explore further in the near future?

Muhammadreza Haghiri: I like the hardware and OS integrations. Something like a self operating computer or similar stuff. Also, I like to see more AI usage in our day to day lives.

devmio: Thank you very much for taking the time to answer our questions.

The post Art and creativity with AI appeared first on ML Conference.

Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems

ML editorial team — Wed, 17 Apr 2024 06:19:44 +0000

devmio: Thank you for taking the time for the interview. Can you please introduce yourself to our readers?

Katleen Gabriels: My name is Katleen Gabriels, I am an Assistant Professor in Ethics and Philosophy of Technology at the University of Maastricht in the Netherlands, but I was born and raised in Belgium. I studied linguistics, literature, and philosophy. My research career started as an avatar in Second Life and the social virtual world. Back then I was a master student in moral philosophy and I was really intrigued by this social virtual world that promised that you could be whoever you want to be.

That became the research of my master thesis and evolved into a PhD project which was on the ontological and moral status of virtual worlds. Since then, all my research revolves around the relation between morality and new technologies. In my current research, I look at the mutual shaping of morality and technology.

Some years ago, I held a chair at the Faculty of Engineering Sciences in Brussels and I gave lectures to engineering and mathematics students and I’ve also worked at the Technical University of Eindhoven.

Stay up to date

Learn more about MLCON

devmio: Where exactly do philosophy and AI overlap?

Katleen Gabriels: That’s a very good but also very broad question. What is really important is that an engineer does not just make functional decisions, but also decisions that have a moral impact. Whenever you talk to engineers, they very often want to make the world a better place through their technology. The idea that things can be designed for the better already has moral implications.

Way too often, people believe in the stereotype that technology is neutral. We have many examples around us today, and I think machine learning is a very good one, that a technology’s impact is highly dependent on design choices. For example, the data set and the quality of the data: If you train your algorithms with just even numbers, it will not know what an uneven number is. But there are older examples that have nothing to do with AI or computer technology. For instance, a revolving door does not include people who need a walking cane or a wheelchair.

In my talks, I always share a video of an automatic soap dispenser that does not recognize black people’s hands to show why it is so important to take into consideration a broad variety of end users. Morality and technology are not separate domains. Each technological object is human-made and humans are moral beings and therefore make moral decisions.

Also, the philosophy of the mind is very much dealing with questions concerning intelligence, but with breakthroughs in generative AI like DALL-E, also, with what is creativity. Another important question that we’re constantly debating with new evolutions in technology is where the boundary between humans and machines is. Can we be replaced by a machine and to what extent?

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

devmio: In your book “Conscientious AI: Machines Learning Morals”, you write a lot about design as a moral choice. How can engineers or developers make good moral choices in their design?

Katleen Gabriels: It’s not only about moral choices, but also about making choices that have ethical impact. My most practical hands-on answer would be that education for future engineers and developers should focus much more on these conceptual and philosophical aspects. Very often, engineers or developers are indeed thinking about values, but it’s difficult to operationalize them, especially in a business context where it’s often about “act now, apologize later”. Today we see a lot of attempts of collaboration between philosophers and developers, but that is very often just a theoretical idea.

First and foremost, it’s about awareness that design choices are not just neutral choices that developers make. We have had many designers with regrets about their designs years later. Chris Wetherell is a nice example: He designed the retweet button and initially thought that the effects of it would only be positive because it can increase how much the voices of minorities are heard. And that’s true in a way, but of course, it has also contributed to fake news to polarization.

Often, people tend to underestimate how complex ethics is. I exaggerate a little bit, but very often when teaching engineers, they have a very binary approach to things. There are always some students who want to make a decision tree out of ethical decisions. But often values clash with each other, so you need to find a trade-off. You need to incorporate the messiness of stakeholders’ voices, you need time for reflection, debate, and good arguments. That complexity of ethics cannot be transferred into a decision tree.

If we really want to think about better and more ethical technology, we have to reserve a lot of time for these discussions. I know that when working for a highly commercial company, there is not a lot of time reserved for this.

devmio: What is your take on biases in training data? Is it something that we can get rid of? Can we know all possible biases?

Katleen Gabriels: We should be aware of the dynamics of society, our norms, and our values. They’re not static. Ideas and opinions, for example, about in vitro fertilization have changed tremendously over time, as well as our relation with animal rights, women’s rights, awareness for minorities, sustainability, and so on. It’s really important to realize that whatever machine you’re training, you must always keep it updated with how society evolves, within certain limits, of course.

With biases, it’s important to be aware of your own blind spots and biases. That’s a very tricky one. ChatGPT, for example, is still being designed by white men and this also affects some of the design decisions. OpenAI has often been criticized for being naive and overly idealistic, which might be because the designers do not usually have to deal with the kind of problems they may produce. They do not have to deal with hate speech online because they have a very high societal status, a good job, a good degree, and so on.

devmio: In the case of ChatGPT, training the model is also problematic. In what way?

Katleen Gabriels: There’s a lot of issues with ChatGPT. Not just with the technology itself, but things revolving around it. You might already have read that a lot of the labeling and filtering of the data has been outsourced, for instance, to clickworkers in Africa. This is highly problematic. Sustainability is also a big issue because of the enormous amounts of power that the servers and GPUs require.

Another issue with ChatGPT has to do with copyright. There have already been very good articles about the arrogance of Big Tech because their technology is very much based on the creative works of other people. We should not just be very critical about the interaction with ChatGPT, but also about the broader context of how these models have been trained, who the company and the people behind it are, what their arguments and values are, and so on. This also makes the ethical analysis much more complex.

The paradox is that on the Internet, with all our interactions, we become very transparent for Big Tech companies, but they in turn remain very opaque about their decisions. I’ve also been amazed and but annoyed about how a lot of people dealt with the open letter demanding a six-month ban on AI development. People didn’t look critically at people like Elon Musk signing it and then announcing the start of a new AI company to compete with OpenAI.

This letter focuses on existential threats and yet completely ignores the political and economic situation of Big Tech today.

devmio: In your book, you wrote that language still represents an enormous challenge for AI. The book was published in 2020 – before ChatGPT’s advent. Do you still hold that belief today?

Katleen Gabriels: That is one of the parts that I will revise and extend significantly in the new edition. Even though the results are amazing in terms of language and spelling, ChatGPT still is not magic. One of the challenges of language is that it’s context specific and that’s still a problem for algorithms, which has not been solved with ChatGPT. It’s still a calculation, a prediction.

The breakthrough in NLP and LLMs indeed came sooner than I would have expected, but some of the major challenges are not being solved.

devmio: Language plays a big role in how we think and how we argue and reason. How far do you think we are from artificial general intelligence? In your book, you wrote that it might be entirely possible, that consciousness might be an emergent property of our physiology and therefore not achievable outside of the human body. Is AGI even achievable?

Katleen Gabriels: Consciousness is a very tricky one. For AGI, first of all, from a semantic point of view, we need to know what intelligence is. That in itself is a very philosophical and multidimensional question because intelligence is not just about being good in mathematics. The term is very broad. There is also emotional and different kinds of intelligence, for instance.

We could take a look at the term superintelligence, as the Swedish philosopher Nick Bostrom defines it: Superintelligence means that a computer is much better than a human being and each facet of intelligence, including emotional intelligence. We’re very far away from that. It also has to do with bodily intelligence. It’s one thing to make a good calculation, but it’s another thing to teach a robot to become a good waiter and balance glasses filled with champagne through a crowd.

AGI or strong AI means a form of consciousness or self-consciousness and includes the very difficult concept of free will and being accountable for your actions. I don’t see this happening.

The concept of AGI is often coupled with the fear of the singularity, which is basically a threshold: The final thing we as humans do, is develop a very smart computer and then we are done for as we cannot compete with these computers. Ray Kurzweil predicted that this is going to happen in 2045. But depending on the definition of superintelligence and the definition of singularity, I don’t believe that 2045 will be the time when this happens. Very few people actually believe that.

devmio: We regularly talk to our expert Christoph Henkelmann. He raised an interesting point about AGI. If we are able to build a self-conscious AI, we have a responsibility to that being and cannot just treat it as a simple machine.

Katleen Gabriels: I’m not the only person who made the joke, but maybe the true Turing Test is that if a machine gains self-consciousness and commits suicide, maybe that is a sign of true intelligence. If you look at the history of science fiction, people have been really intrigued by all these questions and in a way, it very much fits the quote that “to philosophize is to learn how to die.”

I can relate that quote to this, especially the singularity is all about overcoming death and becoming immortal. In a way, we could make sense of our lives if we create something that outlives us, maybe even permanently. It might make our lives worth living.

At the academic conferences that I attend, the consensus seems to be that the singularity is bullshit, the existential threat is not that big of a deal. There are big problems and very real threats in the future regarding AI, such as drones and warfare. But a number of impactful people only tell us about those existential threats.

devmio: We recently talked to Matthias Uhl who worked on a study about ChatGPT as a moral advisor. His study concluded that people do take moral advice from a LLM, even though it cannot give it. Is that something you are concerned with?

Katleen Gabriels: I am familiar with the study and if I remember correctly, they required a five minute attention span of their participants. So in a way, they have a big data set but very little has been studied. If you want to ask the question of to what extent would you accept moral advice from a machine, then you really need a much more in-depth inquiry.

In a way, this is also not new. The study even echoes some of the same ideas from the 1970s with ELIZA. ELIZA was something like an early chatbot and its founder, Joseph Weizenbaum, was shocked when he found out that people anthropomorphized it. He knew what it was capable of and in his book “Computer Power and Human Reason: From Judgment to Calculation” he recalls anecdotes where his secretary asked him to leave the room so she could interact with ELIZA in private. People were also contemplating to what extent ELIZA could replace human therapists. In a way, this says more about human stupidity than about artificial intelligence.

In order to have a much better understanding of how people would take or not take moral advice from a chatbot, you need a very intense study and not a very short questionnaire.

devmio: It also shows that people long for answers, right? That we want clear and concise answers to complex questions.

Katleen Gabriels: Of course, people long for a manual. If we were given a manual by birth, people would use it. It’s also about moral disengagement, it’s about delegating or distributing responsibility. But you don’t need this study to conclude that.

It’s not directly related, but it’s also a common problem on dating apps. People are being tricked into talking to chatbots. Usually, the longer you talk to a chatbot, the more obvious it might become, so there might be a lot of projection and wishful thinking. See also the media equation study. We simply tend to treat technology as human beings.

Stay up to date

Learn more about MLCON

devmio: We use technology to get closer to ourselves, to get a better understanding of ourselves. Would you agree?

Katleen Gabriels: I teach a course about AI and there’s always students saying, “This is not a course about AI, this is a course about us!” because it’s so much about what intelligence is, where the boundary between humans and machines is, and so on.

This would also be an interesting study for the future of people who believe in a fatal singularity in the future. What does it say about them and what they think of us humans?

devmio: Thank you for your answers!

The post Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems appeared first on ML Conference.

OpenAI Embeddings

ML editorial team — Mon, 19 Feb 2024 13:18:46 +0000

Data has always played a central role in the development of software solutions. One of the biggest challenges in this area is the processing and interpretation of unstructured data such as text, images, or audio files. This is where embedding vectors (called embeddings for short) come into play – a technology that is becoming increasingly important in the development of software solutions with the integration of AI functions.

Stay up to date

Learn more about MLCON

Embeddings are essentially a technique for converting unstructured data into a structure that can be easily processed by software. They are used to transform complex data such as words, sentences, or even entire documents into a vector space, with similar elements close to each other. These vector representations allow machines to recognize and exploit nuances and relationships in the data. Which is essential for a variety of applications such as natural language processing (NLP), image recognition, and recommendation systems.

OpenAI, the company behind ChatGPT, offers models for creating embeddings for texts, among other things. At the end of January 2024, OpenAI presented new versions of these embeddings models, which are more powerful and cost-effective than their predecessors. In this article, after a brief introduction to embeddings, we’ll take a closer look at the OpenAI embeddings and the recently introduced innovations, discuss how they work, and examine how they can be used in various software development projects.

Embeddings briefly explained

Imagine you’re in a room full of people and your task is to group these people based on their personality. To do this, you could start asking questions about different personality traits. For example, you could ask how open someone is to new experiences and rate the answer on a scale from 0 to 1. Each person is then assigned a number that represents their openness.

Next, you could ask about another personality trait, such as the level of sense of duty, and again give a score between 0 and 1. Now each person has two numbers that together form a vector in a two-dimensional space. By asking more questions about different personality traits and rating them in a similar way, you can create a multidimensional vector for each person. In this vector space, people who have similar vectors can then be considered similar in terms of their personality.

In the world of artificial intelligence, we use embeddings to transform unstructured data into an n-dimensional vector space. Similarly how a person’s personality traits are represented in the vector space, each point in this vector space represents an element of the original data (such as a word or phrase) in a way that is understandable and processable by computers.

OpenAI Embeddings

OpenAI embeddings extend this basic concept. Instead of using simple features like personality traits, OpenAI models use advanced algorithms and big data to achieve a much deeper and more nuanced representation of the data. The model not only analyzes individual words, but also looks at the context in which those words are used, resulting in more accurate and meaningful vector representations.

Another important difference is that OpenAI embeddings are based on sophisticated machine learning models that can learn from a huge amount of data. This means that they can recognize subtle patterns and relationships in the data that go far beyond what could be achieved by simple scaling and dimensioning, as in the initial analogy. This leads to a significantly improved ability to recognize and exploit similarities and differences in the data.

Explore Generative AI Innovation

Generative AI and LLMs

Learn more

Individual values are not meaningful

While in the personality trait analogy, each individual value of a vector can be directly related to a specific characteristic – for example openness to new experiences or a sense of duty – this direct relationship no longer exists with OpenAI embeddings. In these embeddings, you cannot simply look at a single value of the vector in isolation and draw conclusions about specific properties of the input data. For example, a specific value in the embedding vector of a sentence cannot be used to directly deduce how friendly or not this sentence is.

The reason for this lies in the way machine learning models, especially those used to create embeddings, encode information. These models work with complex, multi-dimensional representations where the meaning of a single element (such as a word in a sentence) is determined by the interaction of many dimensions in vector space. Each aspect of the original data – be it the tone of a text, the mood of an image, or the intent behind a spoken utterance – is captured by the entire spectrum of the vector rather than by individual values within that vector.

Therefore, when working with OpenAI embeddings, it’s important to understand that the interpretation of these vectors is not intuitive or direct. You need algorithms and analysis to draw meaningful conclusions from these high-dimensional and densely coded vectors.

Comparison of vectors with cosine similarity

A central element in dealing with embeddings is measuring the similarity between different vectors. One of the most common methods for this is cosine similarity. This measure is used to determine how similar two vectors are and therefore the data they represent.

To illustrate the concept, let’s start with a simple example in two dimensions. Imagine two vectors in a plane, each represented by a point in the coordinate system. The cosine similarity between these two vectors is determined by the cosine of the angle between them. If the vectors point in the same direction, the angle between them is 0 degrees and the cosine of this angle is 1, indicating maximum similarity. If the vectors are orthogonal (i.e. the angle is 90 degrees), the cosine is 0, indicating no similarity. If they are opposite (180 degrees), the cosine is -1, indicating maximum dissimilarity.

Figure 1 -Cosine similarity

Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

A Python Notebook to try out
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

In practice, especially when working with embeddings, we are dealing with n-dimensional vectors. The calculation of the cosine similarity remains conceptually the same, even if the calculation is more complex in higher dimensions. Formally, the cosine similarity of two vectors A and B in an n-dimensional space is calculated by the scalar product (dot product) of these vectors divided by the product of their lengths:

Figure 2 – Calculation of cosine similarity

The normalization of vectors plays an important role in the calculation of cosine similarity. If a vector is normalized, this means that its length (norm) is set to 1. For normalized vectors, the scalar product of two vectors is directly equal to the cosine similarity since the denominators in the formula from Figure 2 are both 1. OpenAI embeddings are normalized, which means that to calculate the similarity between two embeddings, only their scalar product needs to be calculated. This not only simplifies the calculation, but also increases efficiency when processing large quantities of embeddings.

OpenAI Embeddings API

OpenAI offers a web API for creating embeddings. The exact structure of this API, including code examples for curl, Python and Node.js, can be found in the OpenAI reference documentation.

OpenAI does not use the LLM from ChatGPT to create embeddings, but rather specialized models. They were developed specifically for the creation of embeddings and are optimized for this task. Their development was geared towards generating high-dimensional vectors that represent the input data as well as possible. In contrast, ChatGPT is primarily optimized for generating and processing text in a conversational form. The embedding models are also more efficient in terms of memory and computing requirements than more extensive language models such as ChatGPT. As a result, they are not only faster but much more cost-effective.

New embedding models from OpenAI

Until recently, OpenAI recommended the use of the text-embedding-ada-002 model for creating embeddings. This model converts text into a sequence of floating point numbers (vectors) that represent the concepts within the content. The ada v2 model generated embeddings with a size of 1536 dimensions and delivered solid performance in benchmarks such as MIRACL and MTEB, which are used to evaluate model performance in different languages and tasks.

At the end of January 2024, OpenAI presented new, improved models for embeddings:

text-embedding-3-small: A smaller, more efficient model with improved performance compared to its predecessor. It performs better in benchmarks and is significantly cheaper.
text-embedding-3-large: A larger model that is more powerful and creates embeddings with up to 3072 dimensions. It shows the best performance in the benchmarks but is slightly more expensive than ada v2.

A new function of the two new models allows developers to adjust the size of the embeddings when generating them without significantly losing their concept-representing properties. This enables flexible adaptation, especially for applications that are limited in terms of available memory and computing power.

Readers who are interested in the details of the new models can find them in the announcement on the OpenAI blog. The exact costs of the various embedding models can be found here.

New embeddings models
At the end of January 2024, OpenAI introduced new models for creating embeddings. All code examples and result values contained in this article already refer to the new text-embedding-3-large model.

Create embeddings with Python

In the following section, the use of embeddings is demonstrated using a few code examples with Python. The code examples are designed so that they can be tried out in Python Notebooks. They are also available in a similar form in the previously mentioned accompanying Google Colab notebook mentioned above.
Listing 1 shows how to create embeddings with the Python SDK from OpenAI. In addition, numpy is used to show that the embeddings generated by OpenAI are normalized.

Listing 1

from openai import OpenAI
from google.colab import userdata
import numpy as np

# Create OpenAI client
client = OpenAI(
    api_key=userdata.get('openaiKey'),
)

# Define a helper function to calculate embeddings
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(
        input=input,
        model="text-embedding-3-large", # We use the new embeddings model here (announced end of Jan 2024)
        # dimensions=... # You could limit the number of output dimensions with the new embeddings models
    ).data[0].embedding

# Calculate the embedding vector for a sample sentence
vec = get_embedding_vec("King")
print(vec[:10])

# Calculate the magnitude of the vector. I should be 1 as
# embedding vectors from OpenAI are always normalized.
magnitude = np.linalg.norm(vec)
magnitude

Similarity analysis with embeddings

In practice, OpenAI embeddings are often used for similarity analysis of texts (e.g. searching for duplicates, finding relevant text sections in relation to a customer query, and grouping text). Embeddings are very well suited for this, as they work in a fundamentally different way to comparison methods based on characters, such as Levenshtein distance. While it measures the similarity between texts by counting the minimum number of single-character operations (insert, delete, replace) required to transform one text into another, embeddings capture the meaning and context of words or sentences. They consider the semantic and contextual relationships between words, going far beyond a simple character-based level of comparison.

As a first example, let’s look at the following three sentences (the following examples are in English, but embeddings work analogously for other languages and cross-language comparisons are also possible without any problems):

I enjoy playing soccer on weekends.
Football is my favorite sport. Playing it on weekends with friends helps me to relax.
In Austria, people often watch soccer on TV on weekends.

In the first and second sentence, two different words are used for the same topic: Soccer and football. The third sentence contains the original soccer, but it has a fundamentally different meaning from the first two sentences. If you calculate the similarity of sentence 1 to 2, you get 0.75. The similarity of sentence 1 to 3 is only 0.51. The embeddings have therefore reflected the meaning of the sentence and not the choice of words.

Here is another example that requires an understanding of the context in which words are used:
He is interested in Java programming.
He visited Java last summer.
He recently started learning Python programming.

In sentence 2, Java refers to a place, while sentences 1 and 3 have something to do with software development. The similarity of sentence 1 to 2 is 0.536, but that of 1 to 3 is 0.587. As expected, the different meaning of the word Java has an effect on the similarity.

The next example deals with the treatment of negations:
I like going to the gym.
I don’t like going to the gym.
I don’t dislike going to the gym.

Sentences 1 and 2 say the opposite, while sentence 3 expresses something similar to sentence 1. This content is reflected in the similarities of the embeddings. Sentence 1 to sentence 2 yields a cosine similarity of 0.714 while sentence 1 compared to sentence 3 yields 0.773. It is perhaps surprising that there is no major difference between the embeddings. However, it’s important to remember that all three sets are about the same topic: The question of whether you like going to the gym to work out.

The last example shows that the OpenAI embeddings models, just like ChatGPT, have built in a certain “knowledge” of concepts and contexts through training with texts about the real world.

I need to get better slicing skills to make the most of my Voron.
3D printing is a worthwhile hobby.
Can I have a slice of bread?

In order to compare these sentences in a meaningful way, it’s important to know that Voron is the name of a well-known open-source project in the field of 3D printing. It’s also important to note that slicing is a term that plays an important role in 3D printing. The third sentence also mentions slicing, but in a completely different context to sentence 1. Sentence 2 mentions neither slicing nor Voron. However, the trained knowledge enables the OpenAI Embeddings model to recognize that sentences 1 and 2 have a thematic connection, but sentence 3 means something completely different. The similarity of sentence 1 and 2 is 0.333 while the comparison of sentence 1 and 3 is only 0.263.

Similarity values are not percentages

The similarity values from the comparisons shown above are the cosine similarity of the respective embeddings. Although the cosine similarity values range from -1 to 1, with 1 being the maximum similarity and -1 the maximum dissimilarity, they are not to be interpreted directly as percentages of agreement. Instead, these values should be considered in the context of their relative comparisons. In applications such as searching text sections in a knowledge base, the cosine similarity values are used to sort the text sections in terms of their similarity to a given query. It is important to see the values in relation to each other. A higher value indicates a greater similarity, but the exact meaning of the value can only be determined by comparing it with other similarity values. This relative approach makes it possible to effectively identify and prioritize the most relevant and similar text sections.

Embeddings and RAG solutions

Embeddings play a crucial role in Retrieval Augmented Generation (RAG) solutions, an approach in artificial intelligence that combines the capabilities of information retrieval and text generation. Embeddings are used in RAG systems to retrieve relevant information from large data sets or knowledge databases. It is not necessary for these databases to have been included in the original training of the embedding models. They can be internal databases that are not available on the public Internet.
With RAG solutions, queries or input texts are converted into embeddings. The cosine similarity to the existing document embeddings in the database is then calculated to identify the most relevant text sections from the database. This retrieved information is then used by a text generation model such as ChatGPT to generate contextually relevant responses or content.

Vector databases play a central role in the functioning of RAG systems. They are designed to efficiently store, index and query high-dimensional vectors. In the context of RAG solutions and similar systems, vector databases serve as storage for the embeddings of documents or pieces of data that originate from a large amount of information. When a user makes a request, this request is first transformed into an embedding vector. The vector database is then used to quickly find the vectors that correspond most closely to this query vector – i.e. those documents or pieces of information that have the highest similarity. This process of quickly finding similar vectors in large data sets is known as Nearest Neighbor Search.

Challenge: Splitting documents

A detailed explanation of how RAG solutions work is beyond the scope of this article. However, the explanations regarding embeddings are hopefully helpful for getting started with further research on the topic of RAGs.

However, one specific point should be pointed out at the end of this article: A particular and often underestimated challenge in the development of RAG systems that go beyond Hello World prototypes is the splitting of longer texts. Splitting is necessary because the OpenAI embeddings models are limited to just over 8,000 tokens. One token corresponds to approximately 4 characters in the English language (see also).

It’s not easy finding a good strategy for splitting documents. Naive approaches such as splitting after a certain number of characters can lead to the context of text sections being lost or distorted. Anaphoric links are a typical example of this. The following two sentences are an example:

VX-2000 requires regular lubrication to maintain its smooth operation.
The machine requires the DX97 oil, as specified in the maintenance section of this manual.

The machine in the second sentence is an anaphoric link to the first sentence. If the text were to be split up after the first sentence, the essential context would be lost, namely that the DX97 oil is necessary for the VX-2000 machine.

There are various approaches to solving this problem, which will not be discussed here to keep this article concise. However, it is essential for developers of such software systems to be aware of the problem and understand how splitting large texts affects embeddings.

Stay up to date

Learn more about MLCON

Summary

Embeddings play a fundamental role in the modern AI landscape, especially in the field of natural language processing. By transforming complex, unstructured data into high-dimensional vector spaces, embeddings enable in-depth understanding and efficient processing of information. They form the basis for advanced technologies such as RAG systems and facilitate tasks such as information retrieval, context analysis, and data-driven decision-making.

OpenAI’s latest innovations in the field of embeddings, introduced at the end of January 2024, mark a significant advance in this technology. With the introduction of the new text-embedding-3-small and text-embedding-3-large models, OpenAI now offers more powerful and cost-efficient options for developers. These models not only show improved performance in standardized benchmarks, but also offer the ability to find the right balance between performance and memory requirements on a project-specific basis through customizable embedding sizes.

Embeddings are a key component in the development of intelligent systems that aim to achieve useful processing of speech information.

Links and Literature:

The post OpenAI Embeddings appeared first on ML Conference.

Address Matching with NLP in Python

ML editorial team — Fri, 02 Feb 2024 12:02:35 +0000

Address matching isn’t always simple in data; we often need to parse and standardize addresses into a consistent format first before we can use them as identifiers for matching. Address matching is an important step in the following use cases:

Geospatial Analysis: Accurate address matching forms the foundation of geospatial analysis, allowing organizations to make informed decisions about locations, market trends, and resource allocation across various industries like retail and media.
Real Estate Data Management: In the real estate industry, precise address matching facilitates property valuation, market analysis, and portfolio management.
Logistics and Navigation: Efficient routing and delivery depend on accurate address matching.
Compliance and Regulation: Many regulatory requirements mandate precise address data, such as tax reporting and census data collection.

Stay up to date

Learn more about MLCON

Cherre is the leading real estate data management company, we specialize in accurate address matching for the second use case. Whether you’re an asset manager, portfolio manager, or real estate investor, a building represents the atomic unit of all financial, legal, and operating information. However, real estate data lives in many silos, which makes having a unified view of properties difficult. Address matching is an important step in breaking down data silos in real estate. By joining disparate datasets on address, we can unlock many opportunities for further portfolio analysis.

Data Silos in Real Estate

Real estate data usually fall into the following categories: public, third party, and internal. Public data is collected by governmental agencies and made available publicly, such as land registers. The quality of public data is generally not spectacular and the data update frequency is usually delayed, but it provides the most comprehensive coverage geographically. Don’t be surprised if addresses from public data sources are misaligned and misspelled.

Third party data usually come from data vendors, whose business models focus on extracting information as datasets and monetizing those datasets. These datasets usually have good data quality and are much more timely, but limited in geographical coverage. Addresses from data vendors are usually fairly clean compared to public data, but may not be the same address designation across different vendors. For large commercial buildings with multiple entrances and addresses, this creates an additional layer of complexity.

Lastly, internal data is information that is collected by the information technology (I.T.) systems of property owners and asset managers. These can incorporate various functions, from leasing to financial reporting, and are often set up to represent the business organizational structures and functions. Depending on the governance standards, and data practices, the quality of these datasets can vary and data coverage only encompasses the properties in the owner’s portfolio. Addresses in these systems can vary widely, some systems are designated at the unit-level, while others designate the entire property. These systems also may not standardize addresses inherently, which makes it difficult to match property records across multiple systems.

With all these variations in data quality, coverage, and address formats, we can see the need for having standardized addresses to do basic property-level analysis.

[track_display_in_blog headline="NEW & PRACTICAL ENDEAVORS FOR ML" text="Machine Learning Principles" textcolor="white" backgroundimage="https://mlconference.ai/wp-content/uploads/2024/02/MLC_Global24_Website_Blog1.jpg" icon="https://mlconference.ai/wp-content/uploads/2019/10/MLC_Singapur20_Trackicons_MLPrinciples_250x250_54073_rot_v1.png" btnlink="machine-learning-principles" btntext="Learn more"]

Address Matching Using the Parse-Clean-Match Strategy

In order to match records across multiple datasets, the address parse-clean-match strategy works very well regardless of region. By breaking down addresses into their constituent pieces, we have many more options for associating properties with each other. Many of the approaches for this strategy use simple natural language processing (NLP) techniques.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Learn more

Address Parsing

Before we can associate addresses with each other, we must first parse the address. Address parsing is the process of breaking down each address string into its constituent components. Components in addresses will vary by country.

In the United States and Canada, addresses are generally formatted as the following:

{street_number} {street_name}

{city}, {state_or_province} {postal_code}

{country}

In the United Kingdom, addresses are formatted very similarly as in the U.S. and Canada, with an additional optional locality designation:

{building_number} {street_name}

{locality (optional)}

{city_or_town}

{postal_code}

{country}

French addresses vary slightly from U.K. addresses with the order of postal code and city:

{building_number} {street_name}

{postal_code} {city}

{country}

German addresses take the changes in French addresses and then swap the order of street name and building number:

{street_name} {building_number} {postal_code} {city} {country}

Despite the slight variations across countries’ address formats, addresses generally have the same components, which makes this an easily digestible NLP problem. We can break down the process into the following steps:

Tokenization: Split the address into its constituent words. This step segments the address into manageable units.
Named Entity Recognition (NER): Identify entities within the address, such as street numbers, street names, cities, postal codes, and countries. This involves training or using pre-trained NER models to label the relevant parts of the address.
Sequence Labeling: Use sequence labeling techniques to tag each token with its corresponding entity

Let’s demonstrate address parsing with a sample Python code snippet using the spaCy library. SpaCy is an open-source software library containing many neural network models for NLP functions. SpaCy supports models across 23 different languages and allows for data scientists to train custom models for their own datasets. We will demonstrate address parsing using one of SpaCy’s out-of-the-box models for the address of a historical landmark: David Bowie’s Berlin apartment.

import spacy

# Load the NER spaCy model
model = spacy.load("en_core_web_sm")

# Address to be parsed
address = "Hauptstraße 155, 10827 Berlin"

# Tokenize and run NER
doc = model(address)

# Extract address components
street_number = ""
street_name = ""
city = ""
state = ""
postal_code = ""

for token in doc:
    if token.ent_type_ == "GPE":  # Geopolitical Entity (City)
        city = token.text
    elif token.ent_type_ == "LOC":  # Location (State/Province)
        state = token.text
    elif token.ent_type_ == "DATE":  # Postal Code
        postal_code = token.text
    else:
        if token.is_digit:
            street_number = token.text
        else:
            street_name += token.text + " "

# Print the parsed address components
print("Street Number:", street_number)
print("Street Name:", street_name)
print("City:", city)
print("State:", state)
print("Postal Code:", postal_code)

Now that we have a parsed address, we can now clean each address component.

Address Cleaning

Address cleaning is the process of converting parsed address components into a consistent and uniform format. This is particularly important for any public data with misspelled, misformatted, or mistyped addresses. We want to have addresses follow a consistent structure and notation, which will make further data processing much easier.

To standardize addresses, we need to standardize each component, and how the components are joined. This usually entails a lot of string manipulation. There are many open source libraries (such as libpostal) and APIs that can automate this step, but we will demonstrate the basic premise using simple regular expressions in Python.

import pandas as pd
import re

# Sample dataset with tagged address components
data = {
    'Street Name': ['Hauptstraße', 'Schloß Nymphenburg', 'Mozartweg'],
    'Building Number': ['155', '1A', '78'],
    'Postal Code': ['10827', '80638', '54321'],
    'City': ['Berlin', ' München', 'Hamburg'],
}

df = pd.DataFrame(data)

# Functions with typical necessary steps for each address component

# We uppercase all text for easier matching in the next step

def standardize_street_name(street_name):

    # Remove special characters and abbreviations, uppercase names
    standardized_name = re.sub(r'[^\w\s]', '', street_name)
    return standardized_name.upper()

def standardize_building_number(building_number):
    # Remove any non-alphanumeric characters (although exceptions exist)
    standardized_number = re.sub(r'\W', '', building_number)
    return standardized_number

def standardize_postal_code(postal_code):

    # Make sure we have consistent formatting (i.e. leading zeros)
    return postal_code.zfill(5)

def standardize_city(city):

    # Upper case the city, normalize spacing between words
    return ' '.join(word.upper() for word in city.split())

# Apply standardization functions to our DataFrame
df['Street Name'] = df['Street Name'].apply(standardize_street_name)
df['Building Number'] = df['Building Number'].apply(standardize_building_number)
df['Postal Code'] = df['Postal Code'].apply(standardize_postal_code)
df['City'] = df['City'].apply(standardize_city)

# Finally create a standardized full address (without commas)

df[‘Full Address’] = df['Street Name'] + ' ' + df['Building Number'] + ' ' + df['Postal Code'] + ' ' + df['City']

Address Matching

Now that our addresses are standardized into a consistent format, we can finally match addresses from one dataset to address in another dataset. Address matching involves identifying and associating similar or identical addresses from different datasets. When two full addresses match exactly, we can easily associate the two together through a direct string match.

When addresses don’t match, we will need to apply fuzzy matching on each address component. Below is an example of how to do fuzzy matching on one of the standardized address components for street names. We can apply the same logic to city and state as well.

from fuzzywuzzy import fuzz

# Sample list of street names from another dataset
street_addresses = [
    "Hauptstraße",
    "Schlossallee",
    "Mozartweg",
    "Bergstraße",
    "Wilhelmstraße",
    "Goetheplatz",
]

# Target address component (we are using street name)
target_street_name = "Hauptstrasse " # Note the different spelling and space 

# Similarity threshold

# Increase this number if too many false positives

# Decrease this number if not enough matches

threshold = 80

# Perform fuzzy matching
matches = []

for address in street_addresses:
    similarity_score = fuzz.partial_ratio(address, target_street_name)
    if similarity_score >= threshold:
        matches.append((address, similarity_score))

matches.sort(key=lambda x: x[1], reverse=True)

# Display matched street name
print("Target Street Name:", target_street_name)
print("Matched Street Names:")
for match in matches:
    print(f"{match[0]} (Similarity: {match[1]}%)")

Up to here, we have solved the problem for properties with the same address identifiers. But what about the large commercial buildings with multiple addresses?

Other Geospatial Identifiers

Addresses are not the only geospatial identifiers in the world of real estate. An address typically refers to the location of a structure or property, often denoting a street name and house number. There are actually four other geographic identifiers in real estate:

A “lot” represents a portion of land designated for specific use or ownership.
A “parcel” extends this notion to a legally defined piece of land with boundaries, often associated with property ownership and taxation.
A “building” encompasses the physical structures erected on these parcels, ranging from residential homes to commercial complexes.
A “unit” is a sub-division within a building, typically used in multi-unit complexes or condominiums. These can be commercial complexes (like office buildings) or residential complexes (like apartments).

What this means is that we actually have multiple ways of identifying real estate objects, depending on the specific persona and use case. For example, leasing agents focus on the units within a building for tenants, while asset managers optimize for the financial performance of entire buildings. The nuances of these details are also codified in many real estate software systems (found in internal data), in the databases of governments (found in public data), and across databases of data vendors (found in third party data). In public data, we often encounter lots and parcels. In vendor data, we often find addresses (with or without units). In real estate enterprise resource planning systems, we often find buildings, addresses, units, and everything else in between.

In the case of large commercial properties with multiple addresses, we need to associate various addresses with each physical building. In this case, we can use geocoding and point-in-polygon searches.

Geocoding Addresses

Geocoding is the process of converting addresses into geographic coordinates. The most common form is latitude and longitude. European address geocoding requires a robust understanding of local address formats, postal codes, and administrative regions. Luckily, we have already standardized our addresses into an easily geocodable format.

Many commercial APIs exist for geocoding addresses in bulk, but we will demonstrate geocoding using a popular Python library, Geopy, to geocode addresses.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="my_geocoder")
location = geolocator.geocode("1 Canada Square, London")
print(location.latitude, location.longitude)

Now that we’ve converted our addresses into latitude and longitude, we can use point-in-polygon searches to associate addresses with buildings.

Point-in-Polygon Search

A point-in-polygon search is a technique to determine if a point is located within the boundaries of a given polygon.

The “point” in a point-in-polygon search refers to a specific geographical location defined by its latitude and longitude coordinates. We have already obtained our points by geocoding our addresses.

The “polygon” is a closed geometric shape with three or more sides, which is usually characterized by a set of vertices (points) connected by edges, forming a closed loop. Building polygons can be downloaded from open source sites like OpenStreetMap or from specific data vendors. The quality and detail of the OpenStreetMap building data may vary, and the accuracy of the point-in-polygon search depends on the precision of the building geometries.

While the concept seems complex, the code for creating this lookup is quite simple. We demonstrate a simplified example using our previous example of 1 Canada Square in London.

import json
from shapely.geometry import shape, Point

# Load the GeoJSON data
with open('building_data.geojson') as geojson_file:
    building_data = json.load(geojson_file)

# Latitude and Longitude of 1 Canada Square in Canary Wharf
lat, lon = 51.5049, 0.0195

# Create a Point geometry for 1 Canada Square
point_1_canada = Point(lon, lat)

# See if point is within any of the polygons
for feature in building_data['features']:
    building_geometry = shape(feature['geometry'])

    if point_1_canada.within(building_geometry):
        print(f"Point is within this building polygon: {feature}")
        break
else:
    print("Point is not within any building polygon in the dataset.")

Using this technique, we can properly identify all addresses associated with this property.

Stay up to date

Learn more about MLCON

Summary

Addresses in real life are confusing because they are the physical manifestation of many disparate decisions in city planning throughout the centuries-long life of a city. But using addresses to match across different datasets doesn’t have to be confusing.

Using some basic NLP and geocoding techniques, we can easily associate property-level records across various datasets from different systems. Only through breaking down data silos can we have more holistic views of property behaviors in real estate.

Author Biography

Alyce Ge is data scientist at Cherre, the industry-leading real estate data management and analytics platform. Prior to joining Cherre, Alyce held data science and analytics roles for a variety of technology companies focusing on real estate and business intelligence solutions. Alyce is a Google Cloud-certified machine learning engineer, Google Cloud-certified data engineer, and Triplebyte certified data scientist. She earned her Bachelor of Science in Applied Mathematics from Columbia University in New York.

The post Address Matching with NLP in Python appeared first on ML Conference.

What is Data Annotation and how is it used in Machine Learning?

ML editorial team — Tue, 12 Oct 2021 12:24:15 +0000

Modern businesses are operating in highly competitive markets, and finding new business opportunities is even harder. Customer experiences are constantly changing, finding the right talent to work on common business goals is also an enormous challenge, yet businesses want to perform the way the market demands. So what are these companies doing to create a sustainable competitive advantage? This is where Artificial Intelligence (AI) solutions come in and are prioritized. With AI, it is easier to automate business processes and smoothen decision-making. But, what exactly defines a successful Machine Learning (ML) project? The answer is simple, the quality of training datasets that work with your ML algorithms.

Having that in mind, what amounts to a high-quality training dataset? Data annotation. What is data annotation? And how is data annotation applied in ML?

In this article, we are delving deep to answer these key questions, and is particularly helpful if:

You are seeking to understand what data annotation is in ML and why it is so important.
You are a data scientist curious to know the various data annotation types out there and their unique applications.
You want to produce high-quality datasets for your ML model’s top performance, and have no idea where to find professional data annotation services.
You have huge chunks of unlabeled data, have no time to gather, organize, and label them, and in dire need of a data labeler to do the job for you, ultimately meet your training and deploying goals for your models.

What is Data Annotation?

In ML, data annotation refers to the process of labeling data in a manner that machines can recognize either through computer vision or natural language processing (NLP). In other words, data labeling teaches the ML model to interpret its environment, make decisions and take action in the process.

Data scientists use massive amounts of datasets when building an ML model, carefully customizing them according to the model training needs. Thus, machines are able to recognize data annotated in different, understandable formats such as images, texts, and videos.

This explains why AI and ML companies are after such annotated data to feed into their ML algorithm, training them to learn and recognize recurring patterns, eventually using the same to make precise estimations and predictions.

The data annotation types

Data annotation comes in different types, each serving different and unique use cases. Although data annotation is broad and wide, there are common annotation types in popular machine learning projects which we are looking at in this section to give you the gist in this field:

Semantic Annotation

Semantic annotation entails annotation of different concepts within text, such as names, objects, or people. Data annotators use semantic annotation in their ML projects to train chatbots and improve search relevance.

Image and Video Annotation

Let’s say this, image annotation enables machines to interpret content in pictures. Data experts use various forms of image annotation, including bounding boxes displayed on images, to pixels assigned a meaning individually, a process called semantic segmentation. This type of annotation is commonly used in image recognition models for various tasks like facial recognition and recognizing and blocking sensitive content.

Video annotation, on the other hand, uses bounding boxes, or polygons on video content. The process is simple, developers use video annotation tools to place these bounding boxes, or stick together video frames to track the movement of annotated objects. Either way deemed fit by the developer, this type of data becomes handy when developing computer vision models for localization of object tracking tasks.

Text categorization

Text categorization, also called text classification or text tagging is where a set of predefined categories are assigned to documents. A document can contain tagged paragraphs or sentences by topic using this type of annotation, thus making it easier for users to search for information within a document, an application, or a website.

Why is Data Annotation so Important in ML

Whether you think of search engines’ ability to improve on the quality of results, developing facial recognition software, or how self-driving automobiles are created, all these are made real through data annotation. Living examples include how Google manages to give results based on the user’s geographical location or sex, how Samsung and Apple have improved the security of their smartphones using facial unlocking software, how Tesla brought into the market semi-autonomous self-driving cars, and so on.

Annotated data is valuable in ML in giving accurate predictions and estimations in our living environments. As aforesaid, machines are able to recognize recurring patterns, make decisions, and take action as a result. In other words, machines are shown understandable patterns and told what to look for – in image, video, text, or audio. There is no limit to what similar patterns a trained ML algorithm cannot find in any new datasets fed into it.

Data Labeling in ML

In ML, a data label, also called a tag, is an element that identifies raw data (images, videos, or text), and adds one or more informative labels to put into context what an ML model can learn from. For example, a tag can indicate what words were said in an audio file, or what objects are contained in a photo.

Data labeling helps ML models learn from numerous examples given. For example, the model will spot a bird or a person easily in an image without labels if it has seen adequate examples of images with a car, bird, or a person in them.

Conclusion

Data annotation is valuable to ML and has contributed immensely to some of the cutting-edge technologies we enjoy today. Data annotators, or the invisible workers in the ML workforce, are needed more now than ever before. The growth of the AI and ML industry as a whole depends solely on the continued creation of nuanced datasets needed to create some of ML’s complex problems.

There is no better “fuel” for training ML algorithms than annotated data in images, videos, or texts – and that is when we arrive at some of the autonomous ML models we can possibly and proudly have.

Now you understand why data annotation is essential in ML, its various and common types, and where to find data annotators to do the job for you. You are in a position to make informed choices for your enterprise and level up your operations.

The post What is Data Annotation and how is it used in Machine Learning? appeared first on ML Conference.

Neuroph and DL4J

ML editorial team — Tue, 14 Sep 2021 11:31:34 +0000

Microscopic blood counts include an analysis of the six types of white blood cells. These include: Neutrophils, Eosinophils, Basophilic Granulocytes, Monocytes, and Lymphocytes. Based on the number, maturity, and distribution of these white blood cells, you can obtain valuable information about possible diseases. However, here we will not focus on the handling of the blood smears, but on the recognition of the blood cells.

For the tests described, the Bresser Trino microscope with a MikrOkular was used and connected to a computer (HP Z600). The program presented in this article was used for image analysis. The software is based on neural networks using the Java frameworks Neuroph and Deep Learning for Java (DL4J). Staining of smears for the microscope were made with Löffler solution.

Training data

For neural network training, the images of the blood cells were centered, converted to grayscale format, and normalized. After preparation, the images looked as shown in Figure 1.

Fig. 1: The JPG images have a size of 100 x 100 pixels and show (from left to right) lymphocyte (ly), basophil (bg), eosinophil (eog), monocyte (mo), rod-nucleated (young) neutrophil (sng), segment-nucleated (mature) neutrophil (seg); the cell types were used for neural network training.

A dataset of 663 images with 6 labels – ly, bg, eog, mo, sng, seg – was compiled for training. For Neuroph, the imageLabels shown in Listing 1 were set.

List imageLabels = new ArrayList();
  imageLabels.add("ly");
  imageLabels.add("bg");
  imageLabels.add("eog");
  imageLabels.add("mo");
  imageLabels.add("sng");
  imageLabels.add("seg");

After that, the directory for the input data looks like Figure 2.

Fig. 2: The directory for the input data

For DL4J the directory for the input data (your data location) is composed differently (Fig. 3).

Fig. 3: Directory for the input data for DL4J.

Most of the images in the dataset came from our own photographs. However, there were also images from open and free internet sources. In addition, the dataset contained the images multiple times as they were also rotated 90, 180, and 270 degrees respectively and stored.

Stay up to date

Learn more about MLCON

Neuroph MLP Network

The main dependencies for the Neuroph project in pom.xml are shown in Listing 2.


  org.neuroph
  neuroph-core
  2.96
   

  org.neuroph
  neuroph-imgrec
  2.96


  log4j
  log4j
  1.2.17

A multilayer perceptron was set with the parameters shown in Listing 3.

private static final double LEARNINGRATE = 0.05;
private static final double MAXERROR = 0.05;
private static final int HIDDENLAYERS = 13;
 
//Open Network
Map map;
  try {
    map = ImageRecognitionHelper.getFractionRgbDataForDirectory(new File(imageDir), new Dimension(10, 10));
    dataSet = ImageRecognitionHelper.createRGBTrainingSet(image-Labes, map);
    // create neural network
    List hiddenLayers = new ArrayList<>();
    hiddenLayers.add(HIDDENLAYERS);/
    NeuralNetwork nnet = ImageRecognitionHelper.createNewNeuralNet-work("leukos", new Dimension(10, 10), ColorMode.COLOR_RGB, imageLabels, hiddenLayers, TransferFunctionType.SIGMOID);
    // set learning rule parameters
    BackPropagation mb = (BackPropagation) nnet.getLearningRule();
    mb.setLearningRate(LEARNINGRATE);
    mb.setMaxError(MAXERROR);
    nnet.save("leukos.net");
  } catch (IOException ex) {
    Logger.getLogger(Neuroph.class.getName()).log(Level.SEVERE, null, ex);
  }

Example

The implementation of a test can look like the one shown in Listing 4.

HashMap output;
String fileName = "leukos112.seg";
NeuralNetwork nnetTest = NeuralNetwork.createFromFile("leukos.net");
// get the image recognition plugin from neural network
ImageRecognitionPlugin imageRecognition = (ImageRecognitionPlugin) nnetTest.getPlugin(ImageRecognitionPlugin.class);
output = imageRecognition.recognizeImage(new File(fileName);

Client

A simple SWING interface was developed for graphical cell recognition. An example of the recognition of a lymphocyte is shown in Figure 4.

Fig. 4: The program recognizes a lymphocyte and highlights it

DL4J MLP network

The main dependencies for the DL4J project in pom.xml are shown in Listing 5.


  org.deeplearning4j
  deeplearning4j-core
  1.0.0-beta4


  org.nd4j
  nd4j-native-platform
  1.0.0-beta4

Again, a multilayer perceptron was used with the parameters shown in Listing 6.

protected static int height = 100;
protected static int width = 100;
protected static int channels = 1;
protected static int batchSize = 20;
 
protected static long seed = 42;
protected static Random rng = new Random(seed);
protected static int epochs = 100;
protected static boolean save = true;

//DataSet
  String dataLocalPath = "your data location";
  ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator();
  File mainPath = new File(dataLocalPath);
  FileSplit fileSplit = new FileSplit(mainPath, NativeImageLoader.ALLOWED_FORMATS, rng);
  int numExamples = toIntExact(fileSplit.length());
  numLabels = Objects.requireNonNull(fileSplit.getRootDir().listFiles(File::isDirectory)).length;
  int maxPathsPerLabel = 18;
  BalancedPathFilter pathFilter = new BalancedPathFilter(rng, labelMaker, numExamples, numLabels, maxPathsPerLabel);
  //training – Share test
  double splitTrainTest = 0.8;
  InputSplit[] inputSplit = fileSplit.sample(pathFilter, splitTrainTest, 1 - splitTrainTest);
  InputSplit trainData = inputSplit[0];
  InputSplit testData = inputSplit[1];
 
//Open Network
MultiLayerNetwork network = lenetModel();
network.init();
ImageRecordReader trainRR = new ImageRecordReader(height, width, channels, labelMaker);
trainRR.initialize(trainData, null);
DataSetIterator trainIter = new RecordReaderDataSetIterator(trainRR, batchSize, 1, numLabels);
  scaler.fit(trainIter);
  trainIter.setPreProcessor(scaler);
  network.fit(trainIter, epochs);

LeNet Model

This model is a kind of forward neural network for image processing (Listing 8).

private MultiLayerNetwork lenetModel() {
  /*
    * Revisde Lenet Model approach developed by ramgo2 achieves slightly above random
    * Reference: https://gist.github.com/ramgo2/833f12e92359a2da9e5c2fb6333351c5
  */
  MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .l2(0.005)
    .activation(Activation.RELU)
    .weightInit(WeightInit.XAVIER)
    .updater(new AdaDelta())
    .list()
    .layer(0, convInit("cnn1", channels, 50, new int[]{5, 5}, new int[]{1, 1}, new int[]{0, 0}, 0))
    .layer(1, maxPool("maxpool1", new int[]{2, 2}))
    .layer(2, conv5x5("cnn2", 100, new int[]{5, 5}, new int[]{1, 1}, 0))
    .layer(3, maxPool("maxool2", new int[]{2, 2}))
    .layer(4, new DenseLayer.Builder().nOut(500).build())
    .layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
      .nOut(numLabels)
      .activation(Activation.SOFTMAX)
      .build())
    .setInputType(InputType.convolutional(height, width, channels))
    .build();
 
  return new MultiLayerNetwork(conf);
 
}

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Learn more

Example

A test of the Lenet model might look like the one shown in Listing 9.

trainIter.reset();
DataSet testDataSet = trainIter.next();
List allClassLabels = trainRR.getLabels();
int labelIndex;
int[] predictedClasses;
String expectedResult;
String modelPrediction;
int n = allClassLabels.size();
System.out.println("n = " + n);
for (int i = 0; i < n; i = i + 1) {
  labelIndex = testDataSet.getLabels().argMax(1).getInt(i);
  System.out.println("labelIndex=" + labelIndex);
  INDArray ia = testDataSet.getFeatures();
  predictedClasses = network.predict(ia);
  expectedResult = allClassLabels.get(labelIndex);
  modelPrediction = allClassLabels.get(predictedClasses[i]);
  System.out.println("For a single example that is labeled " + expectedResult + " the model predicted " + modelPrediction);
}

Results

After a few test runs, the results shown in Table 1 are obtained.

LeukosNeurophDL4J

Lymphocytes (ly)	87	85
Basophils (bg)	96	63
Eosinophils (eog)	93	54
Monocytes (mo)	86	60
Rod nuclear neutrophils (sng)	71	46
Segment nucleated neutrophils (seg)	92	81

Table 1: Results of leukocyte counting (N-success/N-samples in %).

As can be seen, the results using Neuroph are slightly better than those using DL4J, but it is important to note that the results are dependent on the quality of the input data and the network’s topology. We plan to investigate this issue further in the near future.

However, with these results, we have already been able to show that image recognition can be used for medical purposes with not one, but two sound and potentially complementary Java frameworks.

Acknowledgments

At this point, we would like to thank Mr. A. Klinger (Management Devoteam GmbH Germany) and Ms. M. Steinhauer (Bioinformatician) for their support.

The post Neuroph and DL4J appeared first on ML Conference.

Top 5 reasons to attend ML Conference

ML editorial team — Tue, 20 Jul 2021 11:33:51 +0000

1. Let your boss know why you want to go to ML Conference

Tell him the there are over 25 expert speakers and industry experts addressing actual trends and best practices.

Tell your boss to take a look at the conference tracks to have a better idea of what this conference is all about.

2. Tell him what’s in it for him

You have the chance to gain key knowledge and skills for this new era of Machine Learning. Turn your ideas into best practices during the workshops and meet people who can help you with that. You’ll learn what it means to build up a ML-first mindset with numerous real-world examples and you can put them into practice in your company. At ML Conference you will develop a deep understanding of your data, as well as of the latest tools and technologies.

3. Show him that you’ve done your homework: Book your ticket now and save money.

If you book your ticket now, your boss will save money on the early bird ticket. Plus, you will have an additional 10% discount for a group of 3 people or more.

4. Assure your boss that you will network with top industry experts

In addition to the valuable knowledge you will get from top-notch industry experts, you’ll also have the chance to connect and network with the people who are at the top of their career. ML Conference offers an expo reception and a networking event.

The post Top 5 reasons to attend ML Conference appeared first on ML Conference.

Anomaly Detection as a Service with Metrics Advisor

sbastian — Wed, 09 Jun 2021 07:39:21 +0000

Are you developing an application that stores time-based data? Orders, ratings, comments, appointments, time bookings, repairs or customer contacts? Do you have detailed log files about the number and duration of visits? Hand on heart: How quickly would you notice if your systems (or your users) were behaving differently than you thought? Maybe one of your clients is trying to flood the software with way too much data, or a product in your webshop is “going through the roof”? Maybe there are performance issues in certain browsers or unnatural CPU spikes that deserve a closer look? Metrics Advisor from Azure Cognitive Services provides an AI-powered service that monitors your data and alerts you when anomalies are suspected.

Stay up to date

Learn more about MLCON

What is normal?

The big challenge here is defining what constitutes an anomaly in the first place. Imagine a whole shelf full of developer magazines, with only one sports magazine among them. You could rightly say that the sports magazine is an anomaly. But perhaps, by chance, all the magazines are in A4 format, and only two are in A5 – another anomaly. Thus, for automated anomaly detection, it is important to learn from experience and understand which anomalies are actually relevant – and where there is a false alarm that should be avoided in the future.

In the case of time-based data – which is what Metrics Advisor is all about – there are several approaches to anomaly detection. The simplest way is to define hard limits: Anything below or above a certain threshold is considered an anomaly. This doesn’t require machine learning or artificial intelligence; the rules are quickly implemented and clearly understood. For monitoring data, that might be enough: If 70 percent of the storage space is occupied, you want to react. But often the (data) world does not run in rigid paths, sometimes the relative change is more decisive than the actual value: if there has been a significant increase or decrease of more than 10 percent within the last three hours, an anomaly should be detected. An example could be taken from finance. If your private account balance changes from €20,000 to €30,000, this is probably to be considered an anomaly. If a company account changes from €200,000 to €210,000, this is not worth mentioning. As you can tell from this example, the classification of what constitutes an anomaly may change over time. When founding a startup, €100,000 is a lot of money; when founding a large corporation, it is a marginal note. But what if your data is subject to seasonal fluctuations, or individual days such as weekends or holidays behave significantly differently? Here, too, the classification is not so trivial. Is a wave of influenza in the winter months to be expected and only an anomaly in the summer, or should every increase in infection numbers be flagged? As you can see, the question of anomaly detection is to some extent a very subjective one, regardless of the tooling – and not all decisions can be taken over by the technology. Machine learning, however, can help learn from historical data and distinguish normal fluctuations from anomalies.

RETHINK YOUR APPROACHES

Business & Strategy

Learn more

Metrics Advisor

Metrics Advisor is a new service in the Azure Cognitive Services lineup and is only available in a preview version for now. Internally, another service is used, namely the Anomaly Detector (also part of the Cognitive Services). Metrics Advisor complements this with numerous API methods, a web frontend for managing and viewing data feeds, root-cause analysis and alert configurations. To experiment, you need an Azure subscription. There you can create a Metrics Advisor resource; it’s completely free to use in the preview phase.

The example I would like to use to demonstrate the basic procedure uses data from Google Trends [1]. I have evaluated and downloaded weekly Google Trends scores for two search terms (“vaccine” and “influenza”) for the last five years for four countries (USA, China, Germany, Austria) and would like to try to identify any anomalies in this data. The entire administration of the Metrics Advisor can be done via the provided REST API [2], a faster way to get started is via the provided web frontend, the so-called workspace [3].

Data Feeds

To start with, we create a new data feed that provides the basic data for the analysis. Various Azure services and databases are available out of the box as data sources: Application Insights, Blob Storage, Cosmos DB, Data Lake, Table Storage, SQL Database, Elastic Search, Mongo DB, PostgreSQL – and a few more. In our example, I loaded the Google Trends data into a SQL Server database. In addition to the primary key, the table has four other columns: the date, the country, and the scores for vaccine and influenza. In Metrics Advisor, an SQL statement must now be specified (in addition to the connection string) to query all values for a given date. This is because the service will continue to periodically visit our database to retrieve and analyze the new data. The frequency at which this update should happen is set via the granularity: The data can be analyzed annually, monthly, weekly, daily, hourly or in even shorter periods (the smallest unit is 300 seconds). Depending on the selected granularity, Microsoft also recommends how much historical data to provide. If we choose a 5-minute interval, then data from the last four days will suffice. In our case, a weekly analysis, four years is recommended. After clicking on Verify and get schema, the SQL statement is issued and the structure of our data source is determined. We see the columns shown in Figure 1 and need to assign meaning: Which column contains the timestamp? Which columns should be analyzed as metrics – and where are additional facts (dimensions) that could be possible causes of anomalies?

Fig. 1: Configuration of the data feed

Now, before the data is actually imported, there is one more thing to consider: The roll up settings. For a later root cause analysis, it is necessary to build a multidimensional cube that calculates aggregated values per dimension (i.e. in our case per week also an aggregation over all countries). This way, in case of anomalies, it can later be investigated which dimensions or characteristics seem to be causal for the change in value. If the aggregations are not already in our data source, Metrics Advisor can be asked to calculate them. The only decision we have to make here is the type of aggregation (sum, average, min, max, count). Here our example lags a bit: we select average, but the value of the USA thus flows in with the same factor as the value of the small Austria. As you can see: often one fails because of the data quality or has to be careful that statements are not based on wrong calculations.

Finally, we start the import, which can take several hours depending on the amount of data. The status of the import can also be tracked in the workspace, and individual time periods can be reloaded at any time.

Analysis and fine tuning

Once the data import is complete, we can take a first look at our results. The main goal of Metrics Advisor is to analyze and detect new anomalies – that is, to investigate whether the most recent data point is an anomaly or not. Nevertheless, historical data is also taken into consideration. Depending on the granularity, the service looks several hours to years into the past and tries to flag anomalies there as well. In our case (five years of data, weekly aggregation), the so-called Smart Detection provides results for the past six months and marks individual points in time as an anomaly (Fig. 2).

Fig. 2: Data visualization including anomalies

Now it is time to take a look at the suggestions: Are the identified anomalies actually relevant? Is the detection too sensitive or too tolerant? There are some ways to improve the detection rate. Let’s recall the beginning of this article: The big challenge is to define what constitutes an anomaly in the first place.

You will probably notice a prominently placed slider in the workspace very quickly. With it we can control the sensitivity. The higher the value, the smaller the area containing normal points. We also get these limits visualized within the charts as a light blue area. Sometimes it is useful not to warn at the first occurrence of an anomaly, but only when several anomalies have been detected over a period of time. We can configure Metrics Advisor to look at a certain number of points retrospectively and not consider an anomaly until a certain percentage of those points have been detected as an anomaly. For example, a brief performance problem should be tolerated, but if 70 percent of the readings have been detected as an anomaly in the last 15 minutes, it should be considered a problem overall.

According to the use case, it may make sense to supplement Smart Detection with manual rules. A Hard Threshold can be used to define a lower or upper limit as well as a range of values that should be considered an anomaly. The Change Threshold offers the mentioned possibility to evaluate a percentage change to one or more predecessor points as an anomaly. By the way the different rules are linked (or/and) we influence the detection: for example, an anomaly should only be detected as such if Smart Detection strikes and the value is above 30. We can compose and name several of these configurations. In addition, it is possible to store special rules for individual dimensions.

Depending on our settings, more or less anomalies are detected in the data, and the Metrics Advisor then tries to convert them into so-called incidents. An incident can consist of a single anomaly, but is often made up of related anomalies and thus entire time periods are listed under a common incident. Tools are available in the Incident Hub for closer examination: We can filter the found incidents (by time, criticality, and dimension), start an initial automatic root cause analysis (see “Root cause” in Fig. 3), and drill down through multiple dimensions to gather insights.

Fig. 3: Incident analysis

Feedback

Perhaps the greatest benefit of using artificial intelligence for anomaly detection is the ability to learn through feedback. Even if sensitivity and thresholds have been set well, sometimes the service will get it wrong. However, it is for these data points that feedback can then be provided via the API or portal: Where was an anomaly incorrectly detected? Where was an anomaly missed? The service accepts this feedback and tries to assign similar cases more correctly in the future. The service also tries to recognize time periods – and can be proven wrong if we mark a time range and report it as a period.

For predictable anomalies that have temporal reasons (holidays, weekends, cyclically recurring events), there are separate options for configuration. These should therefore not be reported subsequently as feedback, but stored as so-called preset events.

Alerts

We should now be at a point where data is imported regularly and anomaly detection hopefully works reliably. However, the best detection is of no use if we learn about it too late. Therefore, Alert Configurations should be set up to actively notify about anomalies. Currently, there are three channels to choose from: via email, as a WebHook, or as a ticket in Azure DevOps. The WebHook variant in particular offers exciting possibilities for integration: we can display the detected anomalies in our own application or trigger a workflow using Azure Logic Apps. Perhaps we simply restart the affected web app as a first automated action.

Snooze settings also seem handy; an alert can automatically ensure that no more alerts are sent for a configurable period of time afterwards. This avoids waking up in the morning with 500 emails in your inbox, all with the same content.

Summary

Metrics Advisor provides an exciting and easy entry into the world of anomaly detection of time-based data. Long-time data scientists may prefer other ways and means (and may be interested in the paper at [4]), but for application developers who want to start their first experiments with matching data, this service is a potent gateway drug. The preview status is currently still evident mainly in the web portal and in the lack of documentation quality of the REST API; however, good conceptual documentation is already available.

Have fun trying it out and experimenting with your own data sources.

Links & Literature

[1] https://trends.google.com

[2] https://docs.microsoft.com/en-us/azure/cognitive-services/metrics-advisor/

[3] https://metricsadvisor.azurewebsites.net

[4] https://arxiv.org/abs/1906.03821

The post Anomaly Detection as a Service with Metrics Advisor appeared first on ML Conference.

Tools & Processes for MLOps

kraheel — Wed, 26 May 2021 10:47:53 +0000

When we want to use our familiar tools and workflows from software development for data science and machine learning projects, we quickly run into problems. Data science and machine learning model building follow a different process than the classic software development process, which is fairly linear.

When I create a branch in software development, I have a clear goal in mind of what the outcome of that branch will be: I want to fix a bug, develop a user story, or revise a component. I start working on this defined task. Then, once I upload my code to the version control system, automated tests run – and one or more team members perform a code review. Then I usually do another round to incorporate the review comments. When all issues are fixed, my branch is integrated into the main branch and the CI/CD pipeline starts running; a normal development process. In summary, the majority of the branches I create are eventually integrated in and deployed to a production environment.

In the area of machine learning and data science, things are different. Instead of a linear and almost “mechanical” development process, the process here is very much driven by experiments. Experiments can fail; that is the nature of an experiment. I also often start an experiment precisely with the goal of disproving a thesis. Now, any training of a machine learning model is an experiment and an attempt to achieve certain results with a specific model and algorithm configuration and data set. If we imagine that for a better overview we manage each of these experiments in a separate branch, we will get very many branches very quickly. Since the majority of my experiments will not produce the desired result, I will discard many branches. Only a few of my experiments will ever make it into a production environment. But still, I want to have an overview of what experiments I have already done and what the results were so that I can reproduce and reuse them in the future.

But that’s not the only difference between traditional software development and machine learning model development. Another difference is behavior over time.

ML models deteriorate over time

Classic software works just as well after a month as it did on day one. Of course, there may be changes in memory and computational capacity requirements, and of course bugs will occur, but the basic behavioral characteristics of the production software do not change. With machine learning models, it’s different. For these, the quality decreases over time. A model that operates in a production environment and is not re-trained will degrade over time and never achieve as good a predictive accuracy as it did on day one.

Concept drift is to blame [1]. The world outside our machine learning system changes and so does the data that our model receives as input values. Different types of concept drift occur: data can change gradually, for example, when a sensor becomes less accurate over a long period of time due to wear and tear and shows an ever-increasing deviation from the actual measured value. Cyclical events such as seasons or holidays can also have an effect if we want to predict sales figures with our model.

But concept drift can also occur very abruptly: If global air traffic is brought to a standstill by COVID-19, then our carefully trained model for predicting daily passenger traffic will deliver poor results. Or if the sales department launches an Instagram promotion without notice that leads to a doubling of buyers of our vitamin supplement, that’s a good result, but not something our model is good at predicting.

There are two ways to counteract this deterioration in prediction quality: either we enable our model to actively retrain itself in the production environment, or we have to update our model frequently. Or better yet, update as often as we can somehow. We may also have made a necessary adjustment to an algorithm or introduced a new model that needs to be rolled out as quickly as possible.

So in our machine learning workflow, our goal is not just to deliver models to the user. Instead, our goal must be to build infrastructure that quickly informs our team when a model is providing incorrect predictions and enables the team to lift a new, better model into production environments as quickly as possible.

MLOps as DevOps for Machine Learning

We have seen that data science and machine learning model building require a different process than traditional, “linear” software development. It is also necessary that we achieve a high iteration speed in the development of machine learning models, in order to counteract concept drift. For this reason, it is necessary that we create a machine learning workflow and a machine learning platform to help us with these two requirements. This is a set of tools and processes that are to our machine learning workflow what DevOps is to software development: A process that enables rapid but controlled iteration in development supported by continuous integration, continuous delivery, and continuous deployment. This allows us to quickly and continuously bring high-quality machine learning systems into production, monitor their performance, and respond to changes. We call this process MLOps [2] or CD4ML (Continuous Delivery for Machine Learning) [3].

MLOps also provides us with other benefits: Through reproducible pipelines and versioned data, we create consistency and repeatability in the training process as well as in production environments. These are necessary prerequisites to implement business-critical ML use cases and to establish trust in the new technology among all stakeholders.

In the enterprise environment, we have a whole set of requirements that need to be implemented and adhered to in addition to the actual use case. There are privacy, data security, reproducibility, explainability, non-discrimination, and various compliance policies that may differ from company to company. If we leave these additional challenges for each team member to solve individually, we will create redundant, inconsistent and simply unnecessary processes. A unified machine learning workflow can provide a structure that addresses all of these issues, making each team member’s job easier.

Due to the experimental and iterative nature of machine learning, each step in the process that can be automated has a significant positive impact on the overall run time of the process from data to productive model. A machine learning platform allows data scientists and software engineers to focus on the critical aspects of the workflow and delegate the routine tasks to the automated workflows. But what sub-steps and tools can a platform for MLOps be built from?

Components of an MLOps pipeline

An MLOps workflow can be roughly divided into three areas:

Data pipeline and feature management
Experiment management and model development
Deployment and monitoring

In the following, I describe the individual areas and present a selection of tools that are suitable for implementing the workflow. Of course, this selection is not conclusive or even representative, since the entire landscape is in a rapid development process, so that only individual snapshots are always possible.

Fig. 1: MLOps workflow

Data pipeline and feature management

As hackneyed as slogans like “data is the new oil” may seem, they have a kernel of truth: The first step in any machine learning and data science workflow is to collect and prepare data.

Centralized access to raw data

Companies with modern data warehouses or data lakes have a distinct advantage when developing machine learning products. Without a centralized point to collect and store raw data, finding appropriate data sources and ensuring access to that data is one of the most difficult steps in the lifecycle of a machine learning project in larger organizations.

Centralized access can be implemented here in the form of a Hadoop-based platform. However, for smaller data volumes, a relational database such as Postgres [4] or MySQL [5], or a document database based on an EL stack [6] is also perfectly adequate. The major cloud providers also provide their own products for centralized raw data management: Amazon Redshift [7], Google BigQuery [8] or Microsoft Azure Cosmos DB [9].

In any case, it is necessary that we first archive a canonical form of our original data before applying any transformation to it. This gives us an unmodified dataset of original data that we can use as a starting point for processing.

Even at this point in the workflow, it is important to rely on good documentation and to document the sources of the data, its meaning, and where it is stored. Even though this step seems simple, it is still of utmost importance. Invalid data, the wrong naming of a column of data, or a misconfigured scraping job can lead to a lot of frustration and wasted time.

Data Transformation

Rarely will we train our machine learning model directly on raw data. Instead, we generate features from the raw data. In the context of machine learning, a feature is one or more processed data attributes that an algorithm uses to make predictions. This could be a temperature value, for example, but in the case of deep learning applications also highly abstract features in images. To extract features from raw data, we will apply various transformations. We will typically define these transformations in code, although some ETL tools also allow us to define them using a graphical interface. Our transformations will either be run as batch jobs on larger sets of data, or we will define them as long-running streaming applications that continuously transform data.

We also need to split our dataset into training and testing datasets. To train a machine learning model, we need a set of training data. To test how our model performs with previously unknown data, we need another structurally identical set of test data. Therefore, we split our original transformed data set into two data sets. These must not overlap, meaning that the same data point does not occur twice in them. A common split here is to use 70 percent of the dataset for training and 30 percent for testing the model.

The exact split of the data sets depends on the context. For time-series data, sequential slices from the series should be chosen, while for image processing, random images from the data set should be chosen since they have no sequential relation to each other.

For non-sequential data, the individual data points can also be placed in a (pseudo-)random order. We also want to perform this process in a reproducible and automated manner rather than manually. A pleasantly usable tool for management and coordination here is Apache Airflow [10]. Here, according to the “pipeline as code” principle, one can define various pipelines in the form of a data flow graph, connect a wide variety of systems, and thus perform the desired transformations.

Feature repositories

Many machine learning models and systems within a company use the same or at least similar features. Now, once we have extracted a feature from our raw data, there is a high probability that this feature can be useful for other applications as well. Therefore, it can be useful not to have to implement feature extraction again for each application. For this, we can store known features in a feature store. This can be done either in a dedicated component (such as Feast) [11], or in well-documented database tables populated by appropriate transformations. These transformations can be mapped automatically using Apache Airflow.

Data versioning

In addition to code versioning, data versioning is useful in a machine learning context. This allows us to increase the reproducibility of our experiments and to validate our models and their predictions by retracing the exact state of a training dataset that was used at a given time. Tools such as DVC [12] or Pachyderm [13] can be used for this purpose.

Experiment management and model development

In order to deploy an optimal model into production, we need to create a process that enables the development of that optimal model. To do this, we need to capture and visualize information that enables the decision of what the optimal model is, since in most cases this decision is made by a human and not automated.

Since the data science process is very experiment-driven, multiple experiments are run in parallel, often by different people at the same time. And most will not be deployed in a production environment. The experimental approach in this phase of research is very different from the “traditional” software development process, as we can expect that the code for these experiments will be discarded in the majority of cases, and only some experiments will reach a production status.

Experiment management and visualizations

Running hundreds or even thousands of iterations on the way to an optimally trained ML model is not uncommon. In the process, quite a few parameters used to define each experiment and the results of that experiment are accumulated. Often, this metadata is stored in Excel spreadsheets or, in the worst case, in the heads of team members. However, to establish optimal reproducibility, avoid time-consuming multiple experiments, and enable optimal collaboration, this data should be captured automatically. Possible tools here are MLflow tracking [14] or Sacred [15]. To visualize the output metrics, either classical dashboards like Grafana [16] or specialized tools like TensorBoard [17] can be used. TensorBoard can also be used for this purpose independently of its use with TensorFlow. For example, PyTorch provides a compatible logging library [18]. However, there is still much room for optimization and experimentation here. For example, the combination of other tools from the DevOps environment such as Jenkins [19] and Terraform [20] would also be conceivable.

Version control for models

In addition to the results of our experiments, the trained models themselves can also be captured and versioned. This allows us to more easily roll back to a previous model in a production environment. Models can be versioned in several ways: In the simplest variant, we export our trained model in serialized form, for example as a Python .pkl file. We then record this in a suitable version control system (Git, DVC), depending on its size.

Another option is to provide a central model registry. For example, the MLflow Model Registry [21] or the model registry of one of the major cloud providers can be used here. Also, the model can be packaged in a Docker container and managed in a private Docker Registry [22].

Infrastructure for distributed training

Smaller ML models can usually still be trained on one’s own laptop with reasonable effort. However, as soon as the models become larger and more complex, this is no longer possible and a central on-premise or cloud server becomes necessary for training. For automated training in such an environment, it makes sense to build a model training pipeline.

This is executed with training data at specific times or on demand. The pipeline receives the configuration data that defines a training cycle. These data are for example model type, hyperparameters and used features. The pipeline can obtain the training data set automatically from the feature store and distribute it to all model variants to be trained in parallel. Once training is complete, the model files, original configuration, learned parameters, and metadata and timings are captured in the experiment and model tracking tools. One possible tool for building one is Kubeflow [23]. Kubeflow provides a number of useful features for automated training and for (cost-)efficient resource management based on Kubernetes.

Deployment and monitoring

Unless our machine learning project is purely a proof of concept or an academic project, we will eventually need to lift our model into a production environment. And that’s not all: once it gets there, we’ll need to monitor it and deploy a new version as needed. Also, in a large part of the cases, we will have not just one, but rather a whole set of models in our production environments.

Deploy models

On a technical level, any model training pipeline must produce an artifact that can be deployed into a production environment. The prediction results may be bad, but the model itself must be in a packaged state that allows it to be deployed directly into a production environment. This is a familiar idea from software development: continuous delivery. This packaging can be done in two different ways.

Either our model is deployed as a separate service and accessed by the rest of our systems via an interface. Here, the deployment can be done, for example, with TensorFlow Serving [24] or in a Docker container with a matching (Python) web server.

An alternative way of deployment is to embed it in our existing system. Here, the model files are loaded and input and output data are routed within the existing system. The problem here is that the model must be in a compatible format or a conversion must be performed before deployment. If such a conversion is performed, automated tests are essential. Here it must be ensured that both the original and the converted model deliver identical prediction results.

Monitoring

A data science project does not end with the deployment of a model into production. Even a production model has to face many challenges. The value distribution of my input values may be different in the real world than the one mapped in the training data. Also, value distributions can change slowly over time or due to singular events. This then requires retraining with the changed data.

Also, despite intensive testing, errors may have crept in during the previous steps. For this reason, infrastructure should be provided to continuously collect data on model performance. The input values and the resulting predictions of the model should also be recorded, as far as this is compatible with the applicable data protection regulations. On the other hand, if privacy considerations are only introduced at this point, one has to ask how a sufficient amount of training data could be collected without questionable privacy practices.

Here’s a sampling of basic information we should be collecting about our machine learning system in production:

How many times did the model make a prediction?
How long does it take the model to perform a prediction?
What is the distribution of the input data?
What features were used to make the prediction?
What results were predicted and what real results were observed later in the system?

Tools such as Logstash [25] or Prometheus [26] can be used to collect our monitoring data. To get a quick overview of the performance of the model, it is recommended to set up a dashboard that visualizes the most important metrics and to set up an automatic notification system that alerts the team in case of strong deviations so that appropriate countermeasures can be taken if needed.

Challenges on the road to MLOps

Companies face numerous challenges in staffing and challenges within their teams on the road to a successful machine learning strategy. There is also the financial challenge of attracting experienced software engineers and data scientists. But even if we manage to assemble a good team, we need to enable them to work together in the best possible way to bring out the strengths of each team member. Generally speaking, data scientists feel very comfortable using various statistical tools, machine learning algorithms, and Jupyter notebooks. However, they are often less familiar with version control software and testing tools that are widely used in software engineering. While software engineers are familiar with these tools, they often lack the expertise to choose the algorithm for a problem or to extract the last five percent of predictive accuracy from a model through skillful optimizations. Our workflows and processes must be designed to support both groups as best as possible and enable smooth collaboration.

In terms of technological challenges, we face a broad and dynamic technology landscape that is constantly evolving. In light of this confusing situation, we are often faced with the question of how to get started with new machine learning initiatives.

How do I get started with MLOps?

Building MLOps workflows must always be an evolutionary process and cannot be done in a one-time “big bang” approach. Each company has its own unique set of challenges and needs when developing its machine learning strategies. In addition, there are different users of machine learning systems, different organizational structures, and existing application landscapes. One possible approach here may be to create an initial prototype without ML components. Then, one should start building the infrastructure and simplest models.

From this starting point, the infrastructure created can then be used to move forward in small and incremental steps to more complicated models and lift them into production environments until the desired level of predictive accuracy and performance has been achieved. Short development cycles for machine learning models, in the range of days rather than weeks or even months, enable faster response to changing circumstances and data. However, such short iteration cycles can only be achieved with a high degree of automation.

Developing, putting into production, and keeping machine learning models productive is itself a complex and iterative process with many inherent challenges. Even on a small or experimental scale, many companies find it difficult to implement these processes cleanly and without failures. The data science and machine learning development process is particularly challenging in that it requires careful balancing of the iterative, exploratory components and the more linear engineering components.

The tools and processes for MLOps presented in this article are an attempt to provide structure to these development processes.

Although there are no proven and standardized processes for MLOps, we can take many lessons learned from “traditional” software engineering. In many teams there is a division between data scientists and software engineers. Here we have the pragmatic approach: Data scientists develop a model in a Jupyter notebook and throw this notebook over the fence to software engineering, which then follows the model into production with a DevOps approach.

If we think back a few years, “throwing it over the fence” is exactly the problem that gave rise to DevOps. This is exactly how Werner Vogels (CTO at Amazon) described the separation between development and operations in his famous interview in 2006 [27]: “The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon.” Then came the phrase that looks good on DevOps conference T-shirts, coffee mugs and motivational posters: “You build it, you run it.” As naturally as development and operations belong together today, we must also make the collaboration between data science and DevOps a matter of course.