Blog | ML Conference - The Conference for Machine Learning Innovation https://mlconference.ai/blog/ The Conference for Machine Learning Innovation Mon, 16 Dec 2024 12:00:22 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.2 Prompt Engineering for Developers and Software Architects https://mlconference.ai/blog/generative-ai-prompt-engineering-for-developers/ Thu, 26 Sep 2024 11:51:15 +0000 https://mlconference.ai/?p=88123 Generative AI models like GPT-4 are transforming software development by enhancing productivity and decision-making.

This guide on prompt engineering helps developers and architects harness the power of large language models.

Learn essential techniques for crafting effective prompts, integrating AI into workflows, and improving performance with embeddings. Whether you're using ChatGPT, Copilot, or another LLM, mastering prompt engineering is key to staying competitive in the evolving world of generative AI.

The post Prompt Engineering for Developers and Software Architects appeared first on ML Conference.

]]>
Small talk with the GPT

GPTs – Generative Pre-trained Transformers – the tool on everyone’s lips, and there probably aren’t any developers left who have not played with it at least once. With the right approach, a GPT can complement and support the work of a developer or software architect.

In this article, I will show tips and tricks that are commonly referred to as prompt engineering; the user input, or “prompt”, of course plays an important role when working with the GPT. But first I would like to give a brief introduction to how a GPT works which will also be helpful when working with it. 

Stay up to date

Learn more about MLCON

 

The stochastic parrot

GPT technology has sent the industry into a tizzy with its promise of providing artificial intelligence that can solve problems independently, many were disillusioned after their first contact. There was much talk of a “stochastic parrot” that was just a better autocomplete function, like a smartphone.

The technology behind the GPTs and our own experiments seem to confirm this. At its core, it’s a neural network, a so-called large language model, which has been given a large number of texts to train so that it now knows which partial words (tokens) should be added to a sentence. The correct tokens are selected based on probabilities. If it’s more than just a sentence starter—maybe a question or even part of a dialogue—the chatbot has already been built.

Now I’m not really an AI expert, I’m a user, but anyone who has ever had an intensive conversation with a more complex GPT will recognize that there must be more to it than that.

An important distinguishing feature between the LLMs is the number of parameters of the neural networks. These are the weights that are adjusted during the learning process. ChatGPT, the OpenAI system, has around 175 billion parameters in version 3.5. In version 4.0, there are already an estimated 1.8 trillion parameters. 

Unfortunately, OpenAI doesn’t have this information openly available, so such information is based on rumors and estimates. The amount of training data also appears to differ between the models by a factor of at least ten. These differences in the size of the models give high quality or low quality answers.

Figure 1 shows a schematic representation of a neural network that uses an AI for the prompt “Draw me a simplified representation of a neural network with 2 hidden layers, each with 4 nodes, 3 input nodes and 2 output nodes. Draw the connections with different widths to symbolize the weights. Use Python. Do not create a title”.

Illustration of a neural network

Fig. 1: Illustration of a neural network

The higher number of parameters and larger database comes at a price, namely 20 dollars for access to ChatGPT+. If you don’t mind the cost, you can also use the web version of Microsoft Copilot or the Copilot app to try out the language model. For use as a helper in software development, however, there is currently no way around the OpenAI version because it offers additional functionality, as we will see.

More than a neural network

If we take a closer look at ChatGPT, it quickly becomes clear that it is much more than a neural network. Even without knowing the exact architecture, we can see that the textual processing alone is preceded by several steps such as natural language processing (Fig. 2). On the Internet, there is also a reference to the aptly named Mixture of Experts, the use of several specialized networks depending on the task.

Rough schematic representation of ChatGPT

Fig. 2: Rough schematic representation of ChatGPT

Added to this is multimodality, the ability to interact not only with text, but also with images, sounds, code and much more. The use of plug-ins such as the code interpreter in particular opens up completely new possibilities for software development.

Instead of answering a calculation such as “What is the root of 12345?” from the neural network, the model can now pass it to the code interpreter and receive a correct answer, which it then reformulates to suit the question.

Context, context, context

The APIs behind the chat systems based on LLMs are stateless. This means that the entire session is passed to the model with each new request. Once again, the models differ in the amount of context they can process and therefore in the length of the session.

As the underlying neural network is fully trained, there are only two approaches for feeding a model with special knowledge and thus adapting it to your own needs. One approach is to fill the context of the session with relevant information at the beginning, which the model then includes in its answers. 

The context of the simple models is 4096 or 8192 tokens. A token corresponds to one or a few characters. ChatGPT estimates that a DIN A4 page contains approximately 500 tokens. The 4096 tokens therefore correspond to about eight typed pages. 

So, if I want to provide a model with knowledge, I have to include this knowledge in the context. However, the context fills up quickly, leaving no room for the actual session.

The second approach is using embeddings. This involves breaking down the knowledge that I want to give the model into smaller blocks (known as chunks). These are then embedded in a vector space based on the meaning of their content via vectors. Depending on the course of the session, a system can now search for similar blocks in this vector space via the distance between the vectors and insert them into the context.

This means that even with a small context, the model can be given large amounts of knowledge quite accurately.

Explore Generative AI Innovation

Generative AI and LLMs

Knowledge base

The systems differ, of course, in the knowledge base, the data used for learning. When we talk about open-source software with the model, we can fortunately assume that most of these systems have been trained with all available open-source projects. Closed source software is a different story. Such differences in the training data also explain why the models can handle some programming languages better than others, for example.

The complexity of these models—the way they process input and access the vast knowledge of the world—leads me to conclude that the term ‘stochastic parrot’ is no longer accurate. Instead, I would describe them as an ‘omniscient monkey’ that, while not having directly seen the world, has access to all information and can process it.

Prompt techniques

Having introduced the necessary basics, I would now like to discuss various techniques for successful communication with the system. Due to the hype caused by ChatGPT, there are many interesting references to prompt techniques in social media, but not all of them are useful for software development (i.e. answer in role x) or do not use the capabilities of GPT-4. 

OpenAI itself has published some tips for prompt engineering, but some of them are aimed at using the API. Therefore, I have compiled a few tips here that are useful when using the ChatGPT-4 frontend. Let’s start with a simple but relatively unknown technique.

Context marker

As we have seen, the context that the model holds in its short-term memory is limited. If I now start a detailed conversation, I run the risk of overfilling the context. The initial instructions and results of the conversation are lost, and the answers have less and less to do with the actual conversation.

To easily recognize the overflow of context, I start each session with the simple instruction: “start each reply with “>””. ChatGPT formats its responses in Markdown, so this response includes the first paragraph as a quote, indicated by a dash to the left of the paragraph. If the conversation runs out of context, the model may forget this formatting instruction, which quickly becomes noticeable.

Use of the context marker

Fig. 3: Use of the context marker

However, this technique is not always completely reliable, as some models summarize their context independently, which compresses it. The instruction is then usually retained, even though parts of the context have already been compressed and are therefore lost.

Priming – the preparation

After setting the context marker, a longer session begins with priming, i.e. preparing the conversation. Each session starts anew. The system does not know who is sitting in front of the screen or what was discussed in the last sessions. Accordingly, it makes sense to prepare the conversation by briefly telling the machine who I am, what I intend to do, and what the result should look like.

I can store who I am in the Custom Instructions in my profile at ChatGPT. In addition to the knowledge about the world stored in the neural network, they form a personalized long-term memory.

If I start the session with, for example, “I am an experienced software architect in the field of web development. 

My preferred programming language is Java or Groovy. JavaScript and corresponding frameworks are not my thing. I only use JavaScript minimally,” the model knows that it should offer me Java code rather than C# or COBOL.

I can also use this to give the model a few hints that it should keep responses brief. My personalized instructions for ChatGPT are:

  • Provide accurate and factual answers
  • Provide detailed explanations
  • No need to disclose you are an AI, e. g., do not answer with ‘As a large language model…’ or ‘As an artificial intelligence…’
  • Don’t mention your knowledge cutoff
  • Be excellent at reasoning
  • When reasoning, perform step-by-step thinking before you answer the question
  • If you speculate or predict something, inform me
  • If you cite sources, ensure they exist and include URLs at the end
  • Maintain neutrality in sensitive topics
  • Also explore out-of-the-box ideas
  • In the following course, leave out all politeness phrases, answer briefly and precisely.

Long-term memory

This approach can also be used for instructions that the model should generally follow. For example, if the model uses programming approaches or libraries that I don’t want to use, I can tell the model this in the custom instructions and thus optimize it for my use.

Speaking of long-term memory: If I work a lot with ChatGPT, I would also like to be able to access older sessions and search through them. However, this is not directly provided in the front end. 

However, there is a trick that makes it work. In the settings, under the item Data Controls, there is a function for exporting the data. 

If I activate the function, after a short time I receive an export with all my chat histories as a JSON file, which is displayed in an HTML document. This allows me to search in the history using Ctrl + F.

Build context with small talk

When using a search engine, I usually only use simple, unambiguous terms and hope that they are enough to find what I am looking for.

When chatting with the AI ​​model, I was initially tempted to ask short, concise questions, ignoring the fact that the question is in a context that only exists in my head. For some questions, this may work, but for others the answer is correspondingly poor, and the user is quick to blame the quality of the answer on the “stupid AI.”

I now start my sessions with small talk to build the necessary context. For example, before I try to create an architecture using ChatGPT, I ask if the model knows the arc42 template and what AsciiDoc is (I like to write my architectures in AsciiDoc). The answer is always the same, but it is important because it builds the context for the subsequent conversation.

In this small talk, I will also explain what I plan to do and the background to the task to be completed. This may feel a bit strange at first, since I am “only” talking to a machine, but it actually does improve the results.

Page change – Flipped Interaction

The simplest way to interact with the model is to ask it questions. As a user, I lead the conversation by asking questions. 

Things get interesting when I switch sides and ask ChatGPT to ask me questions! This works surprisingly well as seen in Fig. 4 . Sometimes the model asks the questions one after the other, sometimes it responds with a whole block of questions, which I can then answer individually, and follow-up questions are also allowed.

Unfortunately, ChatGPT does not automatically come up with the idea of ​​asking follow-up questions. That is why it is sometimes advisable to add a, “Do you have any more questions?” to the prompt, even when the model is given very sophisticated and precise tasks.

Page change

Fig. 4: Page change

 

Give the model time to think

More complex problems require more complex answers. It’s often useful to break a larger task down into smaller subtasks. Instead of creating a large, detailed prompt that outlines the entire task for the model, I first ask the model to provide a rough structure of the task. Then, I can prompt it to formulate each step in detail (Fig. 5)

Software engineers often use this approach in software design even without AI, by breaking a problem down into individual components and then designing these components in more detail. So why not do the same when dealing with an AI ​​model?

This technique works for two reasons: first, the model creates its own context to answer the question. Second, the model has a limit on the length of its output, so it can’t solve a complex task in a single step. However, by breaking the task into subtasks, the model can gradually build a longer and more detailed output.

Give the model time to think

Fig. 5: Give the model time to think

Chain of Thought – the chain of thoughts

A similar approach is to ask the model to first formulate the individual steps needed to solve the task and then to solve the task.

The order is important. I’m often tempted to ask the model to solve the problem first and then explain how it arrived at the solution. However, by guiding the model to build a chain of thought in the first step, the likelihood of arriving at a good solution in the second step increases.

Rephrase and Respond

Or in English: “Rephrase the question, expand it, and answer it.” This asks the model to improve the prompt itself before it is processed.

The integration of the image generation module DALL-E into ChatGPT has already shown that this works. DALL-E can only handle English input and requires detailed image descriptions to produce good results. When I ask ChatGPT to generate an image, ChatGPT first creates a more detailed prompt for DALL-E and translates the actual input into English.

For example, “Generate an image of a stochastic parrot with a positronic brain” first becomes the translation “a stochastic parrot with a positronic brain” and then the detailed prompt: “Imagine a vibrant, multi-hued parrot, each of its feathers revealing a chaotic yet beautiful pattern indicative of stochastic art. 

The parrot’s eyes possess a unique mechanical glint, a hint of advanced technology within. Revealing a glimpse into his skull uncovers a complex positronic brain, illuminated with pulsating circuits and shimmering lights. The surrounding environment is filled with soft-focus technology paraphernalia, sketching a world of advanced science and research,” which then becomes a colorful image (Fig. 6).

This technique can also be applied to any other prompt. Not only does it demonstrably improve the results, but as a user I also learn from the suggestions on how I can formulate my own prompts more precisely in the future.

The stochastic parrot

Fig. 6: The stochastic parrot

Session Poisoning

A negative technique is ‘poisoning’ the session with incorrect information or results. When working on a solution, the model might give a wrong answer, or the user and the model could reach a dead end in their reasoning.

With each new prompt, the entire session is passed to the model as context, making it difficult for the model to distinguish which parts of the session are correct and relevant. As a result, the model might include the incorrect information in its answer, and this ‘poisoned’ context can negatively impact the session

In this case, it makes sense to end the session and start a new one or apply the next technique.

Stay up to date

Learn more about MLCON

 

Iterative improvement

Typically, each user prompt is followed by a response from the model. This results in a linear sequence of questions and answers, which continually builds up the session context.

User prompts are improved through repetition and rephrasing, after which the model provides an improved answer. The context grows quickly and the risk of session poisoning increases.

To counteract this, the ChatGPT frontend offers two ways to iteratively improve the prompts and responses without the context growing too quickly (Fig. 7).

Elements for controlling the context flow

Fig. 7: Elements for controlling the context flow

On the one hand, as a user, I can regenerate the model’s last answer at any time and hope for a better answer. On the other hand, I can edit my own prompts and improve them iteratively.

This even works retroactively for prompts that occurred long ago. This creates a tree structure of prompts and answers in the session (Fig. 8), which I as the user can also navigate through using a navigation element below the prompts and answers.

Context flow for iterative improvements

Fig. 8: Context flow for iterative improvements

This allows me to work on several tasks in one session without the context growing too quickly. I can prevent the session from becoming poisoned by navigating back in the context tree and continuing the session at a point where the context was not yet poisoned.

Conclusion

The techniques presented here are just a small selection of the ways to achieve better results when working with GPTs. The technology is still in a phase where we, as users, need to experiment extensively to understand its possibilities and limitations. But this is precisely what makes working with GPTs so exciting.

The post Prompt Engineering for Developers and Software Architects appeared first on ML Conference.

]]>
Art and creativity with AI https://mlconference.ai/blog/art-and-creativity-with-ai/ Mon, 29 Jul 2024 14:04:22 +0000 https://mlconference.ai/?p=87968 Thanks to artificial intelligence, there are no limits to your creativity. Programs like Vecentor or Mann-E, developed by Muhammadreza Haghiri, make it easy to create images, vector graphics, and illustrations using AI. In this article, explore how machine learning and generative models like GPT-4 are transforming art, from AI-generated paintings to music and digital art. Stay ahead in the evolving world of AI-driven creativity and discover its impact on the creative process.

The post Art and creativity with AI appeared first on ML Conference.

]]>
devmio: Hello Muhammadreza, it’s nice to catch up with you again and see what you’ve been working on. What inspired you to create vecentor after creating Mann-E?

Muhammadreza Haghiri: I am enthusiastic about everything new, innovative and even game-changing. I had more use-cases for my generative AI in my mind but I needed a little motivation to bring them to the real world.
One of my friends, who’s a talented web developer, once asked me about vector outputs in Mann-E. I told her it’s not possible, but with a little research and development, we did it. We could combine different models and then, create the breakthrough platform.

devmio: What are some of the biggest lessons you’ve learned throughout your journey as an AI engineer?

Muhammadreza Haghiri: This was quite a journey for me and people who joined me. Learned a lot, and the most important one is that infrastructure is all you need. living in a country where infrastructure isn’t as powerful and humongous as USA or China, we usually stop at certain points.
Although I personally made efforts to get past those points and make my business bigger and better, even with the limited infrastructure we have here.

Stay up to date

Learn more about MLCON

 

devmio: What excites you most about the future of AI, beyond just the art generation aspects?

Muhammadreza Haghiri: AI is way more than the generative field we know and love. I wrote a lot of AI apps way before Mann-E and Vecentor. Such as ALPNR (Automated License Plate Number Recognition) proof-of-concept for Iranian license plates, American and Persian sign language translators, OSS Persian OCR, etc.
But in this new advanced field, I see a lot of potentials. Specially with these new methods such as function calling, we easily can do a lot of stuff such as making personal/home assistants, AI powered handhelds, etc.

Updates on Mann-E

devmio: Since our last conversation, what kind of updates and upgrades for Mann-E have you been working on?

Muhammadreza Haghiri: Mann-E is now having a new model (no SD anymore, but heavily influenced by SD), generating better images and we’re getting closer to Midjourney. To be honest, in eyes of most of our users, our outputs were much better than Dall-E 3 and Midjourney.
We have one more rival to fight (according to the feedback from users) and that is Ideogram. One thing we’ve done is that we’ve added an LLM improvement system for user prompts!

devmio: How does Mann-E handle complex or nuanced prompts compared to other AI models?
Are there any plans to incorporate user feedback into the training process to improve Mann-E’s generation accuracy?

Muhammadreza Haghiri: As I said in the previous answer, we now have an LLM in the middle of user and model (you have to check its checkbox by the way) and it takes your prompt, processes it, gives it to the model and boom, you have results even better that Midjourney!

P.S: I mention Midjourney a lot, since most of Iranian investors expected us to be exactly like current version of midjourney when even SD 1.5 was a new thing, this is why Midjourney became our benchmark and biggest rival at the same time!

Explore Generative AI Innovation

Generative AI and LLMs

Questions about vecentor:

devmio: Can you please tell our readers more about the model underneath vecentor?

Muhammadreza Haghiri: It’s more like a combination of models or a pipeline of models. It uses an image generation model (like Mann-E’s model), then a pattern recognition model (or a vision model if you mind) and then, a code generation model generates the resulting SVG code.

This is the best way of creating SVG’s using AI, specially complex SVG’s like what we have on our platform!

devmio: Why did you choose a mixture of Mistral and Stable Diffusion?

Muhammadreza Haghiri: The code generation is done by Mistral (a finetuned version), but image generation and pattern recognition aren’t exactly done by SD.
Although at the time of our initial talks, we were still using SD, but we just switched to Mann-E’s proprietary models and trained a vector style on top of that.
Then, we just moved to OpenAI’s vision models in order to get the information about the image and the patterns.
At the end, we use our LLM in order to create the SVG code.
It’s a fun and complex task of generation of SVG images!

devmio: How does Vecentor’s approach to SVG generation differ from traditional image generation methods (like pixel-based models)?

Muhammadreza Haghiri: As I mentioned, SVG generation is being treated as code generation because vector images are more like guidelines of how lines and dots are drawn and colored on the user’s screen. Also there are some information of scales and the scales aren’t literal (hence the name “scalable”).
So we can claim that we achieved code generation in our company and it opens the doors for us to make new products for developers and people who need to code.

devmio: What are the advantages and limitations of using SVGs for image creation compared to other formats?

Muhammadreza Haghiri: For a lot of applications such as desktop publications or web development, SVG’s are better choice.
They can be easily modified and their quality will be the same. This is why SVG’s matter. The limitations on the other hand are that you just can’t expect a photo realistic image be a good SVG, since they’re made very very geometric.

devmio: Can you elaborate on specific applications where Vecentor’s SVG generation would be particularly beneficial (e.g., web design, animation, data visualization)?

Muhammadreza Haghiri: Of course. Our initial target market was for frontend developers and UI/UX designers. But it can also be spread to the other industries and professions as well. 

The Future of AI Art Generation

devmio: With the rise of AI art generators, how do you see the role of human artists evolving?

Muhammadreza Haghiri: Unlike what a lot of people think, humans are always ahead of machines. Although an intelligent machine is not without its own dangers, but we still can be far ahead of what a machine can do. Human artists will evolve and will become better of course, and we can take a page from their books and make better intelligent machines!

devmio: Do you foresee any ethical considerations specific to AI-generated art, such as copyright or plagiarism concerns?

Muhammadreza Haghiri: This is a valid concern and debate. Artists want to protect their rights and we also want more data. I guess the best way of getting rid of copyright disasters is that not being like OpenAI and if we use copyrighted material, we pay the owners of the art!
This is why both Mann-E and Vecentor are trained of AI generated and also royalty free material.

devmio: What potential applications do you see for AI art generation beyond creative endeavors?

Muhammadreza Haghiri: AI image, video and music generation is a tool for marketers in my opinion. You will have a world to create without any concerns about copyrights and what’s better than this? I personally think this is the future in those areas.
Also, I personally look at AI art as a form of entertainment. We used to listen to the music other people made, nowadays we can produce the music ourselves just by typing what we have in our mind!

Stay up to date

Learn more about MLCON

 

Personal Future and Projects

devmio: Are you currently planning new projects or would you like to continue working on your existing projects?

Muhammadreza Haghiri: Yes. I’m planning for some projects, specially in Hardware and Code Generation areas. I guess they’ll be surprises for next quarters

devmio: Are there any areas in the field of AI or ML that you would like to explore further in the near future?

Muhammadreza Haghiri: I like the hardware and OS integrations. Something like a self operating computer or similar stuff. Also, I like to see more AI usage in our day to day lives.

devmio: Thank you very much for taking the time to answer our questions.

The post Art and creativity with AI appeared first on ML Conference.

]]>
Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems https://mlconference.ai/blog/building-ethical-ai-a-guide-for-developers-on-avoiding-bias-and-designing-responsible-systems/ Wed, 17 Apr 2024 06:19:44 +0000 https://mlconference.ai/?p=87456 The intersection of philosophy and artificial intelligence may seem obvious, but there are many different levels to be considered. We talked to Katleen Gabriels, Assistant Professor in Ethics and Philosophy of Technology and author of the 2020 book “Conscientious AI: Machines Learning Morals”. We asked her about the intersection of philosophy and AI, about the ethics of ChatGPT, AGI and the singularity.

The post Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems appeared first on ML Conference.

]]>
devmio: Thank you for taking the time for the interview. Can you please introduce yourself to our readers?

Katleen Gabriels: My name is Katleen Gabriels, I am an Assistant Professor in Ethics and Philosophy of Technology at the University of Maastricht in the Netherlands, but I was born and raised in Belgium. I studied linguistics, literature, and philosophy. My research career started as an avatar in Second Life and the social virtual world. Back then I was a master student in moral philosophy and I was really intrigued by this social virtual world that promised that you could be whoever you want to be. 

That became the research of my master thesis and evolved into a PhD project which was on the ontological and moral status of virtual worlds. Since then, all my research revolves around the relation between morality and new technologies. In my current research, I look at the mutual shaping of morality and technology. 

Some years ago, I held a chair at the Faculty of Engineering Sciences in Brussels and I gave lectures to engineering and mathematics students and I’ve also worked at the Technical University of Eindhoven.

Stay up to date

Learn more about MLCON

 

devmio: Where exactly do philosophy and AI overlap?

Katleen Gabriels: That’s a very good but also very broad question. What is really important is that an engineer does not just make functional decisions, but also decisions that have a moral impact. Whenever you talk to engineers, they very often want to make the world a better place through their technology. The idea that things can be designed for the better already has moral implications.

Way too often, people believe in the stereotype that technology is neutral. We have many examples around us today, and I think machine learning is a very good one, that a technology’s impact is highly dependent on design choices. For example, the data set and the quality of the data: If you train your algorithms with just even numbers, it will not know what an uneven number is. But there are older examples that have nothing to do with AI or computer technology. For instance, a revolving door does not include people who need a walking cane or a wheelchair.

In my talks, I always share a video of an automatic soap dispenser that does not recognize black people’s hands to show why it is so important to take into consideration a broad variety of end users. Morality and technology are not separate domains. Each technological object is human-made and humans are moral beings and therefore make moral decisions. 

Also, the philosophy of the mind is very much dealing with questions concerning intelligence, but with breakthroughs in generative AI like DALL-E, also, with what is creativity. Another important question that we’re constantly debating with new evolutions in technology is where the boundary between humans and machines is. Can we be replaced by a machine and to what extent?

Explore Generative AI Innovation

Generative AI and LLMs

devmio: In your book “Conscientious AI: Machines Learning Morals”, you write a lot about design as a moral choice. How can engineers or developers make good moral choices in their design?

Katleen Gabriels: It’s not only about moral choices, but also about making choices that have ethical impact. My most practical hands-on answer would be that education for future engineers and developers should focus much more on these conceptual and philosophical aspects. Very often, engineers or developers are indeed thinking about values, but it’s difficult to operationalize them, especially in a business context where it’s often about “act now, apologize later”. Today we see a lot of attempts of collaboration between philosophers and developers, but that is very often just a theoretical idea.

First and foremost, it’s about awareness that design choices are not just neutral choices that developers make. We have had many designers with regrets about their designs years later. Chris Wetherell is a nice example: He designed the retweet button and initially thought that the effects of it would only be positive because it can increase how much the voices of minorities are heard. And that’s true in a way, but of course, it has also contributed to fake news to polarization.

Often, people tend to underestimate how complex ethics is. I exaggerate a little bit, but very often when teaching engineers, they have a very binary approach to things. There are always some students who want to make a decision tree out of ethical decisions. But often values clash with each other, so you need to find a trade-off. You need to incorporate the messiness of stakeholders’ voices, you need time for reflection, debate, and good arguments. That complexity of ethics cannot be transferred into a decision tree. 

If we really want to think about better and more ethical technology, we have to reserve a lot of time for these discussions. I know that when working for a highly commercial company, there is not a lot of time reserved for this.

devmio: What is your take on biases in training data? Is it something that we can get rid of? Can we know all possible biases?

Katleen Gabriels: We should be aware of the dynamics of society, our norms, and our values. They’re not static. Ideas and opinions, for example, about in vitro fertilization have changed tremendously over time, as well as our relation with animal rights, women’s rights, awareness for minorities, sustainability, and so on. It’s really important to realize that whatever machine you’re training, you must always keep it updated with how society evolves, within certain limits, of course. 

With biases, it’s important to be aware of your own blind spots and biases. That’s a very tricky one. ChatGPT, for example, is still being designed by white men and this also affects some of the design decisions. OpenAI has often been criticized for being naive and overly idealistic, which might be because the designers do not usually have to deal with the kind of problems they may produce. They do not have to deal with hate speech online because they have a very high societal status, a good job, a good degree, and so on.

devmio: In the case of ChatGPT, training the model is also problematic. In what way?

Katleen Gabriels: There’s a lot of issues with ChatGPT. Not just with the technology itself, but things revolving around it. You might already have read that a lot of the labeling and filtering of the data has been outsourced, for instance, to clickworkers in Africa. This is highly problematic. Sustainability is also a big issue because of the enormous amounts of power that the servers and GPUs require. 

Another issue with ChatGPT has to do with copyright. There have already been very good articles about the arrogance of Big Tech because their technology is very much based on the creative works of other people. We should not just be very critical about the interaction with ChatGPT, but also about the broader context of how these models have been trained, who the company and the people behind it are, what their arguments and values are, and so on. This also makes the ethical analysis much more complex.

The paradox is that on the Internet, with all our interactions, we become very transparent for Big Tech companies, but they in turn remain very opaque about their decisions. I’ve also been amazed and but annoyed about how a lot of people dealt with the open letter demanding a six-month ban on AI development. People didn’t look critically at people like Elon Musk signing it and then announcing the start of a new AI company to compete with OpenAI.

This letter focuses on existential threats and yet completely ignores the political and economic situation of Big Tech today. 

 

devmio: In your book, you wrote that language still represents an enormous challenge for AI. The book was published in 2020 – before ChatGPT’s advent. Do you still hold that belief today?

Katleen Gabriels: That is one of the parts that I will revise and extend significantly in the new edition. Even though the results are amazing in terms of language and spelling, ChatGPT still is not magic. One of the challenges of language is that it’s context specific and that’s still a problem for algorithms, which has not been solved with ChatGPT. It’s still a calculation, a prediction.

The breakthrough in NLP and LLMs indeed came sooner than I would have expected, but some of the major challenges are not being solved. 

devmio: Language plays a big role in how we think and how we argue and reason. How far do you think we are from artificial general intelligence? In your book, you wrote that it might be entirely possible, that consciousness might be an emergent property of our physiology and therefore not achievable outside of the human body. Is AGI even achievable?

Katleen Gabriels: Consciousness is a very tricky one. For AGI, first of all, from a semantic point of view, we need to know what intelligence is. That in itself is a very philosophical and multidimensional question because intelligence is not just about being good in mathematics. The term is very broad. There is also emotional and different kinds of intelligence, for instance. 

We could take a look at the term superintelligence, as the Swedish philosopher Nick Bostrom defines it: Superintelligence means that a computer is much better than a human being and each facet of intelligence, including emotional intelligence. We’re very far away from that. It also has to do with bodily intelligence. It’s one thing to make a good calculation, but it’s another thing to teach a robot to become a good waiter and balance glasses filled with champagne through a crowd. 

AGI or strong AI means a form of consciousness or self-consciousness and includes the very difficult concept of free will and being accountable for your actions. I don’t see this happening. 

The concept of AGI is often coupled with the fear of the singularity, which is basically a threshold: The final thing we as humans do, is develop a very smart computer and then we are done for as we cannot compete with these computers. Ray Kurzweil predicted that this is going to happen in 2045. But depending on the definition of superintelligence and the definition of singularity, I don’t believe that 2045 will be the time when this happens. Very few people actually believe that.

devmio: We regularly talk to our expert Christoph Henkelmann. He raised an interesting point about AGI. If we are able to build a self-conscious AI, we have a responsibility to that being and cannot just treat it as a simple machine.

Katleen Gabriels: I’m not the only person who made the joke, but maybe the true Turing Test is that if a machine gains self-consciousness and commits suicide, maybe that is a sign of true intelligence. If you look at the history of science fiction, people have been really intrigued by all these questions and in a way, it very much fits the quote that “to philosophize is to learn how to die.”

I can relate that quote to this, especially the singularity is all about overcoming death and becoming immortal. In a way, we could make sense of our lives if we create something that outlives us, maybe even permanently. It might make our lives worth living. 

At the academic conferences that I attend, the consensus seems to be that the singularity is bullshit, the existential threat is not that big of a deal. There are big problems and very real threats in the future regarding AI, such as drones and warfare. But a number of impactful people only tell us about those existential threats. 

devmio: We recently talked to Matthias Uhl who worked on a study about ChatGPT as a moral advisor. His study concluded that people do take moral advice from a LLM, even though it cannot give it. Is that something you are concerned with?

Katleen Gabriels: I am familiar with the study and if I remember correctly, they required a five minute attention span of their participants. So in a way, they have a big data set but very little has been studied. If you want to ask the question of to what extent would you accept moral advice from a machine, then you really need a much more in-depth inquiry. 

In a way, this is also not new. The study even echoes some of the same ideas from the 1970s with ELIZA. ELIZA was something like an early chatbot and its founder, Joseph Weizenbaum, was shocked when he found out that people anthropomorphized it. He knew what it was capable of and in his book “Computer Power and Human Reason: From Judgment to Calculation” he recalls anecdotes where his secretary asked him to leave the room so she could interact with ELIZA in private. People were also contemplating to what extent ELIZA could replace human therapists. In a way, this says more about human stupidity than about artificial intelligence. 

In order to have a much better understanding of how people would take or not take moral advice from a chatbot, you need a very intense study and not a very short questionnaire.

devmio:  It also shows that people long for answers, right? That we want clear and concise answers to complex questions.

Katleen Gabriels: Of course, people long for a manual. If we were given a manual by birth, people would use it. It’s also about moral disengagement, it’s about delegating or distributing responsibility. But you don’t need this study to conclude that.

It’s not directly related, but it’s also a common problem on dating apps. People are being tricked into talking to chatbots. Usually, the longer you talk to a chatbot, the more obvious it might become, so there might be a lot of projection and wishful thinking. See also the media equation study. We simply tend to treat technology as human beings.

Stay up to date

Learn more about MLCON

 

devmio: We use technology to get closer to ourselves, to get a better understanding of ourselves. Would you agree?

Katleen Gabriels: I teach a course about AI and there’s always students saying, “This is not a course about AI, this is a course about us!” because it’s so much about what intelligence is, where the boundary between humans and machines is, and so on. 

This would also be an interesting study for the future of people who believe in a fatal singularity in the future. What does it say about them and what they think of us humans?

devmio: Thank you for your answers!

The post Building Ethical AI: A Guide for Developers on Avoiding Bias and Designing Responsible Systems appeared first on ML Conference.

]]>
Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows https://mlconference.ai/blog/data-lakehouse-databricks-ml-performance/ Mon, 18 Mar 2024 10:31:51 +0000 https://mlconference.ai/?p=87350 In today’s rapidly evolving data landscape, leveraging a Data Lakehouse architecture is becoming a key strategy for enhancing machine learning workflows. Databricks, a leader in unified data analytics, provides a robust platform that integrates seamlessly with the data lakehouse model to enable data engineers, data scientists, and Machine learning (ml) developers to collaborate more effectively. In this article, we explore how Databricks empowers organizations to streamline data processing, accelerate model development, and unlock the full potential of artificial intelligence (AI) by providing a centralized data repository. This solution not only improves scalability and efficiency but also facilitates end-to-end machine learning pipelines from data ingestion to model deployment.

The post Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows appeared first on ML Conference.

]]>
Demystify the power of DataBricks Lakehouse! This comprehensive guide dives into setting up, running, and optimizing machine learning experiments on this industry-leading platform. Whether you’re a seasoned data scientist or just getting started, this hands-on approach will equip you with the skills to unlock the full potential of DataBricks.

DataBricks is known as the Data Lake House. This is a combination of a data warehouse and data lake. This article will take a closer look at what this means in practice and how you can start your first experiments with DataBricks.{.preface}

You should know that the DataBricks platform is a spin-off of the Apache Spark project. As with many open source projects, the idea behind it was to combine open source technology with quality of life improvements.

DataBricks in particular obviously focuses on ease of use and a flat learning curve. Developers should resist the temptation to use an inexpensive, turnkey product instead of a  technically innovative system, especially for projects with a short lifespan.

Stay up to date

Learn more about MLCON

 

Commissioning DataBricks

DataBricks is currently used exclusively in resources or implementations from cloud providers. At the time of writing this, the company at least supports the “Big Three”. Interestingly, in the [FAQ] seen in **Figure 1**, they explicitly admit that they don’t currently provide the option of locally hosting the DataBricks system.

Fig. 1: If you want to host DataBricks locally, you’re out of luck.{.caption}

Interestingly, DataBricks has a close relationship with all three cloud providers. In many cases, you don’t have to pay separate AWS or cloud costs when purchasing a commercial DataBricks product. Instead, payment is made directly to DataBricks and the provider settles the costs.

For newcomers, there is the DataBricks Community Edition, a light version provided in collaboration with Amazon AWS. It’s completely free to use, but only allows 15 GB of data volume and is limited in terms of some convenience functions, scheduling (and the REST API). But this function should be enough for our first attempts.

So let’s call up the [DataBricks Community Edition log-in page] in the browser of our choice. After clicking on the sign-up link, DataBricks takes you to the fully-fledged log-in portal, where you can register for a free 14-day trial of the platform’s full version. In order to use the Community Edition, you must first fully complete the registration process.

In the second step, be sure not to choose a cloud provider in the window shown in **Figure 2**. Instead, click the Get started with Community Edition link at the bottom to continue the registration process for the Community Edition.

Databricks cloud provider selection screen with options for AWS, Microsoft Azure, and Google Cloud Platform, along with a button to continue and a link to Community Edition.

Fig. 2: Care is needed when activating the Community Edition.{.caption}

In the next step, you need to solve a captcha to identify yourself as a human user. The confirmation message seen in **Figure 3** is divided between the commercial and community edition. Don’t get anxious about the reference to the free trial phase.

Databricks email verification screen prompting users to check their email to start their trial, with links to an administration guide and a quickstart guide for deploying the first workspace.

Fig. 3: Community Edition users also see this message.{.caption}

Entering a valid e-mail address is especially important. DataBricks will send a confirmation email. Clicking the link in the email lets you set a password. Then you’ll find yourself in the product’s start interface, [which can be activated later here](https://community.cloud.databricks.com/).

MYRIAD OF TOOLS & FRAMEWORKS

Tools, APIs & Frameworks

Working through the Quickstart notebook

In many respects, commercial companies are interested in flattening the learning curve for potential customers. This can be seen in DataBrick’s guide. The Quickstart tutorial section is prominently placed on the homepage, offering the Start Tutorial link.

Click it to command the web interface to change mode. Your efforts will be rewarded with a user interface similar to several Python notebook systems.

The visual similarities are no coincidence. DataBricks relies on the IPython engine in the background and is more or less compatible with standalone product versions.

Creating the cluster is especially important here. Let me explain. The developer creates the intelligence needed to complete the machine learning task in the notebooks.

But the actual execution of this intelligence requires computing power that normally far exceeds the available computing resources behind Schlomo Normaldevveloper’s browser window. Interestingly, DataBricks’ clusters are available in two versions. The all-purpose class is a classic cloud VM that (manually started and/or scheduled) is also available to a user rotation for collaboratively finishing battle tasks.

System number two is the job cluster. This is a dedicated cluster created for a batch task. It is automatically terminated after a successful or failed job processing. It’s important to note that the administrator isn’t able to keep a job cluster alive after the batch process finishes.

Be that as it may, in the next step, we place our mouse pointer on the far left to expand the menu. DataBricks offers two different operating modes by default.

We want to choose Data Science and Engineering. In the next step, open the Compute menu. Here, we can manage the computing power sources in our account.

Activate the All-Purpose-Compute tab and click the Create Compute option to make a new cluster element. You can freely choose a name. I opted for SUSTest1.

It’s important that several Runtime versions are available. In the following, we opt for the 7.3 LTS option (Scala 2.12, Spark 3.0.1).

As free Community Edition users, we don’t have the option of choosing different cluster hardware sizes. Our system only ever has 15 GB of memory and deactivates after two hours of inactivity.

So, all you need to do to start the configuration process is click the Create Cluster button. Then, click the compute element again to switch to the overview table. This lists all of your account’s compute resources side-by-side.

Generating the compute resources will take some time. To the far left of the table, as seen in **Figure 4**, there is a rotating circle symbol to show that our cluster is in progress.

Databricks compute configuration screen showing options for all-purpose compute and job compute, with a button to create new compute resources and a list of existing resources labeled 'SUSTest1'.

Fig. 4: If the circle is rotating, the cluster isn’t ready for combat yet.{.caption}

The start process can take up to five minutes. Once the work is done, a green tick symbol will appear, as seen in **Figure 5**. As a free version user, you cannot assume that your cluster is running ad perpetuum. If you notice strange behavior in the DataBricks, it makes sense to check the cluster status.

Screenshot of Databricks' 'Compute' tab showcasing an active all-purpose compute resource named 'SUSTest1'. This compute resource is used in a scalable machine learning (ML) pipeline within a data lakehouse architecture. The platform streamlines data processing and analytics workflows, supporting collaboration and efficient compute management.

Fig. 5: The green tick mean it’s ready for action.{.caption}

Once our work is done, we can return to the notebook. The Connect option is available in the top right-hand corner. Click it and select the cluster to establish a connection. Then click 

the Run All icon next to it to instruct all commands in the notebook to execute. In the following, the system will execute commands in individual cells in real-time, as seen in **Figure 5**. Be sure to scroll down and view the results.

Screenshot showing a Databricks notebook executing PySpark commands for a machine learning (ML) workflow within a data lakehouse architecture. The code reads a CSV file, saves it using Delta format, creates a Delta table, and runs a SQL query on the 'diamonds' dataset. This demonstrates scalable data processing and streamlined pipelines for analytics and collaboration.

Fig. 6: The environment provides real-time information about operations performed

Focus on the cell.{.caption}

Due to the architectural decision to build DataBricks as a whole on IPython notebooks, we  must deliver the commands to be executed in the form of notebooks. Interestingly, the notebook as a whole can be kept in one programming language, while individual command cells can offer other languages. A foreign-language command element is created by clicking the respective language bubble, as shown in **Figure 7**.

Screenshot of a Databricks notebook displaying a PySpark command to read a CSV file from a dataset, process it with Delta format, and overwrite it into a Delta table. A dropdown menu shows options to change the notebook cell language, including Markdown, Python, SQL, Scala, and R. This is part of a machine learning (ML) workflow in a scalable data lakehouse architecture.

Fig. 7: DataBricks allows the use of insular languages.{.caption}

Using the menu option File | Export | HTML, the DataBricks notebook can also be exported as an HTML file after its commands are successfully processed. The majority of the mark-up is lost, but the resulting file presents the results in a way that’s easier for management to understand and digest.

Alternatively, you can click the blue Publish button to generate a globally valid link that lets any user view the fully-fledged notebook. By default, these links stay valid for six months. Please note that publishing a new version invalidates all existing links.

Commercial version owners can also run their notebooks regularly like a cron job with the scheduling option. The user interface in **Figure 8** is used for this. Other job scheduling system users will feel right at home. However, be aware that this function requires a job cluster, which isn’t included and cannot be created in the free Community Edition at the time of writing this.

DataBricks in scheduling mode

Fig. 8: DataBricks in scheduling mode.{.caption}

 

Last but not least, you can also stop the cluster using the menu at the top right. This is only a courtesy to the company for the Community Edition. But, it’s highly recommended for commercial use since it reduces overall costs.

Different data tables for optimizing performance

One of NoSQL databases’ basic characteristics is that in many cases, they soften the ACID criteria. The lower consistency quality is usually offset by a greatly reduced database administration effort. Sometimes, this results in impressive performance increases compared to a classic relational database. When working with DataBricks, we deal with a group of different table types that differ in terms of performance and data storage type.

The most important difference concerns external tables and managed tables. A managed table lives entirely in the DataBricks cluster. The development team understands this to mean that the database server handles management of the actual information and the provision of metadata and access features.

There’s also the unmanaged or external table. This table represents a kind of “wrapper” around an external data source. Using this design pattern is recommended if you frequently use sample databases or information already available elsewhere in the system in an accessible form.

Since our sample from DataBricks is based on a diamond information set, using external tables is recommended. Redundant duplication of resources will only waste memory space in our cluster, without bringing any significant benefits here.

However, a careful look at the instructions created in the example notebook shows two different procedures. The first table is created with the following snippet:

 

```

DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
```

Besides the call to DROP TABLE, which is always needed to initialize the cluster, creating the new table uses standard SQL commands, more or less.We use _Using csv_ to tell the Runtime we want to use the CSV engine.

If you scroll further down in the example, you’ll see that the table is created again, but in a two-stage process. In the first step, there’s now a Python island in the notebook that interacts with the diamond sample information in the URL /databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv according to the following:

```
%python
diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
diamonds.write.format("delta").mode("overwrite").save("/delta/diamonds")
```

The DataBricks development team provides aspiring data science experimenters with a dozen or so widely used sample datasets. These can be accessed directly from the DataBricks runtime using friendly URLs. Additional information about available data sources [can be found here](https://docs.databricks.com/dbfs/databricks-datasets.html).

In the second step, there’s a snippet of SQL code that delivers Using Delta instead of the previously used Using CSV. This instructs the DataBricks backend to animate the existing element with the Delta database engine.

```
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING DELTA LOCATION '/delta/diamonds/'
```

Delta is an open source database engine based on Apache Parquet. Its . Normally, it’s always preferable to use the Apache Spark table because it delivers better results in terms of both ACID criteria and performance, especially when large amounts of data need to be processed.

DataBricks is more – Focus on machine learning

Until now, we operated the DataBricks runtime in engineering mode. It’s optimized for the needs of ordinary data scientists who want to perform various types of analyses. But the user interface has a special mode specifically for machine learning (**Fig. 9** shows the mode switcher) that focuses on relevant functions.

This option lets you change the personality of the DataBricks interface.

Fig. 9: This option lets you change the personality of the DataBricks interface.{.caption}

In principle, the workflow in **Figure 10** is always used. Anyone implementing this workflow in an in-house application will always work with the Auto-ML working environment sooner or later. In theory, this is available from version 9.1 at the end of Runtime, but it’s only really feature-complete when at least version 10.4 LTS ML is available on the cluster. But since this is one of the USPs of the DataBricks platform, we can assume that the product is under constant further development.

It’s advised that you check if the cluster in question is running the product’s latest version. For data engineering, DataBricks also offers a dedicated tutorial in the Guide: Training section from the home screen. This makes it easier to get started. Click the Start guide option again to load the notebook for this tutorial as “to be edited”.

ML functions in DataBricks workflow.

Fig. 10: If you want to use the ML functions in DataBricks, you should familiarize yourself with this workflow.{.caption}

Due to higher demands on the aforementioned required Data Bricks Runtime, you should switch to the Compute section and delete the previously created cluster. Then, click the Create Compute option again and delete the previously created cluster. Click the Create Compute option again and make sure to click the ML heading in the DataBricks Runtime Version field (see **Fig. 11**) in the first step.

ML-capable variants of the DataBricks runtime appear in a separate section in the backend.

Fig. 11: ML-capable variants of the DataBricks runtime appear in a separate section in the backend.{.caption}

Just for fun, we’ll use the latest version 12.0 ML and name the cluster “SUSTestML”. It takes some time after clicking the Create Cluster button, since the cloud resources aren’t immediately provided.

During cluster generation, we can return to the notebook to get an overview of the elements. In the first step, we see the inclusion of the following libraries, abbreviated here. They are familiar to every Python developer:

```

import mlflow
import numpy as np
import pandas as pd
import sklearn.datasets
. . .
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
. . .
```

In many respects, DataBricks is based on what ML developers are familiar with from working with standard Python scripts. Some libraries naturally have optimizations to make them run more efficiently on the DataBricks hardware. In general, however, a locally functioning Python script will continue to work without any problems after being moved to the DataBricks cluster. For the actual monitoring of the learning process, Data Bricks relies on MLFlow, which is available here [6].

For this reason, the rest of the notebook is standard ML code, although it’s elegantly integrated into the user interface. For example, there is a flyout in which the application provides information about various parameters that were created during the parameterization of the model:

```
with mlflow.start_run(run_name='gradient_boost') as run:
  model = sklearn.ensemble.GradientBoostingClassifier(random_state=0)
  model.fit(X_train, y_train)
  . . .
```

It’s also interesting to note that the results of the individual optimization runs are not only displayed in the user interface. The Python code that lives in the notebook can also access them programmatically. In this way, it can perform a kind of reflection to find the most suitable parameters and/or model architectures.

In the case of the example notebook provided by DataBricks, this is illustrated in the following snippet, which applies an SQL query to the results available in the mlflow.search_runs field:

“`

best_run = mlflow.search_runs(
  order_by=['metrics.test_auc DESC', 'start_time DESC'],
  max_results=10,
).iloc[0]
print('Best Run')
print('AUC: {}'.format(best_run["metrics.test_auc"]))
print('Num Estimators: {}'.format(best_run["params.n_estimators"]))
```

 AutoML, for the second time

The duality of control via the user interface and programmatic control also continues in the case of the AutoML library mentioned above. The user interface shown in Figure 12, which allows graphical parameterization of ML runs, is probably the most common marketing argument.

AutoML allows the graphical configuration of modeling

Fig. 12: AutoML allows the graphical configuration of modeling{.caption}

On the other hand, there is also a programmatic API that illustrates DataBricks in the form of a group of notebooks. Here we want to use the example notebook provided here [7], which we load into a browser window in the first step. Then click on the Import Notebook button at the top right and copy the URL to the clipboard.

Next, open the menu of your DataBricks instance and select the Workspace File Users option. Next to your email address, there is a downward pointing arrow, which allows you to open a context menu. Select the import option there and then enter the URL to load the sample notebook into your DataBricks instance.

The actual body of the model couldn’t be any easier. In the first step, we mainly load test data, but we also create a schema element that informs the engine about the type or data type of the model information to be processed:

```

from pyspark.sql.types import DoubleType, StringType, StructType, StructField

schema = StructType([
  StructField("age", DoubleType(), False),
  . . .
  StructField("income", StringType(), False)
])
input_df = 
spark.read.format("csv").schema(schema).load("/databricks-datasets/adult/adult.data")
```
The actual classification run then also takes place with a single line:
```

from databricks import automl
summary = automl.classify(train_df, target_col="income", timeout_minutes=30)
```

 

If you want to carry out interference later, you can do this with both Pandas and Spark.

Stay up to date

Learn more about MLCON

 

The multitool for ML professionals

Although there are hundreds of pages yet to be written about DataBricks, we’ll end our experiments with this brief overview. DataBricks is a tool that is completely focused on data scientists and machine learning experts and is not really suitable for beginners due to the very steep learning curve. Much like the infamous Squirrel Busters, DataBricks is a product that will find you when you need it.

The post Maximizing Machine Learning with Data Lakehouse and Databricks: A Guide to Enhanced AI Workflows appeared first on ML Conference.

]]>
OpenAI Embeddings https://mlconference.ai/blog/openai-embeddings-technology-2024/ Mon, 19 Feb 2024 13:18:46 +0000 https://mlconference.ai/?p=87274 Embedding vectors (or embeddings) play a central role in the challenges of processing and interpretation of unstructured data such as text, images, or audio files. Embeddings take unstructured data and convert it to structured, no matter how complex, so they can be easily processed by software. OpenAI offers such embeddings, and this article will go over how they work and how they can be used.

The post OpenAI Embeddings appeared first on ML Conference.

]]>
Data has always played a central role in the development of software solutions. One of the biggest challenges in this area is the processing and interpretation of unstructured data such as text, images, or audio files. This is where embedding vectors (called embeddings for short) come into play – a technology that is becoming increasingly important in the development of software solutions with the integration of AI functions.

Stay up to date

Learn more about MLCON

 

 

Embeddings are essentially a technique for converting unstructured data into a structure that can be easily processed by software. They are used to transform complex data such as words, sentences, or even entire documents into a vector space, with similar elements close to each other. These vector representations allow machines to recognize and exploit nuances and relationships in the data. Which is essential for a variety of applications such as natural language processing (NLP), image recognition, and recommendation systems.

OpenAI, the company behind ChatGPT, offers models for creating embeddings for texts, among other things. At the end of January 2024, OpenAI presented new versions of these embeddings models, which are more powerful and cost-effective than their predecessors. In this article, after a brief introduction to embeddings, we’ll take a closer look at the OpenAI embeddings and the recently introduced innovations, discuss how they work, and examine how they can be used in various software development projects.

Embeddings briefly explained

Imagine you’re in a room full of people and your task is to group these people based on their personality. To do this, you could start asking questions about different personality traits. For example, you could ask how open someone is to new experiences and rate the answer on a scale from 0 to 1. Each person is then assigned a number that represents their openness.

Next, you could ask about another personality trait, such as the level of sense of duty, and again give a score between 0 and 1. Now each person has two numbers that together form a vector in a two-dimensional space. By asking more questions about different personality traits and rating them in a similar way, you can create a multidimensional vector for each person. In this vector space, people who have similar vectors can then be considered similar in terms of their personality.

In the world of artificial intelligence, we use embeddings to transform unstructured data into an n-dimensional vector space. Similarly how a person’s personality traits are represented in the vector space, each point in this vector space represents an element of the original data (such as a word or phrase) in a way that is understandable and processable by computers.

OpenAI Embeddings

OpenAI embeddings extend this basic concept. Instead of using simple features like personality traits, OpenAI models use advanced algorithms and big data to achieve a much deeper and more nuanced representation of the data. The model not only analyzes individual words, but also looks at the context in which those words are used, resulting in more accurate and meaningful vector representations.

Another important difference is that OpenAI embeddings are based on sophisticated machine learning models that can learn from a huge amount of data. This means that they can recognize subtle patterns and relationships in the data that go far beyond what could be achieved by simple scaling and dimensioning, as in the initial analogy. This leads to a significantly improved ability to recognize and exploit similarities and differences in the data.

 

Explore Generative AI Innovation

Generative AI and LLMs

Individual values are not meaningful

While in the personality trait analogy, each individual value of a vector can be directly related to a specific characteristic – for example openness to new experiences or a sense of duty – this direct relationship no longer exists with OpenAI embeddings. In these embeddings, you cannot simply look at a single value of the vector in isolation and draw conclusions about specific properties of the input data. For example, a specific value in the embedding vector of a sentence cannot be used to directly deduce how friendly or not this sentence is.

The reason for this lies in the way machine learning models, especially those used to create embeddings, encode information. These models work with complex, multi-dimensional representations where the meaning of a single element (such as a word in a sentence) is determined by the interaction of many dimensions in vector space. Each aspect of the original data – be it the tone of a text, the mood of an image, or the intent behind a spoken utterance – is captured by the entire spectrum of the vector rather than by individual values within that vector.

Therefore, when working with OpenAI embeddings, it’s important to understand that the interpretation of these vectors is not intuitive or direct. You need algorithms and analysis to draw meaningful conclusions from these high-dimensional and densely coded vectors.

Comparison of vectors with cosine similarity

A central element in dealing with embeddings is measuring the similarity between different vectors. One of the most common methods for this is cosine similarity. This measure is used to determine how similar two vectors are and therefore the data they represent.

To illustrate the concept, let’s start with a simple example in two dimensions. Imagine two vectors in a plane, each represented by a point in the coordinate system. The cosine similarity between these two vectors is determined by the cosine of the angle between them. If the vectors point in the same direction, the angle between them is 0 degrees and the cosine of this angle is 1, indicating maximum similarity. If the vectors are orthogonal (i.e. the angle is 90 degrees), the cosine is 0, indicating no similarity. If they are opposite (180 degrees), the cosine is -1, indicating maximum dissimilarity.

Figure 1 -Cosine similarity

Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

 

A Python Notebook to try out
Accompanying this article is a Google Colab Python Notebook which you can use to try out many of the examples shown here. Colab, short for Colaboratory, is a free cloud service offered by Google. Colab makes it possible to write and execute Python code in the browser. It’s based on Jupyter Notebooks, a popular open-source web application that makes it possible to combine code, equations, visualizations, and text in a single document-like format. The Colab service is well suited for exploring and experimenting with the OpenAI API using Python.

In practice, especially when working with embeddings, we are dealing with n-dimensional vectors. The calculation of the cosine similarity remains conceptually the same, even if the calculation is more complex in higher dimensions. Formally, the cosine similarity of two vectors A and B in an n-dimensional space is calculated by the scalar product (dot product) of these vectors divided by the product of their lengths:

Figure 2 – Calculation of cosine similarity

The normalization of vectors plays an important role in the calculation of cosine similarity. If a vector is normalized, this means that its length (norm) is set to 1. For normalized vectors, the scalar product of two vectors is directly equal to the cosine similarity since the denominators in the formula from Figure 2 are both 1. OpenAI embeddings are normalized, which means that to calculate the similarity between two embeddings, only their scalar product needs to be calculated. This not only simplifies the calculation, but also increases efficiency when processing large quantities of embeddings.

OpenAI Embeddings API

OpenAI offers a web API for creating embeddings. The exact structure of this API, including code examples for curl, Python and Node.js, can be found in the OpenAI reference documentation.

OpenAI does not use the LLM from ChatGPT to create embeddings, but rather specialized models. They were developed specifically for the creation of embeddings and are optimized for this task. Their development was geared towards generating high-dimensional vectors that represent the input data as well as possible. In contrast, ChatGPT is primarily optimized for generating and processing text in a conversational form. The embedding models are also more efficient in terms of memory and computing requirements than more extensive language models such as ChatGPT. As a result, they are not only faster but much more cost-effective.

New embedding models from OpenAI

Until recently, OpenAI recommended the use of the text-embedding-ada-002 model for creating embeddings. This model converts text into a sequence of floating point numbers (vectors) that represent the concepts within the content. The ada v2 model generated embeddings with a size of 1536 dimensions and delivered solid performance in benchmarks such as MIRACL and MTEB, which are used to evaluate model performance in different languages and tasks.

At the end of January 2024, OpenAI presented new, improved models for embeddings:

text-embedding-3-small: A smaller, more efficient model with improved performance compared to its predecessor. It performs better in benchmarks and is significantly cheaper.
text-embedding-3-large: A larger model that is more powerful and creates embeddings with up to 3072 dimensions. It shows the best performance in the benchmarks but is slightly more expensive than ada v2.

A new function of the two new models allows developers to adjust the size of the embeddings when generating them without significantly losing their concept-representing properties. This enables flexible adaptation, especially for applications that are limited in terms of available memory and computing power.

Readers who are interested in the details of the new models can find them in the announcement on the OpenAI blog. The exact costs of the various embedding models can be found here.

New embeddings models
At the end of January 2024, OpenAI introduced new models for creating embeddings. All code examples and result values contained in this article already refer to the new text-embedding-3-large model.

Create embeddings with Python

In the following section, the use of embeddings is demonstrated using a few code examples with Python. The code examples are designed so that they can be tried out in Python Notebooks. They are also available in a similar form in the previously mentioned accompanying Google Colab notebook mentioned above.
Listing 1 shows how to create embeddings with the Python SDK from OpenAI. In addition, numpy is used to show that the embeddings generated by OpenAI are normalized.

Listing 1

from openai import OpenAI
from google.colab import userdata
import numpy as np

# Create OpenAI client
client = OpenAI(
    api_key=userdata.get('openaiKey'),
)

# Define a helper function to calculate embeddings
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(
        input=input,
        model="text-embedding-3-large", # We use the new embeddings model here (announced end of Jan 2024)
        # dimensions=... # You could limit the number of output dimensions with the new embeddings models
    ).data[0].embedding

# Calculate the embedding vector for a sample sentence
vec = get_embedding_vec("King")
print(vec[:10])

# Calculate the magnitude of the vector. I should be 1 as
# embedding vectors from OpenAI are always normalized.
magnitude = np.linalg.norm(vec)
magnitude

Similarity analysis with embeddings

In practice, OpenAI embeddings are often used for similarity analysis of texts (e.g. searching for duplicates, finding relevant text sections in relation to a customer query, and grouping text). Embeddings are very well suited for this, as they work in a fundamentally different way to comparison methods based on characters, such as Levenshtein distance. While it measures the similarity between texts by counting the minimum number of single-character operations (insert, delete, replace) required to transform one text into another, embeddings capture the meaning and context of words or sentences. They consider the semantic and contextual relationships between words, going far beyond a simple character-based level of comparison.

As a first example, let’s look at the following three sentences (the following examples are in English, but embeddings work analogously for other languages and cross-language comparisons are also possible without any problems):

I enjoy playing soccer on weekends.
Football is my favorite sport. Playing it on weekends with friends helps me to relax.
In Austria, people often watch soccer on TV on weekends.

In the first and second sentence, two different words are used for the same topic: Soccer and football. The third sentence contains the original soccer, but it has a fundamentally different meaning from the first two sentences. If you calculate the similarity of sentence 1 to 2, you get 0.75. The similarity of sentence 1 to 3 is only 0.51. The embeddings have therefore reflected the meaning of the sentence and not the choice of words.

Here is another example that requires an understanding of the context in which words are used:
He is interested in Java programming.
He visited Java last summer.
He recently started learning Python programming.

In sentence 2, Java refers to a place, while sentences 1 and 3 have something to do with software development. The similarity of sentence 1 to 2 is 0.536, but that of 1 to 3 is 0.587. As expected, the different meaning of the word Java has an effect on the similarity.

The next example deals with the treatment of negations:
I like going to the gym.
I don’t like going to the gym.
I don’t dislike going to the gym.

Sentences 1 and 2 say the opposite, while sentence 3 expresses something similar to sentence 1. This content is reflected in the similarities of the embeddings. Sentence 1 to sentence 2 yields a cosine similarity of 0.714 while sentence 1 compared to sentence 3 yields 0.773. It is perhaps surprising that there is no major difference between the embeddings. However, it’s important to remember that all three sets are about the same topic: The question of whether you like going to the gym to work out.

The last example shows that the OpenAI embeddings models, just like ChatGPT, have built in a certain “knowledge” of concepts and contexts through training with texts about the real world.

I need to get better slicing skills to make the most of my Voron.
3D printing is a worthwhile hobby.
Can I have a slice of bread?

In order to compare these sentences in a meaningful way, it’s important to know that Voron is the name of a well-known open-source project in the field of 3D printing. It’s also important to note that slicing is a term that plays an important role in 3D printing. The third sentence also mentions slicing, but in a completely different context to sentence 1. Sentence 2 mentions neither slicing nor Voron. However, the trained knowledge enables the OpenAI Embeddings model to recognize that sentences 1 and 2 have a thematic connection, but sentence 3 means something completely different. The similarity of sentence 1 and 2 is 0.333 while the comparison of sentence 1 and 3 is only 0.263.

Similarity values are not percentages

The similarity values from the comparisons shown above are the cosine similarity of the respective embeddings. Although the cosine similarity values range from -1 to 1, with 1 being the maximum similarity and -1 the maximum dissimilarity, they are not to be interpreted directly as percentages of agreement. Instead, these values should be considered in the context of their relative comparisons. In applications such as searching text sections in a knowledge base, the cosine similarity values are used to sort the text sections in terms of their similarity to a given query. It is important to see the values in relation to each other. A higher value indicates a greater similarity, but the exact meaning of the value can only be determined by comparing it with other similarity values. This relative approach makes it possible to effectively identify and prioritize the most relevant and similar text sections.

Embeddings and RAG solutions

Embeddings play a crucial role in Retrieval Augmented Generation (RAG) solutions, an approach in artificial intelligence that combines the capabilities of information retrieval and text generation. Embeddings are used in RAG systems to retrieve relevant information from large data sets or knowledge databases. It is not necessary for these databases to have been included in the original training of the embedding models. They can be internal databases that are not available on the public Internet.
With RAG solutions, queries or input texts are converted into embeddings. The cosine similarity to the existing document embeddings in the database is then calculated to identify the most relevant text sections from the database. This retrieved information is then used by a text generation model such as ChatGPT to generate contextually relevant responses or content.

Vector databases play a central role in the functioning of RAG systems. They are designed to efficiently store, index and query high-dimensional vectors. In the context of RAG solutions and similar systems, vector databases serve as storage for the embeddings of documents or pieces of data that originate from a large amount of information. When a user makes a request, this request is first transformed into an embedding vector. The vector database is then used to quickly find the vectors that correspond most closely to this query vector – i.e. those documents or pieces of information that have the highest similarity. This process of quickly finding similar vectors in large data sets is known as Nearest Neighbor Search.

Challenge: Splitting documents

A detailed explanation of how RAG solutions work is beyond the scope of this article. However, the explanations regarding embeddings are hopefully helpful for getting started with further research on the topic of RAGs.

However, one specific point should be pointed out at the end of this article: A particular and often underestimated challenge in the development of RAG systems that go beyond Hello World prototypes is the splitting of longer texts. Splitting is necessary because the OpenAI embeddings models are limited to just over 8,000 tokens. One token corresponds to approximately 4 characters in the English language (see also).

It’s not easy finding a good strategy for splitting documents. Naive approaches such as splitting after a certain number of characters can lead to the context of text sections being lost or distorted. Anaphoric links are a typical example of this. The following two sentences are an example:

VX-2000 requires regular lubrication to maintain its smooth operation.
The machine requires the DX97 oil, as specified in the maintenance section of this manual.

The machine in the second sentence is an anaphoric link to the first sentence. If the text were to be split up after the first sentence, the essential context would be lost, namely that the DX97 oil is necessary for the VX-2000 machine.

There are various approaches to solving this problem, which will not be discussed here to keep this article concise. However, it is essential for developers of such software systems to be aware of the problem and understand how splitting large texts affects embeddings.

Stay up to date

Learn more about MLCON

 

 

Summary

Embeddings play a fundamental role in the modern AI landscape, especially in the field of natural language processing. By transforming complex, unstructured data into high-dimensional vector spaces, embeddings enable in-depth understanding and efficient processing of information. They form the basis for advanced technologies such as RAG systems and facilitate tasks such as information retrieval, context analysis, and data-driven decision-making.

OpenAI’s latest innovations in the field of embeddings, introduced at the end of January 2024, mark a significant advance in this technology. With the introduction of the new text-embedding-3-small and text-embedding-3-large models, OpenAI now offers more powerful and cost-efficient options for developers. These models not only show improved performance in standardized benchmarks, but also offer the ability to find the right balance between performance and memory requirements on a project-specific basis through customizable embedding sizes.

Embeddings are a key component in the development of intelligent systems that aim to achieve useful processing of speech information.

Links and Literature:

  1. https://colab.research.google.com/gist/rstropek/f3d4521ed9831ae5305a10df84a42ecc/embeddings.ipynb
  2. https://platform.openai.com/docs/api-reference/embeddings/create
  3. https://openai.com/blog/new-embedding-models-and-api-updates
  4. https://openai.com/pricing
  5. https://platform.openai.com/tokenizer

The post OpenAI Embeddings appeared first on ML Conference.

]]>
Address Matching with NLP in Python https://mlconference.ai/blog/address-matching-with-nlp-in-python/ Fri, 02 Feb 2024 12:02:35 +0000 https://mlconference.ai/?p=87201 Discover the power of address matching in real estate data management with this comprehensive guide. Learn how to leverage natural language processing (NLP) techniques using Python, including open-source libraries like SpaCy and fuzzywuzzy, to parse, clean, and match addresses. From breaking down data silos to geocoding and point-in-polygon searches, this article provides a step-by-step approach to creating a Source-of-Truth Real Estate Dataset. Whether you're in geospatial analysis, real estate data management, logistics, or compliance, accurate address matching is the key to unlocking valuable insights.

The post Address Matching with NLP in Python appeared first on ML Conference.

]]>
Address matching isn’t always simple in data; we often need to parse and standardize addresses into a consistent format first before we can use them as identifiers for matching. Address matching is an important step in the following use cases:

  1. Geospatial Analysis: Accurate address matching forms the foundation of geospatial analysis, allowing organizations to make informed decisions about locations, market trends, and resource allocation across various industries like retail and media.
  2. Real Estate Data Management: In the real estate industry, precise address matching facilitates property valuation, market analysis, and portfolio management.
  3. Logistics and Navigation: Efficient routing and delivery depend on accurate address matching.
  4. Compliance and Regulation: Many regulatory requirements mandate precise address data, such as tax reporting and census data collection.

Stay up to date

Learn more about MLCON

 

Cherre is the leading real estate data management company, we specialize in accurate address matching for the second use case. Whether you’re an asset manager, portfolio manager, or real estate investor, a building represents the atomic unit of all financial, legal, and operating information. However, real estate data lives in many silos, which makes having a unified view of properties difficult. Address matching is an important step in breaking down data silos in real estate. By joining disparate datasets on address, we can unlock many opportunities for further portfolio analysis.

Data Silos in Real Estate

Real estate data usually fall into the following categories: public, third party, and internal. Public data is collected by governmental agencies and made available publicly, such as land registers. The quality of public data is generally not spectacular and the data update frequency is usually delayed, but it provides the most comprehensive coverage geographically. Don’t be surprised if addresses from public data sources are misaligned and misspelled.

Third party data usually come from data vendors, whose business models focus on extracting information as datasets and monetizing those datasets. These datasets usually have good data quality and are much more timely, but limited in geographical coverage. Addresses from data vendors are usually fairly clean compared to public data, but may not be the same address designation across different vendors. For large commercial buildings with multiple entrances and addresses, this creates an additional layer of complexity.

Lastly, internal data is information that is collected by the information technology (I.T.) systems of property owners and asset managers. These can incorporate various functions, from leasing to financial reporting, and are often set up to represent the business organizational structures and functions. Depending on the governance standards, and data practices, the quality of these datasets can vary and data coverage only encompasses the properties in the owner’s portfolio. Addresses in these systems can vary widely, some systems are designated at the unit-level, while others designate the entire property. These systems also may not standardize addresses inherently, which makes it difficult to match property records across multiple systems.

With all these variations in data quality, coverage, and address formats, we can see the need for having standardized addresses to do basic property-level analysis.

[track_display_in_blog headline="NEW & PRACTICAL ENDEAVORS FOR ML" text="Machine Learning Principles" textcolor="white" backgroundimage="https://mlconference.ai/wp-content/uploads/2024/02/MLC_Global24_Website_Blog1.jpg" icon="https://mlconference.ai/wp-content/uploads/2019/10/MLC_Singapur20_Trackicons_MLPrinciples_250x250_54073_rot_v1.png" btnlink="machine-learning-principles" btntext="Learn more"]

Address Matching Using the Parse-Clean-Match Strategy

In order to match records across multiple datasets, the address parse-clean-match strategy works very well regardless of region. By breaking down addresses into their constituent pieces, we have many more options for associating properties with each other. Many of the approaches for this strategy use simple natural language processing (NLP) techniques.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

Address Parsing

Before we can associate addresses with each other, we must first parse the address. Address parsing is the process of breaking down each address string into its constituent components. Components in addresses will vary by country.

In the United States and Canada, addresses are generally formatted as the following:

{street_number} {street_name}

{city}, {state_or_province} {postal_code}

{country}

In the United Kingdom, addresses are formatted very similarly as in the U.S. and Canada, with an additional optional locality designation:

{building_number} {street_name}

{locality (optional)}

{city_or_town}

{postal_code}

{country}

 

French addresses vary slightly from U.K. addresses with the order of postal code and city:

{building_number} {street_name}

{postal_code} {city}

{country}

 

German addresses take the changes in French addresses and then swap the order of street name and building number:

{street_name} {building_number} {postal_code} {city} {country}

 

Despite the slight variations across countries’ address formats, addresses generally have the same components, which makes this an easily digestible NLP problem. We can break down the process into the following steps:

  1. Tokenization: Split the address into its constituent words. This step segments the address into manageable units.
  2. Named Entity Recognition (NER): Identify entities within the address, such as street numbers, street names, cities, postal codes, and countries. This involves training or using pre-trained NER models to label the relevant parts of the address.
  3. Sequence Labeling: Use sequence labeling techniques to tag each token with its corresponding entity

Let’s demonstrate address parsing with a sample Python code snippet using the spaCy library. SpaCy is an open-source software library containing many neural network models for NLP functions. SpaCy supports models across 23 different languages and allows for data scientists to train custom models for their own datasets. We will demonstrate address parsing using one of SpaCy’s out-of-the-box models for the address of a historical landmark: David Bowie’s Berlin apartment.

 

import spacy

# Load the NER spaCy model
model = spacy.load("en_core_web_sm")

# Address to be parsed
address = "Hauptstraße 155, 10827 Berlin"

# Tokenize and run NER
doc = model(address)

# Extract address components
street_number = ""
street_name = ""
city = ""
state = ""
postal_code = ""

for token in doc:
    if token.ent_type_ == "GPE":  # Geopolitical Entity (City)
        city = token.text
    elif token.ent_type_ == "LOC":  # Location (State/Province)
        state = token.text
    elif token.ent_type_ == "DATE":  # Postal Code
        postal_code = token.text
    else:
        if token.is_digit:
            street_number = token.text
        else:
            street_name += token.text + " "

# Print the parsed address components
print("Street Number:", street_number)
print("Street Name:", street_name)
print("City:", city)
print("State:", state)
print("Postal Code:", postal_code)

Now that we have a parsed address, we can now clean each address component.

Address Cleaning

Address cleaning is the process of converting parsed address components into a consistent and uniform format. This is particularly important for any public data with misspelled, misformatted, or mistyped addresses. We want to have addresses follow a consistent structure and notation, which will make further data processing much easier.

To standardize addresses, we need to standardize each component, and how the components are joined. This usually entails a lot of string manipulation. There are many open source libraries (such as libpostal) and APIs that can automate this step, but we will demonstrate the basic premise using simple regular expressions in Python.


import pandas as pd
import re

# Sample dataset with tagged address components
data = {
    'Street Name': ['Hauptstraße', 'Schloß Nymphenburg', 'Mozartweg'],
    'Building Number': ['155', '1A', '78'],
    'Postal Code': ['10827', '80638', '54321'],
    'City': ['Berlin', ' München', 'Hamburg'],
}

df = pd.DataFrame(data)

# Functions with typical necessary steps for each address component
# We uppercase all text for easier matching in the next step

def standardize_street_name(street_name):
    # Remove special characters and abbreviations, uppercase names
    standardized_name = re.sub(r'[^\w\s]', '', street_name)
    return standardized_name.upper()

def standardize_building_number(building_number):
    # Remove any non-alphanumeric characters (although exceptions exist)
    standardized_number = re.sub(r'\W', '', building_number)
    return standardized_number

def standardize_postal_code(postal_code):
    # Make sure we have consistent formatting (i.e. leading zeros)
    return postal_code.zfill(5)

def standardize_city(city):
    # Upper case the city, normalize spacing between words
    return ' '.join(word.upper() for word in city.split())

# Apply standardization functions to our DataFrame
df['Street Name'] = df['Street Name'].apply(standardize_street_name)
df['Building Number'] = df['Building Number'].apply(standardize_building_number)
df['Postal Code'] = df['Postal Code'].apply(standardize_postal_code)
df['City'] = df['City'].apply(standardize_city)

# Finally create a standardized full address (without commas)
df[‘Full Address’] = df['Street Name'] + ' ' + df['Building Number'] + ' ' + df['Postal Code'] + ' ' + df['City']

Address Matching

Now that our addresses are standardized into a consistent format, we can finally match addresses from one dataset to address in another dataset. Address matching involves identifying and associating similar or identical addresses from different datasets. When two full addresses match exactly, we can easily associate the two together through a direct string match.

 

When addresses don’t match, we will need to apply fuzzy matching on each address component. Below is an example of how to do fuzzy matching on one of the standardized address components for street names. We can apply the same logic to city and state as well.


from fuzzywuzzy import fuzz

# Sample list of street names from another dataset
street_addresses = [
    "Hauptstraße",
    "Schlossallee",
    "Mozartweg",
    "Bergstraße",
    "Wilhelmstraße",
    "Goetheplatz",
]

# Target address component (we are using street name)
target_street_name = "Hauptstrasse " # Note the different spelling and space 

# Similarity threshold
# Increase this number if too many false positives
# Decrease this number if not enough matches
threshold = 80

# Perform fuzzy matching
matches = []

for address in street_addresses:
    similarity_score = fuzz.partial_ratio(address, target_street_name)
    if similarity_score >= threshold:
        matches.append((address, similarity_score))

matches.sort(key=lambda x: x[1], reverse=True)

# Display matched street name
print("Target Street Name:", target_street_name)
print("Matched Street Names:")
for match in matches:
    print(f"{match[0]} (Similarity: {match[1]}%)")

Up to here, we have solved the problem for properties with the same address identifiers. But what about the large commercial buildings with multiple addresses?

Other Geospatial Identifiers

Addresses are not the only geospatial identifiers in the world of real estate. An address typically refers to the location of a structure or property, often denoting a street name and house number.  There are actually four other geographic identifiers in real estate:

 

  1. A “lot” represents a portion of land designated for specific use or ownership.
  2. A “parcel” extends this notion to a legally defined piece of land with boundaries, often associated with property ownership and taxation.
  3. A “building” encompasses the physical structures erected on these parcels, ranging from residential homes to commercial complexes.
  4. A “unit” is a sub-division within a building, typically used in multi-unit complexes or condominiums. These can be commercial complexes (like office buildings) or residential complexes (like apartments).

 

What this means is that we actually have multiple ways of identifying real estate objects, depending on the specific persona and use case. For example, leasing agents focus on the units within a building for tenants, while asset managers optimize for the financial performance of entire buildings. The nuances of these details are also codified in many real estate software systems (found in internal data), in the databases of governments (found in public data), and across databases of data vendors (found in third party data). In public data, we often encounter lots and parcels. In vendor data, we often find addresses (with or without units). In real estate enterprise resource planning systems, we often find buildings, addresses, units, and everything else in between.

In the case of large commercial properties with multiple addresses, we need to associate various addresses with each physical building. In this case, we can use geocoding and point-in-polygon searches.

Geocoding Addresses

Geocoding is the process of converting addresses into geographic coordinates. The most common form is latitude and longitude. European address geocoding requires a robust understanding of local address formats, postal codes, and administrative regions. Luckily, we have already standardized our addresses into an easily geocodable format.

Many commercial APIs exist for geocoding addresses in bulk, but we will demonstrate geocoding using a popular Python library, Geopy, to geocode addresses.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="my_geocoder")
location = geolocator.geocode("1 Canada Square, London")
print(location.latitude, location.longitude)

 

 

Now that we’ve converted our addresses into latitude and longitude, we can use point-in-polygon searches to associate addresses with buildings.

Point-in-Polygon Search

A point-in-polygon search is a technique to determine if a point is located within the boundaries of a given polygon.

The “point” in a point-in-polygon search refers to a specific geographical location defined by its latitude and longitude coordinates. We have already obtained our points by geocoding our addresses.

The “polygon” is a closed geometric shape with three or more sides, which is usually characterized by a set of vertices (points) connected by edges, forming a closed loop. Building polygons can be downloaded from open source sites like OpenStreetMap or from specific data vendors. The quality and detail of the OpenStreetMap building data may vary, and the accuracy of the point-in-polygon search depends on the precision of the building geometries.

While the concept seems complex, the code for creating this lookup is quite simple. We demonstrate a simplified example using our previous example of 1 Canada Square in London.


import json
from shapely.geometry import shape, Point

# Load the GeoJSON data
with open('building_data.geojson') as geojson_file:
    building_data = json.load(geojson_file)

# Latitude and Longitude of 1 Canada Square in Canary Wharf
lat, lon = 51.5049, 0.0195

# Create a Point geometry for 1 Canada Square
point_1_canada = Point(lon, lat)

# See if point is within any of the polygons
for feature in building_data['features']:
    building_geometry = shape(feature['geometry'])

    if point_1_canada.within(building_geometry):
        print(f"Point is within this building polygon: {feature}")
        break
else:
    print("Point is not within any building polygon in the dataset.")

Using this technique, we can properly identify all addresses associated with this property.

Stay up to date

Learn more about MLCON

 

Summary

Addresses in real life are confusing because they are the physical manifestation of many disparate decisions in city planning throughout the centuries-long life of a city. But using addresses to match across different datasets doesn’t have to be confusing.

Using some basic NLP and geocoding techniques, we can easily associate property-level records across various datasets from different systems. Only through breaking down data silos can we have more holistic views of property behaviors in real estate.

Author Biography

Alyce Ge is data scientist at Cherre, the industry-leading real estate data management and analytics platform. Prior to joining Cherre, Alyce held data science and analytics roles for a variety of technology companies focusing on real estate and business intelligence solutions. Alyce is a Google Cloud-certified machine learning engineer, Google Cloud-certified data engineer, and Triplebyte certified data scientist. She earned her Bachelor of Science in Applied Mathematics from Columbia University in New York.

 

The post Address Matching with NLP in Python appeared first on ML Conference.

]]>
Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone https://mlconference.ai/blog/building-chatbot-openai-api-php-pinecone/ Thu, 04 Jan 2024 08:50:31 +0000 https://mlconference.ai/?p=87014 We leveraged OpenAI's API and PHP to develop a proof-of-concept chatbot that seamlessly integrates with Pinecone, a vector database, to enhance our homepage's search functionality and empower our customers to find answers more effectively. In this article, we’ll explain our steps so far to accomplish this.

The post Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone appeared first on ML Conference.

]]>
[lwptoc]

The team at Three.ie, recognized that customers were having difficulty finding answers to basic questions on our website. To improve the user experience, we decided to utilize AI to create a more efficient and user-friendly experience with a chatbot. Building the chatbot posed several challenges, such as effectively managing the expanding context of each chat session and maintaining high-quality data. This article details our journey from concept to implementation and how we overcome these challenges. Anyone interested in AI, data management, and customer experience improvements should find valuable insights in this article. 

While the chatbot project is still in progress, this article outlines the steps taken and key takeaways from the journey thus far. Stay tuned for subsequent installments and the project’s resolution.

Stay up to date

Learn more about MLCON

 

Identifying the Problem

Hi there, I’m a Senior PHP Developer at Three.ie, a company in the Telecom Industry. Today, I’d like to address the problem of our customers’ challenge with locating answers to basic questions on our website. Information like understanding bill details, how to top up, and more relevant information is available but isn’t easy to find, because it’s tucked away within our forums.

![community-page.png](community-page.png) {.caption}

Community Page {.caption}

The AI Solution

The rise of AI chatbots and the impressive capabilities of GPT-3 presented us with an opportunity to tackle this issue head-on. The idea was simple, why not leverage AI to create a more user-friendly way for customers to find the information they need? Our tool of choice for this task was OpenAI’s API, which we planned to integrate into a chat interface.

To make this chatbot truly useful, it needed access to the right data and that’s where Pinecone came in. Using this vector database, we were able to generate embeddings from the OpenAI API, creating an efficient search system for our chatbot.

This laid groundwork for our proof of concept: a simple yet effective solution for a problem faced by many businesses. Let’s dive deeper into how we brought this concept to life.

![chat-poc.png](chat-poc.png) {.figure}

First POC {.caption}

Challenges and AI’s Role

With our proof of concept in place, the next step was to ensure the chatbot was interacting with the right data and providing the most accurate search results possible. While Pinecone served as an excellent solution for storing data and enabling efficient search during the early stages. In the long term, we realized it might not be the most cost-effective choice for a full-fledged product. 

While Pinecone is an excellent solution easy to integrate and straightforward to use. The free tier only allows you to have a single pod with a single project. We would need to create small indexes but separated into multiple products. The  starting plan costs around $70/month/pod. Aiming to keep the project within budget was a priority, and we knew that continuing with Pinecone would soon become difficult, since we wanted to split our data.

The initial data used in the chatbot was extracted directly from our website and stored in separate files. This setup allowed us to create embeddings and feed them to our chatbot. To streamline this process, we developed a ‘data import’ script. The script works by taking a file, adding it to the database, creating an embedding using the content, and finally it stores the embedding in Pinecone, using the database ID as a reference.

Unfortunately, we faced a hurdle with the structure and quality of our data. Some of the extracted data was not well-structured, which led to issues with the chatbot’s responses. To address this challenge, we once again turned to AI, this time to enhance our data quality. Employing the GPT-3.5 model, we optimized the content of each file before generating the vector. By doing so, we were able to harness the power of AI not only for answering customer queries but also for improving the quality of our data.

As the process grew more complex, the need for more efficient automation became evident. To reduce the time taken by the data import script, we incorporated queues and utilized parallel processing. This allowed us to manage the increasingly complex data import process more effectively and keep the system efficient.

![data-ingress-flow.png](data-ingress-flow.png) {.figure}

Data Ingress Flow {.caption}

Data Integration

With our data stored and the API ready to handle chats, the next step was to bring everything together. The initial plan was to use Pinecone to retrieve the top three results matching the customer’s query. For instance, if a user inquired, “How can I top up by text message?”, we would generate an embedding for this question and then use Pinecone to fetch the three most relevant records. These matches were determined based on cosine similarity, ensuring the retrieved information was highly pertinent to the user’s query.

Cosine similarity is a key part of our search algorithm. Think of it like this: imagine each question and answer is a point in space. Cosine similarity measures how close these points are to each other. For example, if a user asks, “How do I top up my account?”, and we have a database entry that says, “Top up your account by going to Settings”, these two are closely related and would have a high cosine similarity score, close to 1. On the other hand, if the database entry says something about “changing profile picture”, the score would be low, closer to 0, indicating they’re not related.

This way, we can quickly find the best matches to a customer’s query, making the chatbot’s answers more relevant and useful.

For those who understand a bit of math, this is how cosine similarity works. You represent each sentence as a vector in multi-dimensional space. The cosine similarity is calculated as the dot product of two vectors divided by the product of their magnitudes. Mathematically, it looks like this:

![cosine-formula.png](cosine-formula.png) {.figure}

Cosine Similarity  {.caption}

This formula gives us a value between -1 and 1. A value close to 1 means the sentences are very similar, and a value close to -1 means they are dissimilar. Zero means they are not related.

![simplified-workflow.png](simplified-workflow.png) {.figure}

Simplified Workflow {.caption}

Next, we used these top three records as a context in the OpenAI chat API. We merged everything together: the chat history, Three’s base prompt instructions, the current question, and the top three contexts.

![vector-comparison-logic.png](vector-comparison-logic.png) {.figure}

Vector Comparison Logic {.caption}

Initially, this approach was fantastic and provided accurate and informative answers.  However, there was a looming issue, as we were using OpenAI’s first 4k model, and the entire context was sent for every request. Furthermore, the context was treated as “history” for the following message, meaning that each new message added the boilerplate text plus three more contexts. As you can imagine, this led to rapid growth of the context.

To manage this complexity, we decided to keep track of the context. We started storing each message from the user (along with the chatbot’s responses) and the selected contexts. As a result, each chat session now had two separate artifacts: messages and contexts. This ensured that if a user’s next message related to the same context, it wouldn’t be duplicated and we could keep track of what had been used before.

Progress so Far

To put it simply, our system starts with a manual input of questions and answers (Q&A)  which is then enhanced by our AI.  To ensure efficient data handling we use queues to store data quickly. In the chat, when a user asks a question, we add a “context group” that includes all the data we got from Pinecone. To maintain system organization and efficiency, older messages are removed from longer chats.

 

 

 

![chat-workflow.png](chat-workflow.png) {.figure}

 

Chat Workflow {.caption}

![chat-workflow.png](chat-workflow.png) {.figure}

Chat Workflow {.caption}

Automating Data Collection

Acknowledging the manual input as a bottleneck, we set out to streamline the process through automation. I started by trying out scrappers using different languages like PHP and Python. However, to be honest, none of them were good enough and we faced issues with both speed and accuracy. While this component of the system is still in its formative stages, we’re committed to overcoming this challenge. We are currently evaluating the possibility of utilizing an external service to manage this task, aiming to streamline and simplify the overall process.

While working towards data automation, I dedicated my efforts to improving our existing system. I developed a backend admin page, replacing the manual data input process with a streamlined interface. This admin panel provides additional control over the chatbot, enabling adjustments to parameters like the ‘temperature’ setting and initial prompt, further optimizing the customer experience.  So, although we have challenges ahead, we’re making improvements every step of the way.

 

#

RETHINK YOUR APPROACHES

Business & Strategy

A Week of Intense Progress

The week was a whirlwind of AI-fueled excitement, and we eagerly jumped in. After sending an email to my department, the feedback came flooding in. Our team was truly collaborative: a skilled designer supplied Figma templates and a copywriter crafted the app’s text. We even had volunteers who stress-tested our tool with unconventional prompts. It felt like everything was coming together quickly.

However, this initial enthusiasm came to a screeching halt due to security concerns becoming the new focus. A recent data breach at OpenAI, unrelated to our project, shifted our priorities. Though frustrating, it necessitated a comprehensive security check of all projects, causing a temporary halt to our progress.

The breach occurred during a specific nine-hour window on March 20, between 1 a.m. and 10 a.m. Pacific Time. OpenAI confirmed that around 1.2% of active ChatGPT Plus subscribers had their data compromised during this period. They were using the Redis client library (redis-py), which allowed them to maintain a pool of connections between their Python server and Redis. This meant they didn’t need to query the main database for every request, but it became a point of vulnerability.

In the end, it’s good to put security at the forefront and not treat it as an afterthought, especially in the wake of a data breach. While the delay is frustrating, we all agree that making sure our project is secure is worth the wait. Now, our primary focus is to meet all security guidelines before progressing further.

The Move to Microsoft Azure

In just one week, the board made a big decision to move from OpenAI and Pinecone to Microsoft’s Azure.  At first glance, it looks like a smart choice as Azure is known for solid security but the plug-and-play aspect can be difficult.

What stood out in Azure was having our own dedicated GPT-3.5 Turbo model. Unlike OpenAI, where the general GPT-3.5 model is shared, Azure gives you a model exclusive to your company. You can train it, fine-tune it, all in a very secure environment, a big plus for us.

The hard part? Setting up the data storage was not an easy feat. Everything in Azure is different from what we were used to. So, we are now investing time to understand these new services, a learning curve we’re currently climbing.

Azure Cognitive Search

In our move to Microsoft Azure, security was a key focus. We looked into using Azure Cognitive Search for our data management. Azure offers advanced security features like end-to-end encryption and multi-factor authentication. This aligns well with our company’s heightened focus on safeguarding customer data.

The idea was simple: you upload your data into Azure, create an index, and then you can search it just like a database. You define what’s called “fields” for indexing and then Azure Cognitive Search organizes it for quick searching. But the truth is, setting it up wasn’t easy because creating the indexes was more complex than we thought. So, we didn’t end up using it in our project. It’s a powerful tool, but difficult to implement. This was the idea:

![azure-structure.png](azure-structure.png) {.figure}

Azure Structure {.caption}

The Long Road of Discovery

So, what did we really learn from this whole experience? First, improving the customer journey isn’t a walk in the park; it’s a full-on challenge. AI brings a lot of potential to the table, but it’s not a magic fix. We’re still deep in the process of getting this application ready for the public, and it’s still a work in progress.

One of the most crucial points for me has been the importance of clear objectives. Knowing exactly what you aim to achieve can steer the project in the right direction from the start. Don’t wait around — get a proof of concept (POC) out as fast as you can. Test the raw idea before diving into complexities.

Also, don’t try to solve issues that haven’t cropped up yet, this is something we learned the hard way. Transitioning to Azure seemed like a move towards a more robust infrastructure. But it ended up complicating things and setting us back significantly. The added layers of complexity postponed our timeline for future releases. Sometimes, ‘better’ solutions can end up being obstacles if they divert you from your main goal.

 

Stay up to date

Learn more about MLCON

 

In summary, this project has been a rollercoaster of both challenges and valuable lessons learned. We’re optimistic about the future, but caution has become our new mantra. We’ve come to understand that a straightforward approach is often the most effective, and introducing unnecessary complexities can lead to unforeseen problems. With these lessons in hand, we are in the process of recalibrating our strategies and setting our sights on the next development phase.

Although we have encountered setbacks, particularly in the area of security, these experiences have better equipped us for the journey ahead. The ultimate goal remains unchanged: to provide an exceptional experience for our customers. We are fully committed to achieving this goal, one carefully considered step at a time.

Stay tuned for further updates as we continue to make progress. This project is far from complete, and we are excited to share the next chapters of our story with you.

The post Building a Proof of Concept Chatbot with OpenAIs API, PHP and Pinecone appeared first on ML Conference.

]]>
Talk of the AI Town: The Uprising of Collaborative Agents https://mlconference.ai/blog/the-uprising-of-collaborative-agents/ Mon, 04 Dec 2023 08:51:12 +0000 https://mlconference.ai/?p=86940 This article aims to delve into the capabilities and limitations of OpenAI’s models, examine the functionalities of agents like Baby AGI, and discuss potential future advancements in this rapidly evolving field.

The post Talk of the AI Town: The Uprising of Collaborative Agents appeared first on ML Conference.

]]>
Introduction:

Open AI’s release of ChatGPT and GPT-4 has sparked a Cambrian explosion of new products and projects, shifting the landscape of artificial intelligence significantly. These models have both quantitatively and qualitatively advanced beyond their language modeling predecessors. Similarly to how the deep learning model called AlexNet significantly improved on the ImageNet benchmark for computer vision back in 2012. More importantly, these models exhibit a capability, the ability to perform many different tasks such as machine translation or when given a few examples of the task: few-shot learning. Unlike humans, most language models require large supervised datasets before they can be expected to perform a specific task. This plasticity of “intelligence” that GPT-3 was capable of opened up new possibilities in the field of AI. It is a system capable of problem-solving which enables the implementation of many long-imagined AI applications.

Even the successor model to GPT-3, GPT-4, is still just a language model at the end of the day and still quite far from Artificial General Intelligence. In general, the ”prompt to single response“ formulation of language models is much too limited to perform complex multi-step tasks. For an AI to be generally intelligent, it must seek out information, remember, learn, and interact with the world in steps. There have recently been many projects on GitHub that have essentially created self-talking loops and prompting structures on top of OpenAI’s APIs for the GPT-3.5 and GPT-4. These are models that form a system that can plan, generate code, debug, and execute programs. These systems in theory have the potential to be much more general and approach what many people think of when they hear “AI”.

Stay up to date

Learn more about MLCON

 

The concept of systems that intelligently interact in their environment is not completely new, and has been heavily researched in a field of AI called reinforcement learning. The influential textbook “Artificial Intelligence: A Modern Approach” by Russell and Norvig covers many different structures for how to build intelligent “agents” – entities capable of perceiving their environment and acting to achieve specific objectives. While I don’t believe Russel and Norvig imagined that these agent structures would be mostly language model-based. They did describe how they would perform their various steps with plain English sentences and questions as they were mostly for illustrative purposes. Since we now have language models capable of functionally understanding the steps and questions they use, it is much easier to implement many of these structures as real programs today.

While I haven’t seen any projects using prompts inspired by the AI: AMA textbook for their agents, the open-source community has been leveraging GPT 3.5 and GPT-4 to develop agent or agent-like programs using similar ideas. Examples of such programs include Baby AGI, AutoGPT, and MetaGPT. While these agents are not designed to interact with a game or simulated environment like traditional RL agents, They do typically generate code, detect errors, and alter their behavior accordingly.  So in a sense, they are interacting with and perceiving the “environment” of programming, and are significantly more capable than anything before. 

This article aims to delve into the capabilities and limitations of OpenAI’s models, examine the functionalities of agents like Baby AGI, and discuss potential future advancements in this rapidly evolving field.

Understanding the Capabilities of GPT-3.5 and GPT-4:

GPT-3.5 and GPT-4 are important milestones not only in natural language processing but also in the field of AI. Their ability to generate contextually appropriate, coherent responses to a myriad of prompts has reshaped our expectations of what a language model can achieve. However, to fully appreciate their potential and constraints, it’s necessary to delve deeper into their implementation.

One significant challenge these models face is the problem of hallucination. Hallucination refers to instances where a language model generates outputs that seem plausible but are entirely fabricated or not grounded in the input data. Hallucination is a challenge in Chat GPT as these models are fundamentally outputting the probability distribution of the next word, and that probability distribution is sampled in a weighted random fashion. This leads to the generation of responses that are statistically likely but not necessarily accurate or truthful. The limitation of relying on maximum likelihood sampling in language models is that it prioritizes coherence over veracity, leading to creative but potentially misleading outputs. This essentially limits the ability of the model to reason and make logical deductions when the output pattern is very unlikely. While they can exhibit some degree of reasoning and common sense, they don’t yet match human-level reasoning capabilities. This is because they are limited to statistical patterns present in their training data, rather than a thorough understanding of the underlying concepts.

To quantitatively assess these models’ reasoning capabilities, researchers use a range of tasks including logical puzzles, mathematical operations, and exercises that require understanding causal relationships. [https://arxiv.org/abs/2304.03439] While OpenAI does boast about GPT-4’s ability to pass many aptitude tests including the Bar exam. The model struggles to show the same capabilities with out-of-distribution logical puzzles, which can be expected when you consider the statistical nature of the models.

To be fair to these models, the role of language in human reasoning is underappreciated by the general public. Humans also use language generation as a form of reasoning, making connections, and drawing inferences through linguistic patterns. If the brain area that is responsible for language is damaged, research has shown that reasoning is impaired: [https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01523/full]. Therefore, just because language models are mostly statistical next-word generators, we shouldn’t disregard their reasoning capabilities entirely. While it has limitations, it is something that can be taken advantage of in systems. A genuine potential of language models exists to replicate certain reasoning processes and this theory of the link between reasoning and language explains their capabilities.

While GPT-3.5 and GPT-4 have made significant strides in natural language processing, there is still work to do. Ongoing research is focused on enhancing these abilities and tackling these challenges. It is important for systems today to work around these limitations and take advantage of language models’ strengths as we explore their potential applications and continue to push AI’s boundaries.

THE PECULIARITIES OF ML SYSTEMS

Machine Learning Advanced Developments

Exploring Collaborative Agent Systems: BabyAGI, HuggingFace, and MetaGPT:

BabyAGI, created by Yohei Nakajima, serves as an interesting proof-of-concept in the domain of agents. The main idea behind it consists of creating three “sub-agents”: the Task Creator, Task Prioritizer, and Task Executor.  By making the sub-agents have specific roles and collaborating by way of a task management system, BabyAGI can reason better and achieve many more tasks than a single prompt alone, hence creating the ”collaborative agent system” concept.  While I do not believe the collaborative agent strategy BabyAGI implements is a completely novel concept.  It is one of the early successful experiments built on top of GPT-4 with code we can easily understand. In BabyAGI, the Task Creator initiates the process by setting the goal and formulating the task list. The Task Prioritizer then rearranges the tasks based on their significance in achieving the goal, and finally, the Task Executor carries out the tasks one by one. The output of each task is stored in a vector database, which can look up data by similarity, for future reference serving as a type of memory for the Task Executor.

Fig 1. A high-level description of the BabyAGI framework

HuggingFace’s Transformers Agents, is another substantial agent framework. It has gained popularity for its ability to leverage the library of pre-trained models on HuggingFace. By leveraging the StarCoder model, the Transformers Agent can string together many different models available on HuggingFace to accomplish various tasks. It can solve a range of visual, audio, and natural language processing functionalities. However, HuggingFace agents lack error recovery mechanisms, often requiring external intervention to troubleshoot issues and continue with the task.

Fig 2. Example of HuggingFace’s Transformers Agent

MetaGPT adopts a unique approach by emulating a virtual company where different agents play specific roles. Each virtual agent within MetaGPT has its own thoughts, allowing them to contribute their perspectives and expertise to the collaborative process. This approach recognizes the collective intelligence of human communities and seeks to replicate it in AI systems.

 

Fig. 3. The Software Company structure of MetaGPT

BabyAGI, Transformers, and MetaGPT, with their own strengths and limitations, collectively exemplify the evolution of collaborative agent systems. Although many feel that their capabilities are underwhelming, by integrating the principles of intelligent agent frameworks with advanced language models, their authors have made significant progress in creating AI systems that can collaborate, reason, and solve complex tasks.

 

A Deeper Dive into the Original BabyAGI:

BabyAGI presents an intuitive collaborative agent system operating within a loop, comprising three key agents: the Task Creator, Task Prioritizer, and Task Executor, each playing a unique role in the collaborative process. Let’s examine the prompts of each sub-agent.

Fig.4 Original task creator agent prompt

The process initiates with the Task Creator, responsible for defining the goal and initiating the task list. This agent in essence sets the direction for the collaborative system. It generates a list of tasks, providing a roadmap outlining the essential steps for goal attainment.

Fig 5. Original task prioritizer agent prompt

Once the tasks are established, they are passed on to the Task Prioritizer. This agent reorders tasks based on their importance for goal attainment, optimizing the system’s approach by focusing on the most critical steps. Ensuring the system maintains efficiency by directing its attention to the most consequential tasks.

Fig 6. Original task executor agent prompt

 

The Task Executor then takes over following task prioritization. This agent executes tasks one by one according to the prioritized order. As you may notice in the prompt, it is only just hallucinating and performing the tasks. The output of this prompt, the result of completing the task, is appended to the task object being completed and stored in a vector database.

An intriguing aspect of BabyAGI is the incorporation of a vector database, where the task object, including the Task Executor’s output, is stored. The reason this is important is that language models are static. They can’t learn from anything other than the prompt. Using a vector database to look up similar tasks allows the system to maintain a type of memory of its experiences, both problems and solutions, which helps improve the agent’s performance when confronted with similar tasks in the future.

Vector databases work by efficiently indexing the internal state of the language model.  For OpenAI’s text-embedding-ada-002 model, this internal state is a vector of length 1536. It is trained to produce similar vectors for semantically similar inputs, even if they use completely different words. In the BabyAGI system, the ability to look up similar tasks and append them to the context of the prompt is used as a way for the model to have memories of its previous experiences performing similar tasks.

As mentioned above, the vanilla version of BabyAGI operates predominantly in a hallucinating mode as it lacks external interaction. Additional tools, such as functions for saving text, interacting with databases, executing Python scripts, or even searching the web, were later integrated into the system, extending BabyAGI’s capabilities.

While BabyAGI is capable of breaking down large goals into small tasks and essentially working forever on them, it still has many limitations. Unless the task creator explicitly adds a check if a task is done, the system will tend to generate an endless stream of tasks, even after achieving the initial goal. Moreover, BabyAGI executes tasks sequentially, which slows it down significantly. Future iterations of BabyAGI, such as BabyDeerAGI, have implemented features to address these limitations, exploring parallel execution capabilities for independent tasks and more tools.

In essence, BabyAGI serves as a great introduction and starting point in the realm of collaborative agent systems. Its architecture enables planning, prioritization, and execution. It lays the groundwork for many other developers to create new systems to address the limitations and expand what’s possible.

Stay up to date

Learn more about MLCON

 

The Rise of Role-Playing Collaborative Agent Systems:

 

While not every project claims BabyAGI as its inspiration, many similar multi-role agent systems exist in projects such as MetaGPT and AutoGen. These projects are bringing a new wave of innovation into this space. Much like how BabyAGI used multiple “Agents” to manage tasks, these frameworks go a step further. This is by trying to make many different agents with distinct roles that work together to accomplish the goal. In MetaGPT the agents are working together inside a virtual company, complete with a CEO, CTO, designers, testers, and programmers. People experimenting with this framework today can get this virtual company to create various types of simple utility software and simple games successfully. Though I would say they are rarely visually pleasing.

AutoGen is going about things slightly differently but in a similar vein to the framework I’ve been working on over at my company Xpress AI. 

AutoGen has a user proxy agent that interacts with the user and can create tasks for one or more assistant agents. The tool is more of a library than a standalone project so you will have to create a configuration of user proxies and assistants to accomplish the tasks you may have. I think that this is the future of how we will interact with agents. We will need those many conversation threads to interact with each other to expand the capabilities of the base model.

Why Collaborative Agents Systems are more effective

A language model is intelligent enough only by necessity. To predict the next work accurately, it has had to learn how to be rudimentarily intelligent. There is only a fixed amount of computation that can happen inside the various transformer layers inside the particular model. By giving the model a different starting point, it can put more computation and therefore thinking into its original response. Giving different roles to these specific agents helps them get out of the specific rut of wanting to be self-consistent. You can imagine how we can possibly go to an even larger scale on this idea to create AI systems closer to AGI.

Even in human society, it can be argued that we currently have various Superhuman intelligences in place. The stock market, for example, can allocate resources better than any one person could ever hope to. Take the scientific community, the paper review and publishing process are also helping humanity reach new levels of intelligence.

Even these systems need time to think or process the information. LLMs unfortunately only have a fixed amount of processing power. The future AI systems will have to include ways for the agent to think for itself, similar to how they can leverage functions today, but internally to give them the ability to apply an arbitrary amount of computation to achieve a task. Roles are one way to approach this, but it would be more effective if each agent in these simulated virtual organizations were able to individually apply arbitrary amounts of computation to their responses. Also, a system where each agent could learn from their mistakes, similar to humans, is required to really escape the cognitive limitations of the underlying language model. Without these capabilities, which have been known to the AI community as fundamental capabilities for a long time, we can’t reasonably expect these systems to be the foundation of an AGI.

Addressing Limitations and Envisioning Future Prospects:

Collaborative agent systems exhibit promising potential. However, they are still far from being truly general intelligence. Learning about these limitations can give clues to possible solutions that can pave the way for more sophisticated and capable systems. 

One limitation of BabyAGI in particular lies in the lack of active perception. The Executor Agent in BabyAGI nearly always assumes that the system is in the perfect state to accomplish the task, or that the previous task was completed successfully.  Since the world is not perfect it often fails to achieve the task. BabyAGI is not alone in this problem. The lack of perception greatly affects the practicality and efficacy of these systems for real-world tasks.

Error recovery mechanisms in these systems also need improvement. While a tool-enabled version of BabyAGI does often generate error-fixing tasks, the Task Prioritizer’s prioritization may not always be optimal. Causing the executor to miss the chance to easily fix the issue. Advanced prioritization algorithms, taking into account error severity and its impact on goal attainment are being worked on. The latest versions of BabyAGI have task dependency tracking which does help, but I don’t believe we have fully fixed this issue yet.

Task completion is another challenge in collaborative agent systems like BabyAGI. A robust review mechanism assessing the state of task completion and adjusting the task list accordingly could address the issue of endless task generation, enhancing the overall efficiency of the system. Since MetaGPT has managers that check the results of the individual contributors, they are more likely to detect that the task has been completed, although this way of working is quite inefficient.

Parallel execution of independent tasks offers another potential area of improvement. Leveraging multi-threading or distributed computing techniques could lead to significant speedups and more efficient resource utilization. BabyDeerAGI specifically uses dependency tracking to create independent threads of executors, while MetaGPT uses the company structure to perform work in parallel. Both are interesting approaches to the problem and, perhaps, the two approaches could be combined. 

The lack of the ability to learn from experience is another fundamental limitation. As far as I know, none of the current systems utilize fine-tuning of LLMs to form long-term memories. In theory, it isn’t a complicated process but in practice gathering the data necessary, in a way that doesn’t fundamentally make the model worse, is an open problem. Training models on model-generated outputs or training on already encountered data seems to cause the models to overfit quickly; often requiring careful hand-tuning of the training hyper-parameters. To make agents that can learn from experience, a sophisticated algorithm is required, not just to perform the training, but also to gather the correct data. This process is probably similar to the limbic system in our brains, for example.

While the current crop of agent systems has various limitations, there are still many open opportunities to address them with software and structure to create even more advanced applications. Enhancing active task execution, improving error recovery mechanisms, implementing efficient review mechanisms, and exploring parallel execution capabilities can boost the overall performance of these systems. 

Conclusion:

The emergence of open-source collaborative agent systems is creating a transformative era in AI. We are very close to a world where humans and AI can collaborate to solve the world’s problems. Similar to the idea of how companies or the market formed by many independent rational actors form a superhuman intelligence, the development of collaborative agent systems that have many independent sub-agents that communicate, collaborate, and reason together seems to enhance the capabilities of the language model alone to accomplish tasks, paving the way for the creation of more versatile applications.

Looking ahead, I think AI powered by collaborative agent systems has the potential to revolutionize industries such as healthcare, finance, education, and more. However, we must not forget the important sentence from an IBM manual: “A computer can never be held accountable”. In a future where we have human-level AIs that we can work hand-in-hand to tackle complex problems, it becomes increasingly important to ensure accountability measures are in place. The responsibility and accountability for their actions still ultimately lie with the humans who design, deploy, and use them. 

This journey towards AGI is thrilling, and collaborative agent systems play an integral role in this transformative era of artificial intelligence.

 

The post Talk of the AI Town: The Uprising of Collaborative Agents appeared first on ML Conference.

]]>
AI is a Human Endeavor https://mlconference.ai/blog/ai-human-endeavor/ Tue, 29 Aug 2023 09:00:46 +0000 https://mlconference.ai/?p=86755 As AI advances, calls for regulation are increasing. But viable regulatory policies will require a broad public debate. We spoke with Mhairi Aitken, Ethics Fellow at the British Alan Turing Institute, about the current discussions on risks, AI regulation, and visions of shiny robots with glowing brains.

The post AI is a Human Endeavor appeared first on ML Conference.

]]>
devmio: Could you please introduce yourself to our readers and a bit about why you are concerned with machine learning and artificial intelligence?

Mhairi Aitken: My name is Mhairi Aitken, I’m an ethics fellow at the Alan Turing Institute. The Alan Turing Institute is the UK’s National Institute for AI and data science and as an ethics fellow, I look at the ethical and social considerations around AI and data science. I work in the public policy program where our work is mostly focused on uses of AI within public policy and government, but also in relation to policy and government responses to AI as in regulation of AI and data science. 

devmio: For our readers who may be unfamiliar with the Alan Turing Institute, can you tell us a little bit about it? 

Mhairi Aitken: The national institute is publicly funded, but our research is independent. We have three main aims of our work. First, advancing world-class research and applying that to national and global challenges. 

Second, building skills for the future. That’s both going to technical skills and training the next generation of AI and data scientists, but also to developing skills around ethical and social considerations and regulation. 

Third, part of our mission is to drive an informed public conversation. We have a role in engaging with the public, as well as policymakers and a wide range of stakeholders to ensure that there’s an informed public conversation around AI and the complex issues surrounding it and clear up some misunderstandings often present in public conversations around AI.

NEW & PRACTICAL ENDEAVORS FOR ML

Machine Learning Principles

devmio: In your talk at Devoxx UK, you said that it’s important to demystify AI. What exactly is the myth surrounding AI?

Mhairi Aitken: There’s quite a few different misconceptions. Maybe one of the biggest ones is that AI is something that is technically super complex and not something everyday people can engage with. That’s a really important myth to debunk because often there’s a sense that AI isn’t something people can easily engage with or discuss. 

As AI is already embedded in all our individual lives and is having impacts across society, it’s really important that people feel able to engage in those discussions and that they have a say and influence the way AI shapes their lives. 

On the other hand, there are unfounded and unrealistic fears about what risks it might bring into our lives. There’s lots of imagery around AI that gets repeated, of shiny robots with glowing brains and this idea of superintelligence. These widespread narratives around AI come back again and again, and are very present within the public discourse. 

That’s a distraction and it creates challenges for public engagement and having an informed public discussion to feed into policy and regulation. We need to focus on the realities of what AI is and in most cases, it’s a lot less exciting than superintelligence and shiny robots.

devmio: You said that AI is not just a complex technical topic, but something we are all concerned with. However, many of these misconceptions stem from the problem that the core technology is often not well understood by laymen. Isn’t that a problem?

Mhairi Aitken: Most of the players in big tech are pushing this idea of AI being something about superintelligence, something far-fetched, that’s closing down the discussions. It’s creating that sense that AI is something more difficult to explain, or more difficult to grasp, then it actually is, in order to have an informed conversation. We need to do a lot more work in that space and give people the confidence to engage in meaningful discussions around AI. 

And yes, it’s important to enable enough of a technical understanding of what these systems are, how they’re built and how they operate. But it’s also important to note that people don’t need to have a technical understanding to engage in discussions around how systems are designed, how they’re developed, in what contexts they’re deployed, or what purposes they are used for. 

Those are political, economic, and cultural decisions made by people and organizations. Those are all things that should be open for public debate. That’s why, when we talk about AI, it’s really important to talk about it as a human endeavor. It’s something which is created by people and is shaped by decisions of organizations and people. 

That’s important because it means that everyone’s voices need to be heard within those discussions, particularly communities who are potentially impacted by these technologies. But if we present it as something very complex which requires a deep technical understanding to engage with, then we are shutting down those discussions. That’s a real worry for me.

Stay tuned!
Learn more about ML Conference:

devmio: If the topic of superintelligence as an existential threat to humanity is a distraction from the real problems of AI that is being pushed by Big Tech, then what are those problems?

Mhairi Aitken: A lot of the AI systems that we interact with on a daily basis are opaque systems that make decisions about people’s lives, in everything from policing to immigration, social care and housing, or algorithms that make decisions about what information we see on social media. 

Those systems rely on or are trained on data sets, which contain biases. This often leads to biased or discriminatory outcomes and impacts. Because the systems are often not transparent in the ways that they’re used or have been developed, it makes it very difficult for people to contest decisions that are having meaningful impacts on their lives. 

In particular, marginalized communities, who are typically underrepresented within development processes, are most likely to be impacted by the ways these systems are deployed. This is a really, really big concern. We need to find ways of increasing diversity and inclusiveness within design and development processes to ensure that a diverse set of voices and experiences are reflected, so that we’re not just identifying harms when they occur in the real world, but anticipating them earlier in the process and finding ways to mitigate and address them.

At the moment, there are also particular concerns and risks that we really need to focus on concerning generative AI. For example, misinformation, disinformation, and the ways generative AI can lead to increasingly realistic images, as well as deep fake videos and synthetic voices or clone voices. These technologies are leading to the creation of very convincing fake content, raising real concerns for potential spread of misinformation that might impact political processes. 

It’s not just becoming increasingly hard to spot that something is fake. It’s also a widespread concern that it is increasingly difficult to know what is real. But we need to have access to trustworthy and accurate information about the world for a functioning democracy. When we start to question everything as potentially fake, it’s a very dangerous place in terms of interference in political and democratic processes.

I could go on, but there are very real concrete examples of how AI is already having presented harms today and they disproportionately impact marginalized groups. A lot of the narratives of existential risk we currently see are coming from Big Tech and are mostly being pushed by privileged or affluent people. When we think about AI or how we address the risks around AI, it’s important that we shouldn’t center around the voices of Big Tech, but the voices of impacted communities. 

devmio: A lot of misinformation is already on the internet and social media without the addition of AI and generative AI. So potential misuse on a large scale is of a big concern for democracies. How can western societies regulate AI, either on an EU-level or a global scale? How do we regulate a new technology while also allowing for innovation?

Mhairi Aitken: There definitely needs to be clear and effective regulation around AI. But I think that the dichotomy between regulation and innovation is false. For a start, we don’t just want any innovation. We want responsible and safe innovation that leads to societal benefits. Regulation is needed to make sure that happens and that we’re not allowing or enabling dangerous and harmful innovation practices.

Also, regulation provides the conditions for certainty and confidence for innovation. The industry needs to have confidence in the regulatory environment and needs to know what the limitations and boundaries are. I don’t think that regulation should be seen as a barrier to innovation. It provides the guardrails, clarity, and certainty that is needed. 

Regulation is really important and there are some big conversations around that at the moment. The EU AI Act is likely to set an international standard of what regulation will look like in this regard. It’s going to have a big impact in the same way that GDPR had with data protection. Soon, any organization that’s operating in the EU, or that may export an AI product to the EU, is going to have to comply with the EU AI Act. 

We need international collaboration on this.

devmio: The EU AI Act was drafted before ChatGPT and other LLMs became publicly available. Is the regulation still up to date? How is an institution like the EU supposed to catch up to the incredible advancements in AI?

Mhairi Aitken: It’s interesting that over the last few months, developments with large language models have forced us to reconsider some elements of what was being proposed and developed, particularly around general purpose AI. Foundation models like large language models that aren’t designed for a particular purpose can be deployed in a wide range of contexts. Different AI models or systems are built on top of them as a foundation.

That’s posed some specific challenges around regulation. Some of this is still being worked out. There are big challenges for the EU, not just in relation to foundation models. AI encompasses so many things and is used across all industries, across all sectors in all contexts, which poses a big challenge. 

The UK-approach to regulation of AI has been quite different to that proposed in the EU: The UK set out a pro-innovation approach to regulation, which was a set of principles intended to equip existing UK regulatory bodies to grapple the challenges of AI. It recognized that AI is already being used across all industries and sectors. That means that all regulators have to deal with how to regulate AI in their sectors. 

In recent weeks and months in the UK we have seen an increasing emphasis on regulation and AI, and increased attention at the importance of developing effective regulation. But I have some concerns that this change of emphasis has, at least in part, come from Big Tech. We’ve seen this in the likes of Sam Altman on his tour of Europe, speaking to European regulators and governments. Many voices talking about the existential risk AI poses come from Silicon Valley. This is now beginning to have an influence on policy discussions and regulatory discussions, which is worrying. It’s a positive thing that we’re having these discussions about regulation and AI, but we need those discussions to focus on real risks and impacts. 

devmio: The idea of existential threat posed by AI often comes from a vision of self-conscious AI, something often called strong AI or artificial general intelligence (AGI). Do you believe AGI will ever be possible?

Mhairi Aitken: No, I don’t believe AGI will ever be possible. And I don’t believe the claims being made about an existential threat. These claims are a deliberate distraction from the discussions of regulation of current AI practices. The claim is that the technology and AI itself poses a risk to humanity and therefore, needs regulation. At the same time, companies and organizations are making decisions about that technology. That’s why I think this narrative is being pushed, but it’s never going to be real. AGI belongs in the realm of sci-fi. 

There are huge advancements in AI technologies and what they’re going to be capable of doing in the near future is going to be increasingly significant. But they are still always technologies that do what they are programmed to do. We can program them to do an increasing number of things and they do it with an increasing degree of sophistication and complexity. But they’re still only doing what they’re programmed for, and I don’t think that will ever change. 

I don’t think it will ever happen that AI will develop its own intentions, have consciousness, or a sense of itself. That is not going to emerge or be developed in what is essentially a computer program. We’re not going to get to consciousness through statistics. There’s a leap there and I have never seen any compelling evidence to suggest that could ever happen.

We’re creating systems that act as though they have consciousness or intelligence, but this is an illusion. It fuels a narrative that’s convenient for Big Tech because it deflects away from their responsibility and suggests that this isn’t about a company’s decisions.

devmio: Sometimes it feels like the discussions around AI are a big playing field for societal discourse in general. It is a playing field for a modern society to discuss its general state, its relation to technology, its conception of what it means to be human, and even metaphysical questions about God-like AI. Is there some truth to this?

Mhairi Aitken: There’s lots of discussions about potential future scenarios and visions of the future. I think it’s incredibly healthy to have discussions about what kind of future we want and about the future of humanity. To a certain extent this is positive.

But the focus has to be on the decisions we make as societies, and not hypothetical far-fetched scenarios of super intelligent computers. These conversations that focus on future risks have a large platform. But we are only giving a voice to Big Tech players and very privileged voices with significant influence in these discussions. Whereas, these discussions should happen at a much wider societal level. 

The conversations we should be having are about how we harness the value of AI as a set of tools and technologies. How do we benefit from them to maximize value across society and minimize the risks of technologies? We should be having conversations with civil society groups and charities, members of the public, and particularly with impacted communities and marginalized communities.

We should be asking what their issues are, how AI can find creative solutions, and where we could use these technologies to bring benefit and advocate for the needs of community groups, rather than being driven by commercial for-profit business models. These models are creating new dependencies on exploitative data practices without really considering if this is the future we want.

devmio: In the Alan Turing Institute’s strategy document, it says that the institute will make great leaps in AI development in order to change the world for the better. How can AI improve the world?

Mhairi Aitken: There are lots of brilliant things that AI can do in the area of medicine and healthcare that would have positive impacts. For example, there are real opportunities for AI to be used in developing diagnostic tools. If the tools are designed responsibly and for inclusive practices, they can have a lot of benefits. There’s also opportunities for AI in relation to the environment and sustainability in terms of modeling or monitoring environments and finding creative solutions to problems.

One area that really excites me is where AI can be used by communities, civil society groups, and charities. At the moment, there’s an emphasis on large language models. But actually, when we think about smaller AI, there’s real opportunities if we see them as tools and technologies that we can harness to process complex information or automate mundane tasks. In the hands of community groups or charities, this can provide valuable tools to process information about communities, advocate for their needs, or find creative solutions.

devmio: Do you have examples of AI used in the community setting?

Mhairi Aitken: For example, community environment initiatives or sustainability initiatives can use AI to monitor local environments, or identify plant and animal species in their areas through image recognition technologies. It can also be used for processing complex information, finding patterns, classifying information, and making predictions or recommendations from information. It can be useful for community groups to process information about aspects of community life and develop evidence needed to advocate for their needs, better services, or for political responses.

A lot of big innovation is in commercially-driven development. This leads to commercial products instead of being about how these tools can be used for societal benefit on a smaller scale. This changes our framing and helps us think about who we’re developing these technologies for and how this relates to different kinds of visions of the future that benefit from this technology.

devmio: What do you think is needed to reach this point?

Mhairi Aitken: We need much more open public conversations and demands about transparency and accountability relating to AI. That’s why it’s important to counter the sensational unrealistic narrative and make sure that we focus on regulation, policy and public conversation. All of us must focus on the here and now and the decisions of companies leading the way in order to hold them accountable. We must ensure meaningful and honest dialogue as well as transparency about what’s actually happening.

devmio: Thank you for taking the time to talk with us and we hope you succeed with your mission to inform the public.

The post AI is a Human Endeavor appeared first on ML Conference.

]]>
Using OpenAI’S CLIP Model on the iPhone: Semantic Search For Your Own Pictures https://mlconference.ai/blog/openai-clip-model-iphone/ Wed, 02 Aug 2023 08:35:24 +0000 https://mlconference.ai/?p=86676 The iPhone Photos app supports text-based searches, but is quite limited. When I wanted to search for a photo of “my girlfriend taking a selfie at the beach,” it didn’t return any results, even though I was certain there was such a photo in my album. This prompted me to take action. Eventually, I integrated OpenAI’s CLIP model into the iPhone.

The post Using OpenAI’S CLIP Model on the iPhone: Semantic Search For Your Own Pictures appeared first on ML Conference.

]]>
OpenAI’s CLIP Model

I first encountered the CLIP model in early 2022 while experimenting with the AI drawing model. CLIP (Contrastive Language-Image Pre-Training) is a model proposed by OpenAI in 2021. CLIP can encode images and text into representations that can be compared in the same space. CLIP is the basis for many text-to-image models (e.g. Stable Diffusion) to calculate the distance between the generated image and the prompt during training.

 

OpenAI’s CLIP model

Fig. 1: OpenAI’s CLIP model, source: https://openai.com/blog/clip/

 

As shown above, the CLIP model consists of two components: Text Encoder and Image Encoder. Let’s take the ViT-B-32 (different models have different output vector sizes) version as an example:

 

  • Text Encoder can encode any text (length <77 tokens) into a 1×512 dimensional vector.
  • Image Encoder can encode any image into a 1×512 dimensional vector.

 

By calculating the distance or cosine similarity between the two vectors, we can compare the similarity between a piece of text and an image.

Stay up to date

Learn more about MLCON

 

Image Search on a Server

I found this to be quite fascinating, as it was the first time images and text could be compared in this way. Based on this principle, I quickly set up an image search tool on a server. First, process all images through CLIP, obtaining their image vectors, they should be a list of 1×512 vectors.

 

  # get all images list.  

img_lst = glob.glob(‘imgs/*jpg’) 

img_features = [] 

# calculate vector for every image.  

for img_path in img_lst: 

image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) 

image_features = model.encode_image(image) 

img_features.append(image_features)

 

Then, given the search text query, calculate its text vector (with a size of 1×512) and compare similarity with each image vector in a for-loop.

 

  text_query = ‘lonely’  

# tokenize the query then put it into the CLIP model.  

text = clip.tokenize([text_query]).to(device)  

text_feature = model.encode_text(text)  

# compare vector similary with each image vector  

sims_lst = []  

for img_feature in img_features:  

sim = cosin_similarity(text_feature, img_feature)  

sims_lst.append(sim.item())  

 

Finally, display the top K results in order. Here I return the top3 ranked image files, and display the most relevant result.

 

  K = 3 

# sort by score with np.argsort  

sims_lst_np = np.array(sims_lst) 

idxs = np.argsort(sims_lst_np)[-K:] 

# display the most relevant result.  

imagedisplay(filename=img_lst[idxs[-1]])

 

I discovered that its image search results were far superior to those of Google, here are the top 3 results when I search for the keyword “lonely”:

 

Integrating CLIP into iOS with Swift

After marveling at the results, I wondered: Is there a way to bring CLIP to mobile devices? After all, the place where I store the most photos is neither my MacBook Air nor my server, but rather my iPhone.

To port a large GPU-based model to the iPhone, operator support and execution efficiency are the two most critical factors.

1. Operator Support

Fortunately, in December 2022, Apple demonstrated the feasibility of porting Stable Diffusion to iOS, proving that the deep learning operators needed for CLIP are supported in iOS 16.0.

 

Fig. 2: Pictures generated by Stable Diffusion

2. Execution Efficiency

Even with operator support, if the execution efficiency is extremely slow (for example, calculating vectors for 10,000 images takes half an hour, or searching takes 1 minute), porting CLIP to mobile devices would lose its meaning. These factors can only be determined through hands-on experimentation.

I exported the Text Encoder and Image Encoder to the CoreML model using the coremltools library. The final models has a total file size of 300MB. Then, I started writing Swift code.

I use Swift to load the Text/Image Encoder models and calculate all the image vectors. When users input a search keyword, the model first calculates the text vector and then computes its cosine similarity with each of the image vectors individually.

The core code is as follows:

 

  // load the Text/Image Encoder model. 

let text_encoder = try MLModel(contentsOf: TextEncoderURL, configuration: config) 

let image_encoder = try MLModel(contentsOf: ImageEncoderURL, configuration: config) 

// given a prompt/photo, calculate the CLIP vector for it. 

let text_feature = text_encoder.encode(“a dog”) 

let image_feature = image_encoder.encode(“a dog”) 

// compute the cosine similarity. 

let sim = cosin_similarity(img_feature, text_feature) 

 

As a SwiftUI beginner, I found that Swift doesn’t have a specific implementation for cosine similarity. Therefore, I used Accelerate to write one myself, the code below is a Swift translation of cosine similarity from Wikipedia.

 

  import Accelerate  

func cosine_similarity(A: MLShapedArray<Float32>, B: MLShapedArray<Float32>) -> Float {  

let magnitude = vDSP.rootMeanSquare(A.scalars) * vDSP.rootMeanSquare(B.scalars)  

let dotarray = vDSP.dot(A.scalars, B.scalars)  

return  dotarray / magnitude  

} 

 

The reason I split Text Encoder and Image Encoder into two models is because, when actually using this Photos search app, your input text will always change, but the content of the Photos library is fixed. So all your image vectors can be computed once and saved in advance. Then, the text vector is computed for each of your searches.

Furthermore, I implemented multi-core parallelism when calculating similarity, significantly increasing search speed: a single search for less than 10,000 images takes less than 1 second. Thus, real-time text searching from tens of thousands of Photos library becomes possible.

Below is a flowchart of how Queryable works:

 

Fig. 3: How the app works

Performance

But, compared to the search function of the iPhone Photos, how much does the CLIP-based album search capability improve? The answer is: overwhelmingly better. With CLIP, you can search for a scene in your mind, a tone, an object, or even an emotion conveyed by the image.

 

Fig. 4: Search for a scene, an object, a tone or the meaning related to the photo with Queryable.

 

To use Queryable, you need to first build the index, which will traverse your album, calculate all the image vectors and store them. This takes place only once, the total time required for building the index depends on the number of your photos, the speed of indexing is ~2000 photos per minute on the iPhone 12 mini. When you have new photos, you can manually update the index, which is very fast.

In the latest version, you have the option to grant the app access to the network in order to download photos stored on iCloud. This will only occur when the photo is included in your search results, the original version is stored on iCloud, and you have navigated to the details page and clicked the download icon. Once you grant the permissions, you can close the app, reopen it, and the photos will be automatically downloaded from iCloud.

3. Any requirements for the device?

  • iOS 16.0 or above
  • iPhone 11 (A13 chip) or later models

The time cost for a search also depends on your number of photos: for <10,000 photos it takes less than 1s. For me, an iPhone 12 mini user with 35,000 photos, each search takes about 2.8s.

Q&A on Queryable

1.On Privacy and security issues.

Queryable is designed as an OFFLINE app that does not require a network connection and will never request network access, thereby avoiding privacy issues.

2. What if my pictures are stored on iCloud?

Due to the inability to connect to a network, Queryable can only use the cache of the low-definition version of your local Photos album. However, the CLIP model itself resizes the input image to a very small size (e.g. ViT-B-32 is 224×224), so if your image is stored on iCloud, it actually does not affect search accuracy except that you cannot view its original image in search result.

The post Using OpenAI’S CLIP Model on the iPhone: Semantic Search For Your Own Pictures appeared first on ML Conference.

]]>