Measure to Master: Evaluating Your GenAI Applications
Thank you to everyone who attended our webinar on evaluating GenAI applications.
We had an interesting discussion, especially about the common errors when building with LLMs and how to systematically improve them. We hope you find the insights shared valuable and informative. A special thanks to our partners at Weights and Biases for their collaboration. For those who couldn't make it, the video of the event and the presentation slides are available below. We look forward to seeing you at our future webinars.
Agenda:
First lecture:
Building confidence in AI applications, by Scott Condron, Product at Weights & Biases
Abstract:
Since the release of GPT-4, there has been a rush to build on top of LLMs and harness their power for many use cases, but many teams are learning that it's not as simple as editing a prompt.
Participants will learn about some common pitfalls when building with LLMs, how to systematically improve their apps, and how to manage their workflow in a sane way.
Weights & Biases has built the tools that helped OpenAI train GPT-4, and they recently released tools to help you leverage these tools for your own applications.
We will discuss:
– Logging and versioning LLM interactions and surrounding data from development to production.
– Experimenting with prompting techniques, model changes, and parameters.
– Evaluating your models and measuring your progress.
Join Scott to learn about the new toolkit, Weave, and how to use it to apply AI robustly.
Second lecture:
Evaluating RAG is Harder than You Think, by Amnon Catav, Senior ML Engineer at Pinecone
Abstract:
Evaluating an LLM output is a complex topic, but when building a RAG system, assessing the system output becomes even more intricate. RAG evaluation must gauge the system's ability to parse and retrieve context, as well as the LLM's capacity to extract relevant information and reason about it. RAG evaluation is a trending research topic; however, there are yet no standard datasets, metrics, or evaluation methods. In this talk, we will walk through our journey at Pinecone evaluating RAG systems, the metrics we researched, and why we believe this is only the beginning of evaluating knowledgeable AI.
Third lecture:
LLM framework – enhancing prompts output quality, while shortening their development cycle, By Gal Naamani, Data Science team lead at Fiverr
Abstract:
Fourth lecture
Evaluating Dicta-LM 2.0: Into the process of evaluating generative Hebrew LLMs, By Shaltiel Shmidman, Lead Developer at DICTA's Deep Learning Lab
Abstract:
Evaluating the output of GenAI models involves a layered approach, blending both automated metrics and human (or AI) judgment. This presentation will delve into the varied methodologies employed to assess our latest AI model, discussing the specific tasks on which we evaluated as well as introducing the new Open LLM Leaderboard on HuggingFace for assessing GenAI models on their Hebrew capabilities.