close menu

Measure to Master: Evaluating Your GenAI Applications

Thank you to everyone who attended our webinar on evaluating GenAI applications.

We had an interesting discussion, especially about the common errors when building with LLMs and how to systematically improve them. We hope you find the insights shared valuable and informative. A special thanks to our partners at Weights and Biases for their collaboration. For those who couldn't make it, the video of the event and the presentation slides are available below. We look forward to seeing you at our future webinars.

Agenda:

First lecture:

Building confidence in AI applications, by Scott Condron, Product at Weights & Biases

Abstract:

Since the release of GPT-4, there has been a rush to build on top of LLMs and harness their power for many use cases, but many teams are learning that it's not as simple as editing a prompt.

Participants will learn about some common pitfalls when building with LLMs, how to systematically improve their apps, and how to manage their workflow in a sane way.

Weights & Biases has built the tools that helped OpenAI train GPT-4, and they recently released tools to help you leverage these tools for your own applications.
We will discuss:
– Logging and versioning LLM interactions and surrounding data from development to production.
– Experimenting with prompting techniques, model changes, and parameters.
– Evaluating your models and measuring your progress.

Join Scott to learn about the new toolkit, Weave, and how to use it to apply AI robustly.

Second lecture:

Evaluating RAG is Harder than You Think, by Amnon Catav, Senior ML Engineer at Pinecone

Abstract:

Evaluating an LLM output is a complex topic, but when building a RAG system, assessing the system output becomes even more intricate. RAG evaluation must gauge the system's ability to parse and retrieve context, as well as the LLM's capacity to extract relevant information and reason about it. RAG evaluation is a trending research topic; however, there are yet no standard datasets, metrics, or evaluation methods. In this talk, we will walk through our journey at Pinecone evaluating RAG systems, the metrics we researched, and why we believe this is only the beginning of evaluating knowledgeable AI.

Third lecture:

LLM framework – enhancing prompts output quality, while shortening their development cycle, By Gal Naamani, Data Science team lead at Fiverr

Abstract:

As we integrate LLMs into our workflow, the efficacy of our prompts becomes paramount. The quality of an LLM's output is substantially influenced by the prompt. Recognizing this, our team has developed a tool designed to help develop and systematically evaluate LLM prompts.
The tool offers a way to systematically assess a prompt, suggest improvements and tailored adjustments, and help explain prompts behavior. By utilizing this tool, our team can gain enhanced LLM output quality and efficiency, while also shortening the development cycle of prompts.

Fourth lecture

Evaluating Dicta-LM 2.0: Into the process of evaluating generative Hebrew LLMs, By Shaltiel Shmidman, Lead Developer at DICTA's Deep Learning Lab

Abstract:

Evaluating the output of GenAI models involves a layered approach, blending both automated metrics and human (or AI) judgment. This presentation will delve into the varied methodologies employed to assess our latest AI model, discussing the specific tasks on which we evaluated as well as introducing the new Open LLM Leaderboard on HuggingFace for assessing GenAI models on their Hebrew capabilities.

עוד בנושא: