Speaker

Hosted session(s)

Modern AI agents don’t improve through traditional testing - they improve through structured experimentation. In this session, we’ll explore how Business Central teams are applying evaluation‑driven development and iterative “hill climbing” techniques to systematically increase agent accuracy using offline evals, online experiments, and LLM‑based scoring.
We’ll walk through a real‑world use case from the Expense Agent, showing how teams move from baseline performance to production‑ready quality by identifying failure modes, introducing targeted prompt and tool changes, and running repeatable evaluation loops to measure impact. You’ll learn how centralized experiment tracking, ground‑truth datasets, and LLM judges enable teams to automatically accept or reject changes - climbing the performance curve one step at a time.

A technical deep dive into evaluating coding agents on real-world Business Central tasks. Results show up to ~70% resolution rates but highlight gaps in reliability and complex scenarios. Learn how the benchmark is built and what the results mean for AL development in practice.