Loading Events

« All Events

  • This event has passed.

Virtual Paper Review – AI Agent Benchmarks

January 14 @ 6:00 pm7:30 pm

AI Agent Benchmarks

For our first paper review of 2026, we will have Tom Plunkett lead us through papers that define benchmarks used to evaluate Agentic AI.

This will be an hour long deep dive into an Agentic AI benchmark, the Tau benchmark. We’ll start with the 2024 Tau Benchmark paper, then cover the 2025 Tau2 benchmark paper. Finally, we’ll take a look at the tau2 bench Github and using the tau2 benchmark with example agents from the Retail, Telecom, and Airline domains .

Links:

  • τ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains: Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan https://arxiv.org/abs/2406.12045
  • τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment: Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan https://arxiv.org/abs/2506.07982

Details:

Details

Date:
January 14
Time:
6:00 pm – 7:30 pm