What if the most valuable asset in AI agent development wasn’t the model, but how you evaluate it?
That’s the idea behind Aditya Palnitkar’s talk at the Data Innovation Summit 2025! Aditya is a Software Engineer at Meta and one of the early builders of Facebook Watch. He has seen how scale, precision, and product impact converge in AI and ML systems. Now, he’s applying those lessons to using evaluation as the foundation for smarter, faster, and more reliable AI agents.
In this interview, Aditya shares his journey from recommendation systems used by billions to designing evaluations that shape the behavior of advanced AI. He discusses the surprising reliability of LLM-based judges and how teams can avoid common evaluation pitfalls. He also explains why simulation and synthetic data might hold the key to the next leap in agent performance.
Keep reading to explore how a rigorous approach to evaluation can help teams unlock higher performance and make better decisions. It can also lead to AI agents that thrive in the real world!
Hyperight: Aditya, you’ve had an impressive journey. Can you tell us a bit about your professional background and what your current focus is at Meta?

Aditya Palnitkar: Thank you! I’ve been at Meta since 2015, and it’s been quite a journey. I joined as one of the first engineers working on what eventually became Facebook Watch.
At the time, it was just a small team focused on building a new video surface from scratch. Over the years, I helped grow that into a full-fledged product that became one of the largest drivers of usage on the Facebook app.
Currently, my focus is on building evaluations for AI agents. I’m especially interested in how we can use synthetic data generated through scalable evaluations. It can help fine-tune and improve almost any component of an AI agent—from RAG, to tool routing, to final response generation.
Hyperight: Can you give us a sneak peek into your talk at the Data Innovation Summit 2025, “Supercharging AI Agents with Evaluations”? What’s the message you want attendees to walk away with?
Aditya Palnitkar: The one core idea that I’d like for everyone to walk away with is how central evaluations are to AI agent development. Here’s a provocative statement to reel you in—the evaluation set is possibly the most valuable form of IP that your team can create—possibly even more so than the AI agent code itself.
This is especially true in today’s world of rapid improvements in foundation model capabilities. These advances have raised concerns about the viability of smaller teams building products on top of them. In many cases, foundation models cannot match capabilities of custom built agents, purely because the teams building those custom agents put in so much effort to distill their domain knowledge into an evaluation system.
Hyperight: Aditya, you’ve worked on recommendation models that serve over a billion users daily. What key lessons from that experience now shape how you approach AI agent evaluation at Meta?
Aditya Palnitkar: One lesson I’ve seen apply to real experiences over and over throughout my career is Goodhart’s law – when a metric becomes a target, it ceases to be a good measure. In the world of recommendations, we had to regularly up-level the metric we targeted to improve. We moved from simplistic metrics tracking user engagement to much more relevant ones measuring user satisfaction and retention.
The same applies to AI agent evaluation. Especially given how limited most evaluation datasets tend to be in the LLM world. Overfitting your AI agent to perform better on a limited and non-refreshed dataset can lead to a lot of wasted effort. You have to keep refreshing your evaluation sets and your metrics. This applies both to the accuracy of measurement and to what you choose to measure.
Hyperight: Evaluation often gets overlooked in early development. Why do you think it’s such a critical part of building high-performing AI agents?
Aditya Palnitkar: Early development is when you likely need the most signal on important decisions and tradeoffs. Which frameworks do you use to build your AI agent? Which foundation model do you build your stack on?
With a good set of evaluations, you can quickly and confidently navigate such decisions. These decisions can be costly to change later on. Evaluations are a great way to bring trade-offs into focus. For example, you can use a reasoning model for your AI agent, even if it increases your latency by 10%, as long as you are confident in the improvement it brings to your AI agent’s performance, as determined by your evals.
Hyperight: In your work, you use LLMs as scalable judges. Can you share how reliable they’ve become, and where you still see limitations?
Aditya Palnitkar: LLM judges have become surprisingly reliable in terms of how well they adhere to the guidelines provided to them. In fact, when the guidelines are well written, it is not difficult to achieve up to 90% match rates for LLM judges against human raters on many commonly measured dimensions.
One limitation that LLM judges find tough to breach is that they cannot evaluate based on domain knowledge not coded in any data source available to you. This knowledge often exists only as tribal or institutional knowledge inside the minds of subject matter experts. In such cases, it is the AI agent developer’s responsibility to act as the bridge, and make sure that all domain knowledge is written down for LLM judges to use. That is the only way to scale up your evaluation pipeline.
Hyperight: For teams just starting to build out evaluation systems for AI agents, what common pitfalls should they look out for?
Aditya Palnitkar: One of the biggest pitfalls I’ve seen, especially for teams new to building evaluation systems for AI agents, is focusing too heavily on offline metrics. These metrics don’t fully capture real-world performance. It’s easy to over-optimize for things like accuracy or BLEU scores, but those often don’t reflect how the system behaves in live, user-facing environments. You can end up with a model that looks great on paper but completely misses the mark in practice.
Hyperight: Aditya, you’ve been part of the AI/ML journey for a decade. How has your perspective on “what matters” in ML systems changed over the years?
Aditya Palnitkar: That’s a great question! Over the last decade, my perspective on what matters in ML systems has evolved quite a bit. Early on, I was very focused on model performance—metrics like precision, recall, and AUC. It felt like the main goal was always to ship a better model, and we poured a lot of energy into tweaking features and architectures to gain marginal improvements.
But as we scaled Facebook Watch and our recommendation systems started serving billions of users, I started to see how much of the impact came from everything around the model. So, data quality, feedback loops, system reliability, and how well the ML system aligned with product goals.
So, while the models are still important, what really matters is how well the entire system works together to create the desired user experience. That shift–from optimizing for model metrics to optimizing for system impact–is probably the biggest change in my thinking.
Hyperight: In your opinion, what does the future of AI agent evaluation look like over the next few years? Are there trends or technologies you’re excited about?
Aditya Palnitkar: We are increasingly seeing AI agents being evaluated not just by simple LLM judges using basic prompts with few-shot examples. They are now being evaluated by complex LLM judges, which are themselves complex systems, almost qualifying for the ‘agent’ description.
For instance, the best and latest accuracy verification agents work by segmenting a long-form text into chunks and verifying each chunk against ground truth retrieved from web search or on internal RAG results.
Another trend I’m excited about is simulation-based evaluation. We’re starting to see more use of synthetic environments or user simulators to stress-test agents in a variety of scenarios. It’s not perfect, but it’s a big step toward evaluating agents more like we’d evaluate humans: based on their ability to generalize, adapt, and act under uncertainty.

If you’re ready to take your AI systems beyond just building better models, don’t miss Aditya’s talk at the Data Innovation Summit 2025! In “Supercharging AI Agents with Evaluations,” he’ll unpack why evaluation might be your team’s most valuable IP. Learn how scalable, domain-aware testing can unlock more reliable agents!
Whether you work with LLMs, building autonomous workflows, or scaling AI in high-stakes environments, this talk offers insights on avoiding common traps. Rethink how your team measures progress, and learn how smarter evaluations can lead to smarter agents!
Add comment