AI Confidential
Posts
Why AI Evaluation & Observability Matters

Why AI Evaluation & Observability Matters

Exploring the garbage in, garbage out rule of AI, Galileo's ChainPoll eval technique, and why trust will define the next era of GenAI.

AI Confidential Newsletter
June 03, 2025

Welcome to AI Confidential, your biweekly breakdown of the most interesting developments in confidential AI.

Today we’re exploring:

The garbage in, garbage out rule of AI
The FBI’s response to AI training data
Trending open source projects you should know about

Also mentioned in this issue: Atin Sanyal, Satya Nadella, Mark Russinovich, Jason Clinton, Nelly Porter, Daniel Rohrer, James Kaplan, John Willis, João Moura, Jake Broekhuizen, Chester Chen, Galileo, AIMultiple, Gartner, PwC, Meta, Microsoft Build, ServiceNow, Quantum Machine, and The PyTorch Foundation.

Let’s dive in!

This week on the AI Confidential Podcast, we’re joined by Atin Sanyal, Co-founder and CTO of Galileo, a reliability platform built for the GenAI era.

Atin and his team launched Galileo in 2021, fueled by a simple but stubborn truth:

Garbage in = garbage out still rules in AI, and we need to do something about it.

No matter how advanced a model is, poor-quality data will sabotage results—especially in high-stakes enterprise environments.

As large language models grow more powerful, their outputs become less predictable.

Today’s GenAI tools aren’t just responding to prompts. They’re constantly making decisions and interacting with complex systems—so even the smallest of changes creates cascading failures that traditional observability tools just can’t catch.

And that’s where evaluation and observability come into play.

Using their own eval technique called ChainPoll, Galileo turns subjective evaluation into measurable outcomes by using multiple LLMs to judge each other through consensus scoring.

But what we like so much about Atin’s approach to this is his focus on data quality and confidentiality.

In a world where LLMs are becoming commoditized, he argues the competitive edge lies in sensitive, internal knowledge, like the proprietary data that fuels meaningful use cases.

Without it, even the most capable model can’t deliver enterprise-grade reliability.

In this episode, you’ll also hear Atin break down:

Why prompt injection is a growing security risk
Why confidential RAG is exploding across industries
What evaluation agents are—and why they get smarter over time
How Galileo helps enterprises evolve their own AI quality metrics
Why trust, not model power, will define the next era of GenAI

Chatting with Atin was incredibly fun, and we learned so much about a growing area of AI. The episode is live now. Listen here.

P.S. Catch Atin speaking LIVE at the pre-summit workshop for the Confidential Computing Summit on June 16th. Get your ticket now before it’s too late with code AICNEWS25 for a discount.

Keeping it Confidential

According to research from AIMultiple, so far in 2025, which LLM has the lowest hallucination rate?

Claude 3.7 Sonnet
OpenAI GPT-4.5
Llama4 Scout
DeepSeek V3

See the answer at the bottom.

Code for Thought

Important AI news in <2 minutes

💰 88% of senior executives plan to increase spending on AI agents, a recent PwC survey found.

🏢 48% of executives predict the uptick in AI agents will expand headcount requirements to address the change in workflows, the same PwC survey found.

🛡️The FBI, NSA, and other global security agencies release new guidelines on best practices for protecting data from AI systems training.

🕒 AI agents will slash response times to cyber attacks by 50% in just two years, according to Gartner.

🟢 Meta got the green light from the Irish Data Protection Commission (DPC) to train genAI models using public data from Facebook and Instagram users.

Community Roundup

Updates involving OPAQUE and our partners

The countdown is nearly over: OPAQUE’s 2025 Confidential Computing Summit is just one week away!

From June 16–18 in San Francisco, we’re bringing together the leaders defining the next era of secure, enterprise AI.

🎤 Listen to keynotes from:

Mark Russinovich, CTO, Deputy CISO & Technical Fellow, Microsoft Azure
Jason Clinton, CISO, Anthropic
Nelly Porter, Director of Product Management, Google
Daniel Rohrer, VP Software Security, NVIDIA
James Kaplan, Partner, McKinsey and CTO, McKinsey Technology

🎓 Arrive early for the pre-summit workshop on June 16, The New NORMAL: Normalizing AI Enterprise Architecture, featuring:

John Willis, author of Rebels of Reason and The DevOps Handbook
João Moura, CEO, CrewAI
Atin Sanyal, CTO, Galileo
Jake Broekhuizen, Deployed Engineer, LangChain
Chester Chen, Product Manager, NVIDIA

There’s still time to register—but not for long.

Use code AICNEWS25 for an exclusive discount.

We can’t wait to see you there!

OPAQUE in the wild

Last month, Mark Russinovich took that stage at Microsoft Build to talk about the latest insights and innovation with Microsoft Azure—and OPAQUE got a shout out.

During his talk, Mark shared how companies like ServiceNow are using OPAQUE to build multi-GPU agentic flows that, despite using multiple different sources, still protect and secure sensitive data.

Open source spotlight

⚛️ Quantum Machines’ QUAlibrate is a framework that slashes calibration times from hours to minutes for quantum computing, addressing one of the field’s most crucial bottlenecks.

🔧 Meta’s synthetic data kit automates dataset creation for fine-tuning, using Llama models and advanced curation techniques.

🔥 The PyTorch foundation expanded its offerings to include vLLM, an inference engine designed for the efficient deployment of LLMs at scale.

Quotable

🤖 “In the future, we believe every organization is going to have people and agents working together. That means the systems that you are using today ubiquitously for things like identity management or endpoint security will need to extend to agents as well.”

— Satya Nadella, CEO of Microsoft, during the Microsoft Build 2025 keynote

Trivia answer: OpenAI GPT-4.5

In the benchmark study using 60 questions across 16 LLMs, OpenAI GPT-4.5 had a hallucination rate of 15%, which was the lowest of the group. All other LLMs hallucinated more and were less accurate, with Claude 3.7 Sonnet at 17%, Llama4 Scout at 23%, and DeepSeek V3 at a whopping 38%.

Stay confidential!

- Your friends at OPAQUE

ICYMI: Links to past issues

How'd we do this week?

Vote below and let us know!

Reply

or to participate.