Few AI questions bother me as much as: “How do we get our users to trust AI”? (as if it’s users that have the problem, and not our AI systems). Newsflash: users distrust AI for all kinds of valid reasons, such as:
- A consistent media flood of sensational hallucinations from the big AI chatbots.
- Widespread fear of job loss, especially due to lack of proper communication from leadership – and relentless overhyping of “digital teammates.” (this would require an entire piece to do justice to, but the fear is real, often beyond the reality).
- Inaccurate output that raises legitimate questions about what AI to trust, and when.
“Building cool AI applications is one thing – building trust in them is another”
That last bullet is the topic of today’s missive: How do we evaluate AI performance? In our day to day, we may “trust” in certain things, but trust is always relative. You wouldn’t trust your dentist to do heart surgery; you might not trust your landscaper to look after your ailing cat. Trust in AI is also relative: it all depends on how we design our AI use cases (our AI accuracy tolerance varies by industry, task and role). “Trust” also depends on how we evaluate AI systems – and how we optimize them.
During a monster search of AI vendors, I fell deep into content on RAG and agentic evaluation. I challenged all vendors I found with a grueling question on RAG and LLM evaluation, but only one of them had a good answer (Galileo, via their “Evaluation Intelligence” platform). After that, I kept coming back to Galileo’s YouTube channel, due to their blistering pace of educational content, and the coherence of their evaluation tools (I wasn’t the first; George Lawton wrote a good LLM evaluation piece including Galileo back in 2023).
But much has changed since then. In a recent webinar, Unveiling the Latest Innovations in Evaluation Intelligence, Quique Lores, Galileo’s Head of Product, nailed the disconnect between AI sandbox excitement on the one hand, and enterprise-grade trust on the other:
As we keep hearing, gen AI has opened up unprecedented possibilities. The things that developers like yourselves in the audience are building, just keep blowing our minds every single day.
But, Lores added:
Again, as we keep hearing, building cool applications is is one thing, but building trust in them, being able to sleep at night knowing that they work well, is a whole different problem. And in this era of non-deterministic behavior, where LLMs keep hallucinating, agents keep going off the rails, the risks are high – and really hard to measure. And that’s why we think that AI has a big measuring problem… A lot of companies are simply not investing in that last 10% of optimization, because of a lack of visibility and measurement.
What we see this leading to is slower time to market, and a higher risk to brands. This comes at a time when competition is fiercer than ever, and the race to ship AI is accelerating. There’s no time to waste and no chance to run risks, and that’s exactly why we built Galileo.
Bearing down on RAG evaluation – why does context adherence matter?
A week after this webinar, I hopped on an interview with Galileo CTO (and co-founder) Atin Sanyal. Why Galileo? Because they were the only vendor to address a particular set of LLM accuracy concerns, via a metric they call “context adherence.” They also have another important metric called “completeness.” I’ll get back to those shortly, but we should note: these particular metrics pertain to the evaluation of RAG (Retrieval Augmented Generation).
Why am I obsessed with RAG evaluation? Because enterprise AI architectures often rely on RAG, and the customer-specific data provided in the context window, to mitigate LLM inaccuracies, and improve output results. RAG also allows vendors to intentionally keep customer data out of the LLM (even when LLM fine tuning is utilized, or custom LLMs are deployed, RAG can still be needed for “real-time” data updates at time of interference). Good examples of RAG usage: enterprise search, and many types of digital assistants/co-pilots/chatbots, which answer users’ queries on specialized documents/topics.
Of course, in the case of LLM-powered agents, RAG is not the only tool the agent might utilize (planning/orchestration agents choose between a range of tools to complete an action), so agentic evaluation is a different exercise, one that needs additional metrics for agentic performance. (For now, I’ll focus on RAG, and see how far we get). As Sanyal told me, Galileo started down this path before RAG came onto the scene:
We’re squarely in the AI evaluations and observability space, and we got in around four years ago. We started primarily on language models and unstructured data. For a while, we were the only evaluation tool for unstructured AI, like deep learning models. Erstwhile before the LLM moment, people were fine tuning mostly Bert-style models for entity detection.
We initially started building out algorithms that auto-detect the major issues that are caused in language modeling workflows. Data quality and label quality was a very big problem, because labeling was expensive, so we built out algorithms that auto-detect these issues, and that became our hit product.
Over the years, the workflows have changed. People are using these larger models, and mostly querying them through APIs. LLM agents and those things have become a bigger problem, but we’ve still kind of maintained that, ‘Hey, we’re going to build a platform that almost acts like a co-pilot for your evals as you’re building your app, but also we’ll work on new and algorithmic ways to auto-detect them,’ so that we can give you leading signals on, ‘Hey, this is what’s wrong, and this is why.’ And then you can take customized actions…
Remember my pesky LLM question? Here’s an excerpt from that fun email I sent to AI evaluation vendors.
“My question really pertains to RAG evaluation benchmarks, but it also pertains to agents… What I don’t understand is: why isn’t there a benchmark that hones in on the tension points of the LLM deciding its own internal training data is better for the answer, rather than whatever the context window via RAG provides? I would call this benchmark something like ‘prompt adherence,’ because it really comes down to whether the LLM ignores the prompt instructions to utilize the RAG-provided context to answer the question.
This is a big issue in enterprise AI use cases, because in most cases, the AI vendors want the LLM to provide answers or, perhaps, take actions, based on the customer’s data input via the context window.”
Even if the LLM responds correctly to the query based on its own training, that’s still a prompt adherence failure – if you want the LLM to use the context window data. This is especially important for compliance-related questions where new data via retrieval may be vitally important.”
Some vendors have an adjacent benchmark they call “faithfulness,” but faithfulness is an attribute of LLMs that is worthy of scientific study; it’s not an intuitive way to study their retrieval habits. However, Galileo has a metric called “context adherence.” This turns out to be close what I was looking for. As Sanyal explains:
Context adherence is giving you a quantitative measure of: how much did the LLM generation stick to the [RAG] context? When the score is low, when it starts freelancing and outputting its own thing, we’ve done innovation where you we can literally highlight the sentences that were used in the context, in the reasoning of the generation. So that points to the specific citations. If that comes under the highlighted area, we are able to highlight those citations as well.
Better AI evaluation -> the mechanics of AI trust
Sanyal’s reference to citations is really important here. Another advantage to utilizing RAG is getting a big leg up on explainability. LLM output from RAG typically has much better source references, showing users exactly where the output was pulled from. No, the output might not perfectly answer the question, but you can click into it, reducing that “black box” LLM output mystery – and improving trust. And now, with RAG evaluation tools, you can improve the context that shows up – and which sources are cited. That’s what this is really about: improving the mechanics of AI trust.
If the LLM is reliably pulling from RAG data, you can then optimize that data through various “advanced RAG” techniques such as “re-ranking” the “chunks” of relevant data from your RAG context. Yes, that’s where Galileo’s “completeness” metric comes in (more on that soon).
But what about “prompt adherence?” I used that term as an umbrella for both context and prompt. LLM vendors often utilize prompt instructions to guide/plead/beseech the model to utilize the RAG context (I know of one that even explicitly says ‘use the context to answer the question,’ but this is better described as ‘guidance’ than a foolproof guarantee of LLM behavior). Galileo, however, separates this out into two metrics: context adherence and “instruction adherence.” Sanyal:
The other side of the RAG hallucination coin is instruction adherence, where we’ve often seen that people give a lot of prompt, a lot of context, a lot of instructions, like, ‘Here’s ten instructions, like the ten commandments that you have to follow.’ This is kind of a sister metric to context. Adherence works in the same way, but it measures, ‘Hey, did you follow the instructions or not? And that’s also very critical.’
Galileo’s approach is better than the one I envisioned. It’s better to separate these two metrics out. Then you can bear down on “completeness.” That RAG metric helps you to see which part of the context was utilized by the LLM. One particular citation may be the most relevant to a user query, but not “ranked” properly in the answer:
Galileo has useful info on these types of metrics on their site, including “chunk attribution.” On “completeness,” Galileo says:
If Context Adherence is your precision metric for RAG, Completeness is your recall. In other words, it tries to answer the question: ‘Out of all the information in the context that’s pertinent to the question, how much was covered in the answer’
Consider a bot that fields employee questions about employment policy. Perhaps the medical leave information provided by the RAG system is downplaying an important change in how to get your medical leave approved. “Completeness” can automatically flag that, and pinpoint a re-ranking opportunity. Completeness strikes me as a potent metric, because there are plenty of actions you can take to improve what is being provided. Sanyal agrees: “Amongst all the metrics, the completeness metric is easiest to critique, and get to very high accuracy with some feedback.”
What about results? Fair question, but before we go there, a word about Galileo’s overall approach. Galileo’s AI evaluation is not intended as an after the fact assessment, but something you can use throughout your AI app or agent building. So they have three main modules: evaluate, observe, and protect. You can adjust these based on your development priorities, including accuracy, cost, reasoning, and latency (Galileo also utilizes smaller language models if lower cost is the priority).
Galileo’s new “foundation model” platform, Luna, has a couple of interesting features: auto-generated custom metrics, and their variation on reinforcement learning, “continuous learning with human feedback” (CHLF). Though human reinforcement learning took a back seat in the AI field for a little while, due to cost/scaling concerns, it has proven important in the new reasoning models – and, in my view, it plays a role in increasing AI trust (if the model doesn’t ‘learn’ from what you think about its results, how is that a good thing?)
My take – on accuracy, trust, and a better AI result
When I give my AI accuracy stump speech, I now have more company: McKinsey’s 2024 generative AI study found inaccuracy was now the number one concern of enterprise leaders. I’ve seen raw accuracy rates from LLM output as low as 50/60 percent – that’s not viable for enterprise AI.
But accuracy thresholds are also tied to use case. Example: if a new AI model can bump your predictive scenarios from 70 to 75 percent, that’s a potentially huge gain. Investments in accuracy do require a cost/benefit analysis – more sophisticated AI architectures are often more expensive. Though not always: for example: rightsizing/distilling your model.
For enterprise use cases, you usually need the 90 percent range.. Sanyal told me that initial use of Galileo often has users in the 75/80 percent accuracy range. But with improvements based on the evaluations, customers are getting into that 90 to 95 percent range. This opens up plenty of additional enterprise use cases (though not all!)
On the other hand, if you have an agentic process that involves several different task completions, and they all land in that 90 percent range, the math of good results can deteriorate, due to the compounding error problem (an issue for another time).
Now that we have the RAG part sorted, we can get to agents. One interesting/cautionary note: Sanyal told me that their highest accuracy rates are achieved by customers that also fine tune the model itself. As I noted, some vendors/customers are reluctant to get into model tuning, in order to keep costs down, or to keep customer data out of an external model. However, the economics of building/maintaining custom models should continue to improve. In that case, custom models could become central, as Sage has learned in finance: Sage has had to build its own LLMs because ‘GPT doesn’t recognize accounting terms’.
In non-mathematical scenarios, only a human can judge whether an LLM output was indeed the proper one. Still, it’s important to pursue the thoughtful automation of as many metrics as possible. But getting human feedback on output matters also; Galileo’s CLHF efforts are worth tracking.
I haven’t covered all of Galileo’s measurements here. They have other metrics for agents, including tool selection, tool error rate and task completion. That’s fodder for a future piece. I believe we can learn more from distrust in AI than a vague trust that AI vendors will sort this out. Do that – and run the risk if alienating users with substandard output and force-feeding (see: Gemini into Google Workplace).
AI evaluation tools are a step toward a better result. Whether that result contributes to “trust” is debatable, but for now, a better AI result sounds good. The AI tools I personally use are imperfect, but I trust them precisely because I know their limitations – and I’ll keep testing them to see what’s next. But enterprises need something that scales better than my outlier sandbox. Galileo is hardly the only player here; they were just one standout from this search. Casting a wide net is the only sensible move here.
End note: thanks to Vijay Vijayasankar for field views that factored into this article. Also: thanks to the AI Makerspace community – their YouTube videos typically have a good balance of developer enthusiasm and evaluative metrics.