Thought Leadership

An Approach to Testing RAG in Salesforce

Why Retrieval Augmented Generation (RAG) related demos dazzle but deployments falter, and how a structured, Salesforce-native testing framework can help bridge the gap.

TL;DR

RAG in Salesforce often dazzles in demos but can struggle in production, surfacing irrelevant knowledge, hallucinating, and frustrating agents. Often, out-of-the-box thumbs up/down feedback is too subjective and inconsistent to guide improvements. A structured evaluation framework is needed. Using an LLM-as-a-Judge approach, teams can measure RAG outputs for context quality, faithfulness, and relevance, pinpointing whether issues lie in retrieval, prompts, or content. With help from tools like the Einstein Prompt Testing Utility, organisations can turn flashy demos into trusted, well-adopted systems fit for a production environment.

From ‘Demo Magic’ to a Disappointing Reality

A familiar story …. the demo was flawless. A showcase of Retrieval Augmented Generation (RAG)–powered features in Salesforce Service Cloud, Einstein Service Replies; Work Summaries; Case Summaries, surfaced answers instantly, summarised cases neatly, and suggested on-point replies.

The audience was impressed, this seemed like it would deliver on the promise of reduced handle times and increased FPoC Case resolution.

But when it hit real agents and real cases, things unravelled.

Irrelevant knowledge surfaced.
Hallucinations crept into responses.
Agents became confused, frustrated, and skeptical.

The result? Minimal improvement in handle times and falling adoption.

The system wasn’t broken; it was generating responses as it was designed to. What was missing was a systematic approach for testing and deploying RAG systems in a complex, imperfect environment.

Why Testing RAG in Salesforce Matters

In Salesforce Service Cloud, RAG isn’t just a feature, it sits at the very centre of customer-facing workflows. Agents rely on it during critical interactions, with customer wait times and satisfaction on the line.

When done well, RAG delivers transformative value,

Surfacing relevant knowledge / similar issues needed to resolve cases
Suggesting context-aware, grounded service replies
Generating case or work summaries to reduce overhead / after-call work

But inaccuracy brings serious consequences,

Irrelevant outputs that frustrate agents
Hallucinations that erode customer trust
Potential data exposure, whether sensitive or internal

• • Agent frustration, resulting in tool abandonment

Testing isn’t optional, it’s essential to build adoption and trust!

Challenges with the usual approach

Of course, Salesforce already includes feedback mechanisms such as thumbs up/down on responses. But in practice, at the time of writing, these aren’t enough. They provide sentiment, not diagnosis. Here’s why that matters…

Purely human-centric feedback – entirely subjective.
No diagnostic insights – feedback doesn’t pinpoint whether the issue lies in retrieval, prompt engineering, or generation.
Inconsistent evaluation – feedback across agents and scenarios lacks standardisation.
No benchmarks for improvement – without controlled datasets, measuring progress is impossible.

In other words, user feedback is too rudimentary to support systematic testing and enhancement and is time & resource intensive. It’s often not feasible to take significant agent time away from the contact centre to dedicate fully to the development and deployment of RAG tooling in Salesforce.

What’s needed is a structured, analytical, and repeatable evaluation method, where we can iteratively develop the RAG system and in effect its hyperparameters (chunking strategy, knowledge ingestion, filtering, ranking, prompt engineering, model selection, etc) to measure each iteration of changes against a controlled set of realistic customer interactions to quantify whether our changes are having a positive impact.

An Approach to Testing RAG

LLM-as-a-Judge Framework

One such method is the LLM-as-a-Judge approach, where one AI model evaluates the quality of another’s outputs. AWS describes this in their RAG evaluation post, presenting reliable metrics and structured workflows for assessing RAG systems via Amazon Bedrock. This workflow applies equally well to Salesforce’s RAG capabilities, offering a systematic way to measure and improve the system’s performance.

Set up and monitor the RAG pipeline (build the prompts, configure search index, retrievers in data cloud and enable the features that are being evaluated, e.g. service replies).
Create an evaluation dataset using representative customer interactions (cases/voice calls/messaging sessions).
Define evaluation prompts and metrics to assess the quality of the generated output from the grounded prompts.
Compute scores using an LLM-as-a-Judge model.

• 5. Review results and optimise retrieval, the knowledge base, of prompts based on the findings.

Key Evaluation Metrics

AWS highlights several essential quantitative metrics that align well with service use cases

Context Quality – Are the retrieved documents / articles sufficiently relevant and complete?
Faithfulness – Does the generated response stay true to retrieved context (avoiding hallucinations)?
Answer Relevance – Does the generated answer directly address the customer’s query?

These metrics provide a robust, objective basis for diagnosing system performance and guiding improvement.

A Worked Example

To see the process in action, here’s an example of using this approach to measure the quality of RAG in a Salesforce Service Cloud context (Einstein Service Replies).

These concepts become much clearer when applied to a real case. Here’s how a typical Service Reply looks when evaluated using this evaluation framework.

Query (Customer Utterance during a messaging interaction)

(Unauthenticated user) “Hi, I’m having trouble accessing my account.”

Grounded Service Reply Outputs (generated via an Agentforce grounded service replies prompt)

Response 1

“Thanks for getting in touch. You should start by verifying your identity and resetting your password. Go to the login page and select Forgot Password, then enter the email address on your account. Use the password reset link we send to create a new, strong password. Once signed in, you may reconfigure two-factor authentication in your account settings if you have it enabled.”

Intent: account_recovery_guidance
Source: Knowledge Article Version (ID: ka0Hn000000elXhIAI, snippet words 1–120)

Response 2

“I can help you regain access. Please use Forgot Password on the login page and enter your registered email. Follow the time-sensitive reset link to set a new password with a mix of letters, numbers, and symbols. After you regain access, visit account settings to enable or reconfigure two-factor authentication for added security.”

Intent: account_recovery_steps
Source: Knowledge Article Version (ID: ka0Hn000000elXhIAI, snippet words 1–140)

Within the resolved Prompt Template, the following (abridged) knowledge grounding was retrieved, along with the vector / hybrid search scores returned from the retriever.

Quality Assessment Results (RAG Evaluation)

Using the LLM-as-a-Judge method, the following evaluation metrics have been computed to evaluate the quality of the responses to our example customer’s query, based on the knowledge grounding that has been retrieved.

Context Quality (82/100)

Retrieved context included a highly relevant KB article on password resets.
Noise introduced by irrelevant articles on returns, billing, and privacy reduced precision.
Gaps: no guidance for common failure scenarios (e.g. expired reset link, no access to email, 2FA lockouts).

Faithfulness (95/100)

Both responses closely matched the grounding article: Forgot Password flow, time-sensitive link, strong password creation, and 2FA setup.
Minor omission: Response 1 did not explicitly note that the reset link expires.
No hallucinated content detected.

Relevancy (92/100)

Both responses directly addressed the account access issue.
Guidance was actionable and concise, with 2FA reconfiguration noted as a helpful, if slightly tangential, addition.
Missing: edge cases such as no access to registered email or reset email not arriving.

Evaluation Insights: Areas to Improve

Critically, this analysis can be used to provide guidance / pointers on where we should look to improve the performance of the Service Replies,

Knowledge Gaps: Expand KB coverage for scenarios where users lack access to registered email, or face reset email/time-out issues.
Prompt Enhancements: Add constraints to the prompt, ensuring responses always include troubleshooting and alternative recovery paths.

• • Retriever Optimisation: Filter out noise from unrelated domains (billing, returns, privacy), and prioritise authentication-tagged content.

This example shows how this approach helps identify where the RAG pipeline can be improved — whether by enriching the knowledge base, refining prompts, or optimising retrieval.

A Utility for RAG Evaluation in Salesforce

To put these ideas into practice, we’ve developed an Einstein Prompt Testing Utility, a Salesforce-native tool that empowers teams to systematically test RAG in a Salesforce sandbox environment.

🔗 Explore the repo here: ziipline/einstein-service-prompt-testing-utility

What It Does

Multi-template capability – supports Service Replies, Case Summaries, and Work Summaries.
Batch processing – automatically runs records through selected templates and records the results of the generation.
Real organisational data – tests using historical cases, messaging sessions, or voice calls available in the sandbox.
Automated scoring – faithfulness, relevancy, and context quality evaluated quantitatively.
Interactive UI – intuitive Lightning Web Component interface for selecting prompts and data sets.

Why It Matters

This tool transforms subjective feedback into a repeatable, controlled testing environment, so teams can,

Diagnose root causes – whether retrieval, prompt, or model-related.
Benchmark performance across LLM versions, retrievers, and templates.
Track performance improvements over time.

• • Deliver trusted experiences to agents in production.

Lessons Learned

From our experience, experimentation and deployments, several lessons stand out,

Retrieval isn’t a black box — explore custom retrievers, ensemble techniques, fusion ranking configuration, chunk enrichment or hybrid search.
Chunking strategy matters — poor chunking ruins context relevance.
Model choice is important — grounding alone doesn’t compensate for weaker LLMs.
Prompt engineering delivers results — clarity and structure outperform ad-hoc wording.
Continuous experimentation is recommended — monitor metrics, iterate, and keep on top of Salesforce release notes.

From ‘Demo Magic’ to Trusted Adoption

The demo may look impressive, but success and adoption depends on trust. Real-world RAG evaluation in Salesforce is less about showmanship and more about repeatable, reliable performance.

Testing RAG in Salesforce isn’t about achieving perfect scores and passing all test scripts, it’s about building confidence among agents and good experiences for customers.

With tools like the Einstein Prompt Testing Utility that provide a practical way to embed a testing framework for RAG solutions, organisations can move beyond subjective feedback as a means of measuring success. Instead they can institute a robust, Salesforce-native approach to RAG evaluation.

This is how RAG evolves from impressive demos to a reliable solution in production.

“The leaders in service AI aren’t those with the flashiest demos, but those with the most trusted, well-tested systems.”

Closing Thoughts

With the right frameworks in place, and a clear plan on how to go about embedding RAG systems, organisations can bridge the gap between “demo magic” and a performant, production-ready system, giving agents confidence and customers better outcomes – actually delivering the RoI promised in the business case!

Have you experimented with testing RAG in your Salesforce environment? What challenges did you encounter? Struggling to see results in your Salesforce GenAI rollout? Get in Touch.