Implementing Self-RAG with Copilot Studio: Advanced RAG Techniques for Better AI Responses

Following our previous implementation of Naive RAG in Copilot Studio, this time we’ll explore “Self-RAG,” one of the Advanced RAG techniques.

スポンサーリンク

Self-RAG

Self-RAG is a methodology developed around October 2023, designed to improve response quality and reduce hallucinations.

Here’s a high-level overview of how Self-RAG works:

  1. Determine whether information retrieval is necessary (if not, generate response directly)
  2. If retrieval is needed, fetch multiple documents and evaluate their relevance to the question
  3. Generate responses based on relevant documents
  4. Evaluate each response and synthesize the final answer

Ideally, Self-RAG involves fine-tuning LLMs to create separate “critic” and “generator” models. However, since that level of customization isn’t feasible in our context, we’ll utilize GPT4o for all these functions.

Implementing Self-RAG in Copilot Studio

Drawing inspiration from two reference implementations, I’ve designed the following workflow. Since current models have significantly larger input token limits compared to when Self-RAG was first conceived (with GPT-3.5), we can now generate responses for multiple documents simultaneously.
Diagram showing Self-RAG implementation workflow in Copilot Studio

Implementation

Since our main focus is on building Self-RAG in Copilot Studio, we’ll prioritize the implementation over optimizing accuracy (such as prompt engineering).

Trigger and Variable Declaration

Select the “On Redirect” trigger and declare the variable “excluded_keywords.” This variable stores query terms that should be avoided when generating search queries (based on previous searches that failed to find relevant documents).
Screenshot showing trigger setup and variable declaration

Determining Search Necessity

Next, implement the “search necessity evaluation” component.
Screenshot showing search necessity evaluation flow
The evaluation is implemented using a prompt action.
Screenshot showing prompt action configuration
Configure the prompt as shown, specify JSON output format, and select GPT4o as the model.
Screenshot showing prompt settings and model selection
Your task is to determine whether a user's question requires external knowledge retrieval or not. Use the following criteria to make your decision:
### Decision Criteria:
1. **No retrieval required**:
   - If the question can be answered confidently using only your pre-existing knowledge, classify it as "no retrieval required."
   - Examples: Definitions, general knowledge, basic calculations, or simple reasoning tasks.

2. **Retrieval required**:
   - Classify the question as "retrieval required" if it meets any of the following conditions:
     - The question requires up-to-date information (e.g., recent events or news).
     - The question relates to specific domain knowledge (e.g., legal, medical, or technical details) that may not be fully covered by your internal knowledge.
     - The question explicitly references external resources (e.g., specific documents, websites, or datasets).
     - Your internal knowledge alone is insufficient to provide a comprehensive or accurate answer.

### Output Format:
Provide your answer in the following format:
- **"search_required": "yes"** (if retrieval is needed)
- **"search_required": "no"** (if retrieval is not needed)
Here is the user's question:
Question:  {question}

Respond with the required output format only, without any additional explanation or context.
If the evaluation determines that search is unnecessary, GPT4o generates a direct response and ends the process.
Screenshot showing direct response generation flow
If search is deemed necessary, the system generates search queries,
Screenshot showing search query generation
and performs the search using AI Search.
Screenshot showing AI Search implementation
Note: The Retrieval topic is a simple component that executes searches against AI Search using the received query and returns the results.
Screenshot showing Retrieval topic configuration

Relevance Evaluation

Next, evaluate the relevance of the retrieved documents.
Screenshot showing relevance evaluation flow
Configure the prompt as shown, specify JSON output format, and use GPT4o as the model. Note: While we’re using a binary yes/no evaluation here, implementing a scoring system with thresholds might be more effective.
Screenshot showing prompt configuration for relevance evaluation
You are an evaluator tasked with determining the relevance of a retrieved document to a user question.
This assessment does not require overly strict criteria, but the goal is to exclude clearly irrelevant documents.
If the document directly answers the user question, provides supporting information, or includes keywords/semantic meaning clearly related to the question, grade it as relevant.
If the document is unrelated, off-topic, or too vague to establish a clear connection to the user question, grade it as not relevant.

Respond with a binary score:
Output yes if the document is relevant.
Output no if the document is irrelevant.

# user question : {question}
# documents : 
{docs}
If documents are deemed irrelevant, add the unsuccessful search query to our variables and restart from query generation.
Screenshot showing handling of irrelevant document cases

If relevance is confirmed, proceed to response generation.

Response Generation and Answer Validation

Next, generate responses from documents deemed relevant and evaluate whether these responses adequately answer the original question.
Screenshot showing response generation and validation flow
First, generate responses from the documents,
Screenshot showing response generation configuration
Then evaluate the generated responses.
Screenshot showing response evaluation setup
Configure the prompt as shown, specify JSON output format, and use GPT4o as the model.
Screenshot showing prompt configuration for response evaluation
You are an evaluator tasked with assessing whether a generated answer appropriately addresses or resolves a user's question.
If the answer directly resolves the question, provides accurate and sufficient information, or effectively addresses the intent behind the question, grade it as yes.
If the answer is incomplete, vague, inaccurate, off-topic, or fails to address the intent of the user question, grade it as no.

Respond with a binary score:
Output yes if the answer resolves the question.
Output no if the answer does not resolve the question.

# User question : {question}
# LLM generation answer : {generation}
If the response is deemed relevant, present it to the user and end the process.
Screenshot showing relevant response handling
If the response is deemed irrelevant, add the unsuccessful search query to our variables and restart from query generation. Note: While we could just retry response generation, we restart from the search phase since the document retrieval itself might have been suboptimal.
Screenshot showing handling of irrelevant responses

This completes the topic implementation.

Important Note: Without setting a maximum iteration limit for the “search query retry” process, there’s a risk of entering an infinite loop. While we’ve omitted this for this demonstration, it’s crucial to implement such safeguards in production environments.

Optional: Integration with Conversational Boosting

Finally, as in our previous implementation, complete the setup by connecting to the system topic’s “Conversational boosting.”
Screenshot showing Conversational boosting integration

Testing Results

First, we tested with the same question that worked in our previous implementation. The system provided an accurate response.
Screenshot showing successful response to previously answered question
Next, we tried the question that failed in our previous implementation. After several search iterations,
Screenshot showing multiple search iterations
the system successfully generated a response.
Screenshot showing successful response generation
As a bonus, the system also handled questions that don’t require search effectively.
Screenshot showing successful handling of non-search questions

These results confirm the improved accuracy of our implementation. In the next article, I’d like to experiment with CRAG and other advanced techniques.

Copilot Studio Features Used in This Article

コメント

Copied title and URL