RAGを運用に乗せるための定量評価

RAGでドキュメント検索システムやそのチャットボットを構築することは、2025年現在で広く実施されています。しかしそのシステムを実際の運用に載せるには、適切な定量評価と、各種指標に基づく改善が必要となります。本記事ではRAGの定評評価指標を示し、RAGのシステムを運用に載せる際に必要な段取りについて説明します。

山﨑祐太

CEO

2025-1-6

山﨑祐太

RAGを運用に乗せるための定量評価

#AI

概要
RAGの評価指標
評価指標の実装
事前準備
Hit Rate
MRR
Faithfullness
Answer Relevancy
上記を集計するコード
評価用データセットの自動生成
まとめ
参考文献

概要

RAGでドキュメント検索システムやそのチャットボットを構築することは、2025年現在で広く実施されています。しかしそのシステムを実際の運用に載せるには、適切な定量評価と、各種指標に基づく改善が必要となります。

本記事ではRAGの定評評価指標を示し、RAGのシステムを運用に載せる際に必要な段取りについて説明します。

RAGの評価指標

RAGの評価は、下記の2つの側面でそれぞれ評価されます。

検索評価（Retrieval Evaluation）
応答評価（Response Evaluation）

検索評価においては、RAGの回答生成時に取得したチャンクが適当なものであるかどうかを、Hit RateやMRR(Mean Reciprocal Rank）などで評価します。一般的な検索技術における定量評価と近しい評価をします。

応答評価においては、LLMが回答した内容が適切であるかどうかを評価します。

RAGASの論文では、

Faithfulness：回答が取得したチャンクの情報に合致するか
Answer Relevance：質問の回答として適切か
Context Relevance：各チャンクが質問に関連しているか

の3つの側面から評価することを提案しています。

OpenAI Cookbookは、

Hit Rate
MRR
Faithfulness
Relevancy

を評価指標として、llama_indexを用いた実装を公開しています。

評価指標の実装

事前準備

データの準備として、ポールグラムのエッセイデータセットをダウンロードします。

mkdir -p data/paul_graham/
curl "<https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt>" -o "data/paul_graham/paul_graham_essay.txt"

下記のコードをベースとして、定量評価のためにRAGを構築します。

import asyncio
import os

import nest_asyncio
import pandas as pd
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.core.evaluation import (
    BatchEvalRunner,
    EmbeddingQAFinetuneDataset,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = (
    "sk-proj-xxx"
)
nest_asyncio.apply()

def main():
    PERSIST_DIR = "persist"
    MODEL_NAME = "gpt-4o-mini"
    QA_DATASET_PATH = "qa_dataset.json"

    documents = SimpleDirectoryReader("../data/paul_graham/").load_data()

    llm = OpenAI(model=MODEL_NAME)

    # チャンクサイズ512のドキュメントノードの作成
    node_parser = SentenceSplitter.from_defaults(chunk_size=512)
    nodes = node_parser.get_nodes_from_documents(documents)

    # 実行毎にIDが変わらないよう、手動でnodeのIDを付与します
    for idx, node in enumerate(nodes):
        node.id_ = f"node_{idx}"

    # VectoreStoreIndexを作成済みならそこからロード、なければ作成
    if not os.path.exists(PERSIST_DIR):
        os.makedirs(PERSIST_DIR)
        vector_index = VectorStoreIndex(nodes)
        vector_index.storage_context.persist(PERSIST_DIR)
    else:
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        vector_index = load_index_from_storage(storage_context)

    query_engine = vector_index.as_query_engine()

    response_vector = query_engine.query("What did the author do growing up?")
    print(response_vector.response)

if __name__ == "__main__":
    main()

上記のコードはDocumentやVectoreStoreIndexなど最低限の準備をするコードであり、最終はポール・グラムがどのように育ったかをRAGに問いかけ、その結果を標準出力しています。

The author worked on writing short stories and programming, particularly on an IBM 1401 computer in 9th grade using an early version of Fortran.

Hit Rate

ヒット率は検索システムが関連するチャンクを取得できたかどうかを評価します。

llama_indexでは、

取得したチャンクのidと期待するチャンクのidのそれぞれのリストを受け取る
上記のうち1件以上合致すれば1.0、それ以外は0.0として、複数クエリに対する平均を計算する

という実装になっています。

MRR

MRRは上位k件の取得したチャンクの何番目（ランク）に適切なドキュメントが含まれるかを、そのランクの逆数で平均したもので評価します。llama_indexではデフォルトでははじめに取得されたドキュメントのランクの逆数を計算しています。またオプションとして、はじめのドキュメントだけではなく関連する全てのドキュメントのランクの逆数を合計して、関連ドキュメントの数で割る、という計算方法も提供されています。

Faithfullness

Faithfullnessは回答が取得したチャンクの情報に合致しているかどうかを評価する指標です。

llama_indexでは、下記の流れで評価指標を計算しています。

LLMが生成した回答がContextsに合致しているかをYES or NOで回答させる
1クエリの複数チャンクに対して、ひとつでもYESが含まれていれば、その後の回答はYES
YESなら1.0、NOなら0.0として、複数クエリの結果を平均する

llama_indexのデフォルトのプロンプトは下記の通りです。日本語でRAGを構築する場合は、Few Shotサンプルの事例を含め、日本語でプロンプトを書き直す方が良いでしょう。

DEFAULT_EVAL_TEMPLATE = PromptTemplate(
    "Please tell if a given piece of information "
    "is supported by the context.\\n"
    "You need to answer with either YES or NO.\\n"
    "Answer YES if any of the context supports the information, even "
    "if most of the context is unrelated. "
    "Some examples are provided below. \\n\\n"
    "Information: Apple pie is generally double-crusted.\\n"
    "Context: An apple pie is a fruit pie in which the principal filling "
    "ingredient is apples. \\n"
    "Apple pie is often served with whipped cream, ice cream "
    "('apple pie à la mode'), custard or cheddar cheese.\\n"
    "It is generally double-crusted, with pastry both above "
    "and below the filling; the upper crust may be solid or "
    "latticed (woven of crosswise strips).\\n"
    "Answer: YES\\n"
    "Information: Apple pies tastes bad.\\n"
    "Context: An apple pie is a fruit pie in which the principal filling "
    "ingredient is apples. \\n"
    "Apple pie is often served with whipped cream, ice cream "
    "('apple pie à la mode'), custard or cheddar cheese.\\n"
    "It is generally double-crusted, with pastry both above "
    "and below the filling; the upper crust may be solid or "
    "latticed (woven of crosswise strips).\\n"
    "Answer: NO\\n"
    "Information: {query_str}\\n"
    "Context: {context_str}\\n"
    "Answer: "
)

Answer Relevancy

Answer Relevancyは、クエリに対する回答の関連性を評価します。

llama_indexのデフォルトのプロンプトは下記の通りです。

DEFAULT_EVAL_TEMPLATE = PromptTemplate(
    "Your task is to evaluate if the response is relevant to the query.\\n"
    "The evaluation should be performed in a step-by-step manner by answering the following questions:\\n"
    "1. Does the provided response match the subject matter of the user's query?\\n"
    "2. Does the provided response attempt to address the focus or perspective "
    "on the subject matter taken on by the user's query?\\n"
    "Each question above is worth 1 point. Provide detailed feedback on response according to the criteria questions above  "
    "After your feedback provide a final result by strictly following this format: '[RESULT] followed by the integer number representing the total score assigned to the response'\\n\\n"
    "Query: \\n {query}\\n"
    "Response: \\n {response}\\n"
    "Feedback:"
)

上記を集計するコード

import asyncio
import os

import nest_asyncio
import pandas as pd
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.core.evaluation import (
    BatchEvalRunner,
    EmbeddingQAFinetuneDataset,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "sk-proj-xxx"
nest_asyncio.apply()

async def get_metrics(retriever_evaluator, dataset) -> dict[str, float]:
    eval_results = await retriever_evaluator.aevaluate_dataset(dataset)

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    return {"Hit Rate": float(hit_rate), "MRR": float(mrr)}

def get_eval_results(key: str, eval_results: dict) -> float:
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    return score

async def main():
    PERSIST_DIR = "persist"
    MODEL_NAME = "gpt-4o-mini"
    QA_DATASET_PATH = "qa_dataset.json"

    documents = SimpleDirectoryReader("../data/paul_graham/").load_data()

    llm = OpenAI(model=MODEL_NAME)

    # Build index with a chunk_size of 512
    node_parser = SentenceSplitter.from_defaults(chunk_size=512)
    nodes = node_parser.get_nodes_from_documents(documents)

    # Ensure the node ids are same for each run
    for idx, node in enumerate(nodes):
        node.id_ = f"node_{idx}"

    if not os.path.exists(PERSIST_DIR):
        os.makedirs(PERSIST_DIR)
        vector_index = VectorStoreIndex(nodes)
        # persist the index
        vector_index.storage_context.persist(PERSIST_DIR)
    else:
        # rebuild storage context
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        # load index
        vector_index = load_index_from_storage(storage_context)

    if not os.path.exists(QA_DATASET_PATH):
        qa_dataset = generate_question_context_pairs(
            nodes,
            llm=llm,
            num_questions_per_chunk=2,
        )
        qa_dataset.save_json(QA_DATASET_PATH)
    else:
        qa_dataset = EmbeddingQAFinetuneDataset.from_json(QA_DATASET_PATH)

    retriever = vector_index.as_retriever(similarity_top_k=2)

    retriever_evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever
    )

    metrics = await get_metrics(retriever_evaluator, qa_dataset)
    print(metrics)

    queries = list(qa_dataset.queries.values())
    query_engine = vector_index.as_query_engine()

    # Faithfulness & Relevancy Evaluator
    faithfulness = FaithfulnessEvaluator(llm=llm)
    relevancy = RelevancyEvaluator(llm=llm)

    batch_eval_queries = queries
    runner = BatchEvalRunner(
        {"faithfulness": faithfulness, "relevancy": relevancy},
    )
    eval_results = await runner.aevaluate_queries(
        query_engine, queries=batch_eval_queries
    )

    print("Batch Evaluation Results")
    faithfulness_score = get_eval_results("faithfulness", eval_results)
    relevancy_score = get_eval_results("relevancy", eval_results)
    print(f"Faithfulness Score: {faithfulness_score}")
    print(f"Relevancy Score: {relevancy_score}")

if __name__ == "__main__":
    asyncio.run(main())

上記コードを実行した結果は下記の通りです。

{'Hit Rate': 0.7950819672131147, 'MRR': 0.6516393442622951}
Batch Evaluation Results
Faithfulness Score: 0.9098360655737705
Relevancy Score: 0.9590163934426229

FaithfulnessEvaluatorとRelevancyEvaluatorがそれぞれクエリに対する評価を担ってくれるため、それらをバッチ処理として複数クエリに対しての評価ができる、BatchEvalRunnerに投げるだけで定量評価ができます。

評価用データセットの自動生成

今回は評価用のデータセットをllama_indexの機能として提供されている generate_question_context_pairs を用いて自動で生成しました。実装の中身はGitHubを確認したところ、チャンク毎に下記のプロンプトで質問を生成させ、そのチャンクIDを関連するチャンクとして作っているだけのようです。

DEFAULT_QA_GENERATE_PROMPT_TMPL = """\\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \\
{num_questions_per_chunk} questions for an upcoming \\
quiz/examination. The questions should be diverse in nature \\
across the document. Restrict the questions to the \\
context information provided."
"""

実運用に向けての定量評価に必要なデータセットは、現実的にそこにコストをかけにくいということもあり、LLMに代わりに生成させる手法が手っ取り早く始めやすいでしょう。最初は完全にLLMに生成させた結果を使用して、継続的なユーザーフィードバックを収集する中で、一部手動のQAペアを入れて評価していくといった方法も考えられます。

まとめ

RAGを用いたドキュメント検索システムやチャットボットの構築は、現在広く普及しています。しかし、運用においては適切な定量評価と指標に基づく改善が重要です。本記事では、RAGの評価指標としてHit RateやMRR、Faithfulness、Relevancyを紹介し、それらを評価するための実装例を記載しました。

RAG構築のプロジェクトにおいては、上記のような評価用のパイプラインをなるべく早急に構築し、様々な手法による精度向上を検証できる状況を再現することが重要です。

当社では、最新のAI技術を活用したソリューション開発に注力しており、データ活用の課題解決や業務効率化を支援しています。RAGによるチャットボットにご興味がある方は、ぜひお気軽にお問い合わせください。

お問い合わせはこちらから