Skip to main content
RAGOpt provides metrics to evaluate retrieval quality, generation performance, and operational costs in your RAG pipeline.

Metric Categories

Retrieval Metrics

Document retrieval quality, relevance, and ranking

Generation Metrics

Response quality, safety, and alignment

Full Pipeline Metrics

Cost and latency across the entire pipeline

Quick Start

from rag_opt.eval.eval import RAGEvaluator
from rag_opt import init_chat_model
from rag_opt.dataset import TrainDataset

# Setup RAG
llm = init_chat_model(model="gpt-3.5-turbo", model_provider="openai", api_key=OPENAI_API_KEY)
dataset = TrainDataset.from_json("./rag_dataset.json")
embeddings = init_embeddings(model="all-MiniLM-L6-v2", model_provider="huggingface",api_key=HUGGINFACE_API_KEY)
vector_store = init_vectorstore(provider="faiss",embeddings=embeddings,)
rag = RAGWorkflow(
    embeddings=embeddings,
    vector_store=vector_store,
    llm=llm,
    retrieval_config={
        "search_type": "hybrid",
        "k": 3
    },
)

# generate evaluation dataset (to be used by evaluator)
train_dataset = TrainDataset.from_json("./rag_dataset.json")
eval_dataset = rag.get_batch_answers(train_dataset)


# Setup evaluator
evaluator = RAGEvaluator(
    evaluator_llm=llm,
    evaluator_embedding=embeddings
)
results = evaluator.evaluate(eval_dataset, normalize=True,return_tensor=False))

# Print results
for metric_name, result in results.items():
    print(f"{metric_name}: {result.value}")

print(f"\nOverall Score: {evaluator.compute_objective_score(results)}")

How It Works

Metrics return a MetricResult with:
  • name: Metric identifier
  • value: Aggregated score (0-1 scale)
  • category: RETRIEVAL, GENERATION, or FULL
  • metadata: Individual scores and details
  • error: Error message if failed

LLM-Based vs Non-LLM Metrics

LLM-Based (SafetyMetric, AlignmentMetric, ContextPrecision)
  • Use an LLM judge for quality assessment
  • Require llm parameter at initialization
  • Batch process for efficiency
Non-LLM (CostMetric, LatencyMetric, MRR, NDCG)
  • Direct calculation without LLM calls
  • Faster and deterministic

Custom Metrics

class CustomMetric(BaseMetric):
    is_llm_based: bool = False # Make True if this metric requires llm
    category: MetricCategory = MetricCategory.GENERATION
    name: str = "my_custom_metric"

    def __init__(self):
        super().__init__(
            negate=False,  # True if lower is better
            worst_value=0.0
        )
    def _evaluate(self, dataset:EvaluationDataset, **kwargs) -> list[float]:
        """ Your evaluation logic here"""
        return [0.85 for _ in dataset.items]

evaluator.add_metric(CustomMetric(), weight=0.5)

Advanced Usage

Selective Evaluation

from rag_opt.eval import RAGEvaluator,ContextPrecision, MRR, NDCG,SafetyMetric, AlignmentMetric
from rag_opt import init_chat_model,init_embeddings


llm = init_chat_model(model="gpt-3.5-turbo", model_provider="openai", api_key=OPENAI_API_KEY)
embeddings = init_embeddings(model="all-MiniLM-L6-v2", model_provider="huggingface",api_key=HUGGINFACE_API_KEY)

# Retrieval only
retrieval_evaluator = RAGEvaluator(
    metrics=[ContextPrecision(llm=llm), MRR(embedding_model=embeddings), NDCG(embedding_model=embeddings)]
)

# Generation only
generation_evaluator = RAGEvaluator(
    metrics=[SafetyMetric(llm), AlignmentMetric(llm)]
)

results = retrieval_evaluator.evaluate(eval_dataset, normalize=True,return_tensor=False)
results2 = generation_evaluator.evaluate(eval_dataset, normalize=True,return_tensor=False)

# Print results
for metric_name, result in results.items():
    print(f"{metric_name}: {result.value}")

for metric_name, result in results2.items():
    print(f"{metric_name}: {result.value}")

print(f"\nOverall Score: {evaluator.compute_objective_score(results)}")