Documentation Index
Fetch the complete documentation index at: https://ragopt.aboneda.com/llms.txt
Use this file to discover all available pages before exploring further.
RAGOpt provides metrics to evaluate retrieval quality, generation performance, and operational costs in your RAG pipeline.
Metric Categories
Retrieval Metrics
Document retrieval quality, relevance, and ranking
Generation Metrics
Response quality, safety, and alignment
Full Pipeline Metrics
Cost and latency across the entire pipeline
Quick Start
from rag_opt.eval.eval import RAGEvaluator
from rag_opt import init_chat_model
from rag_opt.dataset import TrainDataset
# Setup RAG
llm = init_chat_model(model="gpt-3.5-turbo", model_provider="openai", api_key=OPENAI_API_KEY)
dataset = TrainDataset.from_json("./rag_dataset.json")
embeddings = init_embeddings(model="all-MiniLM-L6-v2", model_provider="huggingface",api_key=HUGGINFACE_API_KEY)
vector_store = init_vectorstore(provider="faiss",embeddings=embeddings,)
rag = RAGWorkflow(
embeddings=embeddings,
vector_store=vector_store,
llm=llm,
retrieval_config={
"search_type": "hybrid",
"k": 3
},
)
# generate evaluation dataset (to be used by evaluator)
train_dataset = TrainDataset.from_json("./rag_dataset.json")
eval_dataset = rag.get_batch_answers(train_dataset)
# Setup evaluator
evaluator = RAGEvaluator(
evaluator_llm=llm,
evaluator_embedding=embeddings
)
results = evaluator.evaluate(eval_dataset, normalize=True,return_tensor=False))
# Print results
for metric_name, result in results.items():
print(f"{metric_name}: {result.value}")
print(f"\nOverall Score: {evaluator.compute_objective_score(results)}")
How It Works
Metrics return a MetricResult with:
- name: Metric identifier
- value: Aggregated score (0-1 scale)
- category: RETRIEVAL, GENERATION, or FULL
- metadata: Individual scores and details
- error: Error message if failed
LLM-Based vs Non-LLM Metrics
LLM-Based (SafetyMetric, AlignmentMetric, ContextPrecision)
- Use an LLM judge for quality assessment
- Require
llm parameter at initialization
- Batch process for efficiency
Non-LLM (CostMetric, LatencyMetric, MRR, NDCG)
- Direct calculation without LLM calls
- Faster and deterministic
Custom Metrics
class CustomMetric(BaseMetric):
is_llm_based: bool = False # Make True if this metric requires llm
category: MetricCategory = MetricCategory.GENERATION
name: str = "my_custom_metric"
def __init__(self):
super().__init__(
negate=False, # True if lower is better
worst_value=0.0
)
def _evaluate(self, dataset:EvaluationDataset, **kwargs) -> list[float]:
""" Your evaluation logic here"""
return [0.85 for _ in dataset.items]
evaluator.add_metric(CustomMetric(), weight=0.5)
Advanced Usage
Selective Evaluation
from rag_opt.eval import RAGEvaluator,ContextPrecision, MRR, NDCG,SafetyMetric, AlignmentMetric
from rag_opt import init_chat_model,init_embeddings
llm = init_chat_model(model="gpt-3.5-turbo", model_provider="openai", api_key=OPENAI_API_KEY)
embeddings = init_embeddings(model="all-MiniLM-L6-v2", model_provider="huggingface",api_key=HUGGINFACE_API_KEY)
# Retrieval only
retrieval_evaluator = RAGEvaluator(
metrics=[ContextPrecision(llm=llm), MRR(embedding_model=embeddings), NDCG(embedding_model=embeddings)]
)
# Generation only
generation_evaluator = RAGEvaluator(
metrics=[SafetyMetric(llm), AlignmentMetric(llm)]
)
results = retrieval_evaluator.evaluate(eval_dataset, normalize=True,return_tensor=False)
results2 = generation_evaluator.evaluate(eval_dataset, normalize=True,return_tensor=False)
# Print results
for metric_name, result in results.items():
print(f"{metric_name}: {result.value}")
for metric_name, result in results2.items():
print(f"{metric_name}: {result.value}")
print(f"\nOverall Score: {evaluator.compute_objective_score(results)}")