Documentation Index
Fetch the complete documentation index at: https://ragopt.aboneda.com/llms.txt
Use this file to discover all available pages before exploring further.
Retrieval metrics evaluate how well your embedding model, vector store, and reranker find relevant information for your LLM.
Quick Reference
| Metric | Measures | LLM Required | Best For |
|---|
| ContextPrecision | Relevant docs / Total retrieved | ✅ | Reducing noise |
| ContextRecall | Found info / All relevant info | ✅ | Avoiding gaps |
| MRR | Position of first relevant result | ⚠️ Optional | Reranker quality |
| NDCG | Overall ranking quality | ❌ | Full ranking optimization |
1. ContextPrecision
“How many retrieved documents are actually relevant?”
from rag_opt.eval import RAGEvaluator,ContextPrecision
from rag_opt import init_chat_model
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
metric = ContextPrecision(llm=llm, limit_contexts=5)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = ... # evaluation dataset from RAG
results = evaluator.evaluate(eval_dataset,return_tensor=False)
Score Guide: 0.8+ = Excellent | 0.6-0.8 = Good | 0.4-0.6 = Fair | < 0.4 = Poor
Options:
limit_contexts: Evaluate only top N contexts (default: all)
prompt: Custom relevance judgment prompt
2. ContextRecall
“Did I retrieve ALL the information needed?”
from rag_opt.eval import RAGEvaluator,ContextRecall
from rag_opt import init_chat_model
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
metric = ContextRecall(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = ... # rag.get_batch_answers(train_dataset)
results = evaluator.evaluate(eval_dataset,return_tensor=False)
Score Guide: 0.8+ = Excellent | 0.6-0.8 = Good | 0.4-0.6 = Fair | < 0.4 = Poor
⚠️ Warning: Low recall causes hallucinations and incomplete answers.
3. MRR (Mean Reciprocal Rank)
“How quickly do I find the first relevant result?”
from rag_opt.eval import RAGEvaluator,MRR
from rag_opt import init_embeddings
# embedding is required for MRR
embeddings = init_embeddings(model="all-MiniLM-L6-v2",
model_provider="huggingface",
api_key=HUGGINFACE_API_KEY)
metric = MRR(embedding_model=embeddings)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = ... # rag.get_batch_answers(train_dataset)
results = evaluator.evaluate(eval_dataset,return_tensor=False)
Score Guide: 0.8+ = Top result | 0.5-0.8 = Top 2-3 | 0.3-0.5 = Position 3-5 | < 0.3 = Buried deep
4. NDCG (Normalized Discounted Cumulative Gain)
“How close is my ranking to ideal?”
from rag_opt.eval import RAGEvaluator,NDCG
from rag_opt import init_embeddings
# embedding is required for NDCG
embeddings = init_embeddings(model="all-MiniLM-L6-v2",
model_provider="huggingface",
api_key=HUGGINFACE_API_KEY)
metric = NDCG(embedding_model=embeddings)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = ... # rag.get_batch_answers(train_dataset)
results = evaluator.evaluate(eval_dataset,return_tensor=False)
Score Guide: 0.8+ = Near ideal | 0.6-0.8 = Good | 0.4-0.6 = Fair | < 0.4 = Poor
MRR vs NDCG:
- Use MRR when users look at only the first result
- Use NDCG when overall ranking quality matters
Evaluate All Together
rom rag_opt.eval import ContextPrecision, ContextRecall, MRR, NDCG
from rag_opt import init_chat_model, init_embeddings
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
embeddings = init_embeddings(model="all-MiniLM-L6-v2",
model_provider="huggingface",
api_key=HUGGINFACE_API_KEY)
metrics = [
ContextPrecision(llm=llm, limit_contexts=5),
ContextRecall(llm=llm),
MRR(embedding_model=embeddings),
NDCG(embedding_model=embeddings)
]
evaluator = RAGEvaluator(metrics=metrics)
eval_dataset = ... # rag.get_batch_answers(train_dataset)
results = evaluator.evaluate(eval_dataset,return_tensor=False)
for result in results.items():
print(f"{result.name}: {result.value:.3f}")