Full pipeline metrics evaluate operational aspects of your RAG system—cost and performance across the entire workflow.
Note: These metrics are non-LLM based and calculate directly from measured
values.
Quick Reference
| Metric | Measures | Best For | Negate |
|---|
| CostMetric | Total $ per query | Budget monitoring | ✅ Minimize |
| LatencyMetric | Total seconds per query | UX optimization | ✅ Minimize |
CostMetric
Tracks total cost per query across all RAG components.
from rag_opt.eval import RAGEvaluator, CostMetric
from rag_opt import init_chat_model
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
# NOTE:: dataset should be generated by RAG which components is
# loaded from config yaml file with cost information per item
cost_metric = CostMetric(llm)
evaluator = RAGEvaluator(metrics=[cost_metric])
results = evaluator.evaluate(dataset)
# Mean cost
print(f"Average: ${results[0].value}")
# Per-query costs
costs = results[0].metadata["scores"]
print(f"Total: ${sum(costs)}")
print(f"Range: ${min(costs)} - ${max(costs)}")
Cost breakdown per query:
item.cost.embedding - Embedding API cost
item.cost.vectorstore - Vector search cost
item.cost.reranker - Reranking cost
item.cost.llm - LLM generation cost
item.cost.total - Sum of all costs
Default worst value: $0.20 per query
LatencyMetric
Measures total time from query to response.
from rag_opt.eval import RAGEvaluator, LatencyMetric
from rag_opt import init_chat_model
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
train_dataset = TrainDataset.from_json("./rag_dataset.json")
eval_dataset = rag.get_batch_answers(train_dataset) # rag = RAGWorkflow(...)
latency_metric = LatencyMetric(llm)
evaluator = RAGEvaluator(metrics=[latency_metric])
results = evaluator.evaluate(eval_dataset)
# Mean latency
print(f"Average: {results[0].value:.2f}s")
# Percentiles
latencies = sorted(results[0].metadata["scores"])
p95 = latencies[int(len(latencies) * 0.95)]
p99 = latencies[int(len(latencies) * 0.99)]
print(f"P95: {p95:.2f}s | P99: {p99:.2f}s")
Latency breakdown per query:
item.latency.embedding - Embedding time
item.latency.retrieval - Vector search time
item.latency.reranking - Reranking time
item.latency.generation - LLM generation time
item.latency.total - Sum of all times
Default worst value: 7.0 seconds per query
Custom Worst Values
Override defaults based on your constraints:
class CustomCostMetric(CostMetric):
@property
def worst_value(self):
return 0.5 # $0.50 worst case
class CustomLatencyMetric(LatencyMetric):
@property
def worst_value(self):
return 3.0 # 3s worst case for real-time app