Skip to main content
Full pipeline metrics evaluate operational aspects of your RAG system—cost and performance across the entire workflow.
Note: These metrics are non-LLM based and calculate directly from measured values.

Quick Reference

MetricMeasuresBest ForNegate
CostMetricTotal $ per queryBudget monitoring✅ Minimize
LatencyMetricTotal seconds per queryUX optimization✅ Minimize

CostMetric

Tracks total cost per query across all RAG components.
from rag_opt.eval import RAGEvaluator, CostMetric
from rag_opt import init_chat_model

llm = init_chat_model(model="gpt-3.5-turbo",
                    model_provider="openai",
                    api_key=OPENAI_API_KEY)

# NOTE:: dataset should be generated by RAG which components is
# loaded from config yaml file with cost information per item
cost_metric = CostMetric(llm)
evaluator = RAGEvaluator(metrics=[cost_metric])
results = evaluator.evaluate(dataset)

# Mean cost
print(f"Average: ${results[0].value}")

# Per-query costs
costs = results[0].metadata["scores"]
print(f"Total: ${sum(costs)}")
print(f"Range: ${min(costs)} - ${max(costs)}")
Cost breakdown per query:
  • item.cost.embedding - Embedding API cost
  • item.cost.vectorstore - Vector search cost
  • item.cost.reranker - Reranking cost
  • item.cost.llm - LLM generation cost
  • item.cost.total - Sum of all costs
Default worst value: $0.20 per query

LatencyMetric

Measures total time from query to response.

from rag_opt.eval import RAGEvaluator, LatencyMetric
from rag_opt import init_chat_model

llm = init_chat_model(model="gpt-3.5-turbo",
                        model_provider="openai",
                        api_key=OPENAI_API_KEY)


train_dataset = TrainDataset.from_json("./rag_dataset.json")
eval_dataset = rag.get_batch_answers(train_dataset) # rag = RAGWorkflow(...)

latency_metric = LatencyMetric(llm)
evaluator = RAGEvaluator(metrics=[latency_metric])
results = evaluator.evaluate(eval_dataset)

# Mean latency
print(f"Average: {results[0].value:.2f}s")

# Percentiles
latencies = sorted(results[0].metadata["scores"])
p95 = latencies[int(len(latencies) * 0.95)]
p99 = latencies[int(len(latencies) * 0.99)]
print(f"P95: {p95:.2f}s | P99: {p99:.2f}s")
Latency breakdown per query:
  • item.latency.embedding - Embedding time
  • item.latency.retrieval - Vector search time
  • item.latency.reranking - Reranking time
  • item.latency.generation - LLM generation time
  • item.latency.total - Sum of all times
Default worst value: 7.0 seconds per query

Custom Worst Values

Override defaults based on your constraints:
class CustomCostMetric(CostMetric):
    @property
    def worst_value(self):
        return 0.5  # $0.50 worst case

class CustomLatencyMetric(LatencyMetric):
    @property
    def worst_value(self):
        return 3.0  # 3s worst case for real-time app