1. Full Metrics

Full pipeline metrics evaluate operational aspects of your RAG system—cost and performance across the entire workflow.

Note: These metrics are non-LLM based and calculate directly from measured values.

Quick Reference

Metric	Measures	Best For	Negate
CostMetric	Total $ per query	Budget monitoring	✅ Minimize
LatencyMetric	Total seconds per query	UX optimization	✅ Minimize

CostMetric

Tracks total cost per query across all RAG components.

from rag_opt.eval import RAGEvaluator, CostMetric
from rag_opt import init_chat_model

llm = init_chat_model(model="gpt-3.5-turbo",
                    model_provider="openai",
                    api_key=OPENAI_API_KEY)

# NOTE:: dataset should be generated by RAG which components is
# loaded from config yaml file with cost information per item
cost_metric = CostMetric(llm)
evaluator = RAGEvaluator(metrics=[cost_metric])
results = evaluator.evaluate(dataset)

# Mean cost
print(f"Average: ${results[0].value}")

# Per-query costs
costs = results[0].metadata["scores"]
print(f"Total: ${sum(costs)}")
print(f"Range: ${min(costs)} - ${max(costs)}")

Cost breakdown per query:

item.cost.embedding - Embedding API cost
item.cost.vectorstore - Vector search cost
item.cost.reranker - Reranking cost
item.cost.llm - LLM generation cost
item.cost.total - Sum of all costs

Default worst value: $0.20 per query

LatencyMetric

Measures total time from query to response.

from rag_opt.eval import RAGEvaluator, LatencyMetric
from rag_opt import init_chat_model

llm = init_chat_model(model="gpt-3.5-turbo",
                        model_provider="openai",
                        api_key=OPENAI_API_KEY)


train_dataset = TrainDataset.from_json("./rag_dataset.json")
eval_dataset = rag.get_batch_answers(train_dataset) # rag = RAGWorkflow(...)

latency_metric = LatencyMetric(llm)
evaluator = RAGEvaluator(metrics=[latency_metric])
results = evaluator.evaluate(eval_dataset)

# Mean latency
print(f"Average: {results[0].value:.2f}s")

# Percentiles
latencies = sorted(results[0].metadata["scores"])
p95 = latencies[int(len(latencies) * 0.95)]
p99 = latencies[int(len(latencies) * 0.99)]
print(f"P95: {p95:.2f}s | P99: {p99:.2f}s")

Latency breakdown per query:

item.latency.embedding - Embedding time
item.latency.retrieval - Vector search time
item.latency.reranking - Reranking time
item.latency.generation - LLM generation time
item.latency.total - Sum of all times

Default worst value: 7.0 seconds per query

Custom Worst Values

Override defaults based on your constraints:

class CustomCostMetric(CostMetric):
    @property
    def worst_value(self):
        return 0.5  # $0.50 worst case

class CustomLatencyMetric(LatencyMetric):
    @property
    def worst_value(self):
        return 3.0  # 3s worst case for real-time app

Getting started

RAG Configurations

Optimization Workflow

Evaluation & Metrics

Quick Reference

CostMetric

LatencyMetric

Custom Worst Values

Getting started

RAG Configurations

Optimization Workflow

Evaluation & Metrics

​Quick Reference

​CostMetric

​LatencyMetric

​Custom Worst Values

Quick Reference

CostMetric

LatencyMetric

Custom Worst Values