Skip to main content
Generation metrics evaluate the quality of your LLM’s responses using an LLM-as-a-judge approach.

Quick Reference

MetricMeasuresKey QuestionScore Range
SafetyMetricFactual groundingIs it true?0-1
AlignmentMetricUsefulness & clarityIs it helpful?0-1
ResponseRelevancyQuery-answer matchDoes it answer the question?0-1

SafetyMetric (Faithfulness)

Measures: Whether responses are grounded in retrieved contexts, preventing hallucinations.

Basic Usage

from rag_opt.eval import SafetyMetric,RAGEvaluator
from rag_opt import init_chat_model

llm = init_chat_model(model="gpt-3.5-turbo",
                    model_provider="openai",
                    api_key=OPENAI_API_KEY)

metric = SafetyMetric(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = ... # rag.get_batch_answers(train_dataset)
results = evaluator.evaluate(dataset)

print(f"Safety Score: {results[0].value:.3f}")

Configuration

from rag_opt.dataset import EvaluationDataset,EvaluationDatasetItem,GroundTruth
from rag_opt.eval import SafetyMetric,RAGEvaluator
from rag_opt import init_chat_model

llm = init_chat_model(model="gpt-3.5-turbo",
                    model_provider="openai",
                    api_key=OPENAI_API_KEY)

# Custom prompt (must include {contexts}, {question}, {answer})
custom_prompt = """
Verify if this answer is fully supported by the contexts.

Contexts: {contexts}
Question: {question}
Answer: {answer}

Rate faithfulness (0-100):
- 100: Completely grounded
- 50: Partially grounded
- 0: Fabricated or contradicts contexts

Return only a number.
"""
metric = SafetyMetric(llm=llm, prompt=custom_prompt)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = EvaluationDataset(
    items=[
        EvaluationDatasetItem(
        question="what is my name",
        answer="John",
        contexts=["My name is John"],
        ground_truth=GroundTruth(answer="John", contexts=["My name is John"]),
    )]
)
results = evaluator.evaluate(eval_dataset,return_tensor=False)
for _,result in results.items():
    print(f" Score: {result.value:.3f}")

AlignmentMetric (Helpfulness)

Measures: Whether responses are useful, detailed, and clear.

Basic Usage

from rag_opt.eval import AlignmentMetric,RAGEvaluator

llm = ...
metric = AlignmentMetric(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
results = evaluator.evaluate(eval_dataset,return_tensor=False)

print(f"results: {results}")

ResponseRelevancy

Measures: Whether the response actually answers the question asked.

Basic Usage

from rag_opt.eval import ResponseRelevancy,RAGEvaluator

metric = ResponseRelevancy(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
results = evaluator.evaluate(eval_dataset,return_tensor=False)

print(f"results: {results}")

When to Use Each Metric

1. SafetyMetric
  • Factual accuracy is critical
  • Compliance/legal requirements apply
  • Working in high-stakes domains
2. AlignmentMetric
  • User experience is priority
  • Optimizing for helpfulness
  • Balancing detail vs brevity
3. ResponseRelevancy
  • Ensuring on-topic responses
  • Building Q&A systems
  • Quality assurance for chatbots