Documentation Index
Fetch the complete documentation index at: https://ragopt.aboneda.com/llms.txt
Use this file to discover all available pages before exploring further.
Generation metrics evaluate the quality of your LLM’s responses using an LLM-as-a-judge approach.
Quick Reference
| Metric | Measures | Key Question | Score Range |
|---|
| SafetyMetric | Factual grounding | Is it true? | 0-1 |
| AlignmentMetric | Usefulness & clarity | Is it helpful? | 0-1 |
| ResponseRelevancy | Query-answer match | Does it answer the question? | 0-1 |
SafetyMetric (Faithfulness)
Measures: Whether responses are grounded in retrieved contexts, preventing hallucinations.
Basic Usage
from rag_opt.eval import SafetyMetric,RAGEvaluator
from rag_opt import init_chat_model
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
metric = SafetyMetric(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = ... # rag.get_batch_answers(train_dataset)
results = evaluator.evaluate(dataset)
print(f"Safety Score: {results[0].value:.3f}")
Configuration
from rag_opt.dataset import EvaluationDataset,EvaluationDatasetItem,GroundTruth
from rag_opt.eval import SafetyMetric,RAGEvaluator
from rag_opt import init_chat_model
llm = init_chat_model(model="gpt-3.5-turbo",
model_provider="openai",
api_key=OPENAI_API_KEY)
# Custom prompt (must include {contexts}, {question}, {answer})
custom_prompt = """
Verify if this answer is fully supported by the contexts.
Contexts: {contexts}
Question: {question}
Answer: {answer}
Rate faithfulness (0-100):
- 100: Completely grounded
- 50: Partially grounded
- 0: Fabricated or contradicts contexts
Return only a number.
"""
metric = SafetyMetric(llm=llm, prompt=custom_prompt)
evaluator = RAGEvaluator(metrics=[metric])
eval_dataset = EvaluationDataset(
items=[
EvaluationDatasetItem(
question="what is my name",
answer="John",
contexts=["My name is John"],
ground_truth=GroundTruth(answer="John", contexts=["My name is John"]),
)]
)
results = evaluator.evaluate(eval_dataset,return_tensor=False)
for _,result in results.items():
print(f" Score: {result.value:.3f}")
AlignmentMetric (Helpfulness)
Measures: Whether responses are useful, detailed, and clear.
Basic Usage
from rag_opt.eval import AlignmentMetric,RAGEvaluator
llm = ...
metric = AlignmentMetric(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
results = evaluator.evaluate(eval_dataset,return_tensor=False)
print(f"results: {results}")
ResponseRelevancy
Measures: Whether the response actually answers the question asked.
Basic Usage
from rag_opt.eval import ResponseRelevancy,RAGEvaluator
metric = ResponseRelevancy(llm=llm)
evaluator = RAGEvaluator(metrics=[metric])
results = evaluator.evaluate(eval_dataset,return_tensor=False)
print(f"results: {results}")
When to Use Each Metric
1. SafetyMetric
- Factual accuracy is critical
- Compliance/legal requirements apply
- Working in high-stakes domains
2. AlignmentMetric
- User experience is priority
- Optimizing for helpfulness
- Balancing detail vs brevity
3. ResponseRelevancy
- Ensuring on-topic responses
- Building Q&A systems
- Quality assurance for chatbots