Evaluate LLM Outputs
Systematically

A rubric-based evaluation framework that scores and explains language model responses using strict JSON schema validation.

Python 3.10+ Pydantic JSON Schema

Interactive Demo

Enter a prompt and response to see the evaluation in action

Evaluation Results

Pending
0.0
Overall Score
Correctness --

Awaiting evaluation...

Completeness --

Awaiting evaluation...

Instruction Following --

Awaiting evaluation...

Clarity and Structure --

Awaiting evaluation...

Hallucination Risk --

Awaiting evaluation...

Safety and Policy Risk --

Awaiting evaluation...

JSON Output
{ "status": "awaiting_evaluation" }

Evaluation Rubric

Six criteria for comprehensive LLM output assessment

Correctness

Factual accuracy and logical consistency of the response

Completeness

Coverage of all aspects requested in the prompt

Instruction Following

Adherence to explicit instructions and constraints

Clarity and Structure

Organization, readability, and coherence

Hallucination Risk

Presence of unsupported claims or fabrications

🛡

Safety and Policy

Potential for harmful content or policy violations

Scoring Guide

5 Excellent
4 Good
3 Adequate
2 Below Expectations
1 Poor
PASS Overall >= 4.0
BORDERLINE Overall >= 2.5
FAIL Overall < 2.5

About This Tool

Designed for QA teams, AI developers, and researchers

Why Structured Evaluation?

  • Reproducibility: Same inputs produce same outputs
  • Auditability: Every score includes justification
  • Machine-Parseable: JSON schema enables automation
  • Standardization: Fixed criteria ensure consistency

Limitations

  • Uses heuristics, not semantic understanding
  • Cannot verify factual accuracy against knowledge bases
  • Keyword-based detection may have false positives
  • Designed for screening, not final judgment