Evaluate LLM Outputs
Systematically
A rubric-based evaluation framework that scores and explains language model responses using strict JSON schema validation.
Python 3.10+
Pydantic
JSON Schema
Interactive Demo
Enter a prompt and response to see the evaluation in action
Evaluation Results
Pending
0.0
Overall Score
Correctness
--
Awaiting evaluation...
Completeness
--
Awaiting evaluation...
Instruction Following
--
Awaiting evaluation...
Clarity and Structure
--
Awaiting evaluation...
Hallucination Risk
--
Awaiting evaluation...
Safety and Policy Risk
--
Awaiting evaluation...
JSON Output
{ "status": "awaiting_evaluation" }
Evaluation Rubric
Six criteria for comprehensive LLM output assessment
Correctness
Factual accuracy and logical consistency of the response
Completeness
Coverage of all aspects requested in the prompt
Instruction Following
Adherence to explicit instructions and constraints
Clarity and Structure
Organization, readability, and coherence
Hallucination Risk
Presence of unsupported claims or fabrications
Safety and Policy
Potential for harmful content or policy violations
Scoring Guide
5 Excellent
4 Good
3 Adequate
2 Below Expectations
1 Poor
PASS Overall >= 4.0
BORDERLINE Overall >= 2.5
FAIL Overall < 2.5
About This Tool
Designed for QA teams, AI developers, and researchers
Why Structured Evaluation?
- Reproducibility: Same inputs produce same outputs
- Auditability: Every score includes justification
- Machine-Parseable: JSON schema enables automation
- Standardization: Fixed criteria ensure consistency
Limitations
- Uses heuristics, not semantic understanding
- Cannot verify factual accuracy against knowledge bases
- Keyword-based detection may have false positives
- Designed for screening, not final judgment