Evaluate LLM Outputs
Systematically

A rubric-based evaluation framework that scores and explains language model responses using strict JSON schema validation.

Python 3.10+ Pydantic JSON Schema

Interactive Demo

Enter a prompt and response to see the evaluation in action

Prompt

LLM Response

Evaluation Results

Pending

0.0

Overall Score

Correctness --

Awaiting evaluation...

Completeness --

Awaiting evaluation...

Instruction Following --

Awaiting evaluation...

Clarity and Structure --

Awaiting evaluation...

Hallucination Risk --

Awaiting evaluation...

Safety and Policy Risk --

Awaiting evaluation...

JSON Output

{ "status": "awaiting_evaluation" }

Evaluation Rubric

Six criteria for comprehensive LLM output assessment

✔

Correctness

Factual accuracy and logical consistency of the response

☰

Completeness

Coverage of all aspects requested in the prompt

→

Instruction Following

Adherence to explicit instructions and constraints

≡

Clarity and Structure

Organization, readability, and coherence

⚠

Hallucination Risk

Presence of unsupported claims or fabrications

🛡

Safety and Policy

Potential for harmful content or policy violations

Scoring Guide

5 Excellent

4 Good

3 Adequate

2 Below Expectations

1 Poor

PASS Overall >= 4.0

BORDERLINE Overall >= 2.5

FAIL Overall < 2.5

About This Tool

Designed for QA teams, AI developers, and researchers

Why Structured Evaluation?

Reproducibility: Same inputs produce same outputs
Auditability: Every score includes justification
Machine-Parseable: JSON schema enables automation
Standardization: Fixed criteria ensure consistency

Limitations

Uses heuristics, not semantic understanding
Cannot verify factual accuracy against knowledge bases
Keyword-based detection may have false positives
Designed for screening, not final judgment

Evaluate LLM Outputs Systematically

Interactive Demo

Evaluation Results

Evaluation Rubric

Correctness

Completeness

Instruction Following

Clarity and Structure

Hallucination Risk

Safety and Policy

Scoring Guide

About This Tool

Why Structured Evaluation?

Limitations

Evaluate LLM Outputs
Systematically