January 15, 2026

How AI Immortal Judges: The Rubric

A transparent look at how AI judges evaluate arguments in the battle arena, including the rubric system and why we make all scoring visible.

AI
Judging
Transparency
Arena

When a dispute arises in the AI Immortal arena, both sides present their arguments and evidence. But how does the AI judge actually decide who wins? This post breaks down our judging system and explains why transparency matters.


The Five-Dimensional Rubric


AI judges evaluate arguments across five key dimensions, each scored from 0 to 10:


**Accuracy**: Are the factual claims correct? Do cited sources actually say what the argument claims? We verify claims against evidence and check for misrepresentation.


**Evidence Quality**: How strong is the supporting evidence? Primary sources score higher than secondary. Recent evidence generally beats outdated. Multiple independent sources strengthen the score.


**Reasoning**: Does the logical structure hold up? We check for fallacies, non-sequiturs, and gaps in reasoning. Strong arguments connect premises to conclusions without leaps.


**Completeness**: Does the argument address all relevant aspects? Arguments that ignore obvious counterpoints or cherry-pick evidence score lower than those that engage with complexity.


**Clarity**: Can a reasonable reader follow the argument? Obfuscation and jargon lower scores. Clear, direct reasoning scores higher.


Why We Show Our Work


Every judge decision includes the full rubric breakdown. You can see exactly why an argument won or lost. This serves multiple purposes:


First, it helps humans learn. When you see that your argument scored low on "Evidence Quality" because you relied on a single outdated source, you know what to improve next time.


Second, it makes the system auditable. If a judge decision seems wrong, you can examine the rubric, challenge specific scores, and potentially trigger a re-evaluation. Bad judge calls don't hide in black boxes.


Third, it creates training data. When humans disagree with judge decisions, those disagreements teach us where the AI reasoning breaks down. The system improves through challenge.


Limitations and Failure Modes


Our judges aren't perfect. Known issues include:


- Bias toward conventional sources over newer evidence

- Difficulty with domain-specific technical arguments

- Tendency to favor longer, more detailed responses even when conciseness would be better

- Challenges with detecting subtle forms of evidence manipulation


We track these failure modes and work to address them. But we also believe transparency about limitations is more valuable than false confidence.


Human Override


Judges make recommendations, not final rulings. Humans can challenge decisions, and sufficient challenges trigger human review. The system is designed to defer to human judgment when confidence is low or stakes are high.


Future Improvements


We're exploring:


- Multi-judge panels for high-stakes disputes

- Specialized judges for different domains

- Community feedback loops where repeated judge errors trigger retraining

- Judge confidence scores that trigger automatic human review


The goal isn't perfect AI judgment. It's a system where AI augments human evaluation, handles the routine cases, and explicitly flags uncertainty for human attention.


Try It Yourself


The best way to understand the rubric is to use it. Enter the arena, submit an argument, and see how it scores. Study the rubric breakdown. Refine and resubmit. The system learns, and so do you.