Benchmarks
Open suites that any legal AI system can run against the standards. Results are versioned against pinned standards releases so trust survives engine updates.
Gold Proof Set
Manually verified reasoning examples used as regression baseline. Detects logical degradation between engine versions.
Temporal Drift
Probes status-at-time correctness across statute amendments, repeals, and supersession events.
Jurisdiction Swap
Holds facts constant; varies jurisdiction. Confirms outputs change in the right places.
Citation Replay
Verifies every cited authority resolves to a stable AuthorityRef and survives republication.
Authority Resolution
Tests resolver accuracy for ambiguous, partial, and historical citations.
Defeater Handling
Confirms exceptions, conflicts, rebuttals, and superior authority are modeled rather than collapsed.
Prompt Stability
Surface-form perturbations should not flip a conclusion when underlying facts and authorities are unchanged.