The ACORD AL3 Benchmark.
Every AI COBOL translator fails on nested REDEFINES and OCCURS DEPENDING ON. We tested them on the industry's hardest copybook. Only one passed Field Parity.
This is the nightmare.
84 interlocking records. REDEFINES that share memory, OCCURS DEPENDING ON arrays, nested arrays-of-arrays. If the AI hallucinates, the insurance claim fails. The carrier eats the loss. Procurement finds out. Your engagement ends.
ACORD AL3 has been the auto/home transmission standard since 1985. Every personal-lines insurer in North America has it in production. Every modernization vendor claims to handle it. We made them prove it.
01 INSURED-NAME-DATA.02 NAME-TYPE PIC X(1).02 INSURED-LAST-NAME PIC X(30).02 INSURED-FIRST-NAME PIC X(20).02 INSURED-MIDDLE-NAME PIC X(15).01 INSURED-ADDRESS-DATA REDEFINES INSURED-NAME-DATA.02 ADDR-INDICATOR PIC X(1).02 ADDRESS-LINE-1 PIC X(35).02 CITY PIC X(25).02 ZIP-CODE PIC X(9).01 VEHICLE-DETAIL.02 VEHICLE-COUNT PIC 9(2).02 VEHICLE-ENTRY OCCURS 0 TO 10 TIMESDEPENDING ON VEHICLE-COUNT.03 VIN PIC X(17).03 DRIVER-COUNT PIC 9(1).03 DRIVER-LIST OCCURS 0 TO 5 TIMESDEPENDING ON DRIVER-COUNT.04 DRIVER-NAME PIC X(30).04 DRIVER-LICENSE PIC X(20).
Five vendors hallucinated. One verified.
Standard LLM
ChatGPT, Claude, Gemini
REDEFINES
Ignored / silently flattened
FIELD PARITY
Failed (dropped 11 of 84 fields)
DATA EGRESS
Cloud required (high risk)
IBM Watson Code Assistant for Z
IBM
REDEFINES
Requires manual intervention
FIELD PARITY
Unknown (black box output)
DATA EGRESS
Cloud / API bound
AWS Mainframe Modernization (BluAge)
AWS
REDEFINES
Compilation-dependent
FIELD PARITY
Not natively exposed to QA
DATA EGRESS
Requires heavy-lift migration
LZLabs Software Defined Mainframe
LZLabs
REDEFINES
Emulated — no source-level translation
FIELD PARITY
N/A (re-hosts, not translates)
DATA EGRESS
Self-hosted, license-bound
Heirloom Computing PaaS
Heirloom
REDEFINES
Auto-translated — opaque output
FIELD PARITY
Per-engagement audit required
DATA EGRESS
Cloud only
KillSesh (Local AI)
On-prem — runs in your VPC
REDEFINES
Consensus Engine → z.discriminatedUnion()
FIELD PARITY
100% (84/84 fields verified)
DATA EGRESS
Zero (runs in your VPC)
Methodology: identical input copybook (84 fields, 2 REDEFINES, 2 nested OCCURS DEPENDING ON). Each tool given default settings. Output evaluated for field-count parity, REDEFINES handling, and required data residency. Sources cited in the full POC report.
Why this works where everyone else failed.
Past hallucinations are our ground truth. Every translation that has ever failed in production gets anonymized, hashed, and stored — then actively referenced as an anti-pattern on every future run. Every customer's failure makes the next translation safer.
Ground truth from AST, not from the model
The LLM never sees raw COBOL. It receives a deterministic JSON map of fields, levels, and offsets — extracted by Tree-sitter before any inference runs. Spatial reasoning is removed from the model entirely. Fields that don't exist in the AST cannot exist in the output. This is where every standard LLM falls apart: they're asked to reason about memory layout from text. We never let them try.
Field parity is a count, not a vibe
Gate 3 asserts len(cobol_fields) == len(typescript_props). If the count is off by one, the build fails. There's no confidence threshold, no 'looks right to me,' no human-in-the-loop required. The diff is the verdict, computed before the code ever leaves the pipeline. Every other tool ships output and lets the customer find the dropped fields in production.
The Burned Book — past failures become guardrails
Every translation that has ever failed Gate 3 or Gate 4 is anonymized, hashed by AST shape, and stored. When a new copybook matches a known failure signature, the corresponding anti-patterns get injected into the prompt with the directive 'do not do this — has failed N times.' Two-model consensus is required before output emits. The corpus only grows. Today: 12,847 failures referenced. After 100 pilots: an irreducible moat.
Watch the consensus engine reject pass 1 and recover with pass 2.
Below is the actual run trace from translating ACORD_AL3_NIGHTMARE.CPY. Pass 1 (Gemma 12B alone) drops 11 fields and flattens the REDEFINES. The pipeline rejects it, queries the burned book, injects three matched anti-patterns, and re-runs as a two-model consensus. Pass 2 lands clean.
$ killsesh translate ACORD_AL3_NIGHTMARE.CPY[00:01] tree-sitter parse: 84 fields, 2 REDEFINES, 2 nested OCCURS DEPENDING ON[00:01] AST hash: a4f7c2b1 · matched 11 prior engagements[00:02] PASS 1 · local Gemma 12B · drafting TypeScript schema[00:04] ✓ Gate 1 PARSER — 84 fields enumerated cleanly[00:04] ✓ Gate 2 LLM_SANITY — no markdown wrappers, valid Zod imports[00:04] ✗ Gate 3 FIELD_PARITY — 73 of 84 fields (Δ -11)[00:04] ✗ Gate 4 DARK_CORNER — REDEFINES INSURED-NAME-DATA flattened to single object◢ pass 1 rejected · loading burned book…[00:05] burned book: scanning 12,847 historical failures…[00:05] match BURN-2025-04-12-7f3a · same REDEFINES shape · 47 occurrences[00:05] match BURN-2025-09-03-c91b · ODO mapped to fixed array · 14 occurrences[00:05] match BURN-2026-01-22-a4d8 · nested OCCURS dropped · 9 occurrences[00:06] anti-patterns injected into prompt (3):✗ "do not flatten REDEFINES into a single object — must use z.discriminatedUnion"✗ "do not map OCCURS DEPENDING ON to fixed-length tuples — must use z.array"✗ "do not collapse nested OCCURS into primitive types — must preserve nesting"[00:07] PASS 2 · local DeepSeek-Coder 33B · consensus run[00:11] ✓ Gate 1 PARSER — fields enumerated cleanly[00:11] ✓ Gate 2 LLM_SANITY — valid Zod, no hallucinated imports[00:11] ✓ Gate 3 FIELD_PARITY — 84 of 84 fields (Δ 0)[00:11] ✓ Gate 4 DARK_CORNER — z.discriminatedUnion("name_type", […])[00:11] ✓ Gate 5 MOCK_STRUCTURE — generated mock parses through schema (roundtrip clean)◆ two-model consensus required · cross-checking output…[00:12] ✓ consensus achieved between Gemma-12B and DeepSeek-33B (cosine sim 0.97)[00:13] ◉ TRANSLATION VERIFIED · 84/84 PARITY · 0 HALLUCINATIONSengagement hash a4f7c2b1 added to corpus · future runs reference this success
TOTAL TIME
13.2s
MODELS USED
Gemma-12B + DeepSeek-33B
BURNED BOOK REFS
3 anti-patterns / 70 occurrences
OUTPUT
84/84 fields · 0 hallucinations
▸ NEXT STEP
Send us your worst copybook. We'll send back the trace.
No NDA required for the demo. We run it on local hardware in your presence. If the trace doesn't verify, you don't pay.