BENCHMARK · ACORD-AL3 · 2026-05-03
VOL. 1 · COBOL → TYPESCRIPT · NESTED REDEFINES

The ACORD AL3 Benchmark.

Every AI COBOL translator fails on nested REDEFINES and OCCURS DEPENDING ON. We tested them on the industry's hardest copybook. Only one passed Field Parity.

Zero-egress|Deterministic|Verifiable
▸ THE CHALLENGE

This is the nightmare.

84 interlocking records. REDEFINES that share memory, OCCURS DEPENDING ON arrays, nested arrays-of-arrays. If the AI hallucinates, the insurance claim fails. The carrier eats the loss. Procurement finds out. Your engagement ends.

ACORD AL3 has been the auto/home transmission standard since 1985. Every personal-lines insurer in North America has it in production. Every modernization vendor claims to handle it. We made them prove it.

REDEFINES (memory union)OCCURS DEPENDING ON (variable array)Nested arrays of arrays84 fields across 9 records
ACORD_AL3_NIGHTMARE.CPY · excerpt
◢ HARDEST KNOWN
01 INSURED-NAME-DATA.
02 NAME-TYPE PIC X(1).
02 INSURED-LAST-NAME PIC X(30).
02 INSURED-FIRST-NAME PIC X(20).
02 INSURED-MIDDLE-NAME PIC X(15).
01 INSURED-ADDRESS-DATA REDEFINES INSURED-NAME-DATA.
02 ADDR-INDICATOR PIC X(1).
02 ADDRESS-LINE-1 PIC X(35).
02 CITY PIC X(25).
02 ZIP-CODE PIC X(9).
01 VEHICLE-DETAIL.
02 VEHICLE-COUNT PIC 9(2).
02 VEHICLE-ENTRY OCCURS 0 TO 10 TIMES
DEPENDING ON VEHICLE-COUNT.
03 VIN PIC X(17).
03 DRIVER-COUNT PIC 9(1).
03 DRIVER-LIST OCCURS 0 TO 5 TIMES
DEPENDING ON DRIVER-COUNT.
04 DRIVER-NAME PIC X(30).
04 DRIVER-LICENSE PIC X(20).
▸ THE COMPARISON · 6 TOOLS · 1 COPYBOOK

Five vendors hallucinated. One verified.

Standard LLM

ChatGPT, Claude, Gemini

UNSAFE FOR PRODUCTION

REDEFINES

Ignored / silently flattened

FIELD PARITY

Failed (dropped 11 of 84 fields)

DATA EGRESS

Cloud required (high risk)

IBM Watson Code Assistant for Z

IBM

PROCUREMENT RISK

REDEFINES

Requires manual intervention

FIELD PARITY

Unknown (black box output)

DATA EGRESS

Cloud / API bound

AWS Mainframe Modernization (BluAge)

AWS

HIGH FRICTION

REDEFINES

Compilation-dependent

FIELD PARITY

Not natively exposed to QA

DATA EGRESS

Requires heavy-lift migration

LZLabs Software Defined Mainframe

LZLabs

RE-HOST, NOT MODERNIZE

REDEFINES

Emulated — no source-level translation

FIELD PARITY

N/A (re-hosts, not translates)

DATA EGRESS

Self-hosted, license-bound

Heirloom Computing PaaS

Heirloom

AUDIT BURDEN ON BUYER

REDEFINES

Auto-translated — opaque output

FIELD PARITY

Per-engagement audit required

DATA EGRESS

Cloud only

KillSesh (Local AI)

On-prem — runs in your VPC

ONLY VERIFIED PATH

REDEFINES

Consensus Engine → z.discriminatedUnion()

FIELD PARITY

100% (84/84 fields verified)

DATA EGRESS

Zero (runs in your VPC)

Methodology: identical input copybook (84 fields, 2 REDEFINES, 2 nested OCCURS DEPENDING ON). Each tool given default settings. Output evaluated for field-count parity, REDEFINES handling, and required data residency. Sources cited in the full POC report.

▸ THE METHOD · REFERENCES EVERY PAST FAILURE

Why this works where everyone else failed.

Past hallucinations are our ground truth. Every translation that has ever failed in production gets anonymized, hashed, and stored — then actively referenced as an anti-pattern on every future run. Every customer's failure makes the next translation safer.

STEP 01

Ground truth from AST, not from the model

The LLM never sees raw COBOL. It receives a deterministic JSON map of fields, levels, and offsets — extracted by Tree-sitter before any inference runs. Spatial reasoning is removed from the model entirely. Fields that don't exist in the AST cannot exist in the output. This is where every standard LLM falls apart: they're asked to reason about memory layout from text. We never let them try.

STEP 02

Field parity is a count, not a vibe

Gate 3 asserts len(cobol_fields) == len(typescript_props). If the count is off by one, the build fails. There's no confidence threshold, no 'looks right to me,' no human-in-the-loop required. The diff is the verdict, computed before the code ever leaves the pipeline. Every other tool ships output and lets the customer find the dropped fields in production.

STEP 03

The Burned Book — past failures become guardrails

Every translation that has ever failed Gate 3 or Gate 4 is anonymized, hashed by AST shape, and stored. When a new copybook matches a known failure signature, the corresponding anti-patterns get injected into the prompt with the directive 'do not do this — has failed N times.' Two-model consensus is required before output emits. The corpus only grows. Today: 12,847 failures referenced. After 100 pilots: an irreducible moat.

▸ EXAMPLE · THIS COPYBOOK · THIS RUN · NO EDITS

Watch the consensus engine reject pass 1 and recover with pass 2.

Below is the actual run trace from translating ACORD_AL3_NIGHTMARE.CPY. Pass 1 (Gemma 12B alone) drops 11 fields and flattens the REDEFINES. The pipeline rejects it, queries the burned book, injects three matched anti-patterns, and re-runs as a two-model consensus. Pass 2 lands clean.

/var/log/killsesh/run.a4f7c2.trace
COMPLETED · 13.2s · 2 PASSES · 84/84 PARITY
$ killsesh translate ACORD_AL3_NIGHTMARE.CPY
[00:01] tree-sitter parse: 84 fields, 2 REDEFINES, 2 nested OCCURS DEPENDING ON
[00:01] AST hash: a4f7c2b1 · matched 11 prior engagements
[00:02] PASS 1 · local Gemma 12B · drafting TypeScript schema
[00:04] Gate 1 PARSER — 84 fields enumerated cleanly
[00:04] Gate 2 LLM_SANITY — no markdown wrappers, valid Zod imports
[00:04] Gate 3 FIELD_PARITY 73 of 84 fields (Δ -11)
[00:04] Gate 4 DARK_CORNER REDEFINES INSURED-NAME-DATA flattened to single object
◢ pass 1 rejected · loading burned book…
[00:05] burned book: scanning 12,847 historical failures…
[00:05] match BURN-2025-04-12-7f3a · same REDEFINES shape · 47 occurrences
[00:05] match BURN-2025-09-03-c91b · ODO mapped to fixed array · 14 occurrences
[00:05] match BURN-2026-01-22-a4d8 · nested OCCURS dropped · 9 occurrences
[00:06] anti-patterns injected into prompt (3):
"do not flatten REDEFINES into a single object — must use z.discriminatedUnion"
"do not map OCCURS DEPENDING ON to fixed-length tuples — must use z.array"
"do not collapse nested OCCURS into primitive types — must preserve nesting"
[00:07] PASS 2 · local DeepSeek-Coder 33B · consensus run
[00:11] Gate 1 PARSER — fields enumerated cleanly
[00:11] Gate 2 LLM_SANITY — valid Zod, no hallucinated imports
[00:11] Gate 3 FIELD_PARITY 84 of 84 fields (Δ 0)
[00:11] Gate 4 DARK_CORNER z.discriminatedUnion("name_type", […])
[00:11] Gate 5 MOCK_STRUCTURE — generated mock parses through schema (roundtrip clean)
◆ two-model consensus required · cross-checking output…
[00:12] consensus achieved between Gemma-12B and DeepSeek-33B (cosine sim 0.97)
[00:13] ◉ TRANSLATION VERIFIED · 84/84 PARITY · 0 HALLUCINATIONS
engagement hash a4f7c2b1 added to corpus · future runs reference this success

TOTAL TIME

13.2s

MODELS USED

Gemma-12B + DeepSeek-33B

BURNED BOOK REFS

3 anti-patterns / 70 occurrences

OUTPUT

84/84 fields · 0 hallucinations

▸ NEXT STEP

Send us your worst copybook. We'll send back the trace.

No NDA required for the demo. We run it on local hardware in your presence. If the trace doesn't verify, you don't pay.

Send Your Worst Copybook