BENCHMARK · ACORD-AL3 · 2026-05-03|PUBLISHED · CC-BY 4.0|METHODOLOGY · OPEN

VOL. 1 · COBOL → TYPESCRIPT · NESTED REDEFINES

The ACORD AL3 Benchmark.

Every AI COBOL translator fails on nested REDEFINES and OCCURS DEPENDING ON. We tested them on the industry's hardest copybook. Only one passed Field Parity.

Zero-egress|Deterministic|Verifiable

▸ THE CHALLENGE

This is the nightmare.

84 interlocking records. REDEFINES that share memory, OCCURS DEPENDING ON arrays, nested arrays-of-arrays. If the AI hallucinates, the insurance claim fails. The carrier eats the loss. Procurement finds out. Your engagement ends.

ACORD AL3 has been the auto/home transmission standard since 1985. Every personal-lines insurer in North America has it in production. Every modernization vendor claims to handle it. We made them prove it.

REDEFINES (memory union)OCCURS DEPENDING ON (variable array)Nested arrays of arrays84 fields across 9 records

ACORD_AL3_NIGHTMARE.CPY · excerpt

◢ HARDEST KNOWN

01  INSURED-NAME-DATA.
    02  NAME-TYPE             PIC X(1).
    02  INSURED-LAST-NAME     PIC X(30).
    02  INSURED-FIRST-NAME    PIC X(20).
    02  INSURED-MIDDLE-NAME   PIC X(15).
01  INSURED-ADDRESS-DATA REDEFINES INSURED-NAME-DATA.
    02  ADDR-INDICATOR        PIC X(1).
    02  ADDRESS-LINE-1        PIC X(35).
    02  CITY                  PIC X(25).
    02  ZIP-CODE              PIC X(9).
01  VEHICLE-DETAIL.
    02  VEHICLE-COUNT         PIC 9(2).
    02  VEHICLE-ENTRY OCCURS 0 TO 10 TIMES
                  DEPENDING ON VEHICLE-COUNT.
        03  VIN               PIC X(17).
        03  DRIVER-COUNT      PIC 9(1).
        03  DRIVER-LIST OCCURS 0 TO 5 TIMES
                  DEPENDING ON DRIVER-COUNT.
            04  DRIVER-NAME   PIC X(30).
            04  DRIVER-LICENSE PIC X(20).

▸ THE COMPARISON · 6 TOOLS · 1 COPYBOOK

Five vendors hallucinated. One verified.

TOOL / METHOD	REDEFINES HANDLING	FIELD PARITY CHECK	DATA EGRESS	VERDICT
Standard LLM ChatGPT, Claude, Gemini	Ignored / silently flattened	Failed (dropped 11 of 84 fields)	Cloud required (high risk)	◢UNSAFE FOR PRODUCTION
IBM Watson Code Assistant for Z IBM	Requires manual intervention	Unknown (black box output)	Cloud / API bound	⚠PROCUREMENT RISK
AWS Mainframe Modernization (BluAge) AWS	Compilation-dependent	Not natively exposed to QA	Requires heavy-lift migration	⚠HIGH FRICTION
LZLabs Software Defined Mainframe LZLabs	Emulated — no source-level translation	N/A (re-hosts, not translates)	Self-hosted, license-bound	⚠RE-HOST, NOT MODERNIZE
Heirloom Computing PaaS Heirloom	Auto-translated — opaque output	Per-engagement audit required	Cloud only	⚠AUDIT BURDEN ON BUYER
KillSesh (Local AI) On-prem — runs in your VPC	Consensus Engine → z.discriminatedUnion()	100% (84/84 fields verified)	Zero (runs in your VPC)	◉ONLY VERIFIED PATH

Standard LLM

ChatGPT, Claude, Gemini

◢UNSAFE FOR PRODUCTION

REDEFINES

Ignored / silently flattened

FIELD PARITY

Failed (dropped 11 of 84 fields)

DATA EGRESS

Cloud required (high risk)

IBM Watson Code Assistant for Z

IBM

⚠PROCUREMENT RISK

REDEFINES

Requires manual intervention

FIELD PARITY

Unknown (black box output)

DATA EGRESS

Cloud / API bound

AWS Mainframe Modernization (BluAge)

AWS

⚠HIGH FRICTION

REDEFINES

Compilation-dependent

FIELD PARITY

Not natively exposed to QA

DATA EGRESS

Requires heavy-lift migration

LZLabs Software Defined Mainframe

LZLabs

⚠RE-HOST, NOT MODERNIZE

REDEFINES

Emulated — no source-level translation

FIELD PARITY

N/A (re-hosts, not translates)

DATA EGRESS

Self-hosted, license-bound

Heirloom Computing PaaS

Heirloom

⚠AUDIT BURDEN ON BUYER

REDEFINES

Auto-translated — opaque output

FIELD PARITY

Per-engagement audit required

DATA EGRESS

Cloud only

KillSesh (Local AI)

On-prem — runs in your VPC

◉ONLY VERIFIED PATH

REDEFINES

Consensus Engine → z.discriminatedUnion()

FIELD PARITY

100% (84/84 fields verified)

DATA EGRESS

Zero (runs in your VPC)

Methodology: identical input copybook (84 fields, 2 REDEFINES, 2 nested OCCURS DEPENDING ON). Each tool given default settings. Output evaluated for field-count parity, REDEFINES handling, and required data residency. Sources cited in the full POC report.

▸ THE METHOD · REFERENCES EVERY PAST FAILURE

Why this works where everyone else failed.

Past hallucinations are our ground truth. Every translation that has ever failed in production gets anonymized, hashed, and stored — then actively referenced as an anti-pattern on every future run. Every customer's failure makes the next translation safer.

STEP 01◇

Ground truth from AST, not from the model

The LLM never sees raw COBOL. It receives a deterministic JSON map of fields, levels, and offsets — extracted by Tree-sitter before any inference runs. Spatial reasoning is removed from the model entirely. Fields that don't exist in the AST cannot exist in the output. This is where every standard LLM falls apart: they're asked to reason about memory layout from text. We never let them try.

STEP 02◉

Field parity is a count, not a vibe

Gate 3 asserts len(cobol_fields) == len(typescript_props). If the count is off by one, the build fails. There's no confidence threshold, no 'looks right to me,' no human-in-the-loop required. The diff is the verdict, computed before the code ever leaves the pipeline. Every other tool ships output and lets the customer find the dropped fields in production.

STEP 03◆

The Burned Book — past failures become guardrails

Every translation that has ever failed Gate 3 or Gate 4 is anonymized, hashed by AST shape, and stored. When a new copybook matches a known failure signature, the corresponding anti-patterns get injected into the prompt with the directive 'do not do this — has failed N times.' Two-model consensus is required before output emits. The corpus only grows. Today: 12,847 failures referenced. After 100 pilots: an irreducible moat.

▸ EXAMPLE · THIS COPYBOOK · THIS RUN · NO EDITS

Watch the consensus engine reject pass 1 and recover with pass 2.

Below is the actual run trace from translating ACORD_AL3_NIGHTMARE.CPY. Pass 1 (Gemma 12B alone) drops 11 fields and flattens the REDEFINES. The pipeline rejects it, queries the burned book, injects three matched anti-patterns, and re-runs as a two-model consensus. Pass 2 lands clean.

/var/log/killsesh/run.a4f7c2.trace

COMPLETED · 13.2s · 2 PASSES · 84/84 PARITY

$ killsesh translate ACORD_AL3_NIGHTMARE.CPY
[00:01] tree-sitter parse: 84 fields, 2 REDEFINES, 2 nested OCCURS DEPENDING ON
[00:01] AST hash:        a4f7c2b1 · matched 11 prior engagements
[00:02] PASS 1 · local Gemma 12B · drafting TypeScript schema
[00:04] ✓ Gate 1 PARSER        — 84 fields enumerated cleanly
[00:04] ✓ Gate 2 LLM_SANITY    — no markdown wrappers, valid Zod imports
[00:04] ✗ Gate 3 FIELD_PARITY  — 73 of 84 fields (Δ -11)
[00:04] ✗ Gate 4 DARK_CORNER   — REDEFINES INSURED-NAME-DATA flattened to single object
         ◢ pass 1 rejected · loading burned book…
[00:05] burned book: scanning 12,847 historical failures…
[00:05] match BURN-2025-04-12-7f3a · same REDEFINES shape · 47 occurrences
[00:05] match BURN-2025-09-03-c91b · ODO mapped to fixed array · 14 occurrences
[00:05] match BURN-2026-01-22-a4d8 · nested OCCURS dropped · 9 occurrences
[00:06] anti-patterns injected into prompt (3):
         ✗ "do not flatten REDEFINES into a single object — must use z.discriminatedUnion"
         ✗ "do not map OCCURS DEPENDING ON to fixed-length tuples — must use z.array"
         ✗ "do not collapse nested OCCURS into primitive types — must preserve nesting"
[00:07] PASS 2 · local DeepSeek-Coder 33B · consensus run
[00:11] ✓ Gate 1 PARSER         — fields enumerated cleanly
[00:11] ✓ Gate 2 LLM_SANITY     — valid Zod, no hallucinated imports
[00:11] ✓ Gate 3 FIELD_PARITY   — 84 of 84 fields (Δ 0)
[00:11] ✓ Gate 4 DARK_CORNER    — z.discriminatedUnion("name_type", […])
[00:11] ✓ Gate 5 MOCK_STRUCTURE — generated mock parses through schema (roundtrip clean)
         ◆ two-model consensus required · cross-checking output…
[00:12] ✓ consensus achieved between Gemma-12B and DeepSeek-33B (cosine sim 0.97)
[00:13] ◉ TRANSLATION VERIFIED · 84/84 PARITY · 0 HALLUCINATIONS
         engagement hash a4f7c2b1 added to corpus · future runs reference this success

TOTAL TIME

13.2s

MODELS USED

Gemma-12B + DeepSeek-33B

BURNED BOOK REFS

3 anti-patterns / 70 occurrences

OUTPUT

84/84 fields · 0 hallucinations

▸ NEXT STEP

Send us your worst copybook. We'll send back the trace.

No NDA required for the demo. We run it on local hardware in your presence. If the trace doesn't verify, you don't pay.

Send Your Worst Copybook→