For teams building, fine-tuning & shipping coding agents

Veyl builds environments, evals, and audits for coding agents.

Hard, diverse SWE tasks with hidden, reward-hack-resistant graders — proven on a trust loop before they count.

Request a walkthrough See Foundry Lite →

Every task is proven on a trust loop before it counts.

Tasks passed the trust loop: 47
Flake rate across repeats: 0.0
Vendors · author ≠ solver: 2

The Veyl stack

One factory, three surfaces: training environments, an independent grade, and audits of the CI you already trust.

Environments

Hard, diverse SWE task environments with hidden, reward-hack-resistant graders — verifiable enough to train on.

Evals

A deterministic, contamination-audited grade for your agent — the same verdict every run, from a vendor with no stake in the result.

CI Gap Reports

Real defects seeded into real codebases. The repo’s own suite stays green; the report shows what it missed.

Task environment — veyl_swe_ach_reversal_001

Workspace Grader (hidden)Adversaries

schema_version: "0.1"
workspace: moov-io/ach @ cc95789 (pruned, vendored)
grader:
  type: hidden_tests   # overlay-injected at grade time

✓ noop fails
✓ reference passes
✓ adversaries rejected
✓ repeat-stable

Inspect run Export

Grade

INDEPENDENT GRADE ↗ 94/100

The same verdict, every run.

Hidden behavioral contracts, replayed until stable.

1 author ≠ solver (different vendors)
2 flake_rate 0.0 across repeats
3 exports redact grader material

CI Gap Report — moov-io/ach 20/32 SURVIVED

Seeded defect (one identifier dropped)	Their CI	Actually broken
isDebitTransactionCode − LoanDebit	✓ green	✗
isDebitTransactionCode − SavingsReturnNOCDebit	✓ green	✗
isCreditTransactionCode − GLReturnNOCCredit	✓ green	✗
calculateBatchAmounts − SavingsCredit	✓ green	✗
isDebitTransactionCode − CheckingDebit	✗ caught	✗
isCreditTransactionCode − SavingsCredit	✗ caught	✗

green CI · still broken

The gate

A hidden, adversary-tested grader scores every task.

Hidden behavioral contracts
Graders the agent never sees.
Adversary-tested
Plausible-but-wrong patches must fail.
Deterministic replay
Same verdict, every run — flake rate 0.0.

How it works

The trust loop

Author a task
Problem, workspace, and a hidden grader.
Hidden grader
Behavioral contracts, not visible unit tests.
Trust loop
noop fails · reference passes · repeat-stable.
Differential oracle
A different-vendor solver re-derives the fix.
Promote
Survivors copied in, re-gated, archived.
Export
SWE-bench-ready, grader material redacted.

task.yaml

The trust loop runs on every task — hand- or AI-authored. Contact us to run the gate on your own tasks.

Why

Generation is cheap. Verification is scarce.

AI authors
over-many candidates.
The trust gate
filters every one.
Humans move up
to judgment, not production.

Evidence

We audited a real codebase’s CI. It stayed green while broken.

20 of 32 seeded defects survived the repo’s entire 71-package test suite. One example: a one-line classification bug passed every test the repo has — and broke loan-credit reversals at runtime.

CI Gap Report — moov-io/ach 20/32 SURVIVED

Seeded defect (one identifier dropped)	Their CI	Actually broken
isDebitTransactionCode − LoanDebit	✓ green	✗
isDebitTransactionCode − SavingsReturnNOCDebit	✓ green	✗
isCreditTransactionCode − GLReturnNOCCredit	✓ green	✗
calculateBatchAmounts − SavingsCredit	✓ green	✗
isDebitTransactionCode − CheckingDebit	✗ caught	✗
isCreditTransactionCode − SavingsCredit	✗ caught	✗

green CI · still broken

moov-io/ach is a public Apache-2.0 repository; this is our audit, not a customer engagement.

Get a walkthrough

See the full factory run on your tasks.

A private corpus of hardened environments, an independent-vendor oracle, and real-repo CI audits — run on your agent in a walkthrough.

Talk to us

Veyl builds environments, evals, and audits for coding agents.

Environments

Evals

CI Gap Reports

A hidden, adversary-tested grader scores every task.

Hidden behavioral contracts

Adversary-tested

Deterministic replay

The trust loop

Author a task

Hidden grader

Trust loop

Differential oracle

Promote

Export

Generation is cheap. Verification is scarce.

AI authors

The trust gate

Humans move up

We audited a real codebase’s CI. It stayed green while broken.

See the full factory run on your tasks.