Can an AI Agent Verify Software Delivery? A Proof of Concept

October 11, 2025

When one company hires another to build software, a familiar problem arises at the point of delivery: does the work actually meet the specification? Today, that question is answered through manual review — meetings, back-and-forth, and sometimes expensive third-party arbitration. It’s slow, subjective, and scales poorly.

We built a proof of concept that takes a different approach. It combines blockchain-based smart contract escrow with an AI agent acting as a third-party verifier. The client deposits funds into a smart contract. The contractor submits their work. An AI agent evaluates the deliverable against the agreed requirements — building it, running it, testing it, and scoring its compliance — then records the result on-chain. Depending on the score and the AI’s confidence, the contract automatically releases funds, notifies the contractor to fix and resubmit, or escalates to a human arbiter.

We’re publishing a research paper with the full concept, architecture, and results, alongside the source code on GitHub.

How It Works

The system has three components: a Solidity smart contract that manages escrow funds and project state, a Python verification pipeline that builds and tests deliverables in Docker, and an AI evaluator (powered by OpenAI) that assesses the test evidence and produces a structured score.

The flow is straightforward:

The client and contractor agree on requirements and a pass threshold. The client deposits funds into the smart contract.
The contractor submits their deliverable. A cryptographic hash is recorded on-chain to create a tamper-proof record of what was submitted.
The AI verifier retrieves the deliverable, builds and runs it in a container, tests it against the requirements (currently an OpenAPI specification), and asks an LLM to evaluate the evidence.
The smart contract acts on the result:
- High confidence, passing score — funds are released automatically.
- High confidence, failing score — the contractor is notified and can resubmit.
- Low confidence — the case is escalated to a human arbiter with a detailed report.

The key idea is graduated autonomy: clear-cut cases are resolved automatically, while ambiguous ones are escalated rather than guessed at. The AI doesn’t pretend to be infallible — it knows when it’s uncertain.

What We Found

We tested the system with a simple calculator API specified by an OpenAPI document, using four deliberately crafted deliverable variants: a correct implementation, one with a missing endpoint, one with a logic bug, and one that crashes on an edge case.

The results matched expectations. When we submitted a buggy variant — where the subtraction endpoint returned the sum instead of the difference — the AI correctly identified the issue, scored the deliverable at 65/100 with high confidence, and the contract recorded a failure. The verification report pinpointed the exact problem: three of sixteen tests failed, all on the subtraction endpoint, with clear evidence showing expected vs. actual values.

More interesting was what happened with an environment issue during one run. The deliverable’s container started but the test runner couldn’t connect in time. Rather than guessing, the AI reported a score of 20/100 with low confidence and the contract escalated to a human arbiter. The generated report explained exactly what went wrong and included instructions for the arbiter to resolve the case on-chain. This is the graduated autonomy model working as designed — when the AI can’t make a reliable determination, it says so.

Challenges and Honest Limitations

This is a proof of concept, not a production system. Several hard problems remain:

Non-determinism. LLMs can produce different scores for the same input. Two evaluations of the same deliverable might not agree exactly. We mitigate this with low-temperature settings and structured prompts, but it can’t be eliminated entirely.

Adversarial gaming. If real money is at stake, contractors have incentive to write code that passes automated tests while being subtly broken — the software equivalent of teaching to the test.

Spirit vs. letter. The most valuable human judgment in software delivery is whether the work matches the intent of the requirements, not just the technical specification. AI is better at the letter than the spirit, though this gap is narrowing.

Legal liability. If the AI makes the wrong call, who is responsible? This question is currently unresolved and will need to be addressed before any real-money deployment.

What’s Next

The current PoC verifies REST APIs against OpenAPI specs, but the architecture is designed to support other requirement types — test suites, visual designs, performance SLAs — through pluggable runner and evaluator modules.

A natural extension is multi-oracle consensus: multiple independent AI verifiers evaluating the same deliverable, with funds released only when a majority agree. This provides redundancy against any single model’s blind spots and makes adversarial gaming significantly harder. The contract architecture already supports different verifier addresses per project.

The full discussion of the concept, architecture, and future directions is in the whitepaper (PDF). The source code, including the smart contract, verification pipeline, and sample deliverables, is available on GitHub.

Consensix Labs

How It Works

What We Found

Challenges and Honest Limitations

What’s Next