How long does a review take?

Most ADO Pilot reviews complete in 2 to 5 minutes via the Batch API. If the batch hits its 10-minute timeout, the orchestrator falls back to the Messages API.

Last updated

A typical ADO Pilot review completes in 2 to 5 minutes via the Anthropic Batch API. If the Batch API hits its 10-minute timeout, the orchestrator falls back to the synchronous Messages API; in rare cases where both retry, a review can take longer. This page explains what affects duration and what each step is doing while you wait.

Typical timing by PR size

PR sizeTypical durationWorst case at the 95th percentile
Small — 1 to 3 files, fewer than 500 lines2 to 3 minutesabout 5 minutes
Medium — 5 to 15 files, 500 to 1,500 lines3 to 5 minutesabout 7 minutes
Large — 20 or more files, 1,500 or more lines5 to 10 minutesup to ~10 minutes (Batch timeout)

These ranges assume a healthy provider. Actual times vary with code complexity and Anthropic API load.

What the review is doing while you wait

queued                       under a second
  fetch diff and enrichment   ~10 to 30 seconds
Pass 1 (Batch API)            1 to 3 minutes
Pass 2 (Batch API)            1 to 3 minutes
finalize and post             under a second
  • Queued. The orchestrator records the review and acquires the per-org concurrency slot.
  • Fetch and enrich. The diff comes from Azure DevOps. Tree-sitter syntax summaries and Semgrep findings are computed locally and attached to the prompt.
  • Pass 1. The Anthropic Batch API runs the high-recall sweep. Batch is asynchronous and cost-optimized — the orchestrator polls for completion rather than holding an open connection.
  • Pass 2. The same Batch path runs the critical re-check, reusing the cached diff and system prompt to keep the cost low. See Why two passes for the design rationale.
  • Finalize. Confirmed findings post as inline comments, the tracking comment finalizes to PASS, ADVISORY, or FAIL, and the ai-pr-review status check updates.

Why some reviews are slower

Things that make a review faster:

  • Smaller diff (fewer files, fewer lines).
  • Simple code without deep cross-file dependencies.
  • A healthy Anthropic Batch API.

Things that make a review slower:

  • Larger diffs, especially over 1,500 changed lines.
  • High provider load — Anthropic queues batch requests under heavy traffic.
  • Rare: Azure DevOps API delays when fetching the diff or posting comments.

Fallback to the Messages API

Most reviews run end to end on the Anthropic Batch API. When a batch does not complete within roughly 10 minutes, the orchestrator falls back:

  • It cancels the batch.
  • It re-runs the same pass through the synchronous Messages API instead.
  • The review still completes, with the same findings format and the same tracking-comment lifecycle. You will not see a difference in the output.

The tradeoff is internal: Messages-API requests cost more and add real-time request latency, so the fallback is reserved for the cases where batch is genuinely stuck. In normal operation fewer than 5 percent of reviews hit this path.

Very small PRs (under roughly 500 changed lines on Sonnet, or 1,000 on Opus) skip the Messages API fallback because synchronous calls aren't cost-effective at that size; instead the orchestrator re-queues onto a fresh batch. The user-visible outcome is the same — the review eventually completes — but it may take longer when this re-queue path triggers.

Where to see the duration

The tracking comment's footer carries the wall-clock duration of the run:

<sub>ADO Pilot v0.1.5 · [full review](https://app.adopilot.dev/reviews/{reviewId}) · model: claude-sonnet-4-6 · took 2m 41s</sub>

The full review detail page in the admin portal breaks the duration down by phase (queued, Pass 1, Pass 2, finalize) so you can see where the time went.

Speeding up your reviews

The biggest lever is PR size. A 500-line review is meaningfully faster than a 5,000-line review and almost always produces better findings, because the model can hold more of the change in working context. If you have a large refactor to ship, split it into per-subsystem PRs. Each one reviews quickly, and the rolling feedback is more useful than one slow verdict on the whole change.