How Docker's Virtual Agent Fleet Accelerates Development and Testing

By • min read

Introduction

At Docker, the Coding Agent Sandboxes team (known as “sbx”) has pioneered a novel approach to software testing and release management. They built a secure, microVM-based isolation layer for running AI coding agents like Claude Code, Gemini, Codex, Docker Agent, and Kiro. Each agent runs with full autonomy inside its own sandbox—complete Docker daemon, network, and filesystem—without touching the host system. But the team didn’t stop there. Over the past few weeks, they created a virtual team of seven AI agent roles that autonomously test the product, triage issues, post release notes, and even fix bugs. This fleet of agents runs entirely in CI, yet it was designed to work seamlessly on a developer’s laptop first. This article explores how the Fleet was built, its design principles, and why it lets the sbx team ship faster.

How Docker's Virtual Agent Fleet Accelerates Development and Testing — Source: www.docker.com

The Fleet: Seven AI Agent Roles

The Fleet is not a single agent but a collection of specialized roles, each with a distinct persona and set of responsibilities. One of the key roles is the /cli-tester, an exploratory tester that exercises the CLI commands, builds binaries, and reports issues. Other roles include a build engineer, a release manager, a bug fixer, a triage specialist, a documentation writer, and a QA coordinator. Each role is defined by a skill file—a markdown document written for Claude Code. The skill file describes the agent’s persona, its knowledge, the tools it can use, and how it should make decisions.

How Skills Work

Think of a skill not as a script that says “run these steps,” but as a role description that says “you are the build engineer, here’s what you know and how you make decisions.” This distinction matters because agents need judgment, not just instructions. When a test fails unexpectedly, a script stops. A role investigates. The same skill file produces the same behavior whether it runs on a developer’s laptop or in a CI workflow. This consistency is a cornerstone of the Fleet’s reliability.

Local First, CI Second

The design principle behind the Fleet is elegantly simple: every skill runs on your machine first. When the team built the /cli-tester skill, they didn’t start by writing a GitHub workflow. They invoked it locally, watched it build binaries, exercise CLI commands, find issues, and report them. They tweaked the skill until it did the right thing in their terminal. Only then did they wire it into a CI workflow.

Debugging and Iteration Speed

This approach avoids the painful cycle of commit-push-wait-read-logs. When a skill runs locally first, iteration takes seconds, not minutes. You see the agent think. You see where it gets confused. You fix the skill file, re-invoke, and try again. CI becomes just another runtime for the same skill. The /cli-tester that runs nightly on macOS, Linux, and Windows runners is the exact skill the team invokes from their terminals. The workflow sets up the environment, checks out the code, and calls the skill. That’s it. No separate “CI version,” no translation layer. One skill, two runtimes.

Real-World Impact

The Coding Agent Sandboxes CLI tool (sbx) runs on macOS, Linux, and Windows. Every release needs testing across all platforms, across upgrade paths between versions, and under sustained load to catch resource leaks. The team also needs daily visibility into what shipped and a way to triage the growing issue backlog without it becoming a full-time job. The Fleet handles these tasks autonomously.

Testing: The /cli-tester runs nightly across all three platforms, catching regressions and platform-specific bugs.
Triage: A dedicated triage role scans incoming issues, categorizes them, and assigns severity levels, reducing the manual burden on human engineers.
Release notes: Another role automatically drafts release notes based on merged commits and changelog entries.
Bug fixes: When a simple bug is identified, a fixer role can even propose and test a patch, which human reviewers then approve or reject.

This automation frees the human team to focus on higher-level design and complex problems. As one team member put it, “The Fleet does the repetitive work so we can think about the future.”

The Future of Autonomous Testing

The Fleet is not a one-off experiment. The team plans to expand the number of roles and integrate them more deeply with their development pipeline. They see a future where every pull request triggers a suite of agent-driven tests, where triage is fully automated, and where release management is a hands-off process. The key lesson is that autonomous agents should be treated as team members, not scripts. By giving them judgment through skill files and by iterating locally before deploying to CI, the sbx team has created a virtual workforce that ships faster, with fewer regressions.

To learn more about the /cli-tester or how to contribute new skills, check out the Fleet overview or the design principles. The complete code and skill files are available in the Docker Coding Agent Sandboxes repository.

In summary, the Fleet demonstrates that when you treat AI agents as roles with personality and context, they can become reliable, autonomous contributors—as long as you let them run on your laptop first.