Overview
Chaos Agents deploys AI-powered agents into your infrastructure that autonomously discover resources, inject faults, observe the blast radius, and cleanly revert. Instead of writing brittle test scripts, you declare what to stress and the agents figure out how.
Every chaos action returns a rollback handle. When the experiment window expires — or on any failure — all actions revert in LIFO order. No residue.
Architecture
- chaos-core — The brain. Agent trait, Skill trait, Orchestrator, RollbackLog, EventSink, Config.
- chaos-db — Database chaos agent. Connects to PostgreSQL/MySQL, introspects via
information_schema. - chaos-k8s — Kubernetes chaos agent. Discovers workloads via the Kubernetes API.
- chaos-server — Server chaos agent. Connects over SSH for host-level disruptions.
- chaos-llm — LLM integration layer. Natural language experiment definitions.
How It Works
Quickstart
Write an experiment config, then run it. Here's a minimal database chaos example that inserts load into PostgreSQL and changes a config parameter:
experiments: - name: "postgres-quick-test" target: database target_config: connection_url: "postgres://postgres:password@localhost:5432/mydb" db_type: postgres skills: - skill_name: "db.insert_load" params: rows_per_table: 5000 tables: ["orders", "payments"] - skill_name: "db.config_change" params: changes: - param: "work_mem" value: "8MB" duration: "5m"
Or skip the config entirely and describe what you want in plain English:
CLI Commands
Chaos Agents provides six CLI commands for running, planning, scheduling, and validating experiments.
| Command | Description |
|---|---|
| chaos run | Execute experiments from a YAML config file. Supports --dry-run to validate without executing. |
| chaos agent | Describe chaos in plain English. The LLM generates a config, you review, then it executes. Use -y to auto-approve or --save to export. |
| chaos plan | LLM-driven planning only (no execution). Outputs a generated experiment config. |
| chaos daemon | Run experiments on a cron schedule with concurrency control and graceful shutdown. |
| chaos validate | Validate a config file without executing. Checks YAML parsing, skill existence, and parameters. |
| chaos list-skills | List all available skills. Filter by --target database|kubernetes|server. |
Experiment Config
Define experiments declaratively in YAML. Each experiment specifies a target (database, kubernetes, or server), a target_config with connection details, a list of skills to invoke, and a duration window.
experiments: - name: "postgres-load-test" target: database target_config: connection_url: "postgres://postgres:password@localhost:5432/postgres" db_type: postgres skills: - skill_name: "db.insert_load" params: rows_per_table: 10000 tables: ["users", "orders"] count: 1 - skill_name: "db.select_load" params: query_count: 500 - skill_name: "db.config_change" params: changes: - param: "work_mem" value: "4MB" duration: "5m" parallel: false
chaos validate my-experiment.yaml.
Use --dry-run to see the execution plan without applying changes.
Daemon / Scheduling
Run chaos on a cron schedule. The daemon enforces concurrency limits via a semaphore and supports graceful shutdown via SIGTERM. Each scheduled run gets a fresh orchestrator.
settings: max_concurrent: 2 experiments: - experiment: name: "nightly-db-chaos" target: database target_config: connection_url: "postgres://chaos:pw@db.internal:5432/staging" db_type: postgres skills: - skill_name: "db.insert_load" params: rows_per_table: 5000 duration: "15m" schedule: "0 0 2 * * *" # Daily at 2 AM enabled: true - experiment: name: "hourly-pod-chaos" target: kubernetes target_config: namespace: "staging" label_selector: "app=web" skills: - skill_name: "k8s.pod_kill" params: namespace: "staging" count: 1 duration: "5m" schedule: "0 0 * * * *" # Every hour enabled: true
Cron format: sec min hour day_of_month month day_of_week (6 fields). Use --pid-file /var/run/chaos.pid for process management.
Database Agent
Connects to PostgreSQL, MySQL, CockroachDB, YugabyteDB, or MongoDB. Auto-discovers tables via information_schema (or collection listing for Mongo) and runs chaos skills against them.
experiments: - name: "postgres-load-test" target: database target_config: connection_url: "postgres://postgres:password@localhost:5432/postgres" db_type: postgres skills: - skill_name: "db.insert_load" params: rows_per_table: 10000 tables: ["users", "orders"] - skill_name: "db.select_load" params: query_count: 500 - skill_name: "db.config_change" params: changes: - param: "work_mem" value: "4MB" duration: "5m" parallel: false - name: "mysql-update-chaos" target: database target_config: connection_url: "mysql://chaos:password@localhost:3306/mydb" db_type: mysql skills: - skill_name: "db.update_load" params: rows: 200 - skill_name: "db.insert_load" params: rows_per_table: 5000 duration: "3m"
Supported db_type values: postgres, mysql, cockroach, yugabyte, mongo
| Skill | Parameters | Rollback |
|---|---|---|
| db.insert_load | rows_per_table, tables | Deletes all inserted rows by tracked IDs |
| db.update_load | rows | Restores original values from snapshot |
| db.select_load | query_count | No rollback needed (read-only) |
| db.config_change | changes[].param, changes[].value | Reverts to original config values |
| mongo.insert_load | documents_per_collection, collections | Deletes inserted documents |
| mongo.update_load | documents | Restores original documents |
| mongo.find_load | query_count | No rollback needed (read-only) |
Kubernetes Agent
Discovers pods via the Kubernetes API (kubeconfig or in-cluster auth) and injects cluster-level faults. Supports namespace filtering and label selectors.
experiments: - name: "k8s-pod-chaos" target: kubernetes target_config: namespace: "staging" label_selector: "app=web" skills: - skill_name: "k8s.pod_kill" params: namespace: "staging" label_selector: "app=web" count: 2 - skill_name: "k8s.network_chaos" params: namespace: "staging" pod_selector: app: "web" duration: "5m" - name: "k8s-node-drain" target: kubernetes target_config: namespace: "default" skills: - skill_name: "k8s.node_drain" params: {} - skill_name: "k8s.resource_stress" params: namespace: "default" cpu_workers: 4 memory: "512M" duration: "10m"
| Skill | Parameters | Rollback |
|---|---|---|
| k8s.pod_kill | namespace, label_selector, count | Waits for replacement pod to run |
| k8s.node_drain | (none required) | Uncordons the node |
| k8s.network_chaos | namespace, pod_selector | Deletes the deny-all NetworkPolicy |
| k8s.resource_stress | namespace, cpu_workers, memory | Deletes the stress-ng pod |
Server Agent
Connects over SSH (key or password auth) and creates host-level disruptions. Service discovery identifies running services via systemctl, with configurable exclusions to protect critical processes.
experiments: - name: "server-chaos" target: server target_config: hosts: - host: "10.0.1.50" port: 22 username: "chaos-agent" auth: type: key private_key_path: "~/.ssh/id_ed25519" discovery: enabled: true exclude_services: ["docker", "containerd"] skills: - skill_name: "server.service_stop" params: max_services: 2 - skill_name: "server.disk_fill" params: size: "5GB" target_mount: "/tmp" - skill_name: "server.cpu_stress" params: workers: 4 - skill_name: "server.memory_stress" params: memory: "512M" workers: 2 duration: "10m" resource_filters: - "nginx.*" - "postgres.*"
SSH authentication supports both key-based (type: key) and password-based (type: password) auth. Use resource_filters to limit which discovered services are targeted.
| Skill | Parameters | Rollback |
|---|---|---|
| server.service_stop | max_services | Restarts stopped services via systemctl |
| server.disk_fill | size, target_mount | Removes the generated file |
| server.cpu_stress | workers | Kills the stress-ng process |
| server.memory_stress | memory, workers | Kills the stress-ng process |
| server.permission_change | (target-dependent) | Restores original file permissions |
LLM Planning
Describe chaos in plain English. The LLM analyzes your infrastructure using built-in tools (list skills, discover resources, run experiments) and generates a runnable config. Supports Anthropic, OpenAI, and Ollama providers, plus MCP servers for additional tooling.
llm: provider: anthropic api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-5-20250929" max_tokens: 4096 # Optional: OpenAI # llm: # provider: openai # api_key: "${OPENAI_API_KEY}" # model: "gpt-4o" # Optional: Ollama (local, no API key) # llm: # provider: ollama # base_url: "http://localhost:11434" # model: "llama3.1" # Connect MCP servers for richer planning mcp_servers: # - name: "prometheus-mcp" # transport: # type: stdio # command: "npx" # args: ["-y", "@modelcontextprotocol/server-prometheus"] # env: # PROMETHEUS_URL: "http://prometheus:9090" max_turns: 10
--provider flag is given, the CLI checks for
ANTHROPIC_API_KEY then OPENAI_API_KEY environment variables,
and falls back to Ollama (local, no key needed).
Core Concepts
| Concept | Description |
|---|---|
| Agent | A target-specific adapter (database, k8s, server) that discovers resources and executes skills. |
| Skill | A reversible chaos action. Every skill returns a RollbackHandle on success. |
| Experiment | A named configuration of skills to run against a target, with a duration window. |
| Orchestrator | Coordinates agent lifecycle: init, discover, execute, wait, rollback, complete. |
| RollbackHandle | Opaque undo state returned by each skill. Popped in LIFO order during rollback. |
| EventSink | Async consumer for experiment events — plug in tracing, Prometheus, or custom backends. |
| TargetDomain | Enum: Database, Kubernetes, Server. Determines which agent handles the experiment. |
Agent Lifecycle
Agent trait methods — the state machine is handled for you.