Chaos Agents Documentation

Overview

Chaos Agents deploys AI-powered agents into your infrastructure that autonomously discover resources, inject faults, observe the blast radius, and cleanly revert. Instead of writing brittle test scripts, you declare what to stress and the agents figure out how.

Every chaos action returns a rollback handle. When the experiment window expires — or on any failure — all actions revert in LIFO order. No residue.

Databases

Schema-aware load injection, concurrent updates, config mutation. PostgreSQL and MySQL.

Kubernetes

Pod kills, node eviction, network partitions, DNS corruption. Full namespace discovery.

Servers

Disk fill, process termination, permission scramble, port saturation over SSH.

Architecture

chaos-cli CLI binary & daemon scheduler

|

chaos-core Orchestrator, traits, rollback engine

|

+--------+-----------+

| | |

chaos-db chaos-k8s chaos-server

chaos-core — The brain. Agent trait, Skill trait, Orchestrator, RollbackLog, EventSink, Config.
chaos-db — Database chaos agent. Connects to PostgreSQL/MySQL, introspects via information_schema.
chaos-k8s — Kubernetes chaos agent. Discovers workloads via the Kubernetes API.
chaos-server — Server chaos agent. Connects over SSH for host-level disruptions.
chaos-llm — LLM integration layer. Natural language experiment definitions.

How It Works

STEP 1

Discover

Agent connects and maps every resource — tables, pods, services, processes.

STEP 2

Plan

Orchestrator selects skills, parameterizes, and builds an execution plan.

STEP 3

Execute

Each skill runs and returns a RollbackHandle pushed onto a LIFO stack.

STEP 4

Observe

Events stream to sinks in real time — tracing, Prometheus, Datadog.

STEP 5

Rollback

Handles popped in LIFO order. Best-effort: if one fails, the rest continue.

Quickstart

Write an experiment config, then run it. Here's a minimal database chaos example that inserts load into PostgreSQL and changes a config parameter:

                            my-experiment.yaml
                            yaml
                        

experiments:
  - name: "postgres-quick-test"
    target: database
    target_config:
      connection_url: "postgres://postgres:password@localhost:5432/mydb"
      db_type: postgres
    skills:
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 5000
          tables: ["orders", "payments"]
      - skill_name: "db.config_change"
        params:
          changes:
            - param: "work_mem"
              value: "8MB"
    duration: "5m"
                        

$ chaos validate my-experiment.yaml

Config valid. 1 experiment, 2 skills

$ chaos run my-experiment.yaml

[DISCOVER] Connected to PostgreSQL, found 14 tables

[EXECUTE] db.insert_load: 5000 rows into orders, payments

[EXECUTE] db.config_change: work_mem = 8MB

[WAIT] Duration window: 5m remaining

[ROLLBACK] db.config_change reverted

[ROLLBACK] db.insert_load data cleaned

[COMPLETE] Experiment finished. All actions rolled back.

Or skip the config entirely and describe what you want in plain English:

$ chaos agent "Stress test PostgreSQL with heavy writes for 5 minutes"

[LLM] Planning experiment with Anthropic...

[LLM] Generated config: postgres-stress-test (2 skills, 5m duration)

Proceed? [y/N] y

[EXECUTE] Running experiment...

[COMPLETE] All actions executed and rolled back.

CLI Commands

Chaos Agents provides six CLI commands for running, planning, scheduling, and validating experiments.

Command	Description
chaos run	Execute experiments from a YAML config file. Supports `--dry-run` to validate without executing.
chaos agent	Describe chaos in plain English. The LLM generates a config, you review, then it executes. Use `-y` to auto-approve or `--save` to export.
chaos plan	LLM-driven planning only (no execution). Outputs a generated experiment config.
chaos daemon	Run experiments on a cron schedule with concurrency control and graceful shutdown.
chaos validate	Validate a config file without executing. Checks YAML parsing, skill existence, and parameters.
chaos list-skills	List all available skills. Filter by `--target database\|kubernetes\|server`.

# Run an experiment config

$ chaos run config/example-db.yaml

# Dry run (validate + show plan, no execution)

$ chaos run config/example-db.yaml --dry-run

# LLM agent: describe, review, execute

$ chaos agent "Kill 2 pods in staging and observe recovery"

# LLM agent: auto-approve

$ chaos agent "Stress test the web servers" -y

# Save generated config without executing

$ chaos agent "Fill disk on 10.0.1.50" --save plan.yaml

# Plan only (no execution)

$ chaos plan "Test database failover" --provider anthropic

# List skills for a specific target

$ chaos list-skills --target database

Experiment Config

Define experiments declaratively in YAML. Each experiment specifies a target (database, kubernetes, or server), a target_config with connection details, a list of skills to invoke, and a duration window.

                            example-db.yaml
                            yaml
                        

experiments:
  - name: "postgres-load-test"
    target: database
    target_config:
      connection_url: "postgres://postgres:password@localhost:5432/postgres"
      db_type: postgres
    skills:
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 10000
          tables: ["users", "orders"]
        count: 1
      - skill_name: "db.select_load"
        params:
          query_count: 500
      - skill_name: "db.config_change"
        params:
          changes:
            - param: "work_mem"
              value: "4MB"
    duration: "5m"
    parallel: false
                        

Tip: Always validate your config before running: chaos validate my-experiment.yaml. Use --dry-run to see the execution plan without applying changes.

Daemon / Scheduling

Run chaos on a cron schedule. The daemon enforces concurrency limits via a semaphore and supports graceful shutdown via SIGTERM. Each scheduled run gets a fresh orchestrator.

                            daemon.yaml
                            yaml
                        

settings:
  max_concurrent: 2

experiments:
  - experiment:
      name: "nightly-db-chaos"
      target: database
      target_config:
        connection_url: "postgres://chaos:pw@db.internal:5432/staging"
        db_type: postgres
      skills:
        - skill_name: "db.insert_load"
          params:
            rows_per_table: 5000
      duration: "15m"
    schedule: "0 0 2 * * *"   # Daily at 2 AM
    enabled: true

  - experiment:
      name: "hourly-pod-chaos"
      target: kubernetes
      target_config:
        namespace: "staging"
        label_selector: "app=web"
      skills:
        - skill_name: "k8s.pod_kill"
          params:
            namespace: "staging"
            count: 1
      duration: "5m"
    schedule: "0 0 * * * *"   # Every hour
    enabled: true
                        

$ chaos daemon daemon.yaml

[DAEMON] Loaded 2 scheduled experiments (max_concurrent: 2)

[DAEMON] nightly-db-chaos: "0 0 2 * * *"

[DAEMON] hourly-pod-chaos: "0 0 * * * *"

[DAEMON] Waiting for next run..._

Cron format: sec min hour day_of_month month day_of_week (6 fields). Use --pid-file /var/run/chaos.pid for process management.

Database Agent

Connects to PostgreSQL, MySQL, CockroachDB, YugabyteDB, or MongoDB. Auto-discovers tables via information_schema (or collection listing for Mongo) and runs chaos skills against them.

                            example-db.yaml
                            yaml
                        

experiments:
  - name: "postgres-load-test"
    target: database
    target_config:
      connection_url: "postgres://postgres:password@localhost:5432/postgres"
      db_type: postgres
    skills:
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 10000
          tables: ["users", "orders"]
      - skill_name: "db.select_load"
        params:
          query_count: 500
      - skill_name: "db.config_change"
        params:
          changes:
            - param: "work_mem"
              value: "4MB"
    duration: "5m"
    parallel: false

  - name: "mysql-update-chaos"
    target: database
    target_config:
      connection_url: "mysql://chaos:password@localhost:3306/mydb"
      db_type: mysql
    skills:
      - skill_name: "db.update_load"
        params:
          rows: 200
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 5000
    duration: "3m"
                        

Supported db_type values: postgres, mysql, cockroach, yugabyte, mongo

Skill	Parameters	Rollback
db.insert_load	rows_per_table, tables	Deletes all inserted rows by tracked IDs
db.update_load	rows	Restores original values from snapshot
db.select_load	query_count	No rollback needed (read-only)
db.config_change	changes[].param, changes[].value	Reverts to original config values
mongo.insert_load	documents_per_collection, collections	Deletes inserted documents
mongo.update_load	documents	Restores original documents
mongo.find_load	query_count	No rollback needed (read-only)

Kubernetes Agent

Discovers pods via the Kubernetes API (kubeconfig or in-cluster auth) and injects cluster-level faults. Supports namespace filtering and label selectors.

                            example-k8s.yaml
                            yaml
                        

experiments:
  - name: "k8s-pod-chaos"
    target: kubernetes
    target_config:
      namespace: "staging"
      label_selector: "app=web"
    skills:
      - skill_name: "k8s.pod_kill"
        params:
          namespace: "staging"
          label_selector: "app=web"
          count: 2
      - skill_name: "k8s.network_chaos"
        params:
          namespace: "staging"
          pod_selector:
            app: "web"
    duration: "5m"

  - name: "k8s-node-drain"
    target: kubernetes
    target_config:
      namespace: "default"
    skills:
      - skill_name: "k8s.node_drain"
        params: {}
      - skill_name: "k8s.resource_stress"
        params:
          namespace: "default"
          cpu_workers: 4
          memory: "512M"
    duration: "10m"
                        

Skill	Parameters	Rollback
k8s.pod_kill	namespace, label_selector, count	Waits for replacement pod to run
k8s.node_drain	(none required)	Uncordons the node
k8s.network_chaos	namespace, pod_selector	Deletes the deny-all NetworkPolicy
k8s.resource_stress	namespace, cpu_workers, memory	Deletes the stress-ng pod

Server Agent

Connects over SSH (key or password auth) and creates host-level disruptions. Service discovery identifies running services via systemctl, with configurable exclusions to protect critical processes.

                            example-server.yaml
                            yaml
                        

experiments:
  - name: "server-chaos"
    target: server
    target_config:
      hosts:
        - host: "10.0.1.50"
          port: 22
          username: "chaos-agent"
          auth:
            type: key
            private_key_path: "~/.ssh/id_ed25519"
      discovery:
        enabled: true
        exclude_services: ["docker", "containerd"]
    skills:
      - skill_name: "server.service_stop"
        params:
          max_services: 2
      - skill_name: "server.disk_fill"
        params:
          size: "5GB"
          target_mount: "/tmp"
      - skill_name: "server.cpu_stress"
        params:
          workers: 4
      - skill_name: "server.memory_stress"
        params:
          memory: "512M"
          workers: 2
    duration: "10m"
    resource_filters:
      - "nginx.*"
      - "postgres.*"
                        

SSH authentication supports both key-based (type: key) and password-based (type: password) auth. Use resource_filters to limit which discovered services are targeted.

Skill	Parameters	Rollback
server.service_stop	max_services	Restarts stopped services via systemctl
server.disk_fill	size, target_mount	Removes the generated file
server.cpu_stress	workers	Kills the stress-ng process
server.memory_stress	memory, workers	Kills the stress-ng process
server.permission_change	(target-dependent)	Restores original file permissions

LLM Planning

Describe chaos in plain English. The LLM analyzes your infrastructure using built-in tools (list skills, discover resources, run experiments) and generates a runnable config. Supports Anthropic, OpenAI, and Ollama providers, plus MCP servers for additional tooling.

                            example-llm.yaml
                            yaml
                        

llm:
  provider: anthropic
  api_key: "${ANTHROPIC_API_KEY}"
  model: "claude-sonnet-4-5-20250929"
  max_tokens: 4096

# Optional: OpenAI
# llm:
#   provider: openai
#   api_key: "${OPENAI_API_KEY}"
#   model: "gpt-4o"

# Optional: Ollama (local, no API key)
# llm:
#   provider: ollama
#   base_url: "http://localhost:11434"
#   model: "llama3.1"

# Connect MCP servers for richer planning
mcp_servers:
  # - name: "prometheus-mcp"
  #   transport:
  #     type: stdio
  #     command: "npx"
  #     args: ["-y", "@modelcontextprotocol/server-prometheus"]
  #   env:
  #     PROMETHEUS_URL: "http://prometheus:9090"

max_turns: 10
                        

# Plan only (outputs YAML config)

$ chaos plan "Test PostgreSQL resilience under heavy write load" --config config/example-llm.yaml

# Plan + execute interactively

$ chaos agent "Kill random pods in staging" --provider anthropic

# Auto-approve (skip confirmation)

$ chaos agent "Stress test web servers" -y

# Save generated plan for later

$ chaos agent "Fill disk on 10.0.1.50" --save plan.yaml

# Use a different provider

$ chaos plan "Break the database" --provider openai --model gpt-4o

Provider auto-detection: If no --provider flag is given, the CLI checks for ANTHROPIC_API_KEY then OPENAI_API_KEY environment variables, and falls back to Ollama (local, no key needed).

Core Concepts

Concept	Description
Agent	A target-specific adapter (database, k8s, server) that discovers resources and executes skills.
Skill	A reversible chaos action. Every skill returns a RollbackHandle on success.
Experiment	A named configuration of skills to run against a target, with a duration window.
Orchestrator	Coordinates agent lifecycle: init, discover, execute, wait, rollback, complete.
RollbackHandle	Opaque undo state returned by each skill. Popped in LIFO order during rollback.
EventSink	Async consumer for experiment events — plug in tracing, Prometheus, or custom backends.
TargetDomain	Enum: Database, Kubernetes, Server. Determines which agent handles the experiment.

Agent Lifecycle

Initializing Discovering Ready Executing RollingBack Idle

Executing Failed RollingBack

Tip: The orchestrator automatically transitions agents through these states. You only need to implement the Agent trait methods — the state machine is handled for you.

Roadmap

Adaptive Chaos

Agents that learn from past experiments and automatically escalate intensity toward failure boundaries.

Multi-Target Experiments

Coordinated chaos across database + cluster + server in a single experiment.

Observability Integrations

Stream events to Prometheus, Grafana, Datadog, and PagerDuty out of the box.

Cloud-Native Targets

AWS, GCP, Azure fault injection — Lambda throttling, S3 latency, IAM revocation.

Distributed Agent Mesh

Agents coordinating across regions to simulate real-world cascading failures.

Natural Language Experiments

Describe chaos in plain English — the LLM agent compiles it into a runnable plan.