System32 Agents

Chaos Agents

The first autonomous agent from System32. It discovers your infrastructure, injects faults, and cleanly reverts — so you can break things safely before production does it for you.

v0.1.0
Built with Rust
Open Source

Overview

Chaos Agents deploys AI-powered agents into your infrastructure that autonomously discover resources, inject faults, observe the blast radius, and cleanly revert. Instead of writing brittle test scripts, you declare what to stress and the agents figure out how.

Every chaos action returns a rollback handle. When the experiment window expires — or on any failure — all actions revert in LIFO order. No residue.

Databases
Schema-aware load injection, concurrent updates, config mutation. PostgreSQL and MySQL.
Kubernetes
Pod kills, node eviction, network partitions, DNS corruption. Full namespace discovery.
Servers
Disk fill, process termination, permission scramble, port saturation over SSH.

Architecture

crate-structure
chaos-cli CLI binary & daemon scheduler
|
chaos-core Orchestrator, traits, rollback engine
|
+--------+-----------+
| | |
chaos-db chaos-k8s chaos-server
  • chaos-core — The brain. Agent trait, Skill trait, Orchestrator, RollbackLog, EventSink, Config.
  • chaos-db — Database chaos agent. Connects to PostgreSQL/MySQL, introspects via information_schema.
  • chaos-k8s — Kubernetes chaos agent. Discovers workloads via the Kubernetes API.
  • chaos-server — Server chaos agent. Connects over SSH for host-level disruptions.
  • chaos-llm — LLM integration layer. Natural language experiment definitions.

How It Works

STEP 1
Discover
Agent connects and maps every resource — tables, pods, services, processes.
STEP 2
Plan
Orchestrator selects skills, parameterizes, and builds an execution plan.
STEP 3
Execute
Each skill runs and returns a RollbackHandle pushed onto a LIFO stack.
STEP 4
Observe
Events stream to sinks in real time — tracing, Prometheus, Datadog.
STEP 5
Rollback
Handles popped in LIFO order. Best-effort: if one fails, the rest continue.

Quickstart

Write an experiment config, then run it. Here's a minimal database chaos example that inserts load into PostgreSQL and changes a config parameter:

my-experiment.yaml yaml
experiments:
  - name: "postgres-quick-test"
    target: database
    target_config:
      connection_url: "postgres://postgres:password@localhost:5432/mydb"
      db_type: postgres
    skills:
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 5000
          tables: ["orders", "payments"]
      - skill_name: "db.config_change"
        params:
          changes:
            - param: "work_mem"
              value: "8MB"
    duration: "5m"
terminal
$ chaos validate my-experiment.yaml
Config valid. 1 experiment, 2 skills
$ chaos run my-experiment.yaml
[DISCOVER] Connected to PostgreSQL, found 14 tables
[EXECUTE] db.insert_load: 5000 rows into orders, payments
[EXECUTE] db.config_change: work_mem = 8MB
[WAIT] Duration window: 5m remaining
[ROLLBACK] db.config_change reverted
[ROLLBACK] db.insert_load data cleaned
[COMPLETE] Experiment finished. All actions rolled back.

Or skip the config entirely and describe what you want in plain English:

terminal
$ chaos agent "Stress test PostgreSQL with heavy writes for 5 minutes"
[LLM] Planning experiment with Anthropic...
[LLM] Generated config: postgres-stress-test (2 skills, 5m duration)
Proceed? [y/N] y
[EXECUTE] Running experiment...
[COMPLETE] All actions executed and rolled back.

CLI Commands

Chaos Agents provides six CLI commands for running, planning, scheduling, and validating experiments.

CommandDescription
chaos runExecute experiments from a YAML config file. Supports --dry-run to validate without executing.
chaos agentDescribe chaos in plain English. The LLM generates a config, you review, then it executes. Use -y to auto-approve or --save to export.
chaos planLLM-driven planning only (no execution). Outputs a generated experiment config.
chaos daemonRun experiments on a cron schedule with concurrency control and graceful shutdown.
chaos validateValidate a config file without executing. Checks YAML parsing, skill existence, and parameters.
chaos list-skillsList all available skills. Filter by --target database|kubernetes|server.
terminal
# Run an experiment config
$ chaos run config/example-db.yaml
# Dry run (validate + show plan, no execution)
$ chaos run config/example-db.yaml --dry-run
# LLM agent: describe, review, execute
$ chaos agent "Kill 2 pods in staging and observe recovery"
# LLM agent: auto-approve
$ chaos agent "Stress test the web servers" -y
# Save generated config without executing
$ chaos agent "Fill disk on 10.0.1.50" --save plan.yaml
# Plan only (no execution)
$ chaos plan "Test database failover" --provider anthropic
# List skills for a specific target
$ chaos list-skills --target database

Experiment Config

Define experiments declaratively in YAML. Each experiment specifies a target (database, kubernetes, or server), a target_config with connection details, a list of skills to invoke, and a duration window.

example-db.yaml yaml
experiments:
  - name: "postgres-load-test"
    target: database
    target_config:
      connection_url: "postgres://postgres:password@localhost:5432/postgres"
      db_type: postgres
    skills:
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 10000
          tables: ["users", "orders"]
        count: 1
      - skill_name: "db.select_load"
        params:
          query_count: 500
      - skill_name: "db.config_change"
        params:
          changes:
            - param: "work_mem"
              value: "4MB"
    duration: "5m"
    parallel: false
Tip: Always validate your config before running: chaos validate my-experiment.yaml. Use --dry-run to see the execution plan without applying changes.

Daemon / Scheduling

Run chaos on a cron schedule. The daemon enforces concurrency limits via a semaphore and supports graceful shutdown via SIGTERM. Each scheduled run gets a fresh orchestrator.

daemon.yaml yaml
settings:
  max_concurrent: 2

experiments:
  - experiment:
      name: "nightly-db-chaos"
      target: database
      target_config:
        connection_url: "postgres://chaos:pw@db.internal:5432/staging"
        db_type: postgres
      skills:
        - skill_name: "db.insert_load"
          params:
            rows_per_table: 5000
      duration: "15m"
    schedule: "0 0 2 * * *"   # Daily at 2 AM
    enabled: true

  - experiment:
      name: "hourly-pod-chaos"
      target: kubernetes
      target_config:
        namespace: "staging"
        label_selector: "app=web"
      skills:
        - skill_name: "k8s.pod_kill"
          params:
            namespace: "staging"
            count: 1
      duration: "5m"
    schedule: "0 0 * * * *"   # Every hour
    enabled: true
terminal
$ chaos daemon daemon.yaml
[DAEMON] Loaded 2 scheduled experiments (max_concurrent: 2)
[DAEMON] nightly-db-chaos: "0 0 2 * * *"
[DAEMON] hourly-pod-chaos: "0 0 * * * *"
[DAEMON] Waiting for next run..._

Cron format: sec min hour day_of_month month day_of_week (6 fields). Use --pid-file /var/run/chaos.pid for process management.

Database Agent

Connects to PostgreSQL, MySQL, CockroachDB, YugabyteDB, or MongoDB. Auto-discovers tables via information_schema (or collection listing for Mongo) and runs chaos skills against them.

example-db.yaml yaml
experiments:
  - name: "postgres-load-test"
    target: database
    target_config:
      connection_url: "postgres://postgres:password@localhost:5432/postgres"
      db_type: postgres
    skills:
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 10000
          tables: ["users", "orders"]
      - skill_name: "db.select_load"
        params:
          query_count: 500
      - skill_name: "db.config_change"
        params:
          changes:
            - param: "work_mem"
              value: "4MB"
    duration: "5m"
    parallel: false

  - name: "mysql-update-chaos"
    target: database
    target_config:
      connection_url: "mysql://chaos:password@localhost:3306/mydb"
      db_type: mysql
    skills:
      - skill_name: "db.update_load"
        params:
          rows: 200
      - skill_name: "db.insert_load"
        params:
          rows_per_table: 5000
    duration: "3m"

Supported db_type values: postgres, mysql, cockroach, yugabyte, mongo

SkillParametersRollback
db.insert_loadrows_per_table, tablesDeletes all inserted rows by tracked IDs
db.update_loadrowsRestores original values from snapshot
db.select_loadquery_countNo rollback needed (read-only)
db.config_changechanges[].param, changes[].valueReverts to original config values
mongo.insert_loaddocuments_per_collection, collectionsDeletes inserted documents
mongo.update_loaddocumentsRestores original documents
mongo.find_loadquery_countNo rollback needed (read-only)

Kubernetes Agent

Discovers pods via the Kubernetes API (kubeconfig or in-cluster auth) and injects cluster-level faults. Supports namespace filtering and label selectors.

example-k8s.yaml yaml
experiments:
  - name: "k8s-pod-chaos"
    target: kubernetes
    target_config:
      namespace: "staging"
      label_selector: "app=web"
    skills:
      - skill_name: "k8s.pod_kill"
        params:
          namespace: "staging"
          label_selector: "app=web"
          count: 2
      - skill_name: "k8s.network_chaos"
        params:
          namespace: "staging"
          pod_selector:
            app: "web"
    duration: "5m"

  - name: "k8s-node-drain"
    target: kubernetes
    target_config:
      namespace: "default"
    skills:
      - skill_name: "k8s.node_drain"
        params: {}
      - skill_name: "k8s.resource_stress"
        params:
          namespace: "default"
          cpu_workers: 4
          memory: "512M"
    duration: "10m"
SkillParametersRollback
k8s.pod_killnamespace, label_selector, countWaits for replacement pod to run
k8s.node_drain(none required)Uncordons the node
k8s.network_chaosnamespace, pod_selectorDeletes the deny-all NetworkPolicy
k8s.resource_stressnamespace, cpu_workers, memoryDeletes the stress-ng pod

Server Agent

Connects over SSH (key or password auth) and creates host-level disruptions. Service discovery identifies running services via systemctl, with configurable exclusions to protect critical processes.

example-server.yaml yaml
experiments:
  - name: "server-chaos"
    target: server
    target_config:
      hosts:
        - host: "10.0.1.50"
          port: 22
          username: "chaos-agent"
          auth:
            type: key
            private_key_path: "~/.ssh/id_ed25519"
      discovery:
        enabled: true
        exclude_services: ["docker", "containerd"]
    skills:
      - skill_name: "server.service_stop"
        params:
          max_services: 2
      - skill_name: "server.disk_fill"
        params:
          size: "5GB"
          target_mount: "/tmp"
      - skill_name: "server.cpu_stress"
        params:
          workers: 4
      - skill_name: "server.memory_stress"
        params:
          memory: "512M"
          workers: 2
    duration: "10m"
    resource_filters:
      - "nginx.*"
      - "postgres.*"

SSH authentication supports both key-based (type: key) and password-based (type: password) auth. Use resource_filters to limit which discovered services are targeted.

SkillParametersRollback
server.service_stopmax_servicesRestarts stopped services via systemctl
server.disk_fillsize, target_mountRemoves the generated file
server.cpu_stressworkersKills the stress-ng process
server.memory_stressmemory, workersKills the stress-ng process
server.permission_change(target-dependent)Restores original file permissions

LLM Planning

Describe chaos in plain English. The LLM analyzes your infrastructure using built-in tools (list skills, discover resources, run experiments) and generates a runnable config. Supports Anthropic, OpenAI, and Ollama providers, plus MCP servers for additional tooling.

example-llm.yaml yaml
llm:
  provider: anthropic
  api_key: "${ANTHROPIC_API_KEY}"
  model: "claude-sonnet-4-5-20250929"
  max_tokens: 4096

# Optional: OpenAI
# llm:
#   provider: openai
#   api_key: "${OPENAI_API_KEY}"
#   model: "gpt-4o"

# Optional: Ollama (local, no API key)
# llm:
#   provider: ollama
#   base_url: "http://localhost:11434"
#   model: "llama3.1"

# Connect MCP servers for richer planning
mcp_servers:
  # - name: "prometheus-mcp"
  #   transport:
  #     type: stdio
  #     command: "npx"
  #     args: ["-y", "@modelcontextprotocol/server-prometheus"]
  #   env:
  #     PROMETHEUS_URL: "http://prometheus:9090"

max_turns: 10
terminal
# Plan only (outputs YAML config)
$ chaos plan "Test PostgreSQL resilience under heavy write load" --config config/example-llm.yaml
# Plan + execute interactively
$ chaos agent "Kill random pods in staging" --provider anthropic
# Auto-approve (skip confirmation)
$ chaos agent "Stress test web servers" -y
# Save generated plan for later
$ chaos agent "Fill disk on 10.0.1.50" --save plan.yaml
# Use a different provider
$ chaos plan "Break the database" --provider openai --model gpt-4o
Provider auto-detection: If no --provider flag is given, the CLI checks for ANTHROPIC_API_KEY then OPENAI_API_KEY environment variables, and falls back to Ollama (local, no key needed).

Core Concepts

ConceptDescription
AgentA target-specific adapter (database, k8s, server) that discovers resources and executes skills.
SkillA reversible chaos action. Every skill returns a RollbackHandle on success.
ExperimentA named configuration of skills to run against a target, with a duration window.
OrchestratorCoordinates agent lifecycle: init, discover, execute, wait, rollback, complete.
RollbackHandleOpaque undo state returned by each skill. Popped in LIFO order during rollback.
EventSinkAsync consumer for experiment events — plug in tracing, Prometheus, or custom backends.
TargetDomainEnum: Database, Kubernetes, Server. Determines which agent handles the experiment.

Agent Lifecycle

Initializing Discovering Ready Executing RollingBack Idle
Executing Failed RollingBack
Tip: The orchestrator automatically transitions agents through these states. You only need to implement the Agent trait methods — the state machine is handled for you.

Roadmap

Adaptive Chaos
Agents that learn from past experiments and automatically escalate intensity toward failure boundaries.
Multi-Target Experiments
Coordinated chaos across database + cluster + server in a single experiment.
Observability Integrations
Stream events to Prometheus, Grafana, Datadog, and PagerDuty out of the box.
Cloud-Native Targets
AWS, GCP, Azure fault injection — Lambda throttling, S3 latency, IAM revocation.
Distributed Agent Mesh
Agents coordinating across regions to simulate real-world cascading failures.
Natural Language Experiments
Describe chaos in plain English — the LLM agent compiles it into a runnable plan.