Event Driven Agents

The ReAct loop hits a wall

The dominant agent architecture today — ReAct (Reasoning + Acting) — runs a synchronous loop: observe state, invoke the LLM to reason, select a tool, execute it, feed the result back, repeat. Frameworks like LangGraph, CrewAI, and AutoGen all implement variations of this pattern. It works well for single-agent, single-task workflows. But it has a fundamental limitation: it's pull-based.

A pull-based agent must actively poll for changes. It checks a database, reads a queue, calls an API on a timer. Between polls, it's either idle (wasting a running process) or sleeping (introducing latency). When multiple agents need to coordinate, you end up building ad-hoc shared state — a database row that Agent A writes to and Agent B polls against. This is the exact coordination problem that distributed systems engineering solved decades ago with event-driven architecture. Agents just haven't caught up.

Event-driven architecture: a quick primer

Event-driven architecture (EDA) is a design pattern where components communicate by producing and consuming immutable events through a message broker. The core primitives are:

Events — immutable records of something that happened. An event has a type, a timestamp, a source, and a payload. "PR #472 opened on repo X" is an event. "Deployment d-8f3a failed on service Y" is an event. Events are facts, not commands.

Producers — components that emit events. A CI/CD pipeline produces deployment events. A monitoring system produces alert events. Producers don't know or care who consumes their events.

Consumers — components that subscribe to event types and react when events arrive. A consumer declares interest in specific event patterns and receives matching events via push delivery. No polling.

Broker — the intermediary that decouples producers from consumers. Apache Kafka, NATS JetStream, Redis Streams, AWS EventBridge, and RabbitMQ are all implementations of this pattern. The broker handles routing, buffering, replay, and delivery guarantees.

The key property is temporal decoupling. Producers and consumers don't need to be running at the same time. Events are persisted in the broker's log. A consumer that crashes and restarts can resume from its last committed offset without losing events. This is fundamentally different from synchronous RPC, where both parties must be available simultaneously.

Applying EDA to agents

An event-driven agent replaces the ReAct polling loop with an event subscription model. Instead of while True: observe() → think() → act(), the agent registers handlers for specific event types and remains dormant until an event arrives.

Concretely, an event-driven agent has three components:

1. Event subscriptions. The agent declares which event types it handles. A code-review agent subscribes to pull_request.opened and pull_request.synchronize. A deployment agent subscribes to ci.pipeline.passed. An incident responder subscribes to alert.firing with severity ≥ critical. These subscriptions are the agent's "trigger conditions" — equivalent to event triggers in AWS Lambda or Cloud Functions.

2. Event handler. When a matching event arrives, the agent's handler function is invoked. The handler receives the full event payload, constructs a prompt with the relevant context, calls the LLM, and executes any resulting tool calls. The handler is stateless by default — all state comes from the event payload and any external stores the agent queries. This is critical for horizontal scaling: any instance of the agent can handle any event.

3. Event emission. After processing, the agent can emit new events. A code-review agent that finishes reviewing a PR emits a review.completed event. A test agent that detects a flaky test emits test.flaky_detected. These downstream events can trigger other agents, creating event-driven workflows without explicit orchestration.

Event-driven agents use choreography, not orchestration. No central controller sequences the workflow — agents react to events and emit events, and the workflow emerges from their composition.

Technical advantages over polling

The shift from poll-based to event-driven agents yields concrete technical gains.

Eliminated idle compute. A polling agent running a while True loop with a 30-second sleep consumes a process (or container) 24/7, even when there's nothing to do. An event-driven agent can use a serverless execution model — spin up on event arrival, process, shut down. For agents monitoring low-frequency signals (weekly reports, nightly builds), this reduces compute costs by orders of magnitude. The model maps directly onto FaaS platforms: AWS Lambda, Google Cloud Run, or Kubernetes KEDA-scaled deployments where pod count scales to zero when the event queue is empty.

Sub-second latency. With a polling interval of t seconds, average response latency is t/2 seconds. With event push delivery, latency is bounded by broker-to-consumer network time plus agent cold start — typically under 500ms even with container cold starts, and under 50ms with warm instances. For real-time use cases (incident response, live code review, production alerts), this difference matters.

Natural fan-out and fan-in. A single event can trigger multiple consumers in parallel — this is native pub/sub fan-out. A deployment.completed event can simultaneously trigger a smoke-test agent, a notification agent, and a metrics agent. No orchestrator needed. Fan-in (waiting for multiple events before proceeding) is handled via event aggregation patterns: the broker or a stateful processor collects events until a completion condition is met, then emits a composite event.

Built-in durability and replay. Log-based brokers like Kafka and NATS JetStream persist events to disk with configurable retention. If an agent crashes mid-processing, it resumes from its last committed consumer offset on restart. If you deploy a new agent version, you can replay historical events against it for testing. This is impossible with polling architectures where "checking for changes" produces no durable record.

Backpressure handling. When events arrive faster than the agent can process them, the broker buffers them. The agent processes events at its own pace without dropping any. Consumer groups (Kafka) or queue-based subscriptions (NATS) allow horizontal scaling: add more agent instances to increase throughput. The broker handles partition assignment and rebalancing automatically.

Event schema design for agents

The quality of your event schemas directly determines agent effectiveness. A poorly designed event forces the agent to make additional API calls to gather context, increasing latency and cost. A well-designed event carries everything the agent needs to reason and act.

Use CloudEvents. The CloudEvents specification provides a standard envelope for events with required fields: id, source, type, specversion, and time. This gives you interoperability across brokers and tooling for free. Your agent doesn't need custom parsing logic for each event source.

Include full context in the payload. An event of type pull_request.opened should include the diff, file list, PR description, author, and base branch — not just a PR number that the agent has to resolve via the GitHub API. This is the "fat event" pattern. It trades storage for latency and reduces the agent's external dependencies at processing time.

Use schema registries. As your event types grow, enforce schemas with a registry (Confluent Schema Registry for Kafka, or JSON Schema validation at the broker layer). Schema evolution rules (backward/forward compatibility) prevent breaking consumers when producers add or remove fields. Agents can trust that events conform to a known shape without defensive parsing.

Failure modes and mitigation

Event-driven agents introduce failure modes that don't exist in simple polling loops. Understanding them is essential.

Poison events. An event with a malformed payload or edge-case data can crash the agent's handler repeatedly. Without mitigation, the consumer retries the same event forever, blocking all subsequent events in the partition. The fix: implement a dead letter queue (DLQ). After n failed processing attempts, route the event to a DLQ topic for manual inspection and advance the consumer offset. Kafka consumers handle this via max.poll.retries and custom DeserializationExceptionHandler implementations. NATS JetStream has native MaxDeliver configuration.

Exactly-once processing. In distributed systems, exactly-once semantics are notoriously hard. If an agent processes an event, emits a downstream event, but crashes before committing its consumer offset, it will reprocess the same event on restart. For LLM-backed agents, this means duplicate inference calls and potentially duplicate side effects (duplicate PR comments, duplicate deployments). Mitigation: make handlers idempotent. Use the event's unique id field as a deduplication key. Before processing, check if the event ID has already been handled. Store processed event IDs in a fast lookup store (Redis SET or a Bloom filter for high-throughput scenarios).

Event ordering. Kafka guarantees ordering within a partition, but not across partitions. If a PR receives three rapid pushes, the corresponding pull_request.synchronize events must be processed in order to avoid reviewing stale code. Solution: partition events by a stable key (repository + PR number) so all events for the same PR land in the same partition and are processed sequentially by a single consumer instance.

Cold start latency. Serverless execution models introduce cold start penalties — 1-5 seconds for container-based runtimes, 100-500ms for lightweight runtimes. For latency-sensitive agents, maintain a warm pool of pre-initialized instances. KEDA's minReplicaCount: 1 keeps at least one instance warm. For FaaS, provisioned concurrency (Lambda) or min-instances (Cloud Run) eliminate cold starts at the cost of baseline compute.

Multi-agent choreography

The real power of event-driven agents emerges in multi-agent systems. Consider a concrete workflow: automated code review and deployment.

Step 1: GitHub webhook fires a pull_request.opened event. The broker receives it and routes it to subscribed consumers.

Step 2: The code-review agent receives the event, pulls the diff from the payload, invokes the LLM with the diff and coding standards, posts review comments via the GitHub API, and emits a review.completed event with verdict: approve/request-changes.

Step 3: If the verdict is "approve," the test-runner agent (subscribed to review.completed where verdict = approve) triggers the CI pipeline and waits for results. When CI completes, it emits ci.completed with pass/fail status.

Step 4: The deploy agent (subscribed to ci.completed where status = pass and branch = main) initiates a canary deployment, monitors error rates for 5 minutes, and emits deployment.completed or deployment.rolled_back.

No central orchestrator manages this workflow. Each agent operates independently, triggered by events from the agent before it. Adding a new agent (say, a security scanner after code review) requires zero changes to existing agents — you subscribe the new agent to pull_request.opened and it runs in parallel with the code-review agent. This is the open/closed principle applied to agent systems: the workflow is open for extension but closed for modification.

Our take

At System32, we treat events as the fundamental coordination primitive for agent systems. Our infrastructure is built on NATS JetStream for event routing with exactly-once delivery semantics, CloudEvents for schema standardization, and KEDA for scaling agent instances to zero when idle and up under load.

The agent loop isn't dead — individual agents still reason and act in a loop within a single event handler invocation. But the system-level architecture is event-driven. Agents don't poll each other. They don't share mutable state. They emit events and subscribe to events. The result is a system that scales horizontally, recovers from failures automatically, and can be extended without modifying existing agents.

If you're building multi-agent systems with shared databases and polling loops, you're re-inventing problems that event-driven architecture solved twenty years ago. Use events. Your agents will be faster, cheaper, and easier to operate.