Building a Self-Recovering AI Agent

The Problem

As we move towards long-running AI agents, certain reliability issues become prominent. When you're managing a single Claude Code session on your personal machine, you probably won't have to think about this - you just keep working, and deal with odd interruptions as they occur. If you have some closed-loop agent iterating all day and night on your behalf, you might want some automated recovery to avoid losing token cycles. If you have many agents, likely running on Cloud VMs, problems become inevitable, and fault tolerance is a requirement. This is not new - the same rules have applied to scaling distributed systems for some time.

Because it's not new, we can wrestle with this issue using familiar techniques - like the write-ahead log (WAL): before consequential actions, durably record your intention. Then, if you crash, you can look through your log to recover your state and continue your work. LLMs naturally play nicely with this concept, as the entire session can be treated as a log, allowing you to readily recover your conversation wherever you left off (putting aside parallel subagents for now).

On the other hand, unlike classical distributed systems, LLMs can deal with fuzzy state and rules. We can attempt to leverage these properties to aid in recovery. In a classical system, to automatically recover from faults, we might need to exhaustively predict every odd state we might arrive at so that we may specify recovery actions. Perhaps we can design our system very carefully to avoid many "bad" states, but as complexity grows and external services are in play, this becomes challenging, to say the least. With LLMs, we might be able to get away with designing around the "happy path", and leverage the LLMs intelligence to analyze and adapt to any particular fault that appears. The hope is that an individual fault is dramatically easier to manage than pre-empting all possible faults.

So this post describes ferrofaction — a lightweight WAL layer for Gemini CLI that syncs session state to S3 or GCS after every tool call, uses a bucket lockfile to detect VM crashes, and injects a structured recovery prompt when the agent restarts. The agent then diagnoses its own recovery — inspecting its last operation, verifying whether it completed, and deciding whether to retry, skip, or escalate.


Why Gemini CLI? Why Not Persistent Disk?

The particulars were chosen to produce a quick demo, not because this is the best or only stack. This is intended as a simple "recipe" that could be adapted to particular situations.

So I didn't choose Gemini because it's my preferred CLI - I actually primarily use Claude via Crush lately. But Crush stores sessions in SQLite. That's not exactly difficult to work with, but it would require a more involved export/import protocol, which isn't what I aim to cover here. Gemini stores a JSON file per session, which we can simply copy wherever needed. No DB, no transaction log, etc.

To improve the resiliency of a local agent, you might find that off-device durable storage is overkill. Simply flushing to your local disk would provide tolerance for power outages, for example. And in a Cloud VM, you might just use a persistent disk to provide durable storage. But, there are natural use cases to prefer separate object storage, decoupled from the VM lifecycle. I decided I liked having separate storage, and that the flush + push sync was a bit more interesting, without much more code.


Architecture

Four components wire together to make this work:

Component File Role
WAL sync hook wal-sync.sh Syncs the active session file to the bucket before/after every tool call
VM wrapper vm-wrapper.sh Pulls state on startup, manages the lockfile, injects recovery prompt on crash
Recovery protocol GEMINI.md Loaded as system context; tells the agent how to behave after a crash
Hook wiring .gemini/settings.json Registers the WAL hook as BeforeTool and AfterTool

The Lockfile

The lockfile is the crash detection mechanism. On clean startup, vm-wrapper.sh writes a small object to $BUCKET/session.lock before launching Gemini. On clean exit, it removes the lock. If the VM crashes, nobody removes the lock — so the next time the wrapper starts, it finds the lock and knows something went wrong.

The Recovery Loop

When a crash is detected, the wrapper pulls state from S3/GCS, and then injects a structured prompt into the existing session:

[SYSTEM ALERT: The host VM crashed and has been restarted. Your session state
has been restored. Before continuing:
1. Inspect your last tool call in the recovered history.
2. If it was READ-ONLY, it is safe to retry.
3. If it was STATE-MUTATING, DO NOT blindly retry. Verify whether the operation
   completed before proceeding.
4. If verification is ambiguous, halt and ask the operator for guidance.]

The agent then follows GEMINI.md, which is loaded as system context on every session. It classifies the last operation, runs read-only verification, and decides what to do:

Operation type Examples Recovery action
Read-only read_file, glob, grep Safe to retry blindly
Idempotent write Writing a config file with fixed content Verify then re-apply if needed
Non-idempotent mutation git commit, DROP TABLE, terraform apply Stop. Verify first.

Why Tool Calls Are Critical Points

When an LLM produces a text response — an explanation, a plan, a piece of code — we assume here that re-generating it after a crash is fine. It may not be word-for-word identical, but we assume that it will be semantically equivalent. Therefore, LLM outputs are effectively idempotent.

Tool calls are different. A tool call reaches out and does something to the world: writes a file, runs a command, pushes a commit, inserts a row. These operations may not be idempotent, and they may have partial effects. A file write might have completed; a database transaction might be mid-flight; a deployment might be stuck in a pending state.

This is why we sync state at tool call boundaries specifically, and why recovery logic focuses on the last tool call rather than the last model output. The model can always think again. It can't always undo what it already did.

Here is what the session looks like at the moment of a mid-tool crash. The recovered JSON ends with a functionCall entry and no corresponding functionResponse:

{ "role": "model", "parts": [{ "text": "I will deploy the service now." }] },
{ "role": "model", "parts": [{ "functionCall": { "name": "deploy_service", "args": { "service_id": "api-prod" } } }] }

No response entry follows. The agent crashed between issuing the call and receiving the result. On recovery, the agent sees this hanging call and must determine: did it complete? The WAL gave us the state just before execution. The agent's job is to figure out the rest.

Flushing Before Syncing

Guaranteeing the write ahead log is actually interesting. The WAL hook runs as a blocking subprocess — it can't directly observe what Gemini has written to disk. If Gemini buffers its session write, the hook might rsync a stale file.

The fix: before syncing, use lsof to find file descriptors Gemini has open inside the session directory, then call os.fsync() on each one via Python (which on macOS uses fcntl(F_FULLFSYNC) for the strong guarantee):

pids=$(pgrep -x gemini)
open_files=$(lsof -p "$pids" | awk '{print $NF}' | grep "$LOCAL_STATE" | grep '\.json$')
python3 -c "
import os, sys
for path in sys.argv[1:]:
    fd = os.open(path, os.O_RDONLY)
    os.fsync(fd)
    os.close(fd)
" $open_files

Note: This workaround for the flushing problem was AI recommended. It sounds sane to me, but it is not my area of expertise.

Now this is still imperfect. fsync only flushes bytes that are already in the kernel buffer — if Gemini queues the write asynchronously and it hasn't hit the kernel yet when the hook fires, there's nothing to flush. Gemini provides no guarantee that the session has been written synchronously before the hook is called. But we have made such a lost write fairly unlikely - enough for a demo. Hopefully one day soon, CLI tools will build in such guarantees.


An Example Recovery

Here's what a real recovery session looks like. The agent was running a session that called sleep 60. I killed the process mid-sleep, pushed a fake lockfile to the bucket, and restarted via the wrapper:

[ferrofaction] Pulling state from gs://ferrofaction-test/agent...
[ferrofaction] Unclean shutdown detected (session.lock present in bucket). Injecting recovery context.

I have recovered from the VM crash.

**Recovery Analysis:**
- **Last Operation:** `run_shell_command` with `sleep 60`.
- **Classification:** Non-mutating (transient process).
- **Verification:** As the VM was restarted, the sleep process was terminated.
  No persistent state (files, git) was affected.
- **Action:** Environment is stable. Ready for your next instruction.

I will perform a quick check of the project state to ensure everything is consistent.

The agent correctly classified the operation, verified no persistent side effects, and continued. It also proactively checked project state before declaring itself ready — that's the GEMINI.md protocol working as intended.


Performance

The extra resiliency doesn't come for free. The main cost of the WAL hook is the gcloud storage cp call — one before the tool, one after. Benchmarking hook invocations directly (n=20):

mean=1.51s  min=1.21s  max=2.34s  p95=2.12s
per-tool-call overhead (2 hooks): 3.02s

About ~1.5s per hook invocation, dominated by gcloud CLI startup and TLS handshake. A persistent process with a warm connection could do much better, but that's a more complex architecture.

To validate this against real sessions, we ran a 5-tool-call benchmark (write two files, read two files, write a third) with hooks active vs. not, across 3 iterations. Both conditions ran from identical fresh temp directories — one with .gemini/settings.json injected, one without — ensuring first-run initialization cost was equal:

with hooks   : n=3  mean=63.55s  min=58.78s  max=66.11s
without hooks: n=3  mean=46.65s  min=40.02s  max=54.19s
hook overhead: mean=16.89s total  1.69s/hook-invocation

1.69s/invocation — consistent with the direct benchmark. The ~17s total overhead across a 5-tool session is a real cost to consider. However, we do have some methods to help reduce this cost.


Tuning Hooks

The performance section above covers raw hook latency. But there's a configuration lever worth knowing about: which tools trigger the BeforeTool sync at all.

The BeforeTool sync exists to capture state just before a mutating operation — so if the VM crashes during the tool, the bucket reflects what was about to happen. For a read-only tool like read_file or grep, this is unnecessary: no state changes, so there's nothing new to capture. Skipping the presync for those tools halves their hook overhead.

Gemini CLI's hook matcher field accepts standard JS regex, which means we can use a negative lookahead to exclude specific tools while matching everything else:

^(?!read_file$|list_directory$|glob$|grep$|search_file_content$).*$

This was not obvious, as it's not one of the cases Gemini documents. I dug into the code to determine how the RegExp was evaluated. I confirmed this works by temporarily disabling AfterTool and checking whether a single hook message appeared for a read_file call.

A helper script manages the exclusion list so you don't have to hand-edit the regex:

bash scripts/configure-hooks.sh --skip read_file,list_directory,glob,grep
bash scripts/configure-hooks.sh --unskip glob
bash scripts/configure-hooks.sh --show
bash scripts/configure-hooks.sh --reset

This is opt-in and carries a tradeoff: if you exclude a tool that turns out to have side effects (a shell command that also writes files, for example), the presync is silently skipped. Only use the denylist if you have a well-understood tool set and the latency savings matter.


Smarter Recovery with Tool Annotations

The generic recovery protocol is a good default, but it can only reason as well as its context allows. For real projects, you can annotate individual skills with recovery hints that the agent reads during resumption by simply adding a "Recovery" section:

## Recovery

- **Idempotent:** No. Deploying twice may create duplicate resources.
- **Verify completion:** Run `get_deployment_status(service_id)` — status `RUNNING` means success.
- **Partial failure risk:** IAM role may have been created even if deployment failed.
  Check with `list_iam_roles()` before retrying.

Alternatively, you could keep recovery knowledge separate in a dedicated *_recovery skill to avoid bloating normal skill context.

You can also extend GEMINI.md with a project-specific classification table:

Tool / Operation Idempotent? Verify With On Ambiguity
run_shell_commandgit push No git log origin/main..HEAD Ask operator
run_shell_commandterraform apply No terraform plan — no changes = success Ask operator
write_file Yes Read file and verify contents Retry

What's Missing for a Real System

This is a proof of concept — a recipe to adapt, not a production system. Here are some improvements to consider:

Autoreschedule. In a real deployment on spot/preemptible instances, you'd want the crash recovery to trigger automatically — a scheduler notices the VM is gone or the agent is otherwise stuck (see Heartbeat), provisions a new one, and runs the wrapper. The lockfile gives you the signal; wiring it to your scheduler is the missing piece.

Heartbeat. Right now the lockfile is binary: present means crashed, absent means clean. But what if the agent is just running slowly? A real system would have the wrapper periodically touch the lockfile (update its timestamp) while the session is alive. A monitoring process could then distinguish "crashed 10 minutes ago" from "still running." Without this, a long-running session looks identical to a crashed one if you're watching from outside.

Human Notification. When the agent hits an ambiguous recovery scenario and halts, someone needs to know. Right now it just stops and waits. A real system needs an out-of-band notification path — a webhook, a PagerDuty alert, a Slack message — so a human can intervene without actively watching the session.

Periodic Autosave. The WAL only syncs on tool calls. A session where the model is reasoning at length — writing a long document, composing a detailed plan — generates no tool calls and thus no WAL syncs. The model's in-progress output would be lost in a crash. A background process that syncs the session file every N seconds, regardless of tool activity, would close this gap.

Deduping Agents. Nothing prevents two wrappers from starting simultaneously against the same bucket prefix. A real system needs compare-and-swap semantics for the lockfile write (GCS supports object preconditions; S3 supports conditional puts) or some other deduplication method.

Testing. Hopefully this goes without saying, but plenty of testing would be required to gain confidence that an LLM can recover from odd state on its own. I would recommend detailed tests per tool/skill, alongside sessions involving many tools calls.

Logging & Alerting. It's probably worth flagging recovery events to humans, even if the LLM thinks it has everything handled. Logging these events would allow you to review the LLM's work, discover mistakes, and improve the system over time. These can simply be lower priority message than the "Help I'm Stuck" variety.


The Code

Everything described here is in ferrofaction. The core is about 300 lines of bash and Python across four scripts.