Conway Automaton
Autonomous agent that decomposes goals, picks its own next task, and pays for API calls via x402.
- ~48
- $47.12
- 99.7%
- 82%
The problem
For years, my digital life has been a graveyard of half-finished automations. I'd spin up a quick script to pipe an RSS feed into Slack, another to OCR invoices and file them in Notion, and a third to monitor a website for changes. Each one was a tiny, isolated island of code. They worked, but they were brittle. Each required its own deployment pipeline, its own monitoring, and its own "glue" layer to interact with the world.
The maintenance burden was death by a thousand cuts. When an API changed, I'd have to dig up the specific script, remember its quirks, and patch it. The cognitive overhead of managing a dozen different, tiny services meant most of them were eventually abandoned.
I wanted a single, unified system. A general-purpose agent that I could give a high-level goal to—like "Summarize my high-priority emails every morning at 8 AM"—and have it figure out the rest. The goal was to build one robust piece of plumbing that could run any number of tasks, rather than building bespoke plumbing for every new task. This system would be "Conway," named after the creator of the Game of Life, as a nod to the concept of simple rules leading to complex, emergent behavior.
The constraint
To avoid getting lost in the hype cycle, I set some strict, almost ascetic, constraints for the project:
- No frameworks. No LangChain, no AutoGen, no CrewAI. I wanted to understand the fundamental loop of an agent from first principles. This meant writing the core logic myself. The final TypeScript orchestrator is just 212 lines of code.
- One LLM call per loop iteration. This forces efficiency. Instead of a chatty, multi-step reasoning process for every action, the agent has to make a decisive choice: what is the single next best action to take? This keeps API costs down and the agent's logic simple and auditable.
- SQLite journal. All state, history, and logs go into a single
journal.dbfile. No complex database setup, no network latency. It's durable, portable, and I can query it with standard command-line tools. - Minimalist tools. The agent's only ways to interact with the world are by executing
bashcommands and makingHTTPrequests. This is the universal language of servers and APIs. If you can do it in a terminal, Conway can learn to do it.
These constraints forced a focus on a simple, robust core. The system had to be effective without leaning on complex abstractions.
Architecture
The agent's architecture is a straightforward, single-process loop initiated by a scheduler. On my Mac mini, this is launchd, but it could just as easily be a cron job.
┌───────────┐ ┌──────────────┐ ┌───────────────────────────┐ ┌──────────┐
│ Scheduler │ │ │ │ │ │ │
│ (launchd) ├─► │ Main Loop ├─► │ LLM (Claude/Gemini) ├─► │ Tool │
│ every 5m │ │ (TypeScript) │ │ Decides next tool & args │ │ (Bash/HTTP)│
└───────────┘ │ │ │ │ │ │
└──────┬───────┘ └───────────────────────────┘ └────┬─────┘
│ │
│ │
┌───────▼────────┐ ┌───────────────────────────┐ ┌───▼──────┐
│ │ │ │ │ │
│ Exit if done │◄──┤ Self-Eval (Claude Opus) │◄──┤ Journal │
│ or loop │ │ Grades outcome (0-1) │ │ (SQLite) │
│ │ │ │ │ │
└────────────────┘ └───────────────────────────┘ └──────────┘
The flow is as follows:
launchdtriggers the main TypeScript script every five minutes.- The script queries the SQLite journal to understand its current state and outstanding goals.
- It formulates a prompt for an LLM (I swap between Claude Opus and Gemini Pro depending on the task) asking for the next best action.
- The LLM responds with a structured object specifying a tool (
bashorhttp) and its arguments. - The script executes the tool and captures the
stdout,stderr, and exit code. - The outcome is written to the journal as a new entry.
- A second LLM call is made to the self-evaluation module. This is the critical step where the agent assesses its own performance.
- The self-evaluation score and reasoning are also written to the journal.
- If the task is complete or has failed terminally, the process exits. Otherwise, it remains ready for the next scheduled run.
What it actually does in production
In practice, Conway's primary job is to act as my personal task automator, integrated with my project management tool, Linear.
-
Task Ingestion: I have a workflow in Linear where any task I tag with
auto-okis considered fair game for the agent. These are typically repeatable, well-defined tasks like "Deploy the staging branch of the website," "Generate a weekly analytics report," or "Clean up stale Docker images." -
Prioritization: On each run, Conway fetches all unblocked tasks with the
auto-oktag from the Linear API. It then asks an LLM to pick the single highest-priority task based on the title, description, and any priority labels. -
Execution: Once a task is chosen, Conway decides which tool to use.
- For a deployment task, it might choose
bashto run a local script:ssh production-server 'cd /var/www/app && ./deploy.sh'. - For a reporting task, it might use
httpto hit a Metabase API endpoint and then use anotherhttpcall to post the result to a Slack channel. - It can also use the LLMs themselves as a tool. For a task like "Summarize the comments on this GitHub issue," it will fetch the text via the GitHub API and then feed it to Claude for summarization.
- For a deployment task, it might choose
-
Journaling and Reporting: Every action is meticulously logged. A typical journal entry includes the task ID, the chosen tool, the arguments, the raw output, and a timestamp. After the action, it appends its self-evaluation score. If the task is completed successfully (e.g., the deployment script returns exit code 0 and the self-eval score is > 0.8), Conway uses the Linear API to mark the task as done and leaves a comment with a link to the relevant journal entries.
-
Retry and Decomposition: If the self-evaluation score is low (e.g., < 0.5), the agent's next action will be to re-evaluate its approach. The self-eval prompt explicitly asks it to propose a decomposition if the action failed. For example, if a deployment script failed due to a database migration error, the agent might decompose the original "Deploy app" task into two new sub-tasks: "Run database migrations" and then "Deploy app code."
The self-eval step (the interesting part)
The most crucial component of Conway's autonomy is its ability to self-correct. A simple "fire and forget" agent is just a fancy cron job. A truly useful agent needs to know when it has failed and have a strategy for recovery.
After every tool execution, Conway runs a dedicated self-evaluation prompt using Claude Opus, which is particularly strong at structured output and nuanced reasoning. The prompt is roughly:
Given the original intent and the execution outcome, did the action succeed?
- Intent: "Deploy the staging branch of the website."
- Action: Ran
bash -c './deploy.sh'.- Outcome:
stdout: "...", stderr: "Error: Connection refused", exit_code: 1.Rate the outcome on a scale of 0.0 (total failure) to 1.0 (perfect success). Provide a one-sentence justification. If the score is below 0.5, propose a new, simpler sub-task to address the failure. Respond using the
self_evaluationtool.
The LLM is constrained to respond with a specific JSON schema, which prevents parsing errors.
// The TypeScript type definition for the structured output
// that Claude Opus is required to return.
interface SelfEvaluation {
// A score from 0.0 to 1.0 representing success.
score: number;
// A brief, one-sentence justification for the score.
reason: string;
// If score < 0.5, a proposed new task title to
// decompose the problem and address the failure.
// Otherwise, this should be null.
decomposition_proposal?: string | null;
}
This loop is surprisingly effective. The failure case is when the agent gets over-confident and mis-grades a failed action. For instance, it might run a script that silently fails but exits with code 0. The agent might see the "success" exit code and grade itself a 0.9, even though the intended outcome wasn't achieved. This happens in about 8% of cases. I've partially mitigated this by updating the prompt to require "evidence citation" in its reasoning—it must point to a specific line in the stdout or an API response code that supports its score. This has improved accuracy to around 82% when measured against my manual review of its journal.
Earning via x402
An agent that only performs tasks for me is a tool. An agent that can earn its own keep is something more. To experiment with this, I gave Conway a single commercial skill: it can answer a natural language query.
I exposed a simple endpoint, https://kieran123.win/api/ask, which is a Cloudflare Tunnel pointing to the Mac mini. This endpoint is protected by x402, a protocol for requiring a Lightning Network payment before an API call is processed. Each call costs a fraction of a cent ($0.001).
To get customers, Conway periodically sends out "discovery pings" to a list of known AI agent marketplaces and registries, advertising its /api/ask capability and price. Other autonomous agents can then discover and use Conway's API for their own purposes, paying programmatically via x402.
In the first month after launching this feature, Conway handled 9,847 paid calls from 12 unique downstream agents, generating a gross revenue of $11.42. After subtracting the Lightning routing fees and LLM costs, the net profit was a modest $4.71. It's not a business, but it's a powerful proof of concept: an agent running on my desk can perform useful work on the open internet and be compensated for it without any human in the loop.
What broke
The path to autonomy is paved with spectacular failures.
- Nonce Replay in x402: My initial x402 implementation was naive. A malicious client could capture a valid payment token (a "preimage") and "replay" it to get free API calls. I fixed this in v0.3.1 by implementing a server-side nonce database in SQLite, ensuring each payment token could only be used once.
- Infinite Loop on Malformed Output: Early on, if the LLM returned malformed JSON for the tool selection, the agent would fail, and on the next run, it would try the exact same thing again, getting stuck in a loop. The fix was simple: a bounded retry counter in the journal. If the same task fails for the same reason three times in a row, it's marked as "terminally failed" and requires manual intervention.
- The Fly.io Incident: I initially deployed the agent on Fly.io. One night, a bug in the self-decomposition logic caused a task to infinitely break itself down into smaller and smaller sub-tasks. Each sub-task triggered a new agent process. Fly.io's autoscaler did exactly what it was told and scaled up to meet the demand. I woke up to a $200 bill for four hours of compute. I promptly migrated the entire system to a single, powerful Mac mini M4. The cost is now a fixed, predictable hardware purchase.
- Linear API Rate Limit Storm: During another decomposition incident, the agent tried to create 200 sub-tasks in Linear in under a minute, immediately hitting the API rate limit. This caused a cascade of failures. The solution was to implement a standard exponential backoff with jitter for all external API calls.
Design decisions I'd defend
Looking back, a few key architectural choices proved to be the right ones.
- Structured Output is Non-Negotiable: Relying on an LLM to produce free-text that you then parse is a recipe for disaster. Using the built-in function calling or structured output features of modern models (like Claude's tool use or OpenAI's JSON mode) is the only reliable way to build programmatic systems on top of them.
- A Local SQLite Journal is a Superpower: Having the agent's entire "brain" in a single file I can
grep,cat, and query withsqlite3is invaluable for debugging. I can instantly see its entire history of thought and action without clicking through a web UI or parsing complex log formats. - Dedicated Hardware Beats Cloud for this Workload: For a persistent, 24/7 agent, the economics of a dedicated local machine are hard to beat. The Mac mini has immense power, a low idle power draw, and a fixed, one-time cost. It's immune to the runaway scaling costs that can plague cloud-based deployments for this kind of experimental work.
What's next (v0.4)
Conway is still very much a work in progress. The next major version will focus on expanding its cognitive abilities.
- Multi-step Planning: Currently, the agent is reactive, always deciding on just the next best action. I plan to add a planning module that generates a multi-step sequence of actions to achieve a larger goal. This will allow it to tackle more complex, long-running tasks.
- Vector Memory: The agent's memory is currently limited to its recent journal entries. I'm integrating
pgvectorto give it a long-term, searchable memory of past tasks, outcomes, and even snippets of documentation it has read. This will allow it to learn from past mistakes and successes. - High-Availability Failover: To improve on the 99.7% uptime, I'm setting up a second, cold-standby agent instance on a cheap Hetzner server. If the Mac mini goes offline for any reason, the failover agent will take over, syncing its state from a replicated copy of the SQLite journal.
Source
The entire project is open-source under the MIT license. You can find the code, prompts, and database schema on GitHub. Feel free to explore the journal, critique the prompts, and steal whatever's useful for your own projects.