Service · 02 — AI agents

AI agents that do work, not demos.

Most “AI agents” are a system prompt and a single API call dressed up as a product. Real agents close loops: they read the inputs you actually have, call the tools you actually use, and leave behind the receipts so you can trust what they did.

OpenClawClaude / GPT-4 class LLMsPythonTypeScriptTermuxSKILL.mdEval harnesses

What you get

An agent designed for one job — not a kitchen sink. Specific inputs, specific tools, specific success criteria, specific failure handling.
A real eval harness — not vibes. Test cases for the inputs you care about, regression checks, and a confidence number you can put in a PR description.
Tool calls you can audit — every external call logged with inputs, outputs, and reasoning. No black box.
Local-first where the data is sensitive — agents that run on your hardware (or via Termux on your phone) instead of streaming everything to a vendor.
A SKILL.md or README a junior engineer can extend — the next person on this should not need to call the studio.

Recent work in this space

Real, in-the-open references — not hypotheticals:

OpenClaw Droid — an OpenClaw AI Gateway optimized to run inside Termux on Android. Agentic workflows on a phone, no cloud round-trip required for the orchestration layer.
Presearch Search Skill — a clean, minimal SKILL.md wrapping the Presearch API so an agent can do privacy-preserving web search without leaking queries to ad-tech.
Custom client agents — covered by NDA, but we’ll walk you through the architecture on a call.

How scoping works

Most agent projects fail at scoping, not engineering. The first conversation is short and concrete:

What inputs does the agent get? Email body? CSV? A user’s typed prompt? An API event?
What does “done” look like? A row in a database? A reply to the human? A PR opened? A Slack message? Something the user clicks “approve” on?
What’s the worst-case failure? Wrong answer? Spends money? Sends a bad email? This drives how much guardrail logic the agent needs.
Who watches it? Fully autonomous, human-in-the-loop, or batch-with-review?

Bring rough answers to those four and you have most of a scope already.

Honest answers

Will it use OpenAI / Anthropic / Google?

Whatever fits the job. The studio has no allegiance — Claude for long context and careful reasoning, GPT-4 class for tool use, smaller open models for privacy-sensitive work. We’ll recommend the cheapest model that holds the eval, not the flashiest.

What about hallucinations?

Hallucinations are a scoping problem more often than a model problem. Agents constrained to call a verified tool (search, database, calculator) instead of generating a fact freehand hallucinate dramatically less. We design the constraints in.

Cost to operate?

Estimated up-front based on your expected throughput. Most agents we build run for cents to dollars per task. If your workload would make the model bill bigger than the engineering bill, we’ll say so on the first call.

Code ownership?

Yours. Repository transferred to your GitHub org. Generic, reusable pieces (a clean SKILL.md, a logging helper) often get extracted into open-source repos with your permission — that part’s optional.

Stack: OpenClaw + Py
Turnaround: 3–8 wks
Format: Fixed scope
Location: Remote, US

Have a workflow that should be agentic?

Send the rough shape — inputs, outputs, what done looks like, what scares you about it. We’ll reply with whether it’s a fit and what a sensible first cut would be.

Start a project conversation →