Skip to main content
Tilbake til blogg
OpenClawMarch 28, 202610 min

How I pick the right AI model for every job

Running 26 AI agents means picking models carefully. Local Llama and Qwen handle 70% of the work for free. Here's how I decide what runs where and why.

DB

David Bakke

Founder, Bakke & Co

PostShare
ForsidebildeOpenClaw

The question that comes up every time I add an agent

Every time I spin up a new agent in OpenClaw, I have to answer the same question: which model should this thing run on? It sounds trivial. It's not. The wrong answer either costs too much or produces work that isn't good enough.

I'm running 26 agents right now. They range from Nyx, my primary orchestrator handling strategic decisions, to Synapse, a background worker that runs cron jobs and syncs knowledge between agents. The idea that they should all run on the same model is like saying every employee at a company should have the same salary. The work is different. The requirements are different. The budget should reflect that.

Here's how I think about it, after about six weeks of trial and error.

The four tiers

I've landed on four tiers of models, each matched to a different class of work.

Local models (free): Llama 3.3 70B, Qwen 32B, and GLM Flash. These run on my Mac Studio through Ollama. The Mac Studio has 98GB of unified memory, which is enough to run the 70B parameter Llama model comfortably. These models cost nothing to run beyond electricity.

Cloud utility models (cheap): GPT-4o-mini for lightweight marketing and business tasks. It's fast, inexpensive, and good enough for structured outputs that don't need deep reasoning.

Cloud workhorse (moderate): Gemini 2.5 Flash for batch processing, research, and analytics. This is my go-to when I need cloud-grade intelligence at volume. It handles the kind of work where you need to process a lot of data but the output doesn't need to be perfect on the first pass.

Cloud premium (expensive): Claude Sonnet 4.6 for most agents that need strong reasoning. Claude Opus 4.6 reserved exclusively for Forge, my lead engineer agent. Opus is the best coding model I've tested, and Forge is the agent where code quality matters most.

What runs locally and why

The local tier handles roughly 70% of all compute across the system. That number surprised me too.

Synapse, the operations agent, runs on Llama 3.1 8B. Synapse does cron jobs: nightly broadcast sweeps, knowledge syncing, cleanup tasks. It doesn't need to reason about complex problems. It needs to follow instructions reliably. A small local model is perfect for this.

For routine summarization, formatting, and simple classification tasks, I route to Llama 3.3 70B or Qwen 32B. These models are good at structured work. Give them a clear prompt with a defined output format and they perform well. They stumble on open-ended reasoning or tasks requiring nuance, but that's fine. Not every job needs nuance.

GLM Flash is the smallest of the local models and handles the most trivial work. Simple Q&A, formatting transforms, template expansion. Think of it as the intern who's great at repetitive tasks but you wouldn't put in front of a client.

The key insight is that most work in a multi-agent system is mundane. It's moving data between formats. It's checking if a file exists. It's extracting a date from a string. You don't need Opus for that. You don't even need Sonnet.

The Mac Studio as a local inference server

Running local models requires decent hardware. I'm on a Mac Studio with Apple Silicon and 98GB unified memory. This is important because large language models are memory-bound, not compute-bound. The 70B parameter Llama model needs roughly 40GB of memory when quantized. With 98GB, I can run it alongside smaller models without swapping.

Ollama makes this almost too easy. Install it, pull a model, and you have a local inference endpoint. I can switch models by changing a model alias in the agent config. No API keys, no rate limits, no per-token billing.

The downside of local models is latency. They're slower than cloud APIs. A response that takes 2 seconds from Anthropic's API might take 8 seconds from Llama 70B running locally. For background tasks, this doesn't matter. For interactive conversations, it would be painful. That's why my interactive agents run on cloud models while batch and background work stays local.

Where cloud models earn their cost

Cloud models are necessary for four categories of work.

Code generation. Forge runs on Claude Opus 4.6 because I've tested every major model on real coding tasks, and Opus consistently produces the most correct, well-structured code on the first attempt. When Forge builds a feature, a single session might generate hundreds of lines across multiple files. Mistakes compound. I'd rather pay for Opus and get clean code than save money on a cheaper model and spend my time debugging.

Strategic reasoning. Nyx runs on Claude Sonnet 4.6. As the orchestrator, Nyx needs to understand context, prioritize tasks, and make judgment calls about which agent should handle what. This requires a model that can hold complex context and reason about dependencies. Local models can't do this reliably.

User-facing communication. Hermes, the email agent, runs on Sonnet. When an agent drafts email responses or classifies incoming mail, the quality needs to be high because actual humans will see the output. Same logic applies to Content (blog writing) and Counsel (legal review).

Research and analysis. Scout and Atlas both run on Gemini 2.5 Flash. These agents do web research and market analysis. Gemini Flash is fast, handles long contexts well, and costs significantly less than Anthropic for the same volume of work. When you're searching the web and summarizing dozens of articles, you want a model that's cheap enough to call frequently.

The marketing tier

My seven marketing agents (CMO, SEO, Ads, Email Marketing, Partnerships, Analytics, and Content) run a split strategy. Content runs Sonnet because the writing quality ceiling matters. The rest run GPT-4o-mini.

This might seem like I'm shortchanging marketing. I'm not. What GPT-4o-mini does well is structured work with clear prompts. Generate ad copy from this template. Analyze these SEO keywords against this target list. Format this email campaign. These are well-defined tasks where a cheaper model with a good prompt produces equivalent output to a premium model.

Content is the exception because blog posts and long-form writing require a voice that small models can't consistently maintain. Ask GPT-4o-mini to write 1,500 words in a specific tone and it starts drifting by paragraph four. Sonnet doesn't.

The Anthropic rate limit lesson

This is the part I wish I'd known before the incident.

On March 12, I had multiple agents running Claude simultaneously. Nyx was handling an interactive session. Forge was building a feature. The Mission Control quality loop kicked off on a cron job, which spawned another Forge session. That's three concurrent Anthropic sessions, and the API returned 529 errors. The gateway restarted. I lost context on two sessions.

The rule I established after that: maximum two parallel Anthropic jobs at any time. If you need a third, wait. If you're running batch work, use Gemini Flash or GPT-4o-mini instead.

This forced me to be more intentional about model assignment. Before the incident, I defaulted to Claude for everything because it produces the best results. After, I started asking: does this task actually need Anthropic? Or am I just being lazy about prompt engineering for a cheaper model?

Turns out, a lot of tasks I was running on Claude worked fine on Gemini or GPT-4o-mini once I adjusted the prompts. The premium models are more forgiving of sloppy prompts. Cheaper models need better instructions. That's a prompt engineering problem, not a model capability problem.

The real cost breakdown

I don't have exact numbers because some of the cost is amortized into hardware I already owned. But directionally, it looks like this.

Local models (70% of compute): effectively free. The Mac Studio runs 24/7 anyway. The incremental electricity cost is negligible.

GPT-4o-mini (marketing, business agents): a few dollars per month. These agents don't run constantly, and when they do, mini is cheap.

Gemini 2.5 Flash (research, analytics, batch work): moderate. Maybe $15-25 a month. This is my highest-volume cloud model because it handles all the cron-based batch work that used to hit Anthropic.

Claude Sonnet 4.6 (most interactive agents): this is where the real spending happens. Maybe $50-80 a month depending on how much interactive work I'm doing. Heavy coding weeks cost more.

Claude Opus 4.6 (Forge only): the most expensive per-token model, but Forge only runs when there's significant engineering work to do. In quiet weeks, it barely costs anything. During a sprint, it can add up.

Total cloud API spend: roughly $100-150 a month. That's for running 26 agents. Given that local models absorb 70% of the work, the cloud spend is surprisingly manageable.

How I decide which model for a new agent

When I'm setting up a new agent, I ask four questions in order.

Can a local model do this? If the task is structured, repetitive, or doesn't require strong reasoning, start with Llama 70B or Qwen 32B. Test it. If the output quality is acceptable, stop there.

Does the output face a human? If real people will see the agent's work (emails, blog posts, client deliverables), it needs a cloud model. Probably Sonnet or better.

Is this batch or interactive? Batch work goes to Gemini Flash. Interactive conversations go to Anthropic. This keeps Anthropic utilization low and costs predictable.

Is code quality critical? If the agent writes code that goes into production, seriously consider Opus. The cost per token is higher, but the cost of debugging bad code is higher still.

The final step is setting up fallback chains. Every agent config includes fallback models in case the primary is unavailable. If Anthropic returns a 529 error, the agent automatically tries Gemini. If Gemini fails, it tries GPT-4o. No agent should ever be completely stuck because of a rate limit or outage on a single provider.

What I've learned

The biggest lesson is that model selection isn't a one-time decision. It's an ongoing optimization problem. I've changed model assignments multiple times based on actual usage patterns.

Scout started on Sonnet, moved to Gemini Flash when I realized research tasks didn't need Anthropic's reasoning depth. The marketing agents started on GPT-4o, moved to GPT-4o-mini when OpenAI proved the smaller model could handle the same tasks. Synapse started on a cloud model and moved to local Llama when I realized background cron jobs don't justify API spend.

The other lesson is that the model tier system only works if you also invest in prompt quality. A well-prompted GPT-4o-mini will outperform a lazily prompted Sonnet for many tasks. The time I spend tuning prompts for cheaper models pays for itself in reduced API costs within weeks.

There's no universal right answer for model selection. Your hardware, your budget, your quality requirements, and your risk tolerance will all be different from mine. But the framework of asking "what's the cheapest model that can do this job well enough?" has served me well. Start cheap, promote to a more expensive model only when the output quality genuinely requires it.

openclawmodelscostllmollama