245 Million Tokens Later: What OpenClaw's Usage Data Reveals About Agent Efficiency

At $229 across 300 sessions, you might think there's nothing interesting to analyze. You'd be wrong. When you log every API call at the individual token level, patterns emerge that are invisible in aggregate dashboards. This is not a tutorial on how to reduce your API bill by 10%. This is an investigation into what the data actually shows.

Data analysis dashboard with charts and metrics

// Data patterns hide in volume. 245M tokens across 300 sessions reveals things a monthly summary never could.

01. THE DATA

What Was Captured

The dataset spans approximately 6 weeks of active agent operation on OpenClaw, running the Claude API (primarily Sonnet and Opus models). Every session logged: session start/end timestamps, input tokens, output tokens, cache write tokens, cache read tokens, model used, and total cost. No sampling. No aggregation. Every call.

300TOTAL SESSIONS

245MTOTAL TOKENS

$229TOTAL SPEND

$0.76MEAN SESSION COST

$0.52MEDIAN SESSION COST

816KTOKENS PER SESSION AVG

The gap between mean ($0.76) and median ($0.52) is the first signal. Right-skewed distributions mean a small number of expensive sessions pull the mean up. More on that later.

Token Composition

Total token volume breaks down as follows:

TOKEN TYPE	VOLUME	% OF TOTAL	EFFECTIVE PRICE
Cache Read (Input)	97.0M	39.6%	$0.30/M
Fresh Input	82.0M	33.5%	$3.00/M
Context Files / Overhead	22.0M	9.0%	$3.00/M
Output Tokens	44.0M	18.0%	$15.00/M (Opus) / $3.00/M (Sonnet)
TOTAL	245M	100%	$0.93/M blended

The blended rate is the key number. List price for Claude Sonnet output is $15/M tokens. The blended effective rate across all 245M tokens is $0.93/M - a 94% reduction driven almost entirely by cache reuse on input tokens.

Token breakdown: 245M tokens by type - cache read, fresh input, context, output

// Token composition across 300 sessions. Cache read tokens (green) represent 39.6% of all tokens at 10x lower cost.

02. TOKEN PATTERNS

Input/Output Ratio

The input-to-output ratio sits at roughly 5.6:1 by token count. For every token the model generates, 5.6 tokens went in. This ratio is consistent across session types but shifts meaningfully by model:

MODEL	INPUT:OUTPUT RATIO	AVG SESSION COST	SESSIONS
Claude Sonnet 4	4.8:1	$0.41	198
Claude Opus 4	6.9:1	$1.84	102

Opus sessions have a higher input ratio because they tend to run longer with more multi-turn context accumulation before the final output. The model doesn't change how many output tokens it generates per question - it changes how much context gets loaded first.

Cache Performance by Session Phase

Cache hit rate is not uniform across a session. It follows a curve:

SESSION PHASE	CACHE HIT RATE	INTERPRETATION
First 2 turns	8-12%	New session, cold cache
Turns 3-8	28-38%	Workspace files cached
Turns 9-20	51-64%	Context accumulating
Turns 20+	68-74%	Stable long-running session

This is the most important pattern in the dataset. Cache does almost nothing for short sessions. It becomes dramatically valuable for long sessions. The economic implication: session management strategy matters more than model selection for cost.

Counter-intuitive finding: Splitting one 20-turn session into four 5-turn sessions costs roughly 2.3x more. The overhead of cold cache restarts costs more than any benefit from shorter context windows.

// Cache hit rate climbs with session length. Sessions under 100K tokens have 18% cache hit rates; sessions over 2M tokens hit 71%.

When Cache Helps Most

Not all tasks benefit equally from caching. Breaking down by session type:

TASK TYPE	CACHE HIT RATE	COST VS NO-CACHE EQUIV
Code review (same codebase)	71%	-68%
Article writing (with sources)	64%	-61%
Multi-step research	52%	-49%
Single-shot Q&A	14%	-13%
Image/media generation prompts	9%	-8%

Single-shot Q&A is the worst use case for an agent running with large context files. You pay full price to load everything, use maybe 15% of it, and get almost no cache benefit. For tasks like these, smaller specialized contexts or a stripped-down system prompt would have cost 3-4x less.

Server infrastructure and data processing

// Cache architecture is infrastructure, not a feature. Design around it or pay the penalty at scale.

03. COST OPTIMIZATION

Model Selection Impact

The data shows 198 Sonnet sessions vs 102 Opus sessions. Total spend attribution:

MODEL	SESSIONS	TOTAL SPEND	% OF BUDGET
Claude Sonnet 4	198 (66%)	$81.20	35.5%
Claude Opus 4	102 (34%)	$147.80	64.5%

Opus sessions represent 34% of session count but 64.5% of cost. Per-session average: Sonnet at $0.41, Opus at $1.84 - a 4.5x gap. The question is whether Opus tasks actually require Opus capability, or whether routing was suboptimal.

Analysis of the 102 Opus sessions by output quality requirements:

~38 sessions were clearly Opus-justified (complex reasoning, nuanced writing, multi-step planning)
~41 sessions were borderline (Sonnet would likely have been adequate)
~23 sessions were clearly Sonnet-level tasks routed to Opus by default

Conservative estimate: 23-41 sessions could have run on Sonnet. At $0.41 avg Sonnet cost vs $1.84 Opus, this represents $33-$75 in recoverable spend - 14-33% of total budget. Without changing a single line of task logic.

Session Cost Distribution

The distribution is right-skewed with a heavy tail:

// Most sessions cluster under $0.50. Eight sessions over $5 accounted for 14% of total spend.

COST BUCKET	SESSION COUNT	% OF SESSIONS	% OF TOTAL SPEND
$0.00 - $0.25	89	29.7%	5.8%
$0.25 - $0.50	62	20.7%	8.4%
$0.50 - $1.00	71	23.7%	19.1%
$1.00 - $2.00	48	16.0%	26.5%
$2.00 - $5.00	22	7.3%	26.2%
>$5.00	8	2.7%	14.0%

Eight sessions - 2.7% of total - consumed 14% of the budget. These were not anomalies: they were identifiable task types (large codebase reviews, long multi-tool research sessions, iterative generation tasks). They are predictable and can be budgeted for or restructured.

Batch vs Individual Calls

The data contains 42 identifiable "batch" patterns: sequences of related tasks run consecutively in the same session vs the same tasks spread across multiple sessions. Comparing equivalent workloads:

APPROACH	EXAMPLE TASKS	TOTAL TOKENS	TOTAL COST
Batched (single session)	5 article drafts	1.2M	$0.89
Individual (5 sessions)	Same 5 articles	2.1M	$2.14
Delta	-	-43%	-58%

The token count difference (1.2M vs 2.1M for equivalent work) comes from context file loading. Each new session reloads SOUL.md, USER.md, AGENTS.md, and any relevant memory files. In the single-session batch, those files load once. In five separate sessions, they load five times. At ~22K tokens per full context load, that's 88K tokens of pure overhead in the individual approach vs zero additional overhead in the batch.

AI model cost comparison and optimization

// Model routing is the highest-leverage cost lever. Getting it wrong by 34% of sessions means ~15-33% budget waste.

04. EFFICIENCY DISCOVERIES

The Long Session Paradox

The intuitive assumption: long sessions are expensive. The data says the opposite, once you control for task volume. Sessions over 1M tokens (roughly 20+ turns) have the lowest cost-per-useful-output of any session bucket.

Why: cache hit rates above 60% mean each additional turn costs a fraction of what the first turns cost. The session overhead - context loading, system prompt, workspace files - is amortized over more output. The "expensive" sessions in the tail are not expensive because they're long. They're expensive because they ran Opus on tasks that could have been Sonnet.

Context File Size Overhead

Tracking the impact of context file evolution over the 6-week period:

WEEK	CONTEXT LOAD (TOKENS)	SESSIONS	OVERHEAD COST
Week 1-2	14,200	89	$1.89
Week 3-4	19,800	112	$3.96
Week 5-6	26,400	99	$7.22

Context files grew by 86% over 6 weeks as SOUL.md, TOOLS.md, and AGENTS.md accumulated new entries. The cost of loading these files per session increased proportionally. By week 5-6, context overhead alone was $7.22 across 99 sessions - 14.6% of that period's API spend just to load files that were mostly unchanged from the prior day.

The bloat tax is real and compounding. Every line added to workspace files has a cost that repeats on every session, forever. The payoff threshold for adding context information: the new information must improve at least one task per X sessions, where X = (additional tokens per session * session count) / (token cost per session).

Unexpected Weekly Pattern

Session distribution by day of week:

DAY	SESSIONS	AVG SESSION COST	PATTERN
Monday	61	$0.94	High Opus, complex planning
Tuesday-Thursday	144	$0.58	Execution, Sonnet-dominant
Friday	38	$0.72	Mixed review tasks
Weekend	57	$0.91	Exploratory, research-heavy

Monday and weekend sessions cost 52-62% more per session than weekday execution sessions. Monday sessions are predictably expensive: they involve strategic planning, long research threads, and Opus-by-default routing. Weekend sessions are exploratory - research rabbit holes, multi-source investigations, long-form writing - which naturally use more tokens without the execution discipline of weekday tasks.

This pattern is useful for forecasting. If Monday is a planning day and weekends are exploration days, those are the periods where session cost monitoring pays off most.

Sub-Agent Cost Behavior

Approximately 40 of the 300 sessions were spawned as sub-agents from a parent session. Their cost profile:

SESSION TYPE	AVG COST	AVG TOKENS	CACHE HIT RATE
Main sessions	$0.82	892K	41%
Sub-agent sessions	$0.54	621K	31%

Sub-agents are cheaper per session ($0.54 vs $0.82) because they run focused tasks without the full workspace context load. But they have lower cache hit rates (31% vs 41%) because each sub-agent starts cold. For short, well-scoped tasks, this is the right trade. For tasks that iterate on the same codebase or document set, a single long main session outperforms multiple sub-agents.

Neural networks and AI processing patterns

// The efficiency gap between well-structured and poorly-structured sessions is not marginal. It compounds over hundreds of sessions into material budget differences.

05. RECOMMENDATIONS

1. Design for Long Sessions on Repetitive Workloads

If you're doing five related coding tasks, do them in one session. If you're writing three articles on the same topic, do them in one session. The cache hit rate jump from turn 2 to turn 15 is the biggest per-token cost reduction available, and it costs nothing to implement.

// Cost comparison: batched vs fragmented
Same 5 tasks, Session approach A:
  - 5 separate sessions
  - Context load x5: 22K tokens * 5 = 110K tokens
  - Cache hit rate: ~18% average
  - Effective blended cost: ~$1.80/M

Same 5 tasks, Session approach B:
  - 1 continuous session
  - Context load x1: 22K tokens * 1 = 22K tokens  
  - Cache hit rate: ~58% by task 4
  - Effective blended cost: ~$0.72/M

// Delta: 60% cost reduction. Zero code changes.

2. Implement Model Routing

The 23 clearly-misrouted Opus sessions represent recoverable budget. Routing logic doesn't need to be complex:

Single Q&A, lookup, formatting: Sonnet by default
Code generation under 200 lines: Sonnet
Multi-step planning, strategic decisions: Opus
Long-form creative writing with nuance requirements: Opus
Anything that's gone wrong before and needs deeper reasoning: Opus

The goal isn't to minimize Opus usage. It's to not use Opus by default when Sonnet is sufficient. Even a rough heuristic catches the obvious cases.

3. Audit Context Files Every Two Weeks

The data showed 86% growth in context file size over 6 weeks. Not all of that growth was valuable. A two-week audit cadence:

Check SOUL.md, USER.md, AGENTS.md, TOOLS.md for stale entries
Remove instructions that are no longer relevant (old projects, deprecated tools)
Keep MEMORY.md under 200 lines - distill, don't append
Target: context load stays under 20K tokens total

Every 1K tokens removed from context files saves (1K * sessions_per_day * 365) tokens per year. At $3/M input: 1K tokens cut, 2 sessions/day = $2.19/year. Small per removal. Large in aggregate when you're trimming 6,000 tokens of bloat.

4. Flag High-Cost Sessions in Real Time

The 8 sessions over $5 were identifiable in advance by task type and model selection. A simple threshold check:

// Pseudo-logic for session cost alerts
if model == "opus" AND estimated_turns > 15:
    warn("This session profile averages $2.80+")
    
if task_type in ["codebase_review", "multi_source_research"]:
    warn("Historical average: $3.20 for this task type")
    
if context_tokens > 50000:
    warn("Large context: consider batching related tasks")

5. Treat Cache as Architecture, Not Accident

Cache hit rates don't just happen. They're the result of consistent context structure: same files, same order, same prefix patterns. Changes to context file content between sessions reset cache for those files. The implication: stability in context files has a direct dollar value.

Specific patterns that improve cache efficiency:

Put static content (identity, constants) before dynamic content (recent events, current tasks)
Load workspace files in consistent order every session
Avoid updating SOUL.md or AGENTS.md during a session if you want cache hits for the rest of that session
Don't inline large blocks of external content (URLs, code files) into system prompts - load them as separate messages so they can cache independently

6. Measure Efficiency, Not Just Spend

Total spend is a bad optimization target. The right metric is useful output per dollar. A $2 session that produces three finished articles is more efficient than a $0.20 session that produces one paragraph. Track:

Cost per completed task (not per session)
Cache hit rate by session type (not aggregate)
Output tokens per dollar (quality-weighted if possible)
Overhead ratio: context load tokens / useful output tokens

The 245M token dataset shows an overhead ratio of roughly 0.18. For every token of useful output, 0.18 tokens went to system overhead (context files, boilerplate prompts). Good systems sit under 0.20. Bad systems hit 0.50+ when context files bloat and task batching is absent.

06. SUMMARY TABLE

FINDING	IMPACT	DIFFICULTY TO FIX
Opus misrouting (23-41 sessions)	$33-75 / period	Low
Context file bloat (86% growth)	14% of late-period budget	Low
Session fragmentation vs batching	2.3x cost multiplier	Medium
Cold cache restarts on repetitive tasks	50-68% cache savings foregone	Medium
Sub-optimal context ordering for cache	~8-12% cache hit improvement available	Low
No real-time session cost signals	Top 2.7% of sessions = 14% of spend	Medium

The total recoverable budget from implementing all of the above: conservatively $60-90 out of $229, or 26-39%. At larger scale, these percentages hold while absolute dollars scale proportionally. At 10x volume ($2,290/period), the same patterns leave $600-900 on the table per period.

07. FINAL NOTE

This investigation was possible because the data existed. Most agent deployments don't log at this granularity. They get a monthly invoice and guess at what drove costs. The first optimization is the logging infrastructure itself - not because the data is interesting (though it is), but because you can't route what you can't measure, and you can't measure what you don't log.

The patterns here are specific to this dataset and model version. Anthropic adjusts cache pricing, model capabilities, and context window behavior periodically. These findings should be re-validated every 60-90 days against fresh data. The methodology is durable. The specific numbers will drift.

245M tokens. $229. 300 sessions. Not expensive. But the gap between efficient and inefficient use of that budget was 26-39% - real money on any serious deployment, and compounding money as usage scales.