Updated May 19, 2026.
Your team ships a feature. Copilot burns tokens across dozens of completions. Claude runs an agentic session or three. The monthly bill arrives as a lump sum. Which feature cost what? Nobody knows.
Think of it like planning poker with blank cards. Everyone holds up a number, nobody checks after the game. Except the stakes are much higher now — and the whales at this table can always fold and walk away from Copilot.
Contents
This is the FinOps gap for AI-assisted development: work is metered at the API level, invisible at the feature level. Cloud FinOps solved the same problem for infrastructure years ago — attribute cost to whatever consumed it. AI coding needs the same treatment.
The practice this article argues for: measure AI cost per work item, feature, or PR. The data is already there — token counts on every API response, rates on every model. The gap is just between the API and git log. Closing it is a tooling problem with several reasonable answers. I’ll suggest one I built; the practice is the point.
Pick whichever unit lets your team compare tasks apples-to-apples — USD, AIC (GitHub’s AI Credits, 1 AIC = $0.01), or raw tokens — and stay consistent. From here on I’ll use “AI cost” generically; switch units in your head as needed.
Why It’s a Real Number Now
On June 1, 2026, GitHub Copilot moves to usage-based billing in AI Credits (AIC). Every plan ships with a monthly AIC allowance, and consumption is metered per token against a published per-model rate card. A single agentic session can burn 600 AIC ($6) or more. The sticker price is the allowance; the question is whether your team’s actual consumption fits inside it.
Until now, “how much will this feature cost in AI?” was a curiosity. After June 1, it’s a planning input — a finite, hard-cap budget like any cloud quota.
The good news: unlike human effort, AI work comes with a receipt. Every AI call has a token count. Every token has a published rate. When Claude Sonnet generates an auth module, the cost isn’t “about three days of senior engineer time” — it’s 47,000 input tokens and 12,000 output at $3/$15 per million: 32 AIC ($0.32). The data exists at the API level. It just doesn’t reach the commit.
The Commit Receipt
The cleanest place to attribute AI cost is the commit itself. Capture per-request token counts from your assistant, compute cost against the published rate card, and append the total as a git trailer on the commit it belongs to. Then git log is the cost ledger.
A commit with all three trailers attached looks like this:
feat: add OAuth2 login flow
Copilot-Est-Cost: $0.41
Copilot-AI-Credits: 41.30
Copilot-AI-Credits-Models: Claude Sonnet 4.6=39.45,GPT-4.1=1.85
(Copilot-AI-Credits is the default trailer; the per-model breakdown and USD-equivalent are opt-in via settings.)
Cost is denominated in AI Credits (1 AIC = $0.01) — the same unit GitHub bills in starting June 1, plan-invariant across Pro, Business, and Enterprise. Copilot Budget reads raw token counts (input, output, cache reads, cache creation) per request and maps them to AIC against the published per-model rate card. The math doesn’t change on June 1, so trailers written today match what GitHub will start billing — no migration step. Whatever produces the trailers, the trailers are the artifact: aggregate across a branch for cost per feature, across a week for a number you can verify against the AIC allowance.
You can wire this up several ways: a CI step that scrapes API logs, a post-commit hook reading provider session files, or an editor extension doing capture inside the IDE. For GitHub Copilot, I wrote one — Copilot Budget, a VS Code extension that reads Copilot’s session metadata and writes these trailers via prepare-commit-msg. MIT-licensed, works on any plan. If you’re on Claude Code, Cursor, or a custom API, the practice is the same; the connector will be different.
The pattern shows fast. A clean auth flow takes 3 commits and $6.15. A dashboard refactor takes 14 commits and $43.05 — the model had to hold the whole design system in context for each variation. That’s a prompt-architecture problem you can only diagnose from the data.
Cost Hotspots
The receipt does more than total spend — it surfaces cost hotspots: work items where the model burns disproportionate tokens. Treat them as investment signals.
A feature that runs up $5 of AI cost across a week of agentic sessions isn’t just costly. The model is doing the same shape of work repeatedly, with no scaffolding to reuse, and you’re paying tokens to rebuild context every session. That’s a tooling gap with a price tag attached.
The response is the same kind of work you’d do to onboard a new engineer faster — write down the implicit:
- Codify recurring tasks as Claude Code Skills, Copilot custom instructions, or snippet libraries. The model loads the pattern once instead of re-deriving it per session.
- Tighten types and contracts. A 50,000-token refactor that’s mostly “what does this function return?” is a vote for stricter type definitions or schema-first APIs.
- Stand up an MCP server for repeat-context domains. If your agent keeps re-reading the same vendor datasheet or internal API spec, expose it as structured queries instead.
- Add reference tests as living docs. Tests are the cheapest context — small, executable, unambiguous. A handful in an expensive area pays back every session that touches it.
The hotspots point at where this investment is most likely to pay back. You’re not refactoring for aesthetics — you’re refactoring because the meter says so. The next week’s receipts tell you whether it worked.
Where This Falls Short
The receipts aren’t perfect. Honest limitations:
- Cache reporting gap is closing. When Copilot reports
cacheReadTokensper request, the math is exact. When it doesn’t (some chat requests today), Copilot Budget falls back to a turn-based heuristic (turn 1 = 0% cached, turn 2+ = 75%) — within ±5–10% of actual. microsoft/vscode-copilot-chat#5076 propagatescacheReadTokensandcacheCreationTokensthrough the chat extension’s telemetry to align with Copilot CLI; when it merges, the heuristic stops applying. - Copilot only — for now. Copilot Budget reads VS Code’s Copilot session files. Other assistants (Claude Code, Cursor, Codex, Gemini, and a long tail) are already covered by multi-assistant trackers like tokscale; piping their output into a git trailer post-commit is the remaining bridge. The trailer format is the interoperable part; the capture layer is per-assistant.
- PR-level aggregation is DIY. Trailers live on individual commits. Summing across a PR or sprint is a
git log --grep='Copilot-AI-Credits' | awkone-liner today, or a CI step if you want it on every PR page. No built-in aggregation yet. - Rate card is bundled, not live. The extension ships a mirror of GitHub’s pricing YAML. When GitHub updates rates, refresh with
npm run update-ratesor accept drift. - Cost ≠ value. A $5 agentic session that ships a critical fix is cheap; a $0.20 session that produces noise is expensive. The receipts shape decisions; they don’t replace judgment about what work was worth.
What to Do
Three habits, no process overhaul:
- Get AI cost trailers on your commits. Copilot Budget for GitHub Copilot (commit hook is opt-in —
commitHook.enabled: trueor theInstall Commit Hookcommand); tokscale for Claude Code, Cursor, Codex, Gemini, and others (bridge its output into a trailer); a small script for custom APIs. Run it for a week without changing anything else. Just observe. - Estimate per task in your chosen unit — six bands work well: $0.25, $0.50, $1, $2, $5, $10. Write the number down before work starts, whether you do planning poker or just a column in your tracker. After the iteration, compare estimates to the trailers in
git log. Where they diverged, ask why: wrong model, missing context, recurring task without scaffolding? - Invest where the meter points. Expensive tasks are candidates for the Skills, types, and MCP servers from the previous section. The next iteration’s receipts tell you whether the investment paid back.
A few principles that age well:
- Visibility, not control. Showback over chargeback. People economize when the number is visible — they don’t need a hard limit.
- Team-owned data. Spend per engineer becomes a stick the moment it’s reviewed up the chain. Keep it in the team’s own retros — compare this week to last week, not engineer to engineer.
- Not every task burns tokens. Architecture conversations, customer interviews, hardware debugging — no receipts. Completions and Next Edit suggestions are often free across assistants too. Set budgets for the work that does burn cost; keep whatever works for the rest.
- Don’t optimize prematurely. Per-token rates fall every model generation. Build the measurement habit now; the data sharpens as costs grow.
Close the Loop
After June 1, the burndown chart that matters isn’t story points against sprint days — it’s AIC consumed against your monthly allowance. That’s the chart with a hard wall at the bottom. Sprint burndowns hide cost hotspots; AIC burndowns surface them. Story points were a target you could miss without consequence; the AIC ceiling is where the meter starts running on overage. The team that tracks it from day one chooses where to spend; the team that doesn’t gets surprised at month-end.
Planning poker had the right instinct: estimate before you build, compare after you ship. It just never had real chips. Now every AI-assisted task comes with a receipt — and the monthly AIC allowance is the chip stack you’re playing with. Same game, higher stakes.
Install the tracker. Run a week. Read the receipts. The rest follows from the data.
References
- GitHub (2026). GitHub Copilot is moving to usage-based billing
- GitHub Docs (2026). Copilot models and pricing
- Vantage (2026). The Hidden Cost Driver in Agentic Coding: It's Not the Per-Token Price
- Finout (2026). OpenAI vs Anthropic API Pricing Comparison 2026
- FinOps Foundation (2025). How to Build a Generative AI Cost and Usage Tracker