← Back to blog

FinOps for AI: Track What Your Code Actually Costs Per Commit

68% of executives acknowledge overspending on AI. Not because they spend too much, but because they can’t tell you what any of it bought.

Think of it like planning poker with blank cards. Everyone holds up a number, nobody checks after the game. Except now the chips are real. Every AI call has a price tag. You’re just not reading it.

Contents

Your team ships a feature. Copilot burns tokens across dozens of completions. Claude runs an agentic session or three. The monthly bill arrives as a lump sum. Which feature cost what? Nobody knows.

This is the FinOps gap for AI-assisted development: the work is metered at the API level, but invisible at the feature level. Cloud FinOps solved this for infrastructure years ago: attribute cost to the service that consumed it. AI coding costs need the same treatment, and the data is already there. You’re just not capturing it.

Here’s how to get a receipt on every commit.

The Missing Feedback Loop

Software engineering has never had a reliable way to connect estimates to outcomes.

Story points, arguably the most widely used estimation unit among agile teams, measure relative effort, not time, not cost. The Fibonacci sequence prevents false precision. Cross-team comparison was deliberately discouraged. Ron Jeffries, credited with inventing them: “I like to say that I may have invented story points, and if I did, I’m sorry now.”

In practice, teams estimate in hours first, then convert to points. Management extracts velocity, divides by team cost, gets “cost per story point,” the exact use the inventors warned against. Velocity becomes a target. Goodhart’s Law kicks in. What was a “3” quietly becomes a “5.”

A UCL study of 37,440 user stories across 32 open-source projects found that story points showed a strong correlation with development time in only 7% of projects (medium in 58%, low in 35%). The correlation was statistically significant in nearly all projects. It just wasn’t useful.

The numbers floated free of verifiable reality. You can’t audit an abstraction deliberately decoupled from measurable outcomes. When all work is done by humans, there’s no meter for the work itself.

AI changed that. We built Copilot Budget to make it visible, but first, here’s why the economics make it urgent.

Copilot BudgetTrack GitHub Copilot token usage and optionally append AI budget info to git commit messages.InstallMarketplace

Tokens Changed the Economics

One Scrum.org team reported completing “over 150 story points of work” in a single sprint using AI tools. Their velocity baseline was meaningless overnight. Story points can’t survive a 10x productivity shift. But the work wasn’t free. AI coding tools are moving to credit-based pricing, and a single agentic session can cost $6 or more, while the subscription sticker price hides the real cost.

Unlike human effort, AI work comes with a receipt. Every API call to an LLM has a token count. Every token has a price. When Claude Sonnet generates your authentication module, the cost isn’t “about three days of a senior engineer’s time.” It’s 47,000 input tokens and 12,000 output tokens at $3 and $15 per million, respectively: $0.32.

The industry noticed. The FinOps Foundation added AI workloads as a first-class cost domain in their 2025 Framework. Inference alone accounts for 80-90% of total GenAI spend. 98% of organizations now manage AI costs, up from 31% two years ago. Yet most still can’t attribute those costs to features.

That’s the gap. Not “how many hours will this take?” but “how many tokens will this burn?” And which feature burned them. Unlike hours, tokens are metered. Per-model. Per-request. Per-feature, if you track them.

Tokens vary by developer. A senior engineer with sharp prompts might finish a task in 5,000 tokens that a junior burns 50,000 on. But when token counts diverge, you can see why. Wrong prompt strategy? Wrong model? Context window too large? Each cause is visible in commit trailers and diagnosable after the fact.

Story points gave you a divergent number with no explanation. Tokens give you a divergent number with a traceable cause. That closes the feedback loop.

The Commit Receipt

When you use AI coding tools, the cost is invisible. Copilot runs in the background burning premium requests you can’t see. Your cloud dashboard shows a monthly lump sum. Your IDE shows nothing. The data exists at the API level (token counts, model IDs, billing multipliers), but none of it reaches the place where engineering decisions happen: the commit.

Copilot Budget bridges that gap. It’s a VS Code extension that tracks GitHub Copilot token usage and embeds cost data directly in git commits.

The extension scans Copilot’s session files every two minutes, extracts token counts per model, calculates premium request consumption using GitHub’s billing multipliers, and writes totals to a tracking file. On commit, a prepare-commit-msg hook appends cost trailers automatically. Here’s what that looks like in git log:

feat: add OAuth2 login flow

Copilot-Premium-Requests: 12.40
Copilot-Est-Cost: $0.41
Copilot-Model: claude_sonnet_4 18200/6800/4.00
Copilot-Model: gpt_4o 3100/1200/0.00

Every trailer is generated by Copilot Budget from actual Copilot usage, not an estimate, not a guess. Premium requests, estimated cost, and per-model token breakdown (input/output/requests), all attached to the commit that consumed them.

Aggregate these across a branch and you get cost per feature. Aggregate across the sprint and you get a budget you can actually verify.

The pattern becomes visible fast. An auth flow takes 3 commits and $0.41 — straightforward and low context. A dashboard refactor takes 14 commits and $2.87. The model had to hold the entire design system in context for each variation. That’s not a “try harder” problem. It’s a prompt architecture problem: extract a reusable context prefix, or switch to a model with cheaper context. You can only discover that from the data.

The receipts won’t tell you the full cost of a feature. Human time is still unmetered. But they give you the first auditable signal across your backlog: which tasks burn disproportionate AI resources, and why. The costs are low today — most tasks cost cents. But agentic AI patterns consume 5-30x more tokens per task than simple completions. The AI share of work grows every quarter. Teams that build the measurement habit now, when stakes are low, will have calibrated intuition when the bill matters.

Planning poker gave you a number you couldn’t verify. Token tracking gives you a number you can act on. Same game, real chips.

Implement It

You don’t need to overhaul your process. You need a measurement layer.

Step 1: Start tracking. Install Copilot Budget or a similar tracker. Run it for one sprint without changing anything else. Just observe.

Step 2: Set budgets per task. Before each sprint, estimate the AI cost per task in dollars: $0.25, $0.50, $1, $2, $5, $10. If your team already does planning poker, swap the Fibonacci cards for these. Same simultaneous reveal, same discussion when estimates diverge, but now the unit is auditable. If your team doesn’t do planning poker, a simple column in your ticket tracker works fine.

When someone estimates “$2” and someone else estimates “$10,” the conversation is unchanged: “what do you know that I don’t?” The difference: after the sprint, you check who was right.

Step 3: Read the receipts. After the sprint, run git log with cost trailers. Compare estimated budgets against actuals. Find outliers. Ask why.

Step 4: Calibrate. By sprint three, your team has enough data to spot patterns: which task types consistently overrun, which models are cost-effective for which work, and where prompt architecture saves money.

Same process. Finally with a feedback loop.

Practical Notes

Start with visibility, not control. The FinOps community calls this “showback,” showing teams what things cost without hard limits. People economize when the number is visible.

Keep it team-owned. Precise estimates historically get weaponized. It’s why story points were invented as an abstraction. If a VP asks “why did your team burn 200 premium requests when Team B only burned 80,” you’ve recreated the exact dysfunction. Token data lives in the team’s retro, not on a management dashboard. The moment it flows upward as a performance metric, you’ve lost.

Don’t optimize prematurely. Token costs drop fast. Per-token API prices have fallen consistently year over year as providers compete and new model generations launch. Build the measurement habit now. The data you collect today becomes your baseline tomorrow.

Not every task burns tokens. Architecture whiteboarding, customer interviews, deployment debugging, manual QA. No AI receipts, and that’s fine. Set token budgets for AI-assisted tasks and keep whatever works for the rest. The AI-assisted share grows every quarter. You don’t need 100% coverage on day one.

Track consumption against your limit. Every Copilot plan has a monthly premium request allowance (check current limits). Plot consumed requests against that ceiling. When the line approaches the limit, the team adjusts: cheaper models, tighter prompts, deferred AI-heavy work. A budget ceiling you can actually hit creates real trade-offs.

Close the Loop

FinOps for cloud infrastructure took years to mature. FinOps for AI is just starting, and the stakes are higher because AI costs scale with usage, not provisioning. You don’t reserve an LLM instance. You pay per thought.

The good news: the data is already there. The missing piece was attribution: connecting cost to the feature that produced it. Commit-level tracking solves that.

Planning poker had the right instinct: estimate before you build, compare after you ship. It just never had real chips. Now every AI-assisted task comes with a receipt. Same game, real money.

Install the tracker. Run one sprint. Read the receipts. The rest follows from the data.


Copilot Budget is MIT-licensed and works with any GitHub Copilot plan. Install it, run a sprint, and see what your code actually costs. Contributions welcome on GitHub.

References

FinOps & AI Cost Management

  1. FinOps Foundation (2025). Optimizing GenAI Usage
  2. FinOps Foundation (2025). 2025 FinOps Framework
  3. FinOps Foundation (2026). State of FinOps 2026
  4. FinOps Foundation (2025). How to Build a Generative AI Cost and Usage Tracker

Story Points & Estimation

  1. Cohn, M. . What Are Story Points? . Mountain Goat Software
  2. Cohn, M. . Don't Equate Story Points to Hours . Mountain Goat Software
  3. Jeffries, R. (2019). Story Points Revisited . ronjeffries.com
  4. Age of Product. 12 Common Strategies of Gaming Velocity
  5. Tawosi, V., Moussa, R. & Sarro, F. (2022). On the Relationship Between Story Points and Development Effort in Agile Open-Source Software . ACM/IEEE ESEM

Tools & Pricing

  1. Finout (2026). OpenAI vs Anthropic API Pricing Comparison 2026