Inference operations for Kubernetes
Trace every request through gateway, scheduler, prefill, KV cache, and decode. Enforce token-level SLOs. See what latency is costing you — in real time, in your cluster.
Roll up spend and SLOs by team, cluster, and model — numbers platform teams can debug and leadership can use in budget review.
Every millisecond of inference latency has a price tag.
Gateway routing 12 ms $0.00006
Scheduler queue 47 ms $0.00021
Prefill (KV hit, forward) 80 ms $0.00080
Inter-pod KV transfer 1 ms
Decode (64 tokens) 200 ms $0.00120
BLOCKED: KV eviction wait 174 ms $0.00056 one request · full stack · every span priced
The problem
Production inference on Kubernetes is flying blind — at every level of the org.
You can't see why p99 TTFT blew up.
Generic observability shows one span for the whole request. You can't tell queue wait from prefill backlog from KV eviction — so every incident starts with guesswork.
Reliability and cost run on different clocks.
PagerDuty fires when SLOs break. Finance sees the bill weeks later. Nobody connects a TTFT regression to the retries and token blowups it caused in real time.
GPU spend is growing. Nobody can explain it by team.
Leadership asks which team or cluster is driving the bill. Generic FinOps tools don't understand inference workloads — so chargeback spreadsheets never get trusted.
SaaS tools can't see inside the model server.
App-layer LLM observability stops at the API boundary. Regulated teams can't send prompts to a vendor anyway. You need self-hosted, infrastructure-deep visibility.
What we do
Performance, reliability, and cost — one problem. One platform.
Deep enough for SREs to debug a single request. Clear enough for leadership to see who is spending what — by team, cluster, and model.
Decompose any production request into gateway, scheduler, prefill, KV cache, and decode spans — with a dollar amount on each. Find the 174 ms of KV eviction wait that every other tool rolls into a single latency number.
- → Per-span breakdown across the inference path
- → Cost overlaid on every span, not just the request total
- → Cross-stack: works wherever your serving layer emits traces
SLOs with prompt-length bands, reasoning-model invisible token time, and burn-rate alerts that fire with root cause — not just "p99 is high." Token-weighted error budgets so a failed 32k-token call burns more budget than a 200-token chat.
- → TTFT bands by prompt length, not one meaningless threshold
- → TPOT alerts with decode-bottleneck root cause
- → Multi-window burn-rate paging adapted from Google SRE
The same chart shows TTFT p99 and $/1k tokens. When latency regresses, you see retries, session blowups, and reasoning-loop cost in real time — not on next month's invoice.
- → Coupled latency and cost time series
- → Slack digests with estimated dollar impact
- → Leadership sees cost impact while SREs still have root cause
Cost per token by team, cluster, workload, and model — rolled up the way inference actually runs, not as a generic GPU line item. Daily chargeback with CSV export, grounded in the same trace data your operators trust.
- → Spend by team · cluster · workload · model
- → Per-team chargeback with CSV export for finance
- → Idle GPU waste tracked separately — never folded into workload cost
Also available
Execution-aware gateway
OpenAI-compatible, self-hosted, on the request path inside your cluster. Per-tenant policy, intelligent routing, and first-party visibility — for teams that want a drop-in gateway alongside Inference SRE.
Who it's for
One platform for the teams that run inference — and the ones who pay for it.
Platform & SRE
Flame graphs, token SLOs, and burn-rate alerts with root cause — so incidents start with evidence, not guesswork.
Engineering leadership
Inference health and spend in one place: rollups by team, cluster, and model that tie reliability to dollars.
FinOps & finance
Defensible chargeback — cost per token by team and workload, daily rollups, CSV export ready for budget review.
Why us
Others trace the API call. KubeBurner traces the inference path.
Prompts, responses, and token counts. Useful for product teams — but one span for the whole request. No queue wait, no KV eviction, no prefill vs decode. Cost and latency stay in separate tools.
Infrastructure-deep, self-hosted Inference SRE. Every request decomposed across the inference stack, with token SLOs and a price tag on every span. The level where reliability and cost are actually decided.
Approach
Four principles. No surprises.
Self-hosted, end to end.
Runs entirely inside your Kubernetes cluster. Your prompts, traces, and billing data — none of it leaves your network. Built for regulated and air-gapped environments.
OpenTelemetry-native.
Collectors in-cluster, semconv-aligned with distributed inference. Trace gateway, scheduler, prefill, KV cache, and decode — correlated with the metrics you already run.
Cross-stack by design.
Works wherever your serving layer emits traces — not locked to one engine or one vendor stack. The trace layer is the product; FinOps and chargeback plug in on top.
Always human-in-the-loop.
Every recommendation is a suggestion until you accept it. Preview, apply, roll back. Nothing is auto-changed in your cluster without an explicit decision.
Where we are
Private preview. Working with a small group of design partners.
- → Private install in your own cluster
- → Early flame graph, SLO, and chargeback dashboards
- → Direct line to the team — influence the roadmap
- → Apache 2.0 — your install is yours, forever
We're a fit if you're running production LLM inference on Kubernetes — whether you're the platform team debugging TTFT, or leadership trying to get ahead of GPU spend with defensible numbers by team and cluster. Self-hosted, infrastructure-deep, Apache 2.0.
Send a note to hello@kubeburner.ai and we'll set up a short call.