Private preview · Inference SRE · Apache 2.0

Inference operations for Kubernetes

Trace every request through gateway, scheduler, prefill, KV cache, and decode. Enforce token-level SLOs. See what latency is costing you — in real time, in your cluster.

Roll up spend and SLOs by team, cluster, and model — numbers platform teams can debug and leadership can use in budget review.

Every millisecond of inference latency has a price tag.

req_8a3f2c1b 502 ms · $0.0023
Gateway routing           12 ms  $0.00006
Scheduler queue           47 ms  $0.00021
Prefill (KV hit, forward)   80 ms  $0.00080
Inter-pod KV transfer       1 ms
Decode (64 tokens)          200 ms  $0.00120
BLOCKED: KV eviction wait  174 ms  $0.00056

one request · full stack · every span priced

The problem

Production inference on Kubernetes is flying blind — at every level of the org.

502ms

You can't see why p99 TTFT blew up.

Generic observability shows one span for the whole request. You can't tell queue wait from prefill backlog from KV eviction — so every incident starts with guesswork.

$↔ms

Reliability and cost run on different clocks.

PagerDuty fires when SLOs break. Finance sees the bill weeks later. Nobody connects a TTFT regression to the retries and token blowups it caused in real time.

$?

GPU spend is growing. Nobody can explain it by team.

Leadership asks which team or cluster is driving the bill. Generic FinOps tools don't understand inference workloads — so chargeback spreadsheets never get trusted.

SaaS tools can't see inside the model server.

App-layer LLM observability stops at the API boundary. Regulated teams can't send prompts to a vendor anyway. You need self-hosted, infrastructure-deep visibility.

What we do

Performance, reliability, and cost — one problem. One platform.

Deep enough for SREs to debug a single request. Clear enough for leadership to see who is spending what — by team, cluster, and model.

01
Inference Flame Graph
One request. Full stack.

Decompose any production request into gateway, scheduler, prefill, KV cache, and decode spans — with a dollar amount on each. Find the 174 ms of KV eviction wait that every other tool rolls into a single latency number.

  • Per-span breakdown across the inference path
  • Cost overlaid on every span, not just the request total
  • Cross-stack: works wherever your serving layer emits traces
02
Token-level SLOs
TTFT and TPOT that understand inference.

SLOs with prompt-length bands, reasoning-model invisible token time, and burn-rate alerts that fire with root cause — not just "p99 is high." Token-weighted error budgets so a failed 32k-token call burns more budget than a 200-token chat.

  • TTFT bands by prompt length, not one meaningless threshold
  • TPOT alerts with decode-bottleneck root cause
  • Multi-window burn-rate paging adapted from Google SRE
03
Latency–cost coupling
When SLOs break, the bill moves in minutes.

The same chart shows TTFT p99 and $/1k tokens. When latency regresses, you see retries, session blowups, and reasoning-loop cost in real time — not on next month's invoice.

  • Coupled latency and cost time series
  • Slack digests with estimated dollar impact
  • Leadership sees cost impact while SREs still have root cause
04
Chargeback & rollups
Numbers you can defend in budget review.

Cost per token by team, cluster, workload, and model — rolled up the way inference actually runs, not as a generic GPU line item. Daily chargeback with CSV export, grounded in the same trace data your operators trust.

  • Spend by team · cluster · workload · model
  • Per-team chargeback with CSV export for finance
  • Idle GPU waste tracked separately — never folded into workload cost

Also available

Execution-aware gateway

OpenAI-compatible, self-hosted, on the request path inside your cluster. Per-tenant policy, intelligent routing, and first-party visibility — for teams that want a drop-in gateway alongside Inference SRE.

Who it's for

One platform for the teams that run inference — and the ones who pay for it.

Platform & SRE

Flame graphs, token SLOs, and burn-rate alerts with root cause — so incidents start with evidence, not guesswork.

Engineering leadership

Inference health and spend in one place: rollups by team, cluster, and model that tie reliability to dollars.

FinOps & finance

Defensible chargeback — cost per token by team and workload, daily rollups, CSV export ready for budget review.

Why us

Others trace the API call. KubeBurner traces the inference path.

API layer vs. inference stack A two-layer diagram. The top layer is the API boundary: choose model, send request, log result. Below, the inference stack traces gateway, scheduler, prefill, KV cache, and decode for a single request.
App-layer LLM observability stops at the API boundary. KubeBurner traces the full inference stack: gateway, scheduler, prefill, KV cache, and decode.
App-layer observability

Prompts, responses, and token counts. Useful for product teams — but one span for the whole request. No queue wait, no KV eviction, no prefill vs decode. Cost and latency stay in separate tools.

KubeBurner

Infrastructure-deep, self-hosted Inference SRE. Every request decomposed across the inference stack, with token SLOs and a price tag on every span. The level where reliability and cost are actually decided.

Approach

Four principles. No surprises.

01

Self-hosted, end to end.

Runs entirely inside your Kubernetes cluster. Your prompts, traces, and billing data — none of it leaves your network. Built for regulated and air-gapped environments.

02

OpenTelemetry-native.

Collectors in-cluster, semconv-aligned with distributed inference. Trace gateway, scheduler, prefill, KV cache, and decode — correlated with the metrics you already run.

03

Cross-stack by design.

Works wherever your serving layer emits traces — not locked to one engine or one vendor stack. The trace layer is the product; FinOps and chargeback plug in on top.

04

Always human-in-the-loop.

Every recommendation is a suggestion until you accept it. Preview, apply, roll back. Nothing is auto-changed in your cluster without an explicit decision.

Where we are

Private preview. Working with a small group of design partners.

What design partners get
  • Private install in your own cluster
  • Early flame graph, SLO, and chargeback dashboards
  • Direct line to the team — influence the roadmap
  • Apache 2.0 — your install is yours, forever
Get in touch

We're a fit if you're running production LLM inference on Kubernetes — whether you're the platform team debugging TTFT, or leadership trying to get ahead of GPU spend with defensible numbers by team and cluster. Self-hosted, infrastructure-deep, Apache 2.0.

Send a note to hello@kubeburner.ai and we'll set up a short call.