The execution-aware AI gateway for Kubernetes.
Self-hosted. OpenAI-compatible. Optimizes how tokens execute on your GPUs. Tells you who owes the bill.
change one line.
The problem
LLM inference on Kubernetes is leaking money in three places.
GPUs sit idle.
Even on top inference platforms, GPU utilization rarely cracks 40% in production. KV-cache fragmentation, dumb scheduling, and over-provisioning quietly burn the budget.
Nobody knows who owes the bill.
Engineering, product, and finance argue every quarter. Generic FinOps tools don't understand the workload-level details that actually drive LLM cost — so the chargeback report never gets trusted.
Prompts leak to a third party.
SaaS gateways are a procurement gate for any team handling PII, financial data, healthcare records, or proprietary code. There is no path forward without a self-hosted option.
What we do
One install. Three load-bearing pillars.
A drop-in OpenAI-compatible gateway that lives inside your cluster, in front of your inference workloads. Streaming responses pass through byte-for-byte. Per-tenant policy, intelligent routing, and full request-level visibility — without your prompts ever leaving your network.
- → OpenAI-compatible — change a base URL, nothing else
- → Per-tenant entitlements, rate limits, and quotas
- → First-party visibility into every request
Generic Kubernetes FinOps tools treat an LLM pod as a generic pod. We don't. We report exact cost-per-token by workload, model, and team — broken down the way LLM inference actually behaves. Daily chargeback rollups with CSV export, ready for finance.
- → Cost-per-token by model · team · workload
- → Per-team chargeback with CSV export
- → Idle GPU waste tracked separately, never folded in
Concrete, dollar-quantified recommendations for cutting waste — not a wall of dashboards waiting for someone to act. Preview the change, apply with one click, roll back on the same screen. Every change snapshots, every action logged. Always human-in-the-loop.
- → Dollar-range savings with confidence notes
- → One-click preview · apply · rollback
- → Snapshot and audit log on every change
Why us
For platform teams who need an AI gateway and cannot send prompts to a SaaS, KubeBurner is the only one that's also execution-aware on the GPUs underneath.
Operate at the control plane: pick a model, send the request, log the result. Useful, but they leave the harder problem — what actually happens on your GPUs — to someone else.
Operates one level deeper: which GPU the request lands on, whether the cache is reused, how the workload is balanced, whether resources are fairly shared. The level where cost and latency are actually decided.
Approach
Four principles. No surprises.
Self-hosted, end to end.
Runs entirely inside your Kubernetes cluster. Your prompts, your data, your billing details — none of it leaves your network.
Drop-in compatible.
OpenAI-compatible from day one. Point your applications at our gateway and the rest of your code continues to work.
First-party measurement.
We sit on the request path, so the numbers are what actually happened — not estimates, not inferred from indirect metrics.
Always human-in-the-loop.
Every recommendation is a suggestion until you accept it. Preview, apply, roll back. Nothing is auto-changed in your cluster.
Where we are
Private preview. Working with a small group of design partners.
- → Private install in your own cluster
- → Direct line to the team
- → Influence over the roadmap
- → Apache 2.0 — your install is yours, forever
We're a fit if you're running production LLM inference on Kubernetes, care about cost and latency, and want to keep your prompts in your own cluster.
Send a note to hello@kubeburner.ai and we'll set up a short call.